Youtube Analysis3

Master Thesis Project
Sentiment Analysis of YouTube

Public Videos based on their
Comments
Author: Indre Kvedaraite

Supervisor: Francis Palma
Examiner: Morgan Ericsson
Reader: Nadeem Abbas
Semester: VT 2021
Course Code: 4DV50E
Subject: Computer Science
Abstract
With the rise of social media and publicly available data, opinion mining
is more accessible than ever. It is valuable for content creators, companies and
advertisers to gain insights into what users think and feel. This work examines
comments on YouTube videos, and builds a deep learning classifier to auto-
matically determine their sentiment. Four Long Short-Term Memory-based
models are trained and evaluated. Experiments are performed to determine
which deep learning model performs with the best accuracy, recall, precision,
F1 score and ROC curve on a labelled YouTube Comment dataset. The results
indicate that a BiLSTM-based model has the overall best performance, with
the accuracy of 89%. Furthermore, the four LSTM-based models are eval-
uated on an IMDB movie review dataset, achieving an average accuracy of
87%, showing that the models can predict the sentiment of different textual
data. Finally, a statistical analysis is performed on the YouTube videos, re-
vealing that videos with positive sentiment have a statistically higher number
of upvotes and views. However, the number of downvotes is not significantly
higher in videos with negative sentiment.
Keywords: Sentiment analysis, Sentiment classification, LSTM, BiLSTM,
Recurrent neural networks, Convolutional neural networks
Acknowledgments
I would like to thank my supervisor Francis Palma for introducing me to the topic of
sentiment analysis and for his time, feedback and guidance throughout this project.
Secondly, I would like to thank Victor Heijler for his support and valuable com-
ments, which helped me improve my report greatly. I would also like to thank my
mother for always being by my side and supporting me during this journey. Finally,
I would like to thank my friends and everyone who participated in the labelling
process of my dataset.
Contents
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Research Questions & Objectives . . . . . . . . . . . . . . . . . . . 3
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Target groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.7 Report Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Background 6
2.1 Theoretical background . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Traditional sentiment classification techniques . . . . . . . 6
2.1.2 Text sentiment classification with deep learning . . . . . . . 6
2.1.3 Recurrent Neural Networks . . . . . . . . . . . . . . . . . 7
2.1.4 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.5 Bidirectional RNN . . . . . . . . . . . . . . . . . . . . . . 9
2.1.6 Convolutional Neural Networks . . . . . . . . . . . . . . . 9
2.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Related work discussion . . . . . . . . . . . . . . . . . . . 11
3 Method 13
3.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Video collection . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.2 Comment collection . . . . . . . . . . . . . . . . . . . . . 15
3.2 Data cleanup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Data labelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4.1 Removal of nonessential data . . . . . . . . . . . . . . . . 17
3.4.2 Simplification of data . . . . . . . . . . . . . . . . . . . . . 18
3.4.3 Word vector generation . . . . . . . . . . . . . . . . . . . . 19
3.5 Model implementation . . . . . . . . . . . . . . . . . . . . . . . . 19
3.6 Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.7 Reliability and Validity . . . . . . . . . . . . . . . . . . . . . . . . 20
3.7.1 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.7.2 Construct validity . . . . . . . . . . . . . . . . . . . . . . . 21
3.7.3 Internal validity . . . . . . . . . . . . . . . . . . . . . . . . 21
3.7.4 External validity . . . . . . . . . . . . . . . . . . . . . . . 22
3.8 Ethical considerations . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Implementation 23
4.1 Environment setup . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Data collection and clean up . . . . . . . . . . . . . . . . . . . . . 23
4.3 Dataset construction . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.1 Web application for comment labelling . . . . . . . . . . . 25
4.3.2 Data processing . . . . . . . . . . . . . . . . . . . . . . . . 25
4.4 Implementation, training and evaluation of LSTM-based models . . 26
4.4.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . 27
4.4.2 Hyperparameter tuning . . . . . . . . . . . . . . . . . . . . 28
4.4.3 Implementation of models and their training . . . . . . . . . 28
4.4.4 Evaluation of model performance . . . . . . . . . . . . . . 29
5 Results 31
5.1 YouTube comment dataset . . . . . . . . . . . . . . . . . . . . . . 31
5.2 YouTube video statistics . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 Model performance . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3.1 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . 32
5.3.2 Evaluation on testing subset . . . . . . . . . . . . . . . . . 32
5.4 Performance evaluation on IMDB dataset . . . . . . . . . . . . . . 37
6 Discussion 39
6.1 Accuracy of sentiment prediction . . . . . . . . . . . . . . . . . . . 39
6.2 LSTM-based model performance . . . . . . . . . . . . . . . . . . . 40
6.3 LSTM-based model performance on IMDB dataset . . . . . . . . . 41
6.4 Relationship between video sentiment and users’ preferences . . . . 42
7 Conclusions and Future Work 46

7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
References 48
A Appendix 1 A
A.1 Video statistics and predicted sentiment . . . . . . . . . . . . . . . A
1 Introduction
Emotions and opinions play an essential role in human behaviour as we often make
decisions based on other people’s experience and thoughts. Since everyone’s opin-
ions are subjective and affected by their personal experiences and beliefs, learning
about them helps to get a broader perspective. Public opinion has been of interest
to organisations, companies, and politicians for a long time. The collective opin-
ion of people can help predict who will win elections, generate knowledge about
future trends and topics. It also helps determine general opinion about products
and services, which helps marketing teams decide on a marketing strategy, improv-
ing existing or production of a new product, and improving customer support [1].
Therefore, recognition of sentiment in various fields is of the essence.
Initially, people’s opinions were collected through surveys or questionnaires and
manually reviewed. However, as the usage of the internet increased, people started
expressing their opinion online more frequently. The boom of social media plat-
forms in recent years has allowed its users to share all kinds of information and
introduce a wider variety and ways to express their opinions. Blogs, discussion
forums, reviews, comments and micro-blogging services, such as Twitter or Face-
book, serve as rich data sources, composed of audio files, videos, images and opin-
ions.
One such platform, YouTube, is currently one of the most popular social media
platforms in the world1 . It allows its users to share their videos or view videos
uploaded by other users and provide feedback. A user can provide feedback by
supporting or rejecting a video by rating it positively or negatively, respectively.
Users may further also provide textual feedback in the form of comments posted to
videos. Each video can hold thousands of comments making YouTube a platform
of interest for text classification and categorisation.
This project focuses on building a YouTube comment dataset, developing a sen-
timent classifier and performing sentiment analysis on the dataset. To help under-
stand the fundamentals of this topic, the terms sentiment analysis, text classification
and text sentiment classification are described in the next section.
1.1 Background
Sentiment analysis - sometimes referred to as opinion mining - is a natural lan-
guage processing (NLP) technique used to determine textual data’s sentiment po-
larity. Over the years, it has become one of the most popular research areas in NLP
[2]. It analyses people’s sentiment and attitude towards products, digital content,
organisations, events, issues and others [2]. While the large volumes of opinion
data can provide an in-depth understanding of overall sentiment, they require a
lot of time to process. Not only is it time-consuming and challenging to review
large quantities of texts, but some texts might also be long and complex, express-
ing reasoning for different sentiments, making it challenging to understand overall
sentiment quickly. Currently, YouTube allows comments of 10 000 characters in
length, which is a considerable amount and can result in long texts. Naturally, this
amount of data calls for tools to ease and automate the process of classification and
sentiment extraction from a text.
1
https://www.alexa.com/topsites
1
Text classification is a process of categorising texts into groups. It is applied in
various domains, such as organisation of news articles, opinion mining of product
reviews, spam filtering, document organisation in digital libraries, such as literature,
social feeds, and more [3]. There can be different ways to classify text - by its topic,
whether it meets specific criteria or by its sentiment [3].
Text sentiment classification is a process of determining the overall sentiment
of some given text[4]. This problem can either result in binary classification or
multi-label classification, where binary classification produces two outputs, posi-
tive and negative, and multi-label classification produces more than two classifi-
cation outputs, for example - very negative, negative, neutral, positive and very
positive [4]. Common techniques to solve text sentiment classification problems of-
ten use traditional machine learning algorithms or deep-learning based approaches.
Convolutional neural networks (CNN) and recurrent neural networks (RNN) such
as Long Short-Term memory (LSTM) and their bidirectional variants, like Bidirec-
tional LSTM (BiLSTM) have been able to successfully perform text and sentiment
classification tasks [5]. Section 2.1 describes these concepts in more detail.
1.2 Motivations
It is common to publicly post opinions on social media, such as Twitter, Facebook
or YouTube, and these opinions can provide essential data to know what people
think about products and services. Furthermore, these opinions might influence
other people’s opinions and affect their decisions in whether or not they would be
interested in a particular product, organisation or topic. This user feedback might,
in turn, inform about general trends such as popularity and quality.
YouTube is primarily a platform where its users can upload their personal videos,
but also a platform for companies to advertise their products, news outlets to share
information on current events and more. YouTube offers a partner program for its
users, where content creators can monetise their videos and earn advertisement rev-
enue [6]. The amount of money a creator can earn is directly related to the amount
of interest, in the forms of views, ratings and comments, that they can generate on
their content [7]. Furthermore, popular videos might gather important user feed-
back, such as comments, where automatic ways to discern the sentiment polarity
is desirable. The video authors’ dependence on their viewers’ approval raises the
importance of their feedback, which makes it attractive also to study the connection
between the various forms of user feedback.
1.3 Problem Statement

Automatic sentiment classification of texts aims to solve the problem of determining
opinions and sentiments for an ever-increasing amount of content and data online.
There are stakeholders wanting to gain insight about content, and as discussed in
Section 1.6, particularly for comments on YouTube videos. Many approaches have
been tested, and traditional classification techniques have achieved good accuracy
(see Section 2.1.1). However, state-of-the-art deep learning models have outper-
formed the more traditional machine learning methods (see Section 2.2). LSTM-
based approaches, in particular, have advantages over other deep learning methods,
such as RNNs and CNNs (see Section 2.1.4). Despite that, based on the research
done, there appear to be no current studies examining LSTM-based approaches to
English YouTube video comments in multiple domains. It is important to expose
2
the classifier with different text patterns of positive and negative sentiment, which
can only be done by using comments of videos from a variety of categories. In addi-
tion, the semantics of languages differ, therefore it is unlikely that classifiers trained
with different languages and in single domains would be generalisable to perform
equally for English YouTube video comments of multiple domains. Lastly, based
on the research done, there seems to be no sentiment labelled English YouTube
comment dataset available publicly online.
This thesis addresses the gap of available research done in YouTube video com-
ment classification in a variety of video categories with various LSTM-based tech-
niques, as well as the lack of available sentiment labelled English YouTube com-
ment dataset.
1.4 Research Questions & Objectives

To address the problems discussed previously, this project answers the following
research questions:
RQ1 What is the highest accuracy that an LSTM-based model can achieve in pre-
dicting YouTube video sentiment?
During the project, it is determined what accuracy an LSTM-based model can

achieve when classifying the sentiment of YouTube comments.
RQ2 How do variants of the LSTM-based model differ in their performance in

predicting YouTube video sentiment?
Four types of LSTM-based models are evaluated - LSTM, BiLSTM, CNN-

LSTM and CNN-BiLSTM (further explained in Section 2). The evaluation pro-
duces metrics, such as recall, precision, F1 score, confusion matrix, accuracy and
receiver operating characteristic (ROC) curve, which are then used to assess and
compare the performance of each model.
RQ3 How accurate are the LSTM-based models with another sentiment dataset,
such as the IMDB movie review dataset?
The four types of the LSTM-based models are evaluated against Internet Movie
Database (IMDB) movie review dataset [8], which consists of 50 000 movie reviews
containing high polarity.
RQ4 What is the relationship between the video sentiment and users’ preferences?
The viewers of a YouTube video can comment and express their opinion by
rating the video with either a like or dislike. A comparison between the overall
sentiment of the comments and the positive and negative video rating is performed.
In addition, the relationship between the video’s view count and video sentiment is
checked.
In order to answer the research questions, the objectives, listed in Table 1.1, are
achieved.
3
O1 Construct an English YouTube comment dataset covering
several domains.
O2 Construct a YouTube video statistics dataset.
O3 Implement four LSTM-based models: LSTM, BiLSTM,
CNN-LSTM and CNN-BiLSTM.
O4 Evaluate the performance of the four LSTM-based models
using accuracy, precision, recall, F1 score and confusion
matrix.
O5 Analyze inconsistencies in different forms of user feedback.
Table 1.1: Objectives achieved during this thesis project.
The results of objective O1 are used to complete objectives O3 and O4, which
are needed to answer the research questions RQ1, RQ2 and RQ3. Objective O2
together with O1 have to be completed to achieve objective O5, which helps an-
swering the research question RQ4.
1.5 Contributions
This project produces four novel contributions:
1. An English YouTube comment dataset of 19 951 comments with posi-

tive, negative and neutral sentiment labels (see footnote 7 (Page 18))
2. An LSTM-based model that can predict YouTube video sentiment
3. An in-depth evaluation of the performance of the proposed LSTM-based model

and its variations in predicting YouTube video sentiment
4. An analysis of the correlation between various user feedback
The following section (Section 1.6) talks about the importance of these contri-
butions.
1.6 Target groups

This project is of interest to researchers working with natural language processing
(especially with sentiment analysis), YouTube content creators and advertisers.
The constructed YouTube comment dataset consists of assigned comment senti-
ments, therefore it is ready to be used for future studies of sentiment analysis using
YouTube comments. Furthermore, video statistics are provided together with the
YouTube comment dataset, allowing to use this data for sentiment analysis research.
Content creators and advertises can use the produced model for YouTube com-
ment classification and user feedback analysis. In addition, the same group or any
other interested parties can benefit from the analysis of inconsistencies in various
forms of user feedback such as YouTube comments and ratings.
4
1.7 Report Structure
In this section, the brief background for the project was presented, research ques-
tions and objectives were defined, and contributions and their use were described.
The rest of the report describes the project more in-depth. Background (Section 2)
provides more detailed description of the background and discusses related work.
Method (Section 3) details the approach used to answer research questions. Each
step is described more in detail, and the reasoning behind each decision is given.
Implementation (Section 4) explains more technical details of the method used and
describes various environment setups for the whole project. Results (Section 5) con-
tains the results of the data collection and model evaluations. Discussion (Section
6) presents a more in-depth analysis of the results and the dataset. Finally, Con-
clusions and Future Work (Section 7) summarises the findings of this project and
presents suggestions for future work.
5
2 Background
This section begins by describing the theoretical background, where traditional and
deep-learning sentiment techniques are covered (Section 2.1) followed by a review
of related work (Section 2.2).
2.1 Theoretical background

This section briefly introduces the common sentiment detection methodologies and
classification approaches in the past and describes the currently popular ones.
2.1.1 Traditional sentiment classification techniques

The text sentiment classification problem has historically been approached with
supervised machine learning techniques, such as Naive Bayes (NB), Support Vector
Machine (SVM) or Logistic Regression (LR) [5].
One of the first articles proposing that machine learning could be used for online
platform text classification based on its sentiment was done by Pang et al. [9]. Pang
et al. evaluated NB, Maximum Entropy (ME) and SVM classifiers on the IMDb
movie review dataset [8]. SVM was capable of achieving a satisfactory accuracy of
almost 83%. Ever since then, the popularity of social media has grown throughout
the years, and nowadays, millions of people share their opinions online. The im-
mense amount of sentimental data available on most social network platforms has
created an interest in achieving automatic sentiment classification.
In the sense of sentiment analysis, one of the most explored social media so far
has been Twitter. Twitter allows people to express their opinion in short messages,
called tweets. Due to limited length, the users are forced to lay out their thoughts in
a brief but straight-to-the-point manner. This produces sentiment-rich data, making
it suitable for NLP tasks. Agarwal et al. achieved 75% accuracy using SVM for
binary classification on non-domain specific data [10], Neethu and Rajasree evalu-
ated and compared the performance of SVM, NB and ME algorithms on electronic
product tweets classification, achieving 93% accuracy with SVM and ME [11].
Some work has also been done using YouTube datasets, such as classifying
YouTube cooking videos using SVM (achieved accuracy (95.3%)) [12] or classifi-
cation of popular Arabic YouTube videos using their comments, where SVM with
Radial Bases function achieved the F1-score of 0.88 [13].
While these techniques may produce accurate sentiment prediction for text, they
also have downsides - traditional machine learning techniques perform poorly with
cross-lingual or cross-domain data [14] and have been under performing in compar-
ison to deep learning (see Section 2.2).
2.1.2 Text sentiment classification with deep learning

Due to traditional techniques having their downsides, researchers have started to
look into more advanced ways to work with textual data. Deep learning has been
performing well in natural language processing tasks and has been utilised more
frequently for NLP tasks with success.
Deep learning is a machine learning class, where computational models are
made up of several non-linear layers, transforming the low-level representation of
6
data to a higher level [15]. Deep learning is often used for pattern recognition, anal-
ysis and classification, and feature extraction [16]. Some of the most well known
deep learning architectures include feed-forward neural networks, convolutional
neural networks and recurrent neural networks. CNNs, RNNs and their variants
have achieved good results in various NLP tasks, especially sentiment classification
[5]. Sections 2.1.3, 2.1.4 and 2.1.5 provide a more in-depth introduction to RNNs
while Section 2.1.6 talks about CNNs.
2.1.3 Recurrent Neural Networks

RNNs are a type of deep learning network that memorizes information about input
data and considers it when processing the next input [5]. This is called dependency
capturing [5]. The input for RNNs is a sequence of some data. For example, in
the case of sentiment analysis, the inputs would be each word in order present in
a sentence. Capturing dependencies is done by processing one word at a time and
maintaining the history of past words of the sentence in memory, or "state vector"
[15]. The "unfolded" flow of action of RNN is visualised in Figure 2.1, where the
information from the previous input is fed into the hidden layer as an input together
with the next input Xn . An output Yn is dependent on the previous input X(n−1) .
Output Y1 Y2 Y3
Hidden Layers
Input
X1 X2 X3
Time
Figure 2.1: An "unfolded" visual representation of recurrent neural network over

time.
Since after processing each input from a given sequence, the input from the past
will be added to the memory, RNNs do not require a set variable for the length of
input and can handle arbitrarily sized inputs without the model increasing in size.
In addition, the output length can vary as well, so a single output can be produced
for multiple inputs. These qualities allow RNNs to predict the next word in the text
when solving text-related problems.
While RNNs are excellent for problems with sequential data, in theory, a regular
RNN suffers from vanishing (too low) and exploding (too high) gradient problems
[17, 18]. These problems, especially vanishing gradient, is mitigated by introducing
a Long Short-Term Memory [18].
7
2.1.4 LSTM
LSTM counters vanishing and exploding gradient and long-term dependency prob-
lems by introducing hidden units [18]. LSTM works similarly to RNN - the in-
formation about the past input is used in processing the next input; however, each
LSTM has gates, which decide if the information should be kept or deleted from the
memory [18]. Memory manipulation is based on how important that information is.
LSTM has three information controlling gates - input, output and forget gate. In
addition, a cell state Cn is used to carry the information [19]. A cell state is affected
by vector operations, such as addition or multiplication [19]. Figure 2.2 visualizes
the unfolded over time LSTM network.
Y1 Y2 Yn
Output
C n-1 X + Cn
t anh
X
X
1 2 3
? ? t anh ?
Y n-1 I nput gat e Yn
Hidden Layers
Xn
Input
X1 X2
Time
Figure 2.2: An "unfolded" visual representation of LSTM network over time. (1)
forget gate, (2) input gate and (3) output gate.
The forget gate decides which information should be disposed of. The Sigmoid
function determines this using the current input and the information from the pre-
vious input, and produces a number between 0 and 1, where 0 represents that the
information should be dropped and 1 represents that information should be kept
[19].
The input gate determines which new information should be kept. The sigmoid
function is responsible for updating the relevant values that describe the sequence,
while the tanh function creates a vector of new values that could be added to the
memory [19]. The old cell state is then updated by multiplying the old cell state
with the values generated by the forget gate and added to the values generated by
the input gate. This results in an updated cell state, and the new information is
carried to the upcoming iteration.
Lastly, the output gate determines the output for the given input. This is done
by first using the sigmoid function on the current input, and previous output [19].
Then tanh function is used on the updated cell state and multiplied with the sigmoid
function output [19].
8
2.1.5 Bidirectional RNN
At its core, RNN preserves information only of the past and therefore is unidirec-
tional. In the textual analysis domain, it is sometimes essential to go forwards and
backwards to understand the context and the meaning. Bidirectional RNN (BiRNN)
trains two RNN layers concurrently, where the first layer receives the given input
and the second one receives the reversed input [20]. Such an approach results in
the use of past and future information of given data. Figure 2.3 shows the visual
representation of the bidirectional RNN.
Output Y1 Y2 Y3
+ + +
Forwards
Hidden Layers
Backwards
Input X1 X2 X3
Time
Figure 2.3: An "unfolded" visual representation of bidirectional RNN over time.
BiLSTM works with a similar concept as BiRNN but with the LSTM layer
instead of a traditional RNN.
2.1.6 Convolutional Neural Networks

CNNs, originally intended to be applied in computer vision [21], is a deep neural
network that traverses the input with a smaller than input window, called kernel
or filter, and continuously applies multiplications (or convolutions) on parts of the
input [22]. These multiplications produce a feature map [22]. Sometimes the multi-
plications overlap at the same point of the input [23]. Each multiplication produces
a single value, so traversing the input this way allows to capture important features
of the input. Figure 2.4 illustrates this process.
CNNs also have pooling layers, which are used to reduce the size of the output
dimensions before it is passed into the next layer of the network. One of the most
common ways to apply pooling is to use a maximum value of the current window
in a feature map [15], and it is often called max pooling. Figure 2.5 illustrates max
pooling.
CNN performs well when dealing with the dimensionality reduction problem
[24], can extract features at various positions, and capture short-range and long-
range relations [25]. It proves that CNNs can be helpful in text analysis tasks as
9
Kernel
Feature map
Input
Figure 2.4: Convolutional layer. The dashed lines indicate that the kernel is applied
several times and produces inputs in the feature map.
Output
Feature map
Figure 2.5: Max pooling of convolutional neural network.
they can recognise the crucial features of textual data.
2.2 Related work

Convolutional and recurrent neural networks have been primarily used for text clas-
sification because CNN handles dimensionality reduction well while LSTM has
been successful with sequential data [24].
Jain and Thenmalar [26] compared an NB classifier with an LSTM network in
Twitter post sentiment prediction. The results showed that LSTM outperformed
NB.
Li and Qian [27] applied an LSTM-based neural network on English and Chi-
nese datasets consisting of product, movie, and travelling help site reviews for
opinion classification. They compared regular RNN with an LSTM-based neural
network. The LSTM-based network outperformed regular RNN with all of the
datasets, reaching the highest accuracy of 94.83%.
10
Day and Lin [28] used a Bidirectional LSTM based model to classify Google
Play reviews’ sentiment. They used several dictionaries to calculate and generate
the features - HowNet, iSGoPaSD and NTUSD. Finally, for the word embedding,
CBOW and Skip-gram proposed by Mikolov et al. [29] was used. The authors
compared BiLSTM’s performance against the performance of SVM and NB algo-
rithms. BiLSTM outperformed traditional machine learning approaches and was
able to reach a 94% accuracy.
Zhou et al. [30] attempted to solve the sentiment classification problem by ap-
plying a combination of convolutional neural network and LSTM, calling this model
C-LSTM. The experiments are performed on Stanford Sentiment Treebank, which
is a dataset consisting of movie reviews. The authors used one-dimensional con-
volutions to extract n-gram feature maps. These features maps are then rearranged
as feature representations and fed into the LSTM layer. The C-LSTM model out-
performed the SVM algorithm on both binary and fine-grained classification exper-
iments.
Other works have also combined CNN and LSTM to solve the sentiment clas-
sification problem. Hassan and Mahmood [31] combined CNN and LSTM, feeding
it pre-trained word vectors. They merged convolutional and LSTM layers into a
single model, which was able to outperform NB, SVM algorithms and CNNs while
reaching the highest accuracy for binary classification of 88.3%.
Huang et al. [32] used word embedding pre-trained with their own dataset
and combined CNN and LSTM where two LSTM layers were stacked on top of
CNN. Their proposed model’s performance outperformed standalone CNN, LSTM,
a combination of CNN and single-layer LSTM, and SVM with an accuracy of
87.2%.
Cunha et al. [33] built a convolutional neural network for sentiment classifica-
tion of comments posted on Brazilian political videos on YouTube. They evaluated
the performance of the proposed classifier by separating user feedback into several
different categories - what the user thinks about the content creator and the video,
users opinion on how well the topic is covered and users opinion on how relevant
the video is. Their classifier was able to classify the first category comments the
best, reaching 84%.
An LSTM-based approach to sentiment classification has been studied with the
previously mentioned IMDB data as well. Yenter and Verma [34] combined several
CNN-LSTM kernels for sentiment detection. They did not use the pre-trained word
embeddings. Their multi-kernel CNN-LSTM model achieved 89.5% accuracy.
Mathapati et al. [35] analysed the performance of several deep learning tech-
niques, including LSTM, CNN, and CNN-LSTM. They used NB as a base algo-
rithm, which was outperformed by all the deep learning techniques. The deep learn-
ing models were trained with pre-trained word vectors and CNN-LSTM performed
the best, achieving 88.3% accuracy.
2.2.1 Related work discussion

Commonly, the research in the sentiment analysis field tends to focus on build-
ing classifiers for a single domain. These domains mainly include various reviews,
such as movie reviews [27, 30, 34] or product reviews [27, 28]. Internet slang and
acronyms are frequent in YouTube comments and can often make up most of the
comment, however, such writing style can rarely be found in product or movie re-
11
view data. Reviews tend to be written more formally and using proper language.
This suggests that the sentiment classifiers proposed in the studies reviewed in Sec-
tion 2.2 would not perform satisfactorily in the YouTube sentiment classification
task using video comments. The importance of different having comments from
different category videos is mentioned by Cunha et al. [33] as well, where the au-
thors suggest that the future work should include comments of videos from several
domains. This thesis project deals with diverse text data, including proper speech,
internet slang, and discussions of various topics. In order to cover several video
domains, videos of several categories are selected when building the YouTube com-
ment dataset (see Section 3.1.1).
In addition, only one of the reviewed studies uses word embeddings [32], how-
ever, the word vectors are pre-trained on their own dataset. Pre-training word vec-
tors on their own dataset might pose a risk of vectors not representing the word
based on its context correctly, especially if the dataset is not large and the word
appeared only a few times. This thesis project uses pre-trained word vectors con-
sisting of 1.2 million words, allowing better word context representation and more
accurate predictions.
Lastly, this project uses deep learning model architectures similar to the studies
reviewed in section 2.2, such as LSTM, BiLSTM, CNN-LSTM. In addition, CNN-
BiLSTM architecture is evaluated.
12
3 Method
The following section describes the research methodology for the data collection,
data cleanup, data labelling and data processing. This is followed by explaining
how the model implementation and evaluation are performed. The methodology
is visualised in Figure 3.6. The section is concluded by discussing the reliability,
validity and ethical considerations of the chosen methods.
1. Data YouTube
YouTube
Produces Produces comment
video dataset collection
dataset
Used for
Clean
2. Data YouTube
Produces
cleanup comment
dataset
Used for
Labelled
3. Data YouTube
Produces
labelling comment
dataset
Used for
Processed and
4. Data labelled
Produces
processing YouTube
comments
Used for
5. Model LSTM-based
Produces
implementation models
Used for
6. Model Used for
evaluation
Model
Produces
performance
Figure 3.6: The research methodology.
3.1 Data collection

The initial step of the practical part of the project is to gather the data that the
following steps will operate on. Two types of data need to be collected in order to
gather sufficient data to answer the research questions, YouTube video statistics and
YouTube video comments. The following section explains the approach to this data
collection.
13
3.1.1 Video collection
Before the YouTube video comments can be collected, a list of videos has to be
defined. Since it is unclear if the overall video comment sentiment is any different
to video ratings or has any correlation to view count, the videos are collected based
on the upload date and the variety of categories. An initial list of videos is collected
using mixed data sources - a public dataset of trending YouTube video statistics [36]
and user constructed playlists of popular videos. The base collection of videos are
then refined using the following inclusion criteria:
• A video must have comments and ratings enabled
• A video access has to be public (not unlisted or private)
• A video has to be uploaded on a date between 2016 and 2018
• A video must have at least 100 000 views and at least 2000 comments
• The intent of the video can not be to ask a question to the audience
• The author of the video can not be encouraging the audience to rate the video
negatively
• The author of the video may not have been deceased
A video must have comments and ratings enabled, be public and have a cer-
tain amount of views and comments to ensure that the video contains all the data
needed. The date criteria ensure that the video is not actively undergoing changes in
statistics (views, ratings) and limiting the timespan of the collected videos if there
are changes to how ratings and views are counted on the videos. Given that changes
in platform implementation details on YouTube, such as algorithms used for view
count, are unknown, this measure attempts to minimise its effect on the data. To
prevent data manipulation, the video is then reviewed to not include videos that ask
questions of the audience (e.g. "Respond in the comments below") or videos where
the author encourage the audience to negatively rate the video [37]. Finally, it is
made sure that the creator of the selected video is not recently diseased as this could
cause the most recent comments to talk about the death of the author rather than the
contents of the video.
YouTube does not provide a functionality to search videos by the year they were
uploaded or by the comment and view number, therefore video collection is done
manually.
The video selection process is crucial as it has a significant impact on the later
construction of the YouTube Comment dataset. The final YouTube Comment dataset
must have a even number of positive and negative sentiment comments. If there is
a more significant portion of either sentiment, the LSTM-based model might be
biased towards one of the sentiments.
The selection of videos with the criteria mentioned above finalized on February
02, 2021 and results in a video collection comprised of 49 videos. The selected
videos cover the following categories:
• Science & Technology
• Entertainment
14
• People & Blogs
• Howto & Style
• Music
• News & Politics
• Gaming
• Comedy
• Film & Animation
• Sports
The video collection is concluded by saving the following information about

each video - video title, video ID (or URL), number of positive ratings (upvotes),
number of negative ratings (downvotes), view count, upload date and comment
count. Figure 3.7 shows an example of how video data is saved.
url title upvote downvote views commentcount date

Selling My
ItYOdWRo0JY 108577 91775 12499974 21123 2017-11-14T00:45:15Z
iPhone X
Figure 3.7: Example of the video information stored in the database.
The list of selected videos and their statistics can be found on this project’s
GitHub repository 2 .
3.1.2 Comment collection

From the selected videos, the 1 000 most recent top-level comments are collected,
where a top-level comment is defined as a comment posted in direct response to
the video and not as a reply to other comments. The comment selection is further
refined not to include more than one comments by the same comment author to
avoid double-posting or same sentiment comments. The size of the raw comment
dataset is 49 000 comments. The details of each comment are saved, including the
comment’s body, comment’s ID, video ID the comment belongs to and the date
the comment was posted on. Figure 3.8 shows an example of how the comment is
stored. Here, the positive, negative and neutral database columns are used to track
which sentiment was the comment assigned by setting one of the values to 1 later
in the labelling process (see Section 3.3). The rated column ensures that the same
comment does not appear in the labelling process again and is also set to 1 when
the comment gets marked with a sentiment.
2
https://github.com/indrer/ytcsc/blob/main/datasets/video/video_stats.csv
15
body positive negative neutral rated comment_id video_id date
You
could
roast so
0 0 0 0 Ugxz_FdO-AA0mEz8FMJ4AaABAg ItYOdWRo0JY2021-01-31T20:24:17Z
many
chestnuts
with this
Figure 3.8: Example of the unlabelled comment stored in the database.
3.2 Data cleanup

The next step in the project involves cleaning the data and aims to produce a clean
dataset that is next used for labelling the data. The process of data cleanup consists
of activities to identify and remove or correct the gathered comment texts to ease
the labelling process and prepare it for the final YouTube comment dataset. Special
attention is put to eliminate non-English comments and comments that do not con-
vey any sentiment, such as comments consisting of only links or non-alphanumeric
characters. Emojis, which include emoticons and symbols, are considered non-
alphanumeric characters; however, they are not removed as they can carry a senti-
ment in internet speech.
3.3 Data labelling

With cleaned data, the labelling of the data can be initialised. The labelling is
performed in order for the data to be used in model training and evaluation. The
data labelling is a process of reading and manually assigning sentiment to each
comment. As explained in Section 2.1, the sentiment polarity can be binary or
multi-label and considering that YouTube only allows positive or negative video
ratings, the comments are rated as either positive or negative as well. Comments
that carry no clear sentiment or are ambiguous are marked as neutral and not used
later in the model training and evaluation. However, they are kept in the dataset for
future research.
Comment labelling is subjective which can cause validity concerns. Validity
concerns regarding subjectivity are discussed in in Section 3.7.3. In addition, the
labelling process is laborious and spans nearly the whole duration of the project. In
order to speed up the process and try to reduce labelling bias, several more people
are involved in labelling by creating a web application3 seen in figure 3.9. The video
title is given as a link, and below it, the user can read the comment posted on that
video. Under the comment, the buttons allow the user to label the comment with a
positive, negative or neutral sentiment. If the comment is non-English or is made of
only URL or non-alphanumeric characters, the user may mark it as invalid.
The link to the YouTube comment sentiment labelling web application is shared
with various user groups comprised of staff and students of the Linnaeus Univer-
sity and other people interested in helping. The labelling process takes place from
February 9, 2021 to May 17, 2021. The labelling is performed by 10 people and
results in 19,951 labelled comments.
3
The web application can be found in https://indre.rocks/labels
16
Figure 3.9: The comment labelling tool.
3.4 Data processing

The comment data has to be normalised and simplified to ensure that the senti-
ment classification model performs well. The remaining data should only consist of
text and text patterns that expose each comment’s sentiment and is understandable
for the model. The process of comment normalisation and simplification is called
processing. Data processing includes the removal of data that does not provide in-
formation about the sentiment and the creation of a normalized dataset. After this
step, the processed data is saved into a comma-separated values file (CSV).
3.4.1 Removal of nonessential data

The YouTube comments that are not processed may contain data that does not pro-
vide any benefit to the model when classifying sentiment, such as links, numbers,
punctuation, hashtags and stop words. Stop words are words that add no sentiment,
such as the following determiners:
• Articles (the, a/an)

• Quantifiers (this, that, another)
• Demonstratives (his, hers, yours)
• Prepositions (in, before, after)
• Conjunctions (and, for, but).
This unwanted data is then identified and removed.
17
3.4.2 Simplification of data
Commonly, YouTube comments contain acronyms and emojis, which introduces
some inconsistencies in the data. To prevent this, a list of most commonly used emo-
jis4 and acronyms5 is collected and these inconsistencies are automatically replaced
with their full terms and simplified textual meanings.The list of used acronyms and
emojis can be found in the project’s GitHub repository6 .
According to G. Angiani et al., repeating vowels and consonants should also
be removed [38]. Hence if a vowel or consonant is repeated more than two times
in a word, the occurrences of that letter are reduced to just two (for example, in
the sentence "this is so coooooool!" the word "coooooool" has a repeating vowel
"o", which occurs more than two times and therefore this vowel is replaced by two
occurrences of "o", making the sentence "this is so cool!").
In addition, G. Angiani et al. claims that ignoring or having a variety of negation
words is one of the leading causes for misclassification in sentiment classification
tasks [38]. The English language has several different negation words, such as
"don’t", "won’t", "can’t", "shouldn’t", and to make the text more consistent, any
negation words are replaced with the word "not".
Moreover, text tokenisation is performed on the comment dataset. Text tokeni-
sation is a process of turning text into a sequence of integers, where each integer
represents a word. A dictionary of all unique words is generated, where the inte-
gers used in the text tokenisation are indices of these words in the dictionary. This
results in a dataset of sequences of numbers, making it possible to be used for deep
learning model training.
Finally, the generated sequences of integers are padded. Padding assures that
all the sequences in the dataset will be of the same length. The longest sequence is
not padded, but any other shorter sequences will have one or several values inserted
before to match the longest sequence length. The first index of a word in a dataset
vocabulary, generated during the text tokenisation, is 1, therefore 0 is used to pad
the integer sequences. Figure 3.10 visualises this process.
Padded integer
Integer sequences sequences
56, 1, 14, 1, 38 56, 1, 14, 1, 38
1, 30 0, 0, 0, 1, 30
Figure 3.10: Example of tokenized text with padding applied to the second text.
The labelled and processed dataset can be found in the GitHub repository7 . In
this dataset, the rating of 0 indicates negative sentiment, rating of 1 positive and -1
- neutral.
4
https://emojipedia.org/people/
5
https://www.netlingo.com/acronyms.php
6
https://github.com/indrer/ytcsc/blob/main/util/acronyms_smileys.py
7
https://github.com/indrer/ytcsc/blob/main/datasets/processed/processed_full.csv
18
3.4.3 Word vector generation
Word vectors (also known as word embeddings or word embedding matrix) are
distributed word representations as vectors in a vector space. Such representation
allows the words that have different meaning in another context to maintain their
meaning and helps with the detection of synonyms. In addition, it can predict pos-
sible follow up words. This project uses pre-trained GloVe8 word vectors trained
on 2 billion Twitter tweets with 1.2 million different words [39]. Preferably, a word
vector trained on YouTube comment data should be used because the vocabulary of
YouTube comments can differ from that of Twitter tweets, however, no such pre-
trained word vectors exist yet, and the dataset compiled during this project does not
contain enough data to train accurate word vectors.
3.5 Model implementation

The model implementation can be described as defining the architecture of the
model followed by training it on the training subset of the data together with the
validation data. For this step, the processed and labelled YouTube comment dataset
is split into three parts, training, validation, and testing. Further, the architectures
of four LSTM-based models are constructed.
The processed and labelled YouTube comment dataset is divided into three sub-
sets with a ratio of 70:10:20. Where the training subset is assigned 70%, the vali-
dation subset 10%, and the testing subset 20% of the dataset. The training subset
is then used to fit the model’s weights, which requires the largest portion of data,
and the validation subset is used for validation-based early stopping to avoid model
over-fitting during the training process. Over-fitting is a condition of a model in
which the model fits the training data too well and does not perform well on the
unseen data. Early stopping is a method that stops training of the model if the per-
formance of the model does not improve with the validation subset anymore. The
testing subset is used to evaluate how well the model is performing during the model
evaluation step.
After the dataset is split, the architectures of four LSTM-based models are de-
fined. Two versions of each model are implemented - one with pre-trained word
vectors in the input layer and one without. The four LSTM-based models that are
defined are LSTM, BiLSTM, CNN-LSTM and CNN-BiLSTM. In the CNN-LSTM
and CNN-BiLSTM models, the data is processed by the convolutional layers first
before it is passed to the LSTM or BiLSTM layers.
Finally, after training the models on the YouTube comment dataset, the same
four models are trained on the IMDB dataset, which is also divided into three por-
tions with the same ratio of training, validation and testing as the YouTube comment
dataset.
3.6 Model evaluation

After the models have been implemented and trained, the model evaluation is per-
formed. This is carried out on the testing subset of the comment dataset. The mod-
els are evaluated in two ways: using the average score of three-fold cross-validation
and evaluating the model on the unseen testing subset. Cross-validation allows to
evaluate model’s generalisability as the model is trained on different portions of
8
https://nlp.stanford.edu/projects/glove/
19
the data. Three fold for cross-validation is chosen due to time limitations as larger
number of folds takes longer to execute. Evaluation on unseen data, on the other
hand, simulates real world, when the model would be trained on the entire training
and validation subset and would predict the unseen data. During cross-validation,
the training and validation subsets are joined and randomised and then, in turn, split
into three groups, where some data in each group is used for training and some
for evaluation. This process is illustrated in Figure 3.11. The model evaluation
on unseen data produces metrics that aid in evaluating each models performance,
including accuracy, precision, recall, F1 score, confusion matrix and ROC curve.
These metrics are chosen as they not only show the accuracy of the model but also
provide different insights of how a model is performing, such as whether a model
is biased towards some sentiment, how accurate is the model in guessing a specific
sentiment or how well does a model do in comparison to other models (the metrics
are explained more in-depth in Section 4.4.4). The goal of analysing the model’s
performance on the testing data is to see how well the model performs on unseen
data and assess if the model is over-fitting during the training process. Higher model
performance on unseen data suggests a more reliable model when applied in real-
life use cases.
Training + validation dataset
Test Train Metrics
Average
Test Train Metrics
metrics
Train Test Metrics
Figure 3.11: Three-fold cross-validation.
3.7 Reliability and Validity

This section discusses the reliability and validity concerns of this project.
3.7.1 Reliability
Reliability assesses how reproducible the approach of this project is and whether it
will yield the same results as presented in the report. During this project, various
steps were taken to ensure that the data and the results collected can be reproduced
to as close to the described numbers as possible. However, there are some consid-
erations, in particular, that should be emphasised.
Attempting to compile YouTube comments and YouTube video statistics will
most likely not result in the same exact data as it was collected during this project.
Some videos are still viewed and interacted with occasionally, therefore the com-
ments, view count and rating count might be different. Further, the authors of the
videos can make videos used in this project unlisted or private, making them inac-
cessible.
20
Moreover, due to the nature of deep learning models, randomness plays a large
part in the model training process as deep neural networks are stochastic. There-
fore, using the same architecture will not yield the same results. The same random
seed was used throughout all the experiments to reduce randomness, however, some
randomness can not be controlled. To solve this, all the model’s weights and their
implementations are saved in separate files. To generate the results achieved during
this project, the models can be loaded and evaluated with the provided YouTube
comment dataset, which should result in accuracy as described in the report.
Finally, the steps for data processing are essential to follow for good perfor-
mance of the model, therefore the same processing as described in this report should
be applied to any new data to achieve similar or identical accuracy metrics.
3.7.2 Construct validity

Construct validity refers to the theoretical constructs that the study was based on.
To minimize construct validity concerns, steps of the methodology of this project
were done based on the previously done research. The data preprocessing steps,
such as data clean up and data processing, were done based on the previous studies
dealing with sentiment analysis. Model performance evaluation used the metrics
and approaches commonly used in binary classification problems to ensure the cor-
rectness of evaluation.
3.7.3 Internal validity

Internal validity refers to whether the project measures what it claims to measure.
There could be issues in the experiment that threatens the internal validity. Some
threats to the internal validity of this project have been identified, in particular to
the data collection.
Due to the limitations of YouTube’s video search functionality, it is impossible
to retrieve a list of random videos. Therefore, other means are used to find an
initial collection of videos that are further filtered down using the video selection
inclusion criteria. The initial collection comprises popular and trending videos, but
it is unknown how videos become popular on YouTube. To prevent bias in how
YouTube determines or promotes videos, videos are selected by a manual review
that aims to ensure that the video categories differ, in case the YouTube algorithm
favours videos of specific features.
Another threat to validity is how the video comment labelling process was per-
formed. During the labelling process, a participant could possibly interpret some
comments, such as comments displaying sarcasm or irony, as having unclear senti-
ment and mark them as having no sentiment. This could affect the variety of data
polarity of the labelled comment dataset. In an attempt to avoid this issue, the link to
the data labelling tool is shared in several communities to gather more participants.
However, it is unclear if this had any positive impact on this validity threat.
During the comment collection, the most recently posted comments are selected.
This could potentially introduce a threat to validity in that the selected comments
might have been affected by external factors or events since the release of the video,
such as controversial situations involving the video authors. This could be ac-
counted for by collecting all the video comments and then randomly select the
comments. However, due to quota limitations of the YouTube Data API, this is
not feasible.
21
3.7.4 External validity
The external validity refers to whether the research generalises to the broader field
or other situations.
Concerns regarding the external validity of the project are primarily whether or
not the finished model is applicable for YouTube videos and comments of different
categories, different text or other contexts, and with what accuracy. To address these
concerns, the four LSTM-based models are evaluated on the IMDB dataset. As
shown in the model evaluation (Section 6.3), the model performance is acceptable
even for a dataset of movie reviews, which should indicate that the models are
generalisable to some extent.
3.8 Ethical considerations

During this project, two datasets are constructed – the YouTube video statistics
dataset and YouTube comment dataset. To preserve the users’ privacy that posts
comments on the videos, no information identifying the user are collected. The
information about the collected comment consists of the date it was posted on, the
body of the comment, comment id and video id. The author cannot be identified
using any of these parameters.
22
4 Implementation
This section describes the technical details and implementation of the methods men-
tioned in Section 3. The choices made regarding implementation and their reasons
are clarified. Further, the environment setup (Section 4.1), data collection and clean
up (Section 4.2), and final YouTube dataset construction (Section 4.3) are explained,
followed by details about the implementation of the four models used in this project
(Section 4.4). All the code is available in a GitHub repository9 .
4.1 Environment setup

Python (version 3.7.8) and its libraries are used for data collection, dataset con-
struction, and model implementation and evaluation. The most popular and com-
monly used Python libraries for data manipulation and neural network implemen-
tation were used in this project. The following is a list of the used libraries, their
versions and what they are used for:
1. Google API Client (1.12.5)10 - Data retrieval from YouTube Data API
2. Psycopg (2.8.6)11 - Storing data in PostgreSQL database
3. NumPy (1.19.5)12 - Data collection, dataset construction
4. pandas (1.0.5)13 - Data collection, dataset construction
5. TensorFlow (2.4.1)14 - Model training
6. Keras (2.4.3)15 - Defining model architecture, model training, model evalua-

tion
7. scikit-learn (0.24.1)16 - Model evaluation, dataset splitting
The computations for the model training and evaluation are done on the GPU
(NVIDIA GeForce GTX 1660 Ti). PostgreSQL (version 12.6) is used to tempo-
rary store video statistics and comments. NodeJS (version 14.15.3), Nginx (version
1.18.0), together with ExpressJS (version 4.17.1), is used to develop and run a web
tool for comment labelling.
4.2 Data collection and clean up

As mentioned in Section 3.1, two types of data are collected - YouTube video statis-
tics and YouTube video comments. The data is retrieved from YouTube’s Data API
(version 3) using the list of YouTube video IDs selected manually. The list of the
manually collected YouTube video IDs can be found in the project GitHub repos-
itory, in the video_list.txt file. The video IDs are iterated over, and a request to
9
https://github.com/indrer/ytcsc
10
https://github.com/googleapis/google-api-python-client
11
https://www.psycopg.org/
12
https://numpy.org/
13
https://pandas.pydata.org/
14
https://www.tensorflow.org/
15
https://keras.io/
16
https://scikit-learn.org/
23
YouTube’s Data API is made for each video. Each request specifies which video re-
source properties17 should be retrieved. For this project, the properties of interest are
snippet and statistics, which consists of the useful information needed to construct
comment dataset - title, id, dislikeCount (negative ratings), likeCount (positive rat-
ings), viewCount, commentCount and publishedAt. The video data of 49 videos are
then saved into the PostgreSQL database.
After the essential video information is retrieved, the video comments are col-
lected. This is again done by making requests to the YouTube Data API. Each
request provided with a video ID, comment page token, specification of what kind
of comment properties to retrieve (snippet) and the order (newest first). Only the
newest top-level comments are collected. The response provides only 100 com-
ments, therefore 10 requests need to be made for each video to collect 1000 com-
ments. Comment page token is used to make sure that no comments from the same
page are fetched. The following information of the comment is stored - textOrig-
inal (body of the comment), id, video_id and updatedAt (when the comment was
posted).
Before the comments are stored in the database, they are cleaned up. First,
any duplicate comments are removed. Then, comments that consists of URLs or
non-alphanumeric characters solely are identified using regular expressions and re-
moved. Afterwards, the language identification model [40, 41] together with the
fastText library18 is used to identify non-English comments and remove them. Many
YouTube comments consist of misspelt words and slang, therefore the language
identification model cannot remove all the non-English comments. These kinds of
comments are later removed manually during the comment labelling process.
Finally, the cleaned-up comments are saved into the PostgreSQL database. The
database is made up of 2 tables - comment and video. The primary key for the
video table is the video’s url, while the primary key for the comment table is the
comment’s ID. Each comment references the video it belongs to with a foreign key
pointing to the video’s ID in the video table. The entity-relationship diagram for the
database can be seen in Figure 4.12.
Comment
Video
comment_id
varchar
(PK) url (PK) varchar
body text title varchar
positive int8 upvote int8
negative int8 downvote int8
neutral int8 views int8
rated int8 commentcount int8
video_id(FK) varchar date varchar
date varchar
Figure 4.12: Entity relationship diagram of the PostgreSQL database.
17
https://developers.google.com/youtube/v3/docs/videos/list
18
https://fasttext.cc/
24
4.3 Dataset construction
The YouTube comment dataset construction has two main steps: labelling the data
and processing the labelled data. This section describes the implementation of both
of the steps.
4.3.1 Web application for comment labelling

The web application is created for this project to ease the process of sentiment
assignment to the comments. The server architecture of the web application can be
seen in Figure 4.13.
Client (web
browser)
Nginx reverse
Server proxy
Web server
Database
Figure 4.13: Server architecture.
Nginx19 is used as a reverse proxy for security, easier logging and encrypted
connection. Nginx handles the HTTP requests first and passes them into the web-
server. The web server is implemented with NodeJS and ExpressJS. The web server
is responsible for updating comment’s sentiment on the database, fetching an unla-
belled comment from the database and displaying the comment to be labelled to the
user.
4.3.2 Data processing

The data processing is performed in Jupyter Notebook20 with Python code. Jupyter
Notebook simplifies and speeds up actions done on the dataset as it can display
any changes made on the go. The data processing starts by retrieving only labelled
comments from the database and saving them in a CSV file, which is then loaded
into a pandas DataFrame. A DataFrame is a data structure with labelled rows and
columns, which allows to apply functions over selected data points, making it easy
19
https://nginx.org/
20
https://jupyter.org/
25
to change the desired data. In addition, DataFrames allow storage of different data
types. The YouTube comments are processed using the natural language prepro-
cessing methods mentioned in Section 3.4 before they are passed into the model.
Positive, negative and neutral sentiment data columns are transformed into a sin-
gle column, where positive sentiment is marked with an integer 1, negative with a
0 and neutral with a -1. The Natural Language Toolkit21 library is used to get a
list of stopwords. Then, the thirty-five most frequently occurring stopwords in the
comment dataset are removed. The dataset is saved into a CSV file.
The next step is to generate word vectors. The pre-trained GloVe word vectors
with Twitter data are used. They are loaded using Gensim22 Python library. The
word vector matrix is constructed by including only the words that occur in the
comment dataset vocabulary.
Lastly, comments are tokenised using Tokenizer, provided by the Keras library.
The process is described more in-depth in Section 3.4. Three types of datasets of the
processed and labelled comments are created - positive, negative, and mixed. The
mixed dataset consists of positive, negative and neutral comments. All the datasets
have the following columns: column_id, video_id, rating, date and body. These
datasets can be found in the GitHub repository, in the datasets/processed folder.
4.4 Implementation, training and evaluation of LSTM-based models
GloVe Twitter word

Training subset Validation subset
vectors
LSTM-based models
LSTM models BiLSTM models
Trained Trained
Trained with Trained with
without word without word
word vectors word vectors
vectors vectors
CNN-LSTM models CNN-BiLSTM models
Trained Trained
vectors vectors
History of model loss

and accuracy on
validation and training
data
Figure 4.14: The flow of model training.

21
https://www.nltk.org/
22
https://radimrehurek.com/gensim/
26
Testing subset
Trained LSTM-based models
LSTM models BiLSTM models
Trained Trained
vectors vectors
CNN-LSTM models CNN-BiLSTM models
Trained Trained
vectors vectors
Accuracy, precision,
recall, F1 score,
confusion matrix,
ROC curve
Figure 4.15: The flow of model evaluation.
Figure 4.14 illustrates the simplified process of how the models are implemented.
Each of the four types of models is trained on YouTube comment data training
and validation subsets with or without word vectors in the embedding layer. This
produces the history of values of model accuracy and loss.
Figure 4.15 illustrates the process of model evaluation. The testing subset of
the comment dataset is used for model evaluation, and this results in several more
metrics that help determine how the model performs with the unknown data. The
produced metrics are accuracy, precision, recall F1 score, confusion matrix and
ROC curve.
This section describes the implementation, training and evaluation of four LSTM-
based models.
4.4.1 Data preparation

The positive and negative comment datasets are randomly split into three subsets -
training, validation and testing at a ratio of 70:10:20, respectively. All of the subsets
must have the same amount of positive and negative comments, therefore scikit-
learn’s function train_test_split() with the parameter stratify as the
comment labels are used to preserve the percentage of each class. The training
subset is passed into the model. The validation subset is used to measure when to
perform early stopping, and the testing subset is used to evaluate the performance
of a model. Table 4.2 shows the size of each subset.
27
DATA SET N UMBER OF COMMENTS
Training set 6041
Validation set 863
Testing set 1726
Table 4.2: Sizes of each subsets.
The same subsets are used for all experiments and hyperparameter training. The
dataset subsets can be found in the GitHub repository 23 .
4.4.2 Hyperparameter tuning

Hyperparameter tuning concerns selecting the optimal hyperparameters for the model,
and it is done before the model training or evaluation is performed. A grid search
is performed on the learning rate for the Adam optimiser [42] and the batch sizes.
An EarlyStopping callback is used to monitor the value of validation loss, and it
stops the model training process when there is no more improvement is observed.
The hyperparameters that result in the best accuracy are later used for model train-
ing and evaluation. Hyperparameter tuning is performed on all four LSTM-based
models.
4.4.3 Implementation of models and their training
Embedding LSTM Fully Fully

LSTM layer (128 units) connected connected
(32 units) (1 unit)
Embedding BiLSTM Fully

BiLSTM layer (64 units) connected
(1 unit)
1D Conv. 1D Conv. Fully

Embedding (128 units, (64 units, Max Pooling LSTM
CNN-LSTM layer 2 kernel 2 kernel (2 pool size) (64 units) connected
size) size) (1 unit)
1D Conv. 1D Conv. Fully

Embedding (128 units, (64 units, Max Pooling BiLSTM
CNN-BiLSTM layer 2 kernel 2 kernel (2 pool size) (128 units) connected
size) size) (1 unit)
Figure 4.16: The architectures of four LSTM-based models. Green nodes are input
layers, white nodes are hidden layers and purple nodes are output layers.
All four models’ architectures are defined using Keras’ methods. Each model is
built as a Sequential model, which is a group of linearly stacked layers. All four
neural networks share the same input and output layers but are different in their
hidden layers. The architecture of each model is visualised in Figure 4.16.
23
https://github.com/indrer/ytcsc/tree/main/datasets/split
28
Input layer. The Input layer is implemented as a Keras Embedding Layer. If not
specified, the embedding layer is initialised with random weights and turns inte-
ger represented word sequences into dense vectors [43]. Models with two types of
embedding layers are evaluated - with initialised weights (word vectors) and with-
out initialised weights. The input dimensions of the embedding layer depend on
whether the word vectors are used or not - if they are, then the input is the size of
the word vector’s vocabulary. Otherwise, the input dimensions are the maximum
integer index of tokenised set + 1. The output dimension is either
the vector size (if word vectors are used) or an integer 32.
Output layer. The output layer produces a single value for all the models rep-
resenting the sentiment of a comment. The output layer is a Dense layer with a
single unit and sigmoid activation function. The produced value is between 0 and
1. If the value is 0.5 or higher, the comment is classified as positive, and if the
value is lower than 0.5, the comment is classified as negative.
Hidden layers Hidden layers of an individual model are what makes each model
unique. The LSTM model has a single LSTM layer with 128 units followed by a
fully connected layer with 32 units. The BiLSTM model has a single bidirectional
LSTM layer with 128 units. The CNN-LSTM model has two one-dimensional
convolutional layers with 128 and 64 filters and a kernel size of 2 for both layers,
followed by a max-pooling layer with a pool size of 2 and an LSTM layer with
64 units. Lastly, the CNN-BiLSTM model has the same configuration of convolu-
tional and max-pooling layers as the CNN-LSTM model, but these layers are then
followed by a bidirectional LSTM layer with 128 units.
Model compilation The models are compiled using binary cross-entropy as a loss
function together with the Adam optimiser. The learning rate for the optimiser is
selected based on the hyperparameter turning results.
Model training Models are trained with the best performing hyperparameters for
that type of model and 30 epochs. The number of epochs can be arbitrarily large
since the early stopping method is used, which stops model training as soon as there
is no improvement in the loss observed.
After each model is trained, its weights and the architecture are saved in a folder
to be loaded and used later. Saved models can be found in the GitHub repository, in
the models/models folder.
4.4.4 Evaluation of model performance

The evaluation of model performance is done using functions in the scikit-learn
sklearn.metrics module and Keras’ function evaluate(). The evaluation
is done with the unseen data to the model (a testing subset of the YouTube comment
dataset). The method evaluate() returns the loss and accuracy of the model in
the test mode. This method allows to check if the model is over-fit to the training
data. Afterwards, the model’s performance is assessed more in-depth.
To start off, predictions are made on the test subset, which produces an array
of values ranging from 0 to 1. If the value is 0.5 or higher, it is considered
that the comment was predicted as positive, otherwise, the comment is predicted
29
as negative. Then, accuracy, precision, recall, F1 score, confusion matrix and
ROC curve are calculated using the scikit-learn methods accuracy_score(),
precision_score(), recall_score(), f1_score(), confusion_-
matrix() and roc_curve() respectively. Performance measurement values
are calculated by using prediction result classification as true positives (TP), true
negatives (TN), false positives (FP) and false negatives (FN). TP and TN values are
correct predicted values for positive and negative classes respectively, while FP and
FN are values where the model predicted one class, but it was the other.
Accuracy is a ratio of how many predictions were correct out of all the predic-
tions and can be calculated as:
(T P + T N )
Accuracy =
(T P + T N + F P + F N )
Precision shows how well the model does not label the sample as negative when
it is positive [44]:
TP
P recision =
TP + FP
Recall shows how many positives were predicted as actually positive [45]:
TP
Recall =
(T P + F N )
F1 score is a weighted average of precision and recall [46]. The value is between
0 and 1 and the higher values indicate better performance:
2 ∗ (precision ∗ recall)
F 1score =
(precision + recall)
The confusion matrix displays the number of TP, TN, FP and FN predictions.
Finally, ROC curves are generated. The ROC curves are constructed using true
positive rates (TPR) and false-positive rates (FPR). The TPR is the rate of predic-
tions that were predicted to be positive out of all the positive samples. Similarly,
the FPR is incorrect positive predictions out of all positive samples. The generated
curve provides the insight between the model’s sensitivity and specificity, and the
closer the curve is to the top left of the graph, the more accurate the classifier is
[47]. A random guess would produce points along the diagonal line of the graph,
therefore the diagonal line is used as a baseline to see if the model is doing worse
than random predictions or better.
30
5 Results
This section describes the final labelled YouTube comment dataset (Section 5.1)
followed by the introduction to the video statistics (Section 5.2). Then, the re-
sults of model performance on the YouTube comment dataset is presented (Section
5.3). Lastly, the model performance on the IMDB movie review dataset is presented
(Section 5.4).
5.1 YouTube comment dataset

19,951 comments were labelled during the project, resulting in a finalized dataset
of 4,315 positive sentiment labelled comments, 5,144 negative sentiment la-
belled comments and the rest are labelled as neutral. Neutral comments make up
most of the comments, however, only positive and negative sentiment comments
are used for model training and evaluation. The distribution of sentiments in the
YouTube comment dataset can be seen in Figure 5.17.
DISTRIBUTION OF COMMENTS IN A LABELLED YOUTUBE

DATASET
Positive Negative Neutral
4315,
22%
10492,
52%
5144,
26%
Figure 5.17: The sentiment distribution of the finalized YouTube comment dataset.
5.2 YouTube video statistics

At the start of the project, the video statistics were gathered. These statistics in-
clude video ID, number of positive ratings, negative ratings, comment count and
view count. Since this project aims to analyze the relationship between video senti-
ment from comments and various user feedback, a table was constructed containing
positive and negative rating counts, view count and predicted sentiment. The sen-
timent value is explained as if the value is equal or over 0.5, the sentiment is
positive. Otherwise, the sentiment is negative. Table 5.3 shows five example rows
of the video statistics and predicted sentiment table. The whole table can be seen in
Appendix A.1.
31
Video ID Upvotes Downvotes Views Sentiment
xuCn8ux2gbs 4467927 71547 112060703 0.869
WeA7edXsU40 129840 2236 2826612 0.766
HVjlcUtuENM 25551 31465 1741401 0.123
u_J0Ng5cUGg 57341 642191 4679316 0.062
ffxKSjUwKdU 7036447 316928 977824338 0.953
Table 5.3: The example of simplified video data (title, comment count and date
removed). This video data was collected on 2021-02-04.
5.3 Model performance

This section presents the results of two evaluations performed on the models - cross-
validation on the training and validation subsets and model evaluation on an unseen
testing subset. The results for both evaluations include a model performance with
the use of word vectors and without. If the model was trained with word vectors, it
is suffixed with the abbreviation + WV in this section. The cross-validation results
include accuracy on each iteration as well as average accuracy. The evaluation of
unseen data results is comprised of accuracy, precision, recall and F1 score, which
is presented as a value from 0 to 1, followed by confusion matrix and ROC curves.
5.3.1 Cross-validation
Table 5.4 shows the accuracy of the models over each iteration of cross-validation
and the final column displays the average accuracy the model achieved during cross-
validation. Table 5.4 shows model accuracy scores over the three iterations of cross-
validation. Cross-validation for each model was performed using random seed to
ensure that the splits made on the dataset are the same for all the models. From
the results, the BiLSTM model trained with word vectors achieved the highest
average accuracy of 0.8782 (87.8%). Cross-validation was performed using the
training and validation subsets. The training subset was used for the model evalua-
tion described in the next section (Section 5.3.2).
I TERATIONS
1 2 3 AVERAGE
M ODEL
LSTM 0.8662 0.8696 0.8644 0.8667
LSTM + WV 0.8858 0.8627 0.8757 0.8747
BiLSTM 0.861 0.8648 0.8618 0.8625
BiLSTM + WV 0.8905 0.867 0.877 0.8782
CNN-LSTM 0.8566 0.8557 0.8462 0.8528
CNN-LSTM + WV 0.8814 0.864 0.8722 0.8725
CNN-BiLSTM 0.8627 0.8605 0.8527 0.8586
CNN-BiLSTM + WV 0.8853 0.8696 0.8727 0.8759
Table 5.4: Model accuracy during cross-validation iterations.
5.3.2 Evaluation on testing subset

For this step, all models were trained using the same training subset and evaluated
on the unseen training subset. Just as done during cross-validation (Section 5.3.1),
32
the models were trained with and without word vectors.
Table 5.5 displays the accuracy scores achieved by all the models. A higher
accuracy indicates that the model has made a larger number of correct predictions.
BiLSTM model trained with word vectors has achieved the highest accuracy of
0.8905 (or 89%), however, all the models trained with word vectors perform better
than the same models with no initial word embedding matrix specified.
M ODEL ACCURACY
LSTM 0.876
LSTM + WV 0.8835
BiLSTM 0.8766
BiLSTM + WV 0.8905
CNN-LSTM 0.8691
CNN-LSTM + WV 0.8824
CNN-BiLSTM 0.8621
CNN-BiLSTM + WV 0.883
Table 5.5: Accuracy scores on unseen test subset.
Table 5.6 displays the recall scores achieved by all of the four models trained
with and without pre-trained word vectors. Recall score describes how accurate
the model was in predicting the positive class out of all true positive classes. The
results show that CNN-LSTM with word vectors model was able to recognize the
positive class the best out of all the models with a recall score of 0.9027.
M ODEL R ECALL
LSTM 0.8436
LSTM + WV 0.8714
BiLSTM 0.8783
BiLSTM + WV 0.8864
CNN-LSTM 0.8749
CNN-BiLSTM 0.8737
Table 5.6: Recall scores on unseen test subset.
Precision scores can be seen in Table 5.7 for models trained with and without
word vectors. Precision refers to how accurate the model was in predicting the
positive class out of all positive class predictions. The precision results show that
LSTM model out of all positive class predictions got the most that were actually
positive, achieving a precision score of 0.9021.
33
M ODEL P RECISION
LSTM 0.9021
LSTM + WV 0.8931
BiLSTM 0.8753
BiLSTM + WV 0.8937
CNN-LSTM 0.8648
CNN-BiLSTM 0.8539
Table 5.7: Precision scores on unseen test subset.
F1 scores can be seen in Table 5.8 for all the four models with and without word
vectors. F1 score provides a balanced metric of precision and recall. The highest
F1 score was achieved by BiLSTM model trained with word vectors with an F1
score of 0.8901.
M ODEL F1 S CORE
LSTM 0.8719
LSTM + WV 0.8821
BiLSTM 0.8768
BiLSTM + WV 0.8901
CNN-LSTM 0.8698
CNN-BiLSTM 0.8637
Table 5.8: F1 scores on unseen test subset.
The correct and incorrect predictions are displayed in confusion matrices. The
confusion matrix shows how many predictions were correct for both, positive and
negative classes. Tables 5.9, 5.10, 5.11 and 5.12 displays confusion matrices for
LSTM, BiLSTM, CNN-LSTM and CNN-BiLSTM respectively.
True True
Negative Positive Negative Positive
Predicted
Predicted
Negative 784 79 Negative 773 90

Positive 135 728 Positive 111 752
(a) Trained without word vectors (b) Trained with word vectors
Table 5.9: Confusion matrix for the LSTM model.
34
True True
Predicted
Predicted
Table 5.10: Confusion matrix for BiLSTM model.
True True
Predicted
Predicted

Table 5.11: Confusion matrix for CNN-LSTM model.
True True
Predicted
Predicted

(a) Trained without word vectors (b) Trained without word vectors
Table 5.12: Confusion matrix for CNN-BiLSTM model.
Finally, the model prediction abilities are displayed in ROC curve graphs, which
can be seen in Figures 5.18, 5.19, 5.20 and 5.21. The dashed diagonal line indicates
random predictions. The area value is the area under the ROC curve (AUC), which
shows what is the probability that the model will classify the random positive com-
ment higher than a random negative comment.
35
ROC curve
1.0
0.8
True positive rate

0.6
0.4
0.2
LSTM (area = 0.941)
0.0 LSTM with word vectors (area = 0.951)
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
Figure 5.18: The ROC curve of LSTM model trained with and without word vec-
tors.
ROC curve
1.0
0.8
True positive rate
0.6
0.4
0.2
BiLSTM (area = 0.946)
0.0 BiLSTM with word vectors (area = 0.959)
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
Figure 5.19: The ROC curve of BiLSTM model trained with and without word
vectors.
36
ROC curve
1.0
0.8
True positive rate

0.6
0.4
0.2
CNN-LSTM (area = 0.939)
0.0 CNN-LSTM with word vectors (area = 0.956)
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
Figure 5.20: The ROC curve of CNN-LSTM model trained with and without word
vectors.
ROC curve
1.0
0.8
True positive rate
0.6
0.4
0.2
CNN-BiLSTM (area = 0.941)
0.0 CNN-BiLSTM with word vectors (area = 0.957)
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
Figure 5.21: The ROC curve of CNN-BiLSTM model trained with and without
word vectors.
5.4 Performance evaluation on IMDB dataset

The model performance was evaluated on the IMDB dataset. The dataset consists
of 12 500 positive and 12,500 negative movie reviews. The IMDB dataset was
split into three subsets in a similar way that the dataset split was performed on
the YouTube comment dataset. To keep the subsets balanced, each subset had the
same number of positive and negative reviews. Following that, the four LSTM-
based models were trained using IMDB training and validation subsets. Lastly,
the models were evaluated using the IMDB testing subset. Table 5.13 shows the
accuracy, precision, recall and F1 score of four model evaluation using the IMDB
dataset. The highest accuracy was achieved by the BiLSTM model, the highest
precision score was achieved by the CNN-LSTM model, the highest recall score
37
was achieved by the LSTM model and the highest F1 score was achieved by the
LSTM model.
M ODEL ACCURACY P RECISION R ECALL F1 SCORE

LSTM 0.8777 0.8618 0.8996 0.8803
BiLSTM 0.8787 0.8727 0.8867 0.8796
CNN-LSTM 0.8639 0.9061 0.8120 0.8565
CNN-BiLSTM 0.8762 0.8731 0.8803 0.8767
Table 5.13: Evaluation scores of four models using IMDB dataset.
Lastly, the ROC curves of all the models are generated to compare the perfor-
mance of the models easier. The ROC curve graph can be seen in Figure 5.22.
ROC curve ROC curve (zoomed in at top left)

1.000
1.0 LSTM (area = 0.948)
0.975 BiLSTM (area = 0.949)
0.8 CNN-LSTM (area = 0.942)
0.950 CNN-BiLSTM (area = 0.947)
True positive rate
True positive rate

0.6 0.925
0.900
0.4 0.875
LSTM (area = 0.948) 0.850
0.2 BiLSTM (area = 0.949)
CNN-LSTM (area = 0.942) 0.825
0.0 CNN-BiLSTM (area = 0.947)
0.800
0.0 0.2 0.4 0.6 0.8 1.0 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200
False positive rate False positive rate
(a) Zoomed out version of ROC graph. (b) Zoomed in version of ROC graph.
Figure 5.22: ROC curve for LSTM-based models on IMDB data.
38
6 Discussion
During this research, a YouTube comment with their sentiment dataset was con-
structed, and four LSTM-based models were built to predict the sentiment of YouTube
videos based on their comments. The performance of the four models was evaluated
and the results of the evaluation were presented in the previous section. In this sec-
tion, the results are analyzed, discussed and compared with the previous research.
The research questions, presented in Section 1.4, are answered.
6.1 Accuracy of sentiment prediction

The first research question for this project asks:
• RQ1 What is the highest accuracy that an LSTM-based model can achieve in
predicting YouTube video sentiment?
In order to answer this question, objectives O1, O3 and partially O4 were com-
pleted. Objective O4 was partially completed as RQ1 only concerns the accuracy
of the model. The accuracy of the four LSTM-based models has to be evaluated.
Two approaches are used to measure the accuracy of the four LSTM-based models -
cross-validation and model evaluation on the testing subset. The results of these are
presented in Table 5.4 for cross-validation and Table 5.5 for the model evaluation
on the testing subset.
Looking at the achieved accuracy scores, it appears that specifying the word
embedding matrix in the embedding layer either results in better performance or
similar performance to the model without a specified word vector. Overall, the
average model accuracy in cross-validation is lower than the model accuracy using
the test subset. Several reasons can cause the difference in accuracy:
1. Models are trained on smaller sized subsets during cross-validation than mod-
els trained and evaluated on the testing subset. Smaller subsets might not
provide enough data to train on, resulting in lower accuracy during model
evaluation, therefore the training and validation subset sizes should be in-
creased.
2. Models trained during cross-validation do not use the early stopping method,
therefore they can overfit the data they are trained on.
The cross-validation results show that, on average, BiLSTM with word vectors
has achieved the highest accuracy, followed by CNN-BiLSTM with word vectors
and LSTM with word vectors. The evaluation on testing subset shows that the same
three model architectures were at the top with accuracy scores as well, but BiLSTM
with pre-trained word vectors reached the highest accuracy of 89%. Therefore, RQ1
is answered as follows:
The highest accuracy that an LSTM-based model can achieve in

YouTube video sentiment prediction based on its comments is 89%.
This accuracy is reached by a BiLSTM model trained together with
word vectors pre-trained using Twitter tweets data.
39
The accuracy results achieved in this research differ from other studies, where
LSTM models with convolutional layers outperformed LSTM networks without
them in binary sentiment classification problems [32, 48, 35]. However, these stud-
ies evaluated their models on different datasets as well as used different data pre-
processing techniques.
Lastly, previous work was done on the YouTube dataset that used traditional
machine learning techniques. Benkhelifa and Laallam assessed the performance
of the SVM classifier in predicting the sentiment of the YouTube cooking videos
based on comments and achieved 95.3% accuracy [12], however it is unclear how
their classifier would perform on the comments from videos of other categories.
In this project, LSTM-based models were exposed to the comments from various
domains, possibly covering a larger variety of terms and vocabulary.
6.2 LSTM-based model performance

The second research questions is:
• RQ2 How do variants of the LSTM-based model differ in their performance

in predicting YouTube video sentiment?
In order to answer this question, several performance metrics were calculated

during the evaluation of the models fully completing the objective O4. In addition
to accuracy, these metrics are recall, precision, F1 score and confusion matrix. Ac-
curacy was discussed and analyzed in the previous section (Section 6.1), therefore
only the remaining metrics are discussed in this section.
Based on the recall scores (Table 5.6), CNN-LSTM model with word vectors
reached the highest recall score (0.9027). This means that the model correctly
identified 90.3% of positively labelled data. High recall of CNN-LSTM can be seen
looking at the confusion matrix in Table 5.11. The CNN-LSTM model with word
vectors has the highest number of true positives out of all the other models. The
LSTM model achieved the worst recall score of 0.8436.
Looking at the precision scores (Table 5.7), LSTM achieved the highest preci-
sion score (0.9021). This means that if a comment is predicted to have a posi-
tive sentiment, there is a 90.2% chance that the comment will actually be positive.
CNN-BiLSTM achieved the worst precision score of 0.8539.
Furthermore, looking at the F1 scores (Table 5.8), BiLSTM with pre-trained
word vectors has achieved the highest score (0.8901). This suggests that while
the CNN-LSTM with word vectors reached the highest recall score and the LSTM
model achieved the highest precision score, BiLSTM with word vectors has the best
balance between precision and recall out of all the LSTM-based models.
Finally, looking at the ROC curve graphs (Figures 5.18, 5.19, 5.20 and 5.21),
the models trained with word vectors have the top of the curve closer to the top
left corner, indicating that they perform better than their variants trained without
word vectors. In addition, the highest AUC is achieved by BiLSTM with word
vectors (0.959), showing that this model is further from the random predictions
and highest in accuracy score.
Therefore, RQ2 is answered as follows:
40
CNN-LSTM with pre-trained word vectors has achieved the highest
recall, meaning out of all the models, it was able to detect the most
positively labelled comments correct. The LSTM model has reached
the highest precision score, meaning that if this model classified the
comment positively, it had the highest chance of being truly positive.
However, BiLSTM with pre-trained word vectors has achieved the
highest F1 score, meaning that it has the best balance between preci-
sion and recall out of all the models. In addition, BiLSTM with word
vectors achieved the highest accuracy in cross-validation and model
evaluation on the unseen data.
Lastly, it is worth mentioning that CNN-LSTM has the highest recall but low
precision (second lowest of all the models). Such results could suggest that the
CNN-LSTM tends to classify comments as positive more often, therefore, even
if the comment is negative, it might still guess that the comment is positive. In
addition, LSTM has the lowest recall but the highest precision, suggesting that the
model is least likely to produce false-positive classifications. Finally, BiLSTM with
word vectors achieved the highest accuracy in cross-validation and the highest F1
score, indicating a balance between recall and precision. These results suggest that
BiLSTM with word vectors is the model that has performed the best overall on the
YouTube comment dataset.
6.3 LSTM-based model performance on IMDB dataset

The third research question was raised to evaluate how well the model generalizes
with other types of sentiment data. The third research question is:
• RQ3 How accurate are the LSTM-based models with another sentiment dataset,
such as the IMDB movie review dataset?
To answer the third research question, all four LSTM-based models were eval-
uated on the IMDB dataset. Pre-trained word vectors were not used. The results
of the evaluation can be seen in Table 5.13. The highest accuracy was achieved
by the BiLSTM model (87.87%). The highest recall score was reached by the
LSTM model (0.8896), and the highest precision was achieved by the CNN-
LSTM model (0.8996). The LSTM model reached the highest F1 score (0.8803).
Looking at the ROC curve (Figure 5.22), the CNN-LSTM shows the worst perfor-
mance with the lowest AUC, while the BiLSTM and LSTM have the highest and
similar AUC values and distance to the top left corner.
Therefore, the third research question is answered as follows:
The highest accuracy achieved by LSTM-based models was reached

by a BiLSTM model. However, the LSTM model achieved the high-
est F1 score and recall. The CNN-LSTM model achieved the highest
precision. Finally, based on the ROC curve, BiLSTM and LSTM
performed similarly on the IMDB dataset.
41
All four models achieved an average of 87% accuracy on the dataset. That is
2% less than the accuracy attained on the YouTube comment dataset. This indicates
that all the models can perform with different datasets.
Lastly, some previous studies have been done using LSTM-based models to pre-
dict IMDB movie review sentiment. Mathapati et al. reached an accuracy score of
88.3% with the CNN-LSTM model [35], however, their model was trained with
pre-trained word vectors meaning that the four LSTM-based models might have
gained higher accuracy scores if word vectors would have been used. Yenter and
Verma implemented a multi-kernel CNN-LSTM model, which achieved 89.5% ac-
curacy on IMDB dataset [34]. This suggests that if the model architecture of the
four LSTM-based models used in this project consisted of joint kernels, the accu-
racy might have been higher. Finally, Peng compared the performance of RNN
and LSTM and their bidirectional variants [49]. Peng discovered that bidirectional
variants outperform regular LSTM and RNN on IMDB data. In this research, the
bidirectional variant of an LSTM model reached a higher accuracy on IMDB data,
but it underperformed in the other metrics. In addition, looking back at the predic-
tion on the YouTube comment dataset, the bidirectional variants achieved slightly
higher accuracy as well.
6.4 Relationship between video sentiment and users’ preferences

The final research questions is:
• RQ4 What is the relationship between the video sentiment and users’ prefer-
ences?
To answer the final research question, objective O2 was completed and box plots
are generated to visualize how the sentiment affects the number of video upvotes,
downvotes and views. Furthermore, ANOVA tests are performed between positive
and negative video statistics to see if there is any significant difference between the
groups.
42
H

8SYRWHV

6HQWLPHQWSRVLWLYHQHJDWLYH
Figure 6.23: A comparison of number of upvotes in positive and negative videos.
H

'RZQYRWHV

Figure 6.24: A comparison of number of downvotes in positive and negative videos.
43
H

9LHZV

Figure 6.25: A comparison of number of views in positive and negative videos.
Videos with a positive sentiment have a higher number of upvotes in comparison

to videos with a negative sentiment, as seen in Figure 6.23. It is clear that there is
a higher number of upvotes in positive videos as the median of upvotes in positive
videos is much higher than the median of upvotes in negative videos.
Following that, Figure 6.24 shows the box plot for the downvotes count for
the different sentiment. Negative videos tend to receive more downvotes, however,
the median of the downvotes for the negative videos is close to the first quartile,
meaning that the majority of negative videos receive a small number of downvotes.
Lastly, a comparison between the number of views for different sentiment videos
is presented in Figure 6.25. From the figure, it can be concluded that positive videos
receive a higher number of views.
Data Groups Measurements p-value

positive vs. negative sentiment number of upvotes 0.000758
negative vs. positive sentiment number of downvotes 0.290862
positive vs. negative sentiment number of views 0.02145
Table 6.14: ANOVA test results for various data groups for YouTube video statis-
tics.
Next, ANOVA tests are performed between positive and negative videos for
video statistic measurements. Based on the box plots, the following hypotheses can
be made:
• The videos is positive if it has more upvotes.
44
• The video is negative if it has more downvotes.
• The video is positive if it has more views.
As seen in Table 6.14, ANOVA tests for the number of upvotes and number of
views result in a p-value lower than a threshold of 0.05, meaning that the video
with positive sentiment has a statistically higher number of upvotes and higher num-
ber of views than a video with a negative sentiment. However, the ANOVA test
result for the number of downvotes is above the threshold. This means that the
number of downvotes is not statistically higher for a negative video.
The results from the ANOVA test and box plots indicate that the viewers that
like the video will watch the video more as well as rate it positively more. However,
there is no statistically significant mean difference between video sentiment and the
number of downvotes. Such findings suggest the videos with negative sentiment
gain fewer views, which might mean that viewers that like the video tend to re-
watch the video several times (assuming that the view count increases if the same
person watches the video several times), while the viewers that dislike the video will
only watch it once. In addition, since there is no statistically significant difference
between downvotes of positive and negative videos, it could mean that the users
tend to express their dissatisfaction with the video in the comments rather than by
rating the video negatively.
Therefore, objective O5 was completed and the fourth and the final research
question is answered as follows:
If the video is liked, the users are more likely to upvote the video and
more likely to view it more as well. Videos that are generally disliked
will get fewer views and fewer upvotes. There is no statistically
significant difference between the video sentiment and the number of
downvotes, suggesting that the users might express their disapproval
of the video in comments rather than the negative ratings.
45
7 Conclusions and Future Work
In this project, the aim was to construct a YouTube comment dataset and build an
LSTM-based model to classify YouTube comments as either positive or negative.
A total of 19,951 comments were manually labelled based on their sentiment,
9,459 of which resulted in being positive and negative. Following that, the perfor-
mance of four LSTM-based models trained on the YouTube comment dataset was
evaluated. The four research questions raised in Section 1.4 were answered.
The accuracy of the four LSTM-based models was calculated using cross-validation
and evaluation on the unseen data to answer the RQ1. The highest achieved accu-
racy on the YouTube comment dataset was 89%, which was reached by a BiLSTM
model with word vectors, which were pre-trained using Twitter tweets data.
Furthermore, various other metrics, such as recall, precision, F1 score and ROC
curves, were computed to answer the RQ2. While CNN-LSTM gained the highest
recall score and LSTM reached a higher precision score, the BiLSTM model with
word vectors achieved the highest F1 score and the AUC value. Considering that
BiLSTM with word vectors also achieved the highest accuracy on the YouTube
comment dataset, it can be concluded that the BiLSTM model with word vectors
performed the best in predicting YouTube comment sentiment.
Moreover, the four LSTM-based models were evaluated on the IMDB dataset
to answer RQ3. The aim of this research question was to show that the models
can perform on any other sentiment data, such as movie reviews. The BiLSTM
model reached the highest accuracy on the IMDB dataset of 87.87%. The average
accuracy achieved by all the models was 87%, which was 2% less than the accuracy
gained on the YouTube comment dataset, meaning that the four LSTM-based can
predict the sentiment on any other text as good as they can predict the sentiment of
YouTube comments.
Lastly, the statistical analysis was performed on YouTube comment and YouTube
video datasets to answer the RQ4. It was discovered that the video with positive
sentiment is more likely to receive a higher number of views and upvotes in compar-
ison to videos with negative sentiment. However, there was no statistically signifi-
cant difference detected between the video sentiment and the number of downvotes,
which might suggest that the users are more likely to express negative opinions in
comments rather than rating a video with a downvote.
7.1 Future Work

It is possible that better performance on the sentiment detection of YouTube com-
ments can be achieved.
During this research, it was observed that the models with pre-trained word
vectors resulted in better overall performance. However, due to the lack of pre-
trained word vectors on YouTube comments, Twitter word vectors were used. It
is believed that the models could have performed even better if the word vectors
were pre-trained on YouTube comment data as there is a possibility that different
language and vocabulary is used on YouTube and Twitter.
Further, the YouTube comment dataset should be expanded and improved. Some
collected comments were short, some of them being just one word. After data
preprocessing, some of the comments were left with no words at all. In the future,
the dataset should be expanded by longer comments of one or several sentences. In
46
addition, there is a possibility that a larger comment dataset would result in better
model accuracy.
Finally, the multi-kernel CNN-LSTM model implemented by Yenter and Verma
outperformed the CNN-LSTM model [34], which suggests that the models might
perform even better if more complex architectures are implemented, such as multi-
ple kernels.
47
References
[1] C. Day, “The importance of sentiment analysis in social me-
dia analysis.” [Online]. Available: https://www.linkedin.com/pulse/
importance-sentiment-analysis-social-media-christine-day
[2] B. Liu, Sentiment Analysis and Opinion Mining. Morgan & Claypool Pub-
lishers, 2012.
[3] C. C. Aggarwal and C. Zhai, A Survey of Text Classification Algorithms.

Boston, MA: Springer US, 2012, pp. 163–222.
[4] B. Liu, Sentiment Analysis: Mining Opinions, Sentiments, and Emotions,

2nd ed. Cambridge University Press, 2020.
[5] K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and

D. Brown, “Text classification algorithms: A survey,” Information, vol. 10,
no. 4, 2019.
[6] “Youtube partner program overview & eligibility,” https://support.google.

com/youtube/answer/72851?hl=en, accessed: 2021-02-09.
[7] “How to earn money on youtube,” https://support.google.com/adsense/

answer/72857?hl=en, accessed: 2021-01-05.
[8] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts,

“Learning word vectors for sentiment analysis,” in Proceedings of the
49th Annual Meeting of the Association for Computational Linguistics:
Human Language Technologies. Portland, Oregon, USA: Association for
Computational Linguistics, June 2011, pp. 142–150. [Online]. Available:
http://www.aclweb.org/anthology/P11-1015
[9] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? sentiment classification

using machine learning techniques,” in Proceedings of the 2002 Conference
on Empirical Methods in Natural Language Processing (EMNLP 2002). As-
sociation for Computational Linguistics, Jul. 2002, pp. 79–86.
[10] A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and R. Passonneau, “Sentiment

analysis of twitter data,” Proceedings of the Workshop on Languages in Social
Media, 01 2011.
[11] M. S. Neethu and R. Rajasree, “Sentiment analysis in twitter using machine

learning techniques,” in 2013 Fourth International Conference on Computing,
Communications and Networking Technologies (ICCCNT), 2013, pp. 1–5.
[12] R. Benkhelifa and F. Z. Laallam, “Opinion extraction and classification of real-

time youtube cooking recipes comments,” in The International Conference on
Advanced Machine Learning Technologies and Applications (AMLTA2018).
Springer International Publishing, 2018, pp. 395–404.
[13] A.-K. Al-Tamimi, A. Shatnawi, and E. Bani-Issa, “Arabic sentiment analysis

of youtube comments,” in 2017 IEEE Jordan Conference on Applied Electri-
cal Engineering and Computing Technologies (AEECT), 2017, pp. 1–6.
48
[14] S. Sun, C. Luo, and J. Chen, “A review of natural language processing tech-
niques for opinion mining systems,” Information Fusion, vol. 36, pp. 10–25,
2017.
[15] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp.
436–44, 05 2015.
[16] L. Deng and D. Yu, “Deep learning: Methods and applications,” Microsoft,
Tech. Rep., May 2014.
[17] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with

gradient descent is difficult,” IEEE Transactions on Neural Networks, vol. 5,
no. 2, pp. 157–166, 1994.
[18] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural com-

putation, vol. 9, pp. 1735–80, 12 1997.
[19] “Understanding lstm networks,” https://colah.github.io/posts/

2015-08-Understanding-LSTMs/, accessed: 2021-03-19.
[20] M. Schuster and K. Paliwal, “Bidirectional recurrent neural networks,” IEEE

Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
[21] L. Zhang, S. Wang, and B. Liu, “Deep learning for sentiment analysis : A
survey,” 2018.
[22] “Convolutional neural networks,” https://www.ibm.com/cloud/learn/

convolutional-neural-networks, accessed: 2021-03-23.
[23] J. Brownlee, “How do convolutional layers work in deep learning neural

networks?” 2021. [Online]. Available: https://machinelearningmastery.com/
convolutional-layers-for-deep-learning-neural-networks/
[24] D. Goularas and S. Kamis, “Evaluation of deep learning techniques in senti-

ment analysis from twitter data,” in 2019 International Conference on Deep
Learning and Machine Learning in Emerging Applications (Deep-ML), 2019,
pp. 12–17.
[25] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural

network for modelling sentences,” 2014.
[26] G. P. Jain and S. Thenmalar, “Sentiment on twitter data set using recurrent
neural network-long short term memory,” International Journal of Innovative
Technology and Exploring Engineering, vol. 8, no. 11 Special Issue, pp. 1206–
1211, 2019.
[27] D. Li and J. Qian, “Text sentiment analysis based on long short-term memory,”
in 2016 First IEEE International Conference on Computer Communication
and the Internet (ICCCI), 2016, pp. 471–475.
[28] M. Day and Y. Lin, “Deep learning for sentiment analysis on google play con-
sumer review,” in 2017 IEEE International Conference on Information Reuse
and Integration (IRI), 2017, pp. 382–388.
49
[29] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word
representations in vector space,” 2013.
[30] C. Zhou, C. Sun, Z. Liu, and F. C. M. Lau, “A C-LSTM neural network for
text classification,” CoRR, vol. abs/1511.08630, 2015.
[31] A. Hassan and A. Mahmood, “Deep learning approach for sentiment analysis
of short texts,” in 2017 3rd International Conference on Control, Automation
and Robotics (ICCAR), 2017, pp. 705–710.
[32] Q. Huang, R. Chen, X. Zheng, and Z. Dong, “Deep sentiment representation

based on cnn and lstm,” in 2017 International Conference on Green Informat-
ics (ICGI), 2017, pp. 30–33.
[33] A. A. L. Cunha, M. C. Costa, and M. A. C. Pacheco, “Sentiment analysis

of youtube video comments using deep neural networks,” in Artificial Intel-
ligence and Soft Computing. Springer International Publishing, 2019, pp.
561–570.
[34] A. Yenter and A. Verma, “Deep cnn-lstm with combined kernels from multi-
ple branches for imdb review sentiment analysis,” in 2017 IEEE 8th Annual
Ubiquitous Computing, Electronics and Mobile Communication Conference
(UEMCON), 2017.
[35] S. Mathapati, A. K. Adur, R. Tanuja, S. H. Manjula, and K. R. Venu-

gopal, “Collaborative deep learning techniques for sentiment analysis on imdb
dataset,” in 2018 Tenth International Conference on Advanced Computing
(ICoAC), 2018, pp. 361–366.
[36] “Trending youtube video statistics and comments,” https://www.kaggle.com/

datasnaek/youtube, accessed: 2021-01-09.
[37] “Convolutional neural networks,” https://www.youtube.com/watch?v=

wx9Jv5uxfac, accessed: 2021-01-20.
[38] G. Angiani, L. Ferrari, T. Fontanini, P. Fornacciari, E. Iotti, F. Magliani, and

S. Manicardi, “A comparison between preprocessing techniques for sentiment
analysis in twitter,” in KDWeb, 2016.
[39] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for

word representation,” in Empirical Methods in Natural Language Processing
(EMNLP), 2014, pp. 1532–1543.
[40] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for efficient
text classification,” arXiv preprint arXiv:1607.01759, 2016.
[41] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov,

“Fasttext.zip: Compressing text classification models,” arXiv preprint
arXiv:1612.03651, 2016.
[42] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in

3rd International Conference on Learning Representations, ICLR 2015, San
Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
50
[43] “Keras documentation: Embedding layer,” https://keras.io/api/layers/core_
layers/embedding/, accessed: 2021-01-04.
[44] “sklearn.metrics.precision_score,” https://scikit-learn.org/stable/modules/

generated/sklearn.metrics.precision_score.html, accessed: 2021-02-15.
[45] “sklearn.metrics.recall_score,” https://scikit-learn.org/stable/modules/

generated/sklearn.metrics.recall_score.html, accessed: 2021-02-15.
[46] “sklearn.metrics.f1_score,” https://scikit-learn.org/stable/modules/generated/

sklearn.metrics.f1_score.html, accessed: 2021-02-15.
[47] “What is a roc curve and how to interpret it,” https://www.displayr.com/

what-is-a-roc-curve-how-to-interpret-it/, accessed: 2021-04-15.
[48] N. Chen and P. Wang, “Advanced combined lstm-cnn model for twitter sen-
timent analysis,” in 2018 5th IEEE International Conference on Cloud Com-
puting and Intelligence Systems (CCIS), 2018, pp. 684–687.
[49] X. Peng, “A comparative study of neural network for text classification,” in

2020 IEEE Conference on Telecommunications, Optics and Computer Science
(TOCS), 2020, pp. 214–218.
51
A Appendix 1
A.1 Video statistics and predicted sentiment
Videos collected on 2021-02-04.
ItYOdWRo0JY 108577 91775 12499974 0.025
34UCkAJKt6M 483801 777373 8700241 0.028
xuCn8ux2gbs 4467927 71547 112060703 0.869
WeA7edXsU40 129840 2236 2826612 0.766
HVjlcUtuENM 25551 31465 1741401 0.123
u_J0Ng5cUGg 57341 642191 4679316 0.062
ffxKSjUwKdU 7036447 316928 977824338 0.953
w3ugHP-yZXw 316735 1175967 46927172 0.311
Rqy_5PmID6k 49458 131340 1304510 0.083
pjS6OdY2dBQ 371810 307717 5993382 0.239
KEJdam7NJ2Y 31154 100307 1258819 0.06
FmE8DwZ6RHk 6526 21986 988623 0.016
l14ejKg__HY 503289 1018815 10428651 0.093
J2HytHu5VBI 3146735 139046 69391465 0.228
VY4wCi1pPkU 1338584 47127 32001830 0.706
_UZbgH3pOBA 44826 75975 2062931 0.201
bukzXzsG77o 2275085 112151 49656246 0.802
PpcNQNJmU9Y 56281 83225 1301252 0.709
RZ65WvTqPl4 86901 812 1496564 0.77
6ZfuNTqbHE8 3751494 91498 243775491 0.942
CfdVnvrJUS0 24450 88820 817911 0.121
fuHcdFPBKy0 121521 23172 10891782 0.605
_GuOjXYl5ew 4023184 579415 240153585 0.939
t6hlkIlGFCI 25271 92506 3192984 0.453
MFWF9dU5Zc0 142405 65676 20739316 0.522
4NJlUribp3c 6429558 201116 624830367 0.96
w38IG6HsI5s 2274121 16891 50533417 0.901
C_Ig3iHWDwk 29130 182268 1070599 0.039
udBgnu1hbLU 1255200 41209 24358855 0.615
8d_202l55LU 5814 134418 1336146 0.091
lqZSV5sFeZI 356537 3671 13004082 0.91
FlsCjmMhFmw 4629425 2338108 237146034 0.814
mNcdlLIOdNw 1890545 42102 109783638 0.927
-HYe1xkjOcQ 28891 81950 1061199 0.167
c7nRTF2SowQ 2379610 45884 66430847 0.97
MfTbHITdhEI 2929314 85599 188112983 0.894
X3hjRKGrC40 24509 148494 4185608 0.22
YbJOTdZBX1g 2978594 18876684 212999243 0.307
QwZT7T-TXT0 1749613 2496105 59153397 0.365
0tO_l_Ed5Rs 23016 22125 618588 0.619
V5cOvyDpWfM 87605 223159 26636385 0.105
cRJjXCK6H6Y 48868 135478 1056332 0.089
5NbhJ2fFGco 1665352 24543 20953835 0.893
A
oJ2faqXlU1s 227746 30770 23925222 0.716
n1WpP7iowLc 1187767 75748 61984709 0.888
EeF3UTkCoxY 633391 3923413 44847897 0.743
LFhT6H6pRWg 12094 283683 1631171 0.091
IvxRsDpXPGo 768853 2293179 58273401 0.306
lYEj9snU838 11388 20093 1801905 0.214

Youtube Analysis3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Youtube Analysis3

Uploaded by

Copyright:

Available Formats

Master Thesis Project

Sentiment Analysis of YouTube

Author: Indre Kvedaraite

7 Conclusions and Future Work 46

1.3 Problem Statement

1.4 Research Questions & Objectives

During the project, it is determined what accuracy an LSTM-based model can

RQ2 How do variants of the LSTM-based model differ in their performance in

Four types of LSTM-based models are evaluated - LSTM, BiLSTM, CNN-

Table 1.1: Objectives achieved during this thesis project.

1. An English YouTube comment dataset of 19 951 comments with posi-

2. An LSTM-based model that can predict YouTube video sentiment

3. An in-depth evaluation of the performance of the proposed LSTM-based model

4. An analysis of the correlation between various user feedback

1.6 Target groups

2.1 Theoretical background

2.1.1 Traditional sentiment classification techniques

2.1.2 Text sentiment classification with deep learning

2.1.3 Recurrent Neural Networks

Figure 2.1: An "unfolded" visual representation of recurrent neural network over

Y n-1 I nput gat e Yn

Figure 2.3: An "unfolded" visual representation of bidirectional RNN over time.

2.1.6 Convolutional Neural Networks

Figure 2.5: Max pooling of convolutional neural network.

they can recognise the crucial features of textual data.

2.2 Related work

2.2.1 Related work discussion

Figure 3.6: The research methodology.

3.1 Data collection

• A video must have comments and ratings enabled

• A video access has to be public (not unlisted or private)

• A video has to be uploaded on a date between 2016 and 2018

• The author of the video may not have been deceased

• Science & Technology

• Howto & Style

• News & Politics

• Film & Animation

The video collection is concluded by saving the following information about

url title upvote downvote views commentcount date

Figure 3.7: Example of the video information stored in the database.

3.1.2 Comment collection

Figure 3.8: Example of the unlabelled comment stored in the database.

3.2 Data cleanup

3.3 Data labelling

3.4 Data processing

3.4.1 Removal of nonessential data

• Articles (the, a/an)

This unwanted data is then identified and removed.

56, 1, 14, 1, 38 56, 1, 14, 1, 38

3.5 Model implementation

3.6 Model evaluation

Training + validation dataset

Test Train Metrics

Train Test Metrics

Figure 3.11: Three-fold cross-validation.

3.7 Reliability and Validity

3.7.2 Construct validity

3.7.3 Internal validity

3.8 Ethical considerations

4.1 Environment setup

2. Psycopg (2.8.6)11 - Storing data in PostgreSQL database

3. NumPy (1.19.5)12 - Data collection, dataset construction