Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

MASTER THESIS

FAKE NEWS DETECTION AND ANALYSIS

Abstract
The main objective of this thesis is the exploration of this crucial problem of Natural Language
Processing, namely false content detection and of how it can be solved as a classification
problem with automatic learning. A linguistic approach is taken, experimenting with different
types of features and models to build accurate fake news detectors. The experiments are
structured in the following three main steps: text pre-processing, feature extraction and
classification itself. In addition, they are conducted on a real-world dataset, LIAR, to offer a
good overview of which model best overcomes day-to-day situations. Two approaches are
chosen: multi-class and binary classification.
In both cases, we prove that out of all the experiments, a simple feed-forward network
combined with fine-tuned DistilBERT embeddings reports the highest accuracy – 27.30% on 6-
labels classification and 63.61% on 2-labels classification. These results emphasize that transfer
learning bring important improvements in this task. In addition, we demonstrate that classic
machine learning algorithms like Decision Tree, Naïve Bayes, and Support Vector Machine act
similar with the state-of-the-art solutions, even performing better than some recurrent neural
networks like LSTM or BiLSTM. This clearly confirms that more complex solutions do not
guarantee higher performance. Regarding features, we confirm that there is a connection
between the degree of veracity of a text and the frequency of terms, more powerful than their
position or order. Yet, context prove to be the most powerful aspect in the characteristic
extraction process. Also, indices that describe the author’s style must be carefully selected to
provide relevant information.
Objectives
The main objective of this thesis is to explore the natural language issue of false content
identification in a specific field, namely media news. The problem is structured as a
classification problem with 6, respectively 2 classes. In addition, we intend to analyze various
approaches to find the reasons why certain techniques and models have higher performance,
highlighting their strengths and weaknesses. This objective is accomplished mainly by means of
experimentation with different type of models, features, and pre-processing operations to
improve accuracy.

State of the Art


Resources
First, we require complete and representative resources. Based on this idea, numerous people
started providing data sources that reflect different real-life situations of false content
dissemination. Some of the most frequent topics are political news, medical news,
advertisement, or celebrity journalism. The interest for the subject is so big that it led to shared
tasks and important academic competitions for the NLP community. We collected some of the
most famous ones in the following sections. Our focus was on English as it is the most common
foreign language and it has a wide use in writing international news and in social media. Besides
that, we mention a couple of Spanish resources as results from a very famous competition with
the same theme which took place for several consecutive years.
Competitions
Datasets
Analysis of News Content
The detection of an article’s truthfulness based on its content and meaning is a current difficult
NLP problem. Despite being widely used, the solutions that are available nowadays are very
specific and do not work in general cases. However, they have safer results than their
predecessors because they focus on the problem, not on its source. There are three major
complementary directions that lead to promising results:

 Knowledge-based approach - uses a priori knowledge


 Machine Learning approach – uses automatic learning of extracted linguistic patterns from
news content
 Hybrid approach – combines Knowledge-based and Machine Learning techniques

3.0 Method
This thesis explores and compares different solutions for inauthentic content detection. Our
approach is divided into three main stages: pre-processing data, extracting features and
classification of the input. In the next sections, we will present what different methods we
tested for each one of the steps.
3.1 Dataset
We focus on applying supervised learning techniques. For that, we use an annotated body of
text, LIAR Dataset (Wang, 2017). It contains short English statements extracted from the
PolitiFact.com website, which are labeled with a certain level of trust based on a detailed
explanation. Having public and accessible data, any necessary information is one click away.
Moreover, the dataset can be enlarged at any time by collecting more data from this
factchecking site, which can only benefit our analysis.
3.2 Text Pre-processing
As mentioned before, in this document, we analyze different methods of fake news detection
from a machine learning point of view. At the base of this process is the way the machine
interprets the text, meaning how words are converted to a numerical format. In general, in
order to have a good representation for the feature vector, certain pre-processing operations
are required. Their combination is debatable, depending on the problem itself and the available
data as in some cases inappropriate operation may lead to loss of information.
3.3 Features
Once the text is in a normalized form, different features can be extracted. The process of
extracting characteristics from the news is an incremental one. It aims to choose various
information and combine them in order to best describe the original text. In the process of
experimenting with different variants to find the solution that offers the best accuracy for the
problem, four directions were tested: bag of n-grams with TF-IDF, static word embeddings
(Word2Vec, GloVe), contextualized word embeddings (BERT, DistilBERT) and stylometric
features.
3.4 Classical Machine Learning Models
3.5 Neural Networks

4 Results
For each solution, we build a multi-label classifier with the original labels from the dataset and a
binary one that treats a simpler case. Therefore, we follow the same pattern as the papers from
the state of the art and group the classes “true”, “mostly-true” and “half-true” as “TRUE”,
respectively “false”, “pants-fire”, “barely-true” as “FAKE”. We evaluate each model on both
validation and testing partition, to see if there are major differences. Also, we used the
validation slice to tune the hyper-parameters of each model, when was the case. The results
obtained in both cases are presented in the following tables, being grouped according to the
type of text features used.

5 Discussion
In the previous chapter, we summed up a collection of fake news detection solutions which
learn various patterns of untrustworthy content. Despite having similar results with those from
Section 3, there are multiple aspects that restrict their final performance, aspects that we will
try to discuss further. A series of arguments can be put forward to explain the existing
limitations and based on them, we can establish further steps that may diminish the impact.
Firstly, the weak results could be explained by the choice of the data set. A wellbalanced,
unbiased, diverse, and real-life-inspired corpus can greatly influence the outcome of the
experiments. In our case, some classes have less examples comparing with the others,
hindering the learning process. Also, they are very specialized, focusing just on specific events
from reality. Moreover, LIAR has six classes, and it contains many short real-world statements
from various contexts and different authors, being very difficult to place them in a pattern.
Therefore, the models cannot learn general patterns for the fake news detection task. Further,
we analyzed the structure of LIAR and found out different limitations like typos, atypically
organized or incomplete sentences that may justify the mistake prone models. Such examples
that make the task difficult are:

You might also like