A Comparative Analysis of Machine Learning Techniques On Fake News Detection 1

Comparative Analysis of Machine Learning
Techniques on Fake News Detection
Session 2018-2022
By
Atif Ullah
Mohammad Asif Nawaz
Syed Dawood Ali
Bachelor of Science in Software Engineering
Department of Computer Science

City University of Science & Information Technology
Peshawar, Pakistan
October, 2022
Session 2018-2022
By
Atif Ullah
Mohammad Asif Nawaz
Syed Dawood Ali
Supervised by
Mr. Saif Ullah Jan

Peshawar, Pakistan
October, 2022
By
Atif Ullah (9736)

Mohammad Asif Nawaz (9705)
Syed Dawood Ali (9715)
CERTIFICATE
A THESIS SUBMITTED IN THE PARTIAL FULFILMENT OF THE
REQUIRMENTS FOR THE DEGREE OF BACHELOR OF SCIENCE IN
SOFTWARE ENGINEERING
We accept this dissertation as conforming to the required standards
(Supervisor) (Internal Examiner)

Mr. Saif Ullah Jan
(External Examiner) (Head of the Department)
(Coordinator FYP) (Approved Date)

Peshawar, Pakistan
October, 2022
Dedication
This research thesis is especially dedicated to Mr. Saif Ullah Jan who helped and guided
us to successfully complete this research work. Also, we would like to dedicate this
research to our dear Parents, who has been a wonderful supporter until our research was
complete, and to our beloved ones, who has been encouraging for months.
iv
Declaration
We hereby declare that we are the authors of this thesis. The work submitted in this
thesis is the result of our own research except where otherwise stated.
Atif Ullah
Mohammad Asif Nawaz
Syed Dawood Ali
October, 2022
v
Acknowledgment
We would like to take this opportunity and foremost thank to ALMIGHTY ALLAH for
giving us strength and knowledge in the writing of this thesis.
This project would not have been possible without the support of many people. Many
thanks to our adviser, Mr. Saif Ullah Jan, who read our numerous revisions and helped
make some sense of the confusion. His guidance and experience proved very helpful in
the progress of this project. We would also like to thank all our fellows and teachers for
giving their fruitful suggestions.
Last but not least, we would like to thank our parents for always being supportive of our
education. We also take this opportunity to acknowledge everyone in our large extended
family.
vi
Abstract
The term ”fake news” became widely used to characterize the situation, especially in
discussions about articles written only for the purpose of generating traffic and clicks,
despite their factual inaccuracies and deliberate dissemination of disinformation. Fake
news, when extensively spread, may harm people and society. Due to a lack of access
control measures, fake messages and accounts have spread throughout websites. Due to
social media’s unique traits and challenges, typical news detection algorithms are inef-
fective or inadequate. False information on the internet has caused worries in politics,
sports, health, and science. The focus of this study is on using the dataset(Fake and
Real News), which consists of plain English text and was obtained from the Kaggle, to
distinguish between true news and Fake news. After the comparison of Logistic Regres-
sion, Gaussian Naive Bayes, Extreme Gradient Boosting, Random Forest, Decision Tree,
Light Gradient Boosting Machine, Stochastic Gradient Descent algorithms; the accuracy
and outcome of the XGB (Extreme Gradient Boosting) classifier obtaining an accuracy
of 81%, which is the leading result of an algorithm among all the respective algorithms.
So, this study recommends the XGBoost techniques for comparative analysis .
vii
Table of Contents
Dedication iv
Declaration v
Acknowledgment vi
Abstract vii
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Scope and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Literature Review 4
2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Research Methodology 8
3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Pre-processing Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3.1 Special Characters . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3.2 Punctuations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3.3 Stopwords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3.4 Userhandles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4.2 Excel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.5 Division of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.6 Features Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
viii
3.6.1 Semantic Features . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.7 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.8 Machine Learning Models For Fake News Detecction . . . . . . . . . . . 12
3.8.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.8.2 GaussianNB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.8.3 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.8.4 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.8.5 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.8.6 LGBM (Light Gradient Boosting Machine) . . . . . . . . . . . . . 16
3.8.7 SGD (Stochastic Gradient Descent) . . . . . . . . . . . . . . . . . 16
3.9 Performance Assessment Measures . . . . . . . . . . . . . . . . . . . . . 17
3.9.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.9.2 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.9.3 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.9.4 F1 Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Evaluation and Discussion 19

4.1 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.1 Why Performance of XGB Classifier is Better? . . . . . . . . . . . 24
Conclusion and Future Work 26
References 27
ix
List of Figures
3.1 Machine learning Model for Fake News Detection . . . . . . . . . . . . . 8
3.2 logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Decision Tree Classification Algorithm . . . . . . . . . . . . . . . . . . . 15
4.1 confusion matrix for binary classifiers . . . . . . . . . . . . . . . . . . . . 20

4.2 Represents confusion matrix for fake news detection . . . . . . . . . . . . 20
4.3 Comparison based on Precision and Recall for fake news detection . . . . 21
4.4 Performance Analysis of each Technique using MCC for Age . . . . . . . 22
4.5 Performance Analysis of each Technique using Accuracye . . . . . . . . . 23
4.6 percentage difference XGBoost with other techniques . . . . . . . . . . . 24
x
List of Tables
3.1 Fake and Real News Detection . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Semantic Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1 Performance Analysis of each Technique using Precision and Recall. . . . 21

4.2 Performance Analysis of each Technique using MCC and F1-score for fake
news detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Performance Analysis of each Technique using Accuracy . . . . . . . . . 23
xi
List of Abbreviations
ANN Artificial Neural Network
AI Artificial Intelligence
CM Confusion Matrix
CNN Convolutional Neural Network
DT Decision Tree
FN False-Negative
FP False-Positive
GNB Gaussian Naive Bayesian
LR Logistic Regression
LGBM Light Gradient Boosted
LSTM Long Short Term Memory
ML Machine Learning
NB Naı̈ve Bayes
NLP Natural Language Processing
PNN Probabilistic Neural Network
PD Percentage Difference
RF Random Forest
RF Random Forest
SGD Stochastic Gradient Descent
SVM Support Vector Machine TN True-Negative
SVM Support Vector Machine
TF-IDF Term Frequency-Inverse Dense Frequency
TN True-Negative
TP True-Positive
US United State
XGB Extreme Gradient Boosting
xii
Chapter 1
Introduction
1.1 Overview
There was a time when everyone who required news had to wait for the next day’s
newspaper. However, with the rise of online newspapers that update news practically
quickly, consumers have discovered a better and faster method to stay updated on topics
of interest. Nowadays, social-networking systems, online news portals, and other forms of
online media are the primary sources of news, with fascinating and breaking news being
disseminated at a quick rate. The rise of online social media has drastically simplified
how individuals connect with one another. Users of online social media share information,
interact with others, and stay up to date on current events. However, most of the new
material on social media is suspect and, in some cases, designed to mislead. This type of
information is frequently referred to as false news [1].
A large volume of bogus news on the internet has the potential to generate severe diffi-
culties in society. Many people believe that false news affected the 2016 U.S. presidential
election campaign. Following the election, the word has entered the common language.
Furthermore, it has piqued the interest of industry and academics, who are attempting to
comprehend its origins, spread, and impacts. The capacity to determine whether internet
material is false and meant to mislead is crucial. This is technically difficult for various
reasons. Material is readily created and swiftly shared through social media platforms,
resulting in a massive volume of content to analyse [2].
This work is made more difficult by the fact that online material is quite diversified,
covering a wide range of disciplines. Because computers cannot always determine the
truth and purpose of a remark, attempts must rely on collaboration between people and
technology. Some stuff, for example, is available that experts believe is incorrect and
meant to mislead. While these resources are few, they can serve as the foundation for
such a collaborative endeavor. Fake news is rapidly being disseminated through social
media platforms such as Twitter and Facebook. According to research on the velocity of
fake news, tweets containing incorrect information reach individuals on Twitter six times
quicker than factual ones [3].
Fake news has fast become a societal issue, with it being used to spread false or misleading
1
information in attempt to influence people’s behavior. It has been demonstrated that the
spread of false news had a significant impact on the 2016 US presidential elections. Here
are some facts about fake news in the United States [4]:
• 62% of US citizens get their news for social Medias.
• Fake news had more share on Facebook than mainstream news .
1.2 Background
Fake news detection efforts relied mostly on manually constructed characteristics taken
from news items or information generated during the news transmission process. Deep
learning models, as opposed to human feature engineering, can extract features from
text automatically [5]. When it comes to the past, there is no method or strategy for
distinguishing between true and fake news. Now this research will do a comparison study
on several approaches to determine the true accuracy and precision of the output [6]. We
employ a classified algorithm that is all about classification and regression.
The Internet has become a perfect breeding ground for propagating fake news, such as
misleading information, fake reviews, fraudulent adverts, rumors, fake political comments,
satires, and so on, due to the rising popularity of online social media. Fake news is now
more popular and frequently disseminated on social media than in traditional media [7].
The topic of internet fake news has received increased attention from both scholars and
practitioners [4]. During the election campaign, fake news has been accused of boosting
political polarization and partisan strife [8], and voters can be easily swayed by false
political comments and claims.
Many of the most recent internet fact-checking systems, such as FactCheck.org and
PolitiFact.com, are based on manual detection procedures by experts, with time lag as
the primary concern. Furthermore, because most existing online fact-checking resources
are primarily focused on the verification of political news, their practical applicability
is limited due to the wide variety of news types and formats, as well as the widespread
and rapid spread of fake information in social networks. Furthermore, a vast volume of
real-time information is generated, commented on, and shared every day via online social
media, making online real-time false news identification much more challenging [9].
2
1.3 Problem Statement
There are number of algorithms (LR, GNB, XGB, RF, DT, LGBM, SGD) that are used
for fake news detection. However the accuracy of these algorithms for fake news detection
is not known.
1.4 Scope and Objectives

The scope and objectives of this research are:
1.4.1 Scope
The scope of this research is to:
• Review of existing models that are discussed in Literature review.
• Evaluate the results of this research recall, F-measurement, precision, and accuracy
are used as performance measures.
1.4.2 Objectives
Following are the objectives of this research study:
• To perform a comparative analysis of machine learning models for fake news detec-
tion.
• Visualize and analyze the above objective on Kagle news detection dataset.
1.5 Thesis Outline

The rest of the thesis is organized as: Chapter 2 is about related work. Methodology is
discussed in chapter 3. Chapter 4 is about experimental results and conclusion of this
research in chapter 5.
3
Chapter 2
Literature Review
This research study focuses on comparison of different machine learning techniques ap-
plied on a dataset of fake and real news. For this purpose, many researchers worked on
different features and also used different techniques but each technique has its strength
and weaknesses. Some of the related work for Fake News detection is preceded here.
2.1 Related Work

The label ”fake news” became widely used to refer to the problem, especially when de-
scribing pieces that were produced primarily in order to generate revenue from page views
but contained factual errors and misinformation [2]. Social media for news consumption
is a two-edged sword. On the one hand, the low cost, easy access, and quick transmission
of information encourage consumers to seek out and consume news via social media. On
the other side, it facilitates the dissemination of ”fake news,” i.e., low-quality news with
purposely incorrect content. The widespread dissemination of false news has the potential
to have tremendously detrimental consequences for both people and society [1].
Many illegitimate messages and accounts have been found to spread across numerous
platforms due to a lack of access control mechanisms. Existing detection algorithms
from conventional news sources are useless or inappropriate for fake news identification
on social media due to its particular characteristics and difficulties. We need to include
auxiliary information, such as user social engagements on social media, to help make
a determination because fake news is first deliberately written to mislead readers into
believing false information. As a result, it is difficult and nontrivial to detect based solely
on news content. Second, making use of this supplementary data is difficult in and of
itself due to the large, sparse, chaotic, and noisy data that users’ social interactions with
fake news create [10].
In 2017 the issue of fake news is examined by analysing the literature in two stages: char-
acterisation and detection. Introduced the fundamental ideas and tenets of false news
in both conventional and social media throughout the characterisation phase. Reviewing
current fake news detection techniques from a data mining standpoint during the detec-
tion phase included feature extraction and model building [11]. In the past ten years,
4
there has been a fast rise in the dissemination of false news, which was most visibly seen
during the 2016 US elections [12].
A lot of issues have arisen not just in politics but also in a number of other fields,
including sports, health, and science, as a result of the widespread circulation of un-
truthful materials online [13]. Fortunately, a variety of computational approaches may
be applied to identify some articles as fraudulent based just on the text contained inside
them [14].There are several repositories kept by researchers that provide lists of websites
that are classified as confusing and fraudulent. The majority of these strategies involve
fact-checking websites like ”PolitiFact” and ”Snopes.” [9].
The issue with these tools (PolitiFact and Snopes) is that articles and websites that are
fraudulent must be identified by human competence. More significantly, the fact-checking
websites only include stories from specific fields, like politics, and are not designed to de-
tect false news from a variety of fields, including entertainment, sports, and technology.
For the purpose of verifying news, numerous visual and statistical elements have recently
been retrieved [15].Visual features include clarity score, coherence score, similarity distri-
bution histogram, diversity score, and clustering score. Statistical features include count,
image ratio, multi-image ratio, hot image ratio, long image ratio, etc. [16].
Graphical Clustering Toolkit, a clustering application, separates news reports based on
their similarities on a chosen classifier. Clustering is used to evaluate and compare a large
amount of data. Employing a k-nearest method and agglomerative clustering to sort a
huge number of data sets and a small number of clusters [17]. Fake news travels faster
than actual news because viewers find it more appealing. It becomes more challenging
to distinguish between fake and real news. Humans have a tendency to feel that their
understanding of the facts is precise, which results in biased opinions about the material
and makes it a target for ”yellow journalism.” This exposes the facts that any person
would take into account depending on their expertise, which utilizes their viewpoint and
their prejudiced perspectives. Twitter has previously come under fire due to social media
platforms’ use as a tool to disseminate misinformation and businesses like Facebook [18].
In 2017 their technique within a Facebook Messenger Chabot and tested it in a real-
world setting; and were able to detect fake news with an accuracy of 81.7%. They
wanted to determine whether a news report was credible or not . However, Shlok Gilda
has mentioned that while bi-gram TF-IDF yields highly effective models for detecting
fake news, the PCFG features do little to add to the models efficiency [19].
As there may be some association between the sentiment of the news story and its genre,
several research studies have also advocated the use of sentiment analysis for fraud de-
tection. Conroy, Rubin, Chen, and Cornwell proposed that word-level analysis may be
improved by evaluating the usefulness of characteristics like part of speech frequency and
5
semantic categories such generalizing words and positive and negative polarity (sentiment
analysis) [20]. Mathieu Cliche in his sarcasm detection blog has described the detection
of sarcasm on twitter through the use of n-grams, words learned from tweets specifi-
cally tagged as sarcastic. His work also includes the use of sentiment analysis as well
as identification of topics (words that are often grouped together in tweets) to improve
prediction accuracy [21]. The noun-to-verb ratio is a major difference between the actual
and fraudulent corpora. The ratio in actual news is larger, averaging 4.27, compared to
2.73 in the false news corpus. The actual news corpus has 20.5 word characteristics on
average, whereas the false news corpus has 14.3 words [22].
On their suggested dataset LIAR, Wang has compared the performance of SVM, LR, Bi-
LSTM, and CNN models [23]. They creates a technique for automatically detecting false
news on Twitter by learning to anticipate accuracy ratings in two credibility-focused Twit-
ter datasets: PHEME, a dataset of prospective rumors on Twitter, and CREDBANK, a
crowd sourced collection of accuracy ratings for events on Twitter [24]. They apply this
method to Twitter content sourced from BuzzFeeds fake news dataset. A feature analysis
identifies features that are most predictive for crowd sourced and journalistic accuracy
assessments, results of which are consistent with prior work.
Random Forest is a method that builds a more robust model from a weaker one (like
DT) to prevent over-adjustment at the lowest possible cost. The well-known bootstrap
approach is used to construct the forest. Bootstrapping’s major goal is to combine learn-
ing models by improving the overall classification result [25]. Decision Tree A typical tree
has a root system, branches, and leaves. The Decision Tree follows the same format. It
has leaf nodes, branches, and a root node. Every internal node has an attribute test, the
test’s result is on the branch, and the leaf node contains the class label as a result [26]. A
root node, as its name indicates, is the topmost node in a tree and the parent of all other
nodes. A decision tree is a tree in which each node represents an attribute, each link
represents a rule, and each leaf represents the result (a category or continuing value) [26].
Support vector machines (SVMs) are typically capable of producing superior performance
in terms of classification accuracy than the other data classification techniques, according
to a number of recent research. Simulators have been used to solve a variety of real-world
issues, including text categorization, handwritten digit identification, tone recognition,
object detection in images, micro-array gene expression data analysis, and data classifi-
cation [27].
A Gaussian Naive Bayesian data classification model based on clustering algorithm was
proposed for fast recognition and classification of unknown continuous data containing a
large number of non-priori knowledge [28]. XGBoost is a decision tree ensemble based
on gradient boosting designed to be highly scalable. Similarly, to gradient boosting,
6
XGBoost builds an additive expansion of the objective function by minimizing a loss
function. Considering that XGBoost is focused only on decision trees as base classifiers,
a variation of the loss function is used to control the complexity of the trees [29].
Automated identification of fake news is a difficult topic to address since news articles
nowadays typically include photos and videos (rather than just text), which are easy to
forge. Furthermore, with the growth of social media, false news items are easily accessible
and have a high effect factor. Furthermore, detecting false news is tough since there is
no government in place to oversee what individuals may read, what carrier they use to
obtain that specific news, or who is behind that particular news article [30].
To get around this restriction, deep learning models that can distinguish between fake
news and legitimate news based on multiple modes and news text data have been pre-
sented. These models have shown the improvement in detection performance, but the
power of deep learning models are not yet fully unleashed due to the lack of fresh high-
quality samples for training [31]. Fact-checking involves determining if statements made
by prominent people like politicians, commentators, etc. are true. Many researchers do
not distinguish fake news detection and fact-checking since both of them are to assess
the truthfulness of claims. Generally, fake news detection usually focuses on news events
while fact-checking is broader [32].
There are organizations working to solve issues like author responsibility, such the House
of Commons and the Crosscheck initiative. However, because they rely on manual de-
tection by humans, which is neither reliable nor practical in a world where millions of
articles are either published or deleted every minute, their breadth is thus constrained.
The development of a system to give a reliable automatic index scoring, or rating, for
the trustworthiness of various publications, and the context of the news, might be a
solution [33].
7
Chapter 3
Research Methodology
This chapter deals with the proposed research design, research method and the methodol-
ogy we have adopted for this research. The overall operational framework is discussed in
this chapter. The framework starts with the background study processes through prob-
lem identification, literature review, description of the proposed model, and mainly the
dataset which we select from the kaggle all of these topic will be discuss in this chapter.
3.1 Methodology
Figure 3.1 shows the details of all the methodology we discuss in this chapter
Figure 3.1: Machine learning Model for Fake News Detection
8
3.2 Dataset Description
The dataset utilized in this study is referred to as ”fake and real news.” The dataset was
obtained from Kaggle. We have imported all of the required Python libraries (Anaconda,
NumPy, Pandas, Matplotlib, Seaborn). This dataset was taken from Kaggle in November
2021, with a total of 2047 records and the three attributes shown below. Table 3.1 shows
few instances of the dataset used
Table 3.1: Fake and Real News Detection
S.NO Title Text Label

muslims busted they print they should pay
1 stole millions in govt all the back all the Real
benefits money ...
why did attorney why did attorney
2 general loretta lynch general loretta lynch Real
plead the fifth plead the fifth ...
intl community still st century wire says
3 financing protecting wire reported on fri- Fake
terrorists ... day ...
fbi director comeys
in a stunning turn of
4 leaked memo explains Fake
events days before ...
why hes ...
3.3 Pre-processing Steps

Mostly dataset has the numbers of data. And they have some non-required data which is
not required for the most researchers. In this dataset some unrequired data remove and
clean the dataset properly. Below we remove some non-required data:
3.3.1 Special Characters
Almost all the text in the data contains white Acute (‘). For the removing of Acute in
text line we used neattext function in python.
3.3.2 Punctuations
We know that in all documents punctuation (?,:,,!. Etc.) is included in some programing
language they occurred some errors. So that’s why remove it. We use same function
9
neattext function for the removing of punctuation.
3.3.3 Stopwords
Stop words are a set of commonly used words in a language. Example “a”, “the”, “is”,
“are” and etc. we remove this on neattext function in python.
3.3.4 Userhandles
In our dataset there are usernames which is important to remove and we remove it using
neattext function.
3.4 Tools
The Python is used to compare different classification algorithms
3.4.1 Python
Python provides simple and easy-to-read code. Python’s simplicity helps developers to
design efficient systems, but complex algorithms and flexible workflows are behind ma-
chine learning and AI. We used python for analyze a piece of text to discover sentiment
hidden within it. It can be accomplished by combining machine learning and natural lan-
guage processing. Sentiment analysis in python allows us to examine the feeling expressed
in piece of text.
Following python libraries have been used which is discussed below.
A. Anaconda
Anaconda is a Python and R programming language distribution for scientific computing.
It is utilised in data science, machine learning, deep learning, and other fields.
B. NumPy
NumPy stands for Numerical Python. NumPy is a library consisting of different objects
called several dimensional objects or arrays. NumPy allows us to conduct mathematical
and logical operations.
C. Pandas
Pandas is a widely used open-source Python library for data science, data analysis, and
machine learning activities. We used pandas for importing data from various formats.
Such as csv files etc. we used panda library for manipulating operation such as merging,
selecting and data cleaning.
D. Matplotlib
Matplotlib is a plotting library offered as a component of NumPy, a big data numeri-
cal handling resource, for the Python programming language. We used matplotlib for
visualizing different numerical graphs.
10
E. Seaborn
Seaborn is a free and open-source Python library based on matplotlib. It is used to visu-
alize data and perform exploratory data analysis. It is used for making statistical graphs
in Python We used seaborn to visualize data and perform exploratory data analysis. And
make statistical graphs in Python.
F. TensorFlow
TensorFlow is a free and open-source software library for machine learning and artificial
intelligence. It can be used across a range of tasks but has a particular focus on training
and inference of deep neural networks. The TensorFlow platform helps you implement
best practices for data automation, model tracking, performance monitoring, and model
retraining. Using production-level tools to automate and track model training over the
lifetime of a product, service, or business process is critical to success.
3.4.2 Excel
Excel is used to plot the data in graphical form to show the visual form of the data.
3.5 Division of Data

Our dataset is containing 2047 records and there are real and fake news. We divide it
50% of real news and 50% of fake news to give the accurate result.
3.6 Features Extraction

A feature represent a unique property, a recognizable measurement and a functional
component which can help in in the loss of important information [34]. Below we are
explaining the features which we will extract from our dataset which is in plain English.
3.6.1 Semantic Features
For our research work we are using 6 different semantic features which we will extract it
from our dataset which is based on plain English. In Table 3.2 shows semantic features.
11
Table 3.2: Semantic Features
S:NO Semantic Features

1 TF(Term Frequency)
2 IDF(Inverse Documnet Frequency)
3 Ngrams
4 Norms
5 Bigrams
6 Trigrams
3.7 Training Data

The train-test split method is used in this research for training the Machine Learning
model. The train-test split is a technique which is used for the evaluating the performance
of a machine learning algorithm. we use k-fold to train the data and take k=10.
3.8 Machine Learning Models For Fake News Detec-

ction
Fake news detection is a supervised machine learning problem which is used for the
identification of real news and fake news. For the identification of fake and real news
we use the given machine learning models: Logistic regression, GaussianNB, xgboost
classifier, random forest, decision tree, lgbm classifier and SGD.
3.8.1 Logistic Regression
Logistic Regression (LR) is one of the most important statistical and data mining tech-
niques employed by statisticians and researchers for the analysis and classification of
binary and proportional response data sets. Some of the main advantages of LR are that
it can naturally provide probabilities and extend to multi-class classification problems.
Figure 3.2 shows the explanation of the logistic regression.
12
Figure 3.2: logistic Regression
Here f1 up to f4 are the features and all of them are connected to a single neuron. The
neuron here in this graph executes two operations, the first is the multiplication of the
features with weight (w) and add it with the bias (b), whereas w is an n dimensional
vector and b is the bias which is a real number. The second operation is called an
activation function, which is indicated by . Y represent the output of the neuron.
3.8.2 GaussianNB
A Gaussian Naive Bayesian data classification model based on clustering algorithm was
proposed for fast recognition and classification of unknown continuous data containing a
large number of non-priori knowledge [35].For some types of probability models, Naive
Bayes classifiers can be trained very efficiently in a supervised learning setting. [36]. The
equation 3.1 show the formula of GNB.
T
P (A B) P (A).P (B|A)
P (A|B) = = (3.1)
P (B) P (B)
Where
P(A) = The probability of A occurring.
P(B) = The probability of B occurring.
P(A/B) = The probability of A given B.
P(A/B) = The probability of B given A.
P(A Intersection B) = The probability of both A and B occurring.
13
3.8.3 XGBoost
XGBoost is a decision tree ensemble based on gradient boosting designed to be highly
scalable. Similarly, to gradient boosting, XGBoost builds an additive expansion of the
objective function by minimizing a loss function. XGBoost is an implementation of a gen-
eralised gradient boosting algorithm that has become a tool of choice in machine learning
competitions. This is due to its excellent predictive performance, highly optimised multi-
core and distributed machine implementation and the ability to handle sparse data. [37].
The equation 3.2 show the formula of XGBoost.
n
X k
X
obj(θ) = l(yi , yi ) + Ω(fk ) (3.2)
i=1 k=1
where, K is the number of trees, f is the functional space of F, F is the set of possible
CARTs.
3.8.4 Random Forest
Random Forest was proposed by Leo Breimans (1996), and is a recent development in
tree based classifiers and quickly proven to be one of the most important algorithm in the
machine learning literature [38]. Random Forest classifier is an ensemble classification
method that comprising a collection of treelike classifiers. The algorithm train several
classifiers and combine their results through a voting process [39]. Random Forest is and
algorithm that learns from a weak model (like DT) to build a more robust one to avoid
over-adjustment with a minimum cost. The forest is built using well-known bootstrap
technique. The main idea of bootstrapping is to merge learning models by increasing the
overall classification result [40].
The characteristics are then chosen at random from the set of t attributes. The best
classification node is obtained by utilizing the best feature norm of a decision tree, so
that all sub samples are leaf nodes. In the third stage, repeat the second step to generate
K decision trees and obtain the final random forest. The fourth stage is the classifier’s
function model, which is s H(x), the decision tree, which is hi, the classification label,
which is y, and the indicator function, which is I(hi(x) = Y). The following is the random-
forest decision-making formula. Figure 3.5 shows the Random Forest explanation. The
equation 3.3 show the formula of Random Forest.
k
X
H(x) = argmax I(hi(x) = Y ) (3.3)
i=1
14
Figure 3.3: Random Forest
3.8.5 Decision Tree

A normal tree includes root, branches and leaves. The same structure is followed in
Decision Tree. It contains root node, branches, and leaf nodes. Testing an attribute is
on every internal node, the outcome of the test is on branch and class label as a result
is on leaf node [41]. A root node is parent of all nodes and as the name suggests it is
the topmost node in Tree. A decision tree is a tree where each node shows a feature
(attribute), each link (branch) shows a decision (rule) and each leaf shows an outcome
(categorical or continues value) [42]. The Figure 3.4 shows the Decision Tree Classification
Algorithm.
Figure 3.4: Decision Tree Classification Algorithm
15
3.8.6 LGBM (Light Gradient Boosting Machine)
LGBM stances for Light Gradient Boosting Machine. The LGBM Algorithm uses two
concepts. They are GBDT (Gradient Boosting Decision Tree) or GOSS (Gradient based
one-sided sampling). These are the widely-used machine learning algorithm involved for
prediction of study [43]. The GBDT is a boosting algorithm where, Boosting is over-all
collective process that replicates a sturdy classifier from a number of puny classifier.
For a training set with n instances x1, ... , xn, where each xi is a vector with dimension
s in space Xs. In each iteration of gradient boosting, the negative gradients of the loss
function with respect to the output of the model are denoted as g1, ... , gn. In this
GOSS method, the training instances are ranked according to their absolute values of
their gradients in the descending order. Then, the top-a × 100% instances with the
larger gradients are kept and we get an instance subset A. Then, for the remaining set Ac
consisting (1- a) × 100% instances with smaller gradients., we further randomly sample a
subset B with size b × —Ac—. Finally, we split the instances according to the estimated
variance gain at vector Vj (d) over the subset A ? B. The equation 3.4 show the formula
of LGBM.
!
1 (ΣxiϵA1 gi + 1−a
b
Σ g
xiϵB 1 i ) 2
(Σ g
xiϵAr i + 1−a
b
Σ g
xiϵB r i ) 2
Vj (d) = j +( j ) (3.4)
n nl (d) nl (d)
The coefficient (1-a)/b is used to normalize the sum of the gradients over B back to the
size of Ac. The equation 3.1 show the formula of GNB.
3.8.7 SGD (Stochastic Gradient Descent)

Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to fitting lin-
ear classifiers and regressors under convex loss functions such as (linear) Support Vector
Machines and Logistic Regression. Even though SGD has been around in the machine
learning community for a long time, it has received a considerable amount of attention
just recently in the context of large-scale learning.SGD has been successfully applied to
large-scale and sparse machine learning problems often encountered in text classification
and natural language processing. The equation 3.1 show the formula of SGD.
∂
θj = θj − α J(θ) (3.5)
∂θj
As a review, gradient descent seeks to minimize an objective function J(Theta) and select
16
a learning rate (Alpha) then iteratively updating each parameter (Theta)
3.9 Performance Assessment Measures

For the performance evaluation we uses Accuracy, Precision, Recall and F1 measures. All
the evaluation metrics are discussed below with their formulas.
3.9.1 Accuracy
Accuracy is a criterion for evaluating classification models. Informally, accuracy is
the percentage of our model’s observations that were correct.This is simply equal to the
proportion of predictions that the model classified correctly. It is one metric for evaluating
classification models. Informally, accuracy is the fraction of predictions our model got
right.The equation 3.6 show the formula of Accuracy.
TP + TN
Accuracy = (3.6)
TP + FP + FN + TN
True positive (TP) = the number of cases correctly identified as patient

False positive (FP) = the number of cases incorrectly identified as patient
True negative (TN) = the number of cases correctly identified as healthy
False negative (FN) = the number of cases incorrectly identified as healthy
3.9.2 Precision
It is the connection between positive observations and positive observations that has
been accurately predicted. Precision is also known as positive predictive value and is the
proportion of relevant instances among the retrieved instances. It (also called positive
predictive value) is the fraction of relevant instances among the retrieved instances. The
equation 3.7 show the formula of Precision.
TP
P recision = (3.7)
TP + FP
Percentage positive instances (TP) out of total predicted positive (PN).
3.9.3 Recall
Recall is the percentage of documents that are successfully retrieved in order to extract
information. In binary classification, remembering is referred to as sensitivity. The
possibility of the query returning a relevant document can be considered. Recall, also
known as the sensitivity, hit rate, or the true positive rate (TPR), is the proportion of
the total amount of relevant instances that were actually retrieved. we define recall as
17
the number of true positives divided by the number of true positives plus the number of
false negatives. The equation 3.8 show the formula of Recall.
TP
Recall = (3.8)
TP + FN
Percentage positive instances (TP) out of total actual positive (FN).
3.9.4 F1 Measure
F1 measure is also known as f-score or f-measure, it takes both the precision and recall
into consideration in order to calculate the performance of and algorithm. Mathematically
it is the harmonic mean of precision and recall. Following is the equation of F1-measure.
F1 score is also a measure of a test’s accuracy, it is the harmonic mean of precision and
recall. It can have a maximum score of 1 (perfect precision and recall) and a minimum of
0. Overall, it is a measure of the preciseness and robustness of a model. It is a measure
of a model’s accuracy on a dataset. It is used to evaluate binary classification systems.
Equation 3.9 shows the formula for finding the F1 Measure.
2 ∗ P recision ∗ Recall
F 1 − measure = (3.9)
P recision + Recall
18
Chapter 4
Evaluation and Discussion

This chapter presents the outcomes of each technique. Accuracy of each technique their
precision of real and fake, recall of real and fake and Measure of real and fake After that
result is discussed to identify which technique performed best and why.
4.1 Evaluation Results

This section presents the outcomes obtained through the analysis of classification al-
gorithms. These classifiers were evaluated by three evaluation metrics. Seven classifiers
including LR, GNB, XGBoost, RF, DT, LGBM, and SDG were tested on Kaggle dataset.
We used K-fold and divide the dataset in k=10. Evaluation on metrics which include,
accuracy, precision, recall and F-measure for analyzing which algorithm works best for
Fake News Detection.
The outcomes achieved for each measure is obtained with the help of Confusion Ma-
trix(CM). CM is a very popular measure used while solving classification problems, it
can be applied to binary classification as well as for multiclass classification problems.
The sample of CM is presented in Figure 4.1 Where each cell shows the values of TP,
FP, FN, and TN. TP stands for True Positive which indicates the number of positive
examples classified accurately. The term FP shows False Positive value which represents
the number of actual negative examples classified as positive, and FN means False Nega-
tive values which is the number of actual positive examples classified as negative and the
last TN indicates True Negative which show the number of negative examples classified
accurately.
19
Figure 4.1: confusion matrix for binary classifiers
Figure 4.2 presents the confusion matrix for fake news detection. This is a binary
classification confusion matrix.
Figure 4.2: Represents confusion matrix for fake news detection
20
Table 4.1 shows the comparative outcomes achieved using precision and recall. The anal-
ysis show the better performance of XGB with values of 81 and 81 for precision and
recall accordingly. Figure 4.3 illustrates these analysis graphically for better understand-
ing. The worst performance is achieved by GNB with values of 67 and 67 for precision
and recall.
Table 4.1: Performance Analysis of each Technique using Precision and Recall.
S.NO Technique Precision Recal

1 LR 75% 75%
2 GNB 67% 67%
3 XGB 81% 81%
4 RF 80% 79%
5 DT 74% 74%
6 LGBM 77% 77%
7 SGD 73% 74%
Figure 4.3: Comparison based on Precision and Recall for fake news detection
21
Table 4.2: Performance Analysis of each Technique using MCC and F1-score for fake
news detection
F1-
S.NO Technique MCC
Score
1 LR 75% 0.30%
2 GNB 67% 0.18%
3 XGB 81% 0.25%
4 RF 78% 0.74%
5 DT 72% 0.48%
6 LGBM 76% 0.32%
7 SGD 73% 0.32%
The MCC analysis for real and fake news using each employed ML algorithms is presented
in Figure 4.2. Here the performance of RF is better than other employed models with
the value of 1. On the other hand GNB shows the weakest performance on MCC for fake
news detection. Besides these the analysis using MCC for fake news detection presented
in Table 4.2 shows the better performance of RF with MCC value of 1. Furthermore
Table 4.2 also shows f1-score the XGBoost outperform with the values of 81%. The
worst performance shows by the GNB with the values of 67%. Figure 4.4 illustrates these
analysis graphically for better understanding.
Figure 4.4: Performance Analysis of each Technique using MCC for Age
22
Table 4.3: Performance Analysis of each Technique using Accuracy
S.NO Technique Accuracy

1 LR 75.4%
2 GNB 66.6%
3 XGB 81.3%
4 RF 78%
5 DT 74.01%
6 LGBM 77%
7 SGD 73.5%
The accuracy analysis of fake news detection also shown in Table 4.3 and Figure 4.5
respectively. These analysis presents the better performance of XGBoost for fake news
prediction. XGBoost achieves 81.3% accuracy for fake news detection. These analysis
propose XGBoost models for fake news detection using fake and real news dataset of the
English language as compare to employed ML models.
Figure 4.5: Performance Analysis of each Technique using Accuracye
4.2 Discussion
This section critically discusses the results analysis achieved via each discussed ML algo-
rithms for fake news detection on proposed fake and real dataset of the English language.
Figure 4.6 presents the percentage difference of each model for fake news detection. Per-
23
centage difference shows the measurement of the different experimental values where one
value is higher than the other value. It also defined as the absolute value of the ratio of
the difference between two numbers and their average expressed as percentage. At can
be defined as:
!
|n1 − n2|
PD = (4.1)
( n1+n2
2
)
Where n1, represents the value of XGBoost from Figure 4.6 while n2 represents the value
of other technique. In our case, we can see that there is a very slight difference between
XGBoost and RF, while XGBoost and GNB have too many variance in the values. On
the other hand for fake news detection, where the performance of XGBoost is better.
Figure 4.6: percentage difference XGBoost with other techniques
The main focus behind this study was to analyze six well known ML classification algo-
rithms on data that we have created. The dataset we have collected from Kaggle fake
and real news dataset of the English language. It can be observed that XGBoost perform
best among all other model for fake news detection.
4.2.1 Why Performance of XGB Classifier is Better?

XGboost is the most widely used algorithm in machine learning, whether the problem is a
classification or a regression problem. It is known for its good performance as compared
to all other machine learning algorithm. XGBoost is a decision-tree-based ensemble
Machine Learning algorithm that uses a gradient boosting framework. XGBoost and
gradient boosting machines are both ensemble tree methods that apply the principle of
24
boosting weak learners using the gradient descent architecture.
Summary
This chapter discusses the outcomes of XGBOOST Kaggle dataset with a comparison
to other classification models. Evaluation metrics used for bench-marking are accuracy,
precision, recall and F-measure. Post prediction, analysis of comparisons are performed
among all the techniques to find the best one. The results obtained conclude that XG-
BOOST achieved the highest accuracy among other classifiers.
25
Conclusion and Future Work
Conclusion
In this study, we describe a model for detecting false news utilising a range of machine
learning and deep learning methods. According to the literature review, different types
of research have produced strategies for false news detection, but increasing accuracy
remains a difficult challenge. Despite much study in this sector, there is still space for
advancement. This research focuses on increasing the accuracy of false news identification.
The dataset was chosen from Kaggle. The accuracy of several methodologies and features
are used to assess performance: precision(real), precision(fake), recall(real), recall(fake),
f-measure(real), and f-measure(fake) (fake). The accuracy and outcomes of all algorithms
used in this study are as follows: LR (75.4%), Gaussian NB (66.6%), XGB classifier
(81.3%), RF (78%), DT (74.01%), LGBM (77%), SGD Classifier (73.5%). Overall, the
XGB Classifier outperforms the Random Forest, obtaining an accuracy of 81% and 78%,
respectively. This study recommends the Random Forest and XGBoost techniques for
detecting fake news.
Future Work
For future work, these machine learning algorithms may be tested on more enriched
dataset. And this work can make it possible to offer real-time classification over websites
by improving the identification of fake and real news.
26
References
[1] “Lai, chun-ming, et al. ”fake news classification based on content level features.”
applied sciences 12.3 (2022): 1116.,”
[2] “Mugdha, shafayat bin shabbir, sayeda muntaha ferdous, and ahmed fahmin. ”eval-
uating machine learning algorithms for bengali fake news detection.” 2020 23rd in-
ternational conference on computer and information technology (iccit). ieee, 2020.,”
[3] P. Dizikes, “Dizikes, peter. ”study: On twitter, false news travels faster than true sto-
ries.” pobrane z: http://news. mit. edu/2018/study-twitterfalse-news-travels-faster-
true-stories-0308 (20.04. 2018) (2018).,” DESIDOC Journal of Library & Informa-
tion Technology, vol. 36, no. 6, 2018.
[4] “Horne, benjamin d., and sibel adali. ”this just in: Fake news packs a lot in title,
uses simpler, repetitive content in text body, more similar to satire than real news.”
eleventh international aaai conference on web and social media. 2017.,”
[5] “Ahmad, iftikhar, et al. ”fake news detection using machine learning ensemble meth-
ods.” complexity 2020 (2020).,”
[6] “Qian, feng, et al. ”neural user response generator: Fake news detection with collec-
tive user intelligence.” ijcai. vol. 18. 2018.,”
[7] “Balmas, meital. ”when fake news becomes real: Combined exposure to multiple
news sources and political attitudes of inefficacy, alienation, and cynicism.” commu-
nication research 41.3 (2014): 430-454.,”
[8] “Riedel, benjamin, et al. ”a simple but tough-to-beat baseline for the fake news
challenge stance detection task.” arxiv preprint arxiv:1707.03264 (2017).,”
[9] “Asr, f. t., and m. taboada. ”misinfotext: a collection of news articles, with false
and true labels.” (2019).,”
[10] “Shu, kai, et al. ”fake news detection on social media: A data mining perspective.”
acm sigkdd explorations newsletter 19.1 (2017): 22-36.,”
27
[12] “Holan, angie. ”the media’s definition of fake news vs. donald trump’s.” first amend.
l. rev. 16 (2017): 121.,”
[13] “Lazer, david mj, et al. ”the science of fake news.” science 359.6380 (2018): 1094-
1096.,”
[14] “Conroy, nadia k., victoria l. rubin, and yimin chen. ”automatic deception detection:
Methods for finding fake news.” proceedings of the association for information science
and technology 52.1 (2015): 1-4.,”
[15] “Jin, zhiwei, et al. ”novel visual and statistical image features for microblogs news
verification.” ieee transactions on multimedia 19.3 (2016): 598-608.,” tech. rep.
[17] “Han, wenlin, and varshil mehta. ”fake news detection in social networks using ma-
chine learning and deep learning: Performance evaluation.” 2019 ieee international
conference on industrial internet (icii). ieee, 2019.,” tech. rep.
[18] “Bali, arvinder pal singh, et al. ”comparative performance of machine learning algo-
rithms for fake news detection.” international conference on advances in computing
and data sciences. springer, singapore, 2019.,”
[19] “Gilda, shlok. ”notice of violation of ieee publication principles: Evaluating machine
learning algorithms for fake news detection.” 2017 ieee 15th student conference on
research and development (scored). ieee, 2017.,”
[20] “Rubin, victoria l., et al. ”fake news or truth? using satirical cues to detect po-
tentially misleading news.” proceedings of the second workshop on computational
approaches to deception detection. 2016.,”
[21] “Das, dipto, and anthony j. clark. ”sarcasm detection on flickr using a cnn.” pro-
ceedings of the 2018 international conference on computing and big data. 2018.,”
[22] “Marquardt, dorota. ”linguistic indicators in the identification of fake news.” medi-
atization studies 3 (2019).,”
[23] “Wang, william yang. ”” liar, liar pants on fire”: A new benchmark dataset for fake
news detection.” arxiv preprint arxiv:1705.00648 (2017).,”
28
[24] “Buntain, cody, and jennifer golbeck. ”automatically identifying fake news in popular
twitter threads.” 2017 ieee international conference on smart cloud (smartcloud).
ieee, 2017.,”
[25] “Morales-molina, carlos domenick, et al. ”methodology for malware classification

using a random forest classifier.” 2018 ieee international autumn meeting on power,
electronics and computing (ropec). ieee, 2018.,”
tive user intelligence.” ijcai. vol. 18. 2018.,” tech. rep.
[28] “Patel, harsh h., and purvi prajapati. ”study and analysis of decision tree based
classification algorithms.” international journal of computer sciences and engineering
6.10 (2018): 74-78.,”
[29] “Thomas, vincentius westley dimitrius, and fitrah rumaisa. ”analisis sentimen ulasan
hotel bahasa indonesia menggunakan support vector machine dan tf-idf.” jurnal me-
dia informatika budidarma 6.3 (2022): 1767-1774.,”
[30] “Chen, yimin, nadia k. conroy, and victoria l. rubin. ”news in an online world: The
need for an “automatic crap detector”.” proceedings of the association for informa-
tion science and technology 52.1 (2015): 1-4.,”
[31] “Ma, jing, et al. ”detecting rumors from microblogs with recurrent neural networks.”
(2016): 3818.,”
[32] “Vlachos, andreas, and sebastian riedel. ”fact checking: Task definition and dataset
construction.” proceedings of the acl 2014 workshop on language technologies and
computational social science. 2014.,”
[33] “Khanam, z., et al. ”fake news detection using machine learning approaches.” iop
conference series: Materials science and engineering. vol. 1099. no. 1. iop publishing,
2021.,”
[34] “Gauss m. cordeiro,peter mccullagh july 1991journal of the royal statistical society.
series b: Methodological 53(3):629-643,”
[35] “Bi, zeng-jun, et al. ”gaussian naive bayesian data classification model based on clus-
tering algorithm.” 2019 international conference on modeling, analysis, simulation
technologies and applications (masta 2019). atlantis press, 2019.,”
29
[36] “shikha agarwal, balmukumd jha, tisu kumar, manish kumar, prabhat ranjan ”hy-
brid of nave bayes and gaussian nave bayes for classifaction: A map reduce ap-
proch”.2019,”
[37] “Mitchell, rory, and eibe frank. ”accelerating the xgboost algorithm using gpu com-
puting.” peerj computer science 3 (2017): e127.,”
[38] “Zhang, zhenyu, and xiaoyao xie. ”research on adaboost. m1 with random forest.”
2010 2nd international conference on computer engineering and technology. vol. 1.
ieee, 2010.,”
[39] “Amini, saeid, saeid homayouni, and abdolreza safari. ”semi-supervised classification
of hyperspectral image using random forest algorithm.” 2014 ieee geoscience and
remote sensing symposium. ieee, 2014.,”
[40] “Morales-molina, carlos domenick, et al. ”methodology for malware classification

using a random forest classifier.” 2018 ieee international autumn meeting on power,
electronics and computing (ropec). ieee, 2018.,”
[42] “Jadhav, sayali d., and h. p. channe. ”efficient recommendation system using decision
tree classifier and collaborative filtering.” int. res. j. eng. technol 3.8 (2016): 2113-
2118.,”
[43] “Dhiman, gaurav, et al. ”federated learning approach to protect healthcare data over
big data scenario.” sustainability 14.5 (2022): 2500.,”
30

A Comparative Analysis of Machine Learning Techniques On Fake News Detection 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Comparative Analysis of Machine Learning Techniques On Fake News Detection 1

Uploaded by

Copyright:

Available Formats

Comparative Analysis of Machine Learning

Techniques on Fake News Detection

Bachelor of Science in Software Engineering

Department of Computer Science

Department of Computer Science

Atif Ullah (9736)

We accept this dissertation as conforming to the required standards

(Supervisor) (Internal Examiner)

(External Examiner) (Head of the Department)

(Coordinator FYP) (Approved Date)

Department of Computer Science

4 Evaluation and Discussion 19

Conclusion and Future Work 26

4.1 confusion matrix for binary classifiers . . . . . . . . . . . . . . . . . . . . 20

4.1 Performance Analysis of each Technique using Precision and Recall. . . . 21

• 62% of US citizens get their news for social Medias.

• Fake news had more share on Facebook than mainstream news .

1.4 Scope and Objectives

• Review of existing models that are discussed in Literature review.

1.5 Thesis Outline

2.1 Related Work

Figure 3.1: Machine learning Model for Fake News Detection

Table 3.1: Fake and Real News Detection

S.NO Title Text Label

3.3 Pre-processing Steps

3.5 Division of Data

3.6 Features Extraction

S:NO Semantic Features

3.7 Training Data

3.8 Machine Learning Models For Fake News Detec-

3.8.5 Decision Tree

Figure 3.4: Decision Tree Classification Algorithm

3.8.7 SGD (Stochastic Gradient Descent)

3.9 Performance Assessment Measures

True positive (TP) = the number of cases correctly identified as patient

Percentage positive instances (TP) out of total actual positive (FN).

Evaluation and Discussion

4.1 Evaluation Results

Figure 4.2: Represents confusion matrix for fake news detection

S.NO Technique Precision Recal

S.NO Technique Accuracy

Figure 4.5: Performance Analysis of each Technique using Accuracye

Figure 4.6: percentage difference XGBoost with other techniques

4.2.1 Why Performance of XGB Classifier is Better?

[25] “Morales-molina, carlos domenick, et al. ”methodology for malware classification

[40] “Morales-molina, carlos domenick, et al. ”methodology for malware classification

You might also like