Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Ain Shams Engineering Journal 14 (2023) 102166

Contents lists available at ScienceDirect

Ain Shams Engineering Journal


journal homepage: www.sciencedirect.com

Exploratory data analysis and deception detection in news articles


on social media using machine learning classifiers
Anu Sharma a,⇑, M.K Sharma b, Rakesh Kr. Dwivedi c
a
Uttarakhand Technical University, Dehradun, Uttarakhand, India
b
Amrapali Institute, Haldwani, India
c
CCSIT, Teerthanker Mahaveer University, Moradabad, UP, India

a r t i c l e i n f o a b s t r a c t

Article history: This paper investigates realistic ways to identify fake news on digital platforms in this context automat-
Received 31 July 2022 ically. To begin, a massive number of current and correlated works were surveyed in an attempt to incor-
Revised 13 November 2022 porate all possible features for detecting fake news, followed by exploratory data analysis to identify
Accepted 10 January 2023
sources that frequently publish fake news and determine the most frequently occurring words in the title
Available online 21 January 2023
and body of fake and genuine news. Our findings indicate that the suggested computer models possess an
advantageous discriminative potential for detecting fake news transmitted via digital channels. In this
Keywords:
paper, we classify documents into fake/real news categories using Random Forest (RF), Naive Bayes
Fake news
Fake news detection
(NB), and Passive Aggressive (PA)] machine learning classifiers with and without text processing (TP).
Machine learning classifiers Our paper’s result is determined and calculated using the confusion matrix and the classifier’s perfor-
Random Forest mance by defining accuracy, precision, recall, and F1 score metrics for fake news detection.
Naïve Bayes Ó 2023 THE AUTHORS. Published by Elsevier BV on behalf of Faculty of Engineering, Ain Shams Uni-
Passive aggressive versity. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

1. Introduction and consumed, causing both unexpected opportunities and com-


plex issues. The flood of manufactured medical information trans-
About a third of the world’s populace uses digital platforms, mitted via digital media is a catastrophe in terms of health [3]. For
such as social media sites and messaging programs [1]. These plat- example, for medically reliable data, a cancer patient has published
forms have fundamentally altered how users engage and commu- an online announcement for an experimental cancer cure, leading
nicate online, enabling a flood of new apps and reshaping to his death [4]. Moreover, the spread of rumors and conspiracies
established information ecosystems. Digital platforms, in particu- through social media platforms has increased during the COVID-
lar, have changed the way news is radically generated, delivered, 19 pandemic [5 6]. In less than two months, the International
Fact-Checking Network (IFCN) discovered almost 3,500 incorrect
statements about COVID-19[7]. As a result, at least 800 individuals
Abbreviations: RF, Random Forest; NB, Naive Bayes; PA, Passive Aggressive; TF- may have died worldwide in the first three months of 2020 due to
IDF, Term Frequency-Inverse Document Frequency; LIWC, Linguistic Inquiry and coronavirus-related misinformation.
Word Count; RST, Rhetorical Structure Theory; VSM, Vector Space Model; gCLUTO, The 2016 presidential election in the United States of America is
Graphical CLUstering Toolkit; RNN, Recurrent neural network; CNN, Convolutional
still renowned for a misinformation war waged primarily on Twit-
Neural Network; SVM, Support Vector Machine; NN, Neural Network; API,
Application Programming Interface; POS, Part-of-speech; MFD, Moral Foundations ter and Facebook. The infamous case featured an attempt by Russia
Dictionary; TP, Text Processing; TP, True Positive; TN, True negative; FP, False to exert influence through targeted advertising [8]. According to a
Positive; FN, False Negative; ACC, Accuracy; PRE, Precision; REC, Recall. recent study, 88 % of the most prevalent photos posted in the
⇑ Corresponding author.
month preceding the Brazilian elections were bogus or deceptive
E-mail address: er.anusharma20@gmail.com (A. Sharma).
[9]. In India, fake tales circulated via the web service were also
Peer review under responsibility of Ain Shams University.
blamed for several lynchings and societal turmoil [10 11].
Identifying fake news automatically is not a simple feat. To
begin with, humans are naturally unable to distinguish between
true and false news [2], much more so when it comes to delicate
Production and hosting by Elsevier
matters such as politics and well-being. Additionally, news items

https://doi.org/10.1016/j.asej.2023.102166
2090-4479/Ó 2023 THE AUTHORS. Published by Elsevier BV on behalf of Faculty of Engineering, Ain Shams University.
This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
A. Sharma, M.K Sharma and R. Kr. Dwivedi Ain Shams Engineering Journal 14 (2023) 102166

are generated by various sources, each with its content style and ii. Deception Modelling based techniques
inherent favouritism. They are transmitted multiple ways through
numerous locations, complicating the challenge of identifying fake RST (‘‘Rhetorical Structure Theory”) and VSM (‘‘Vector Space
news. As such, we plan to research characteristics and solutions Model”) are the two approaches for clustering false and factual sto-
that remain relevant in various contexts and investigate ways to ries [13].
detect fake news propagated on digital platforms. RST: RST procedural analysis captures a story’s logic regarding
functional relationships between various relevant text components
A. Types of bogus news and defines a hierarchical structure for each story [15].
VSM: VSM is used to connect rhetorical structures in RST sets.
Social science academics have examined fake news from vari- Because VSM scans all news items as vectors in high-dimensional
ous angles and classified it broadly, as Rubin et al. did in their latest space, it is essential to represent the extracted text computation-
work [12]. This classification is summarised here. ally [15,16].

 Visual-based: Visual-based false news relies heavily on graphi- iii. Clustering-based Methods
cal representation as content, which may embrace altered
video, images, or a blend of the two [13]. Clustering is a well-known technique for comparing and con-
 User-based: The user-based kind is aimed at a specific audience trasting vast amounts of data; Robin et al. [15] used the gCLUTO
through phony accounts, and its target viewers may exemplify (Graphical CLUstering TOolkit) clustering software to aid in the
particular age groups, genders, or cultures. differentiation of news items based on their resemblance according
 Post-based: post-based fake news is mainly intended to be dis- to the clustering algorithm used. The co-ordinate distance notion
seminated via social media platforms. measures this model’s ability to identify deception in a new story.
 Network-based: Such news is targeted at specific members of For Robin et al. [15], new accounts were appraised for misleading
an organization who are related. values using Euclidean distances from deceptive cluster centers.
 Knowledge-based: Such bogus news articles include precise or The author claims this method works well on large data sets, with
rational explanations for unresolved situations a 63 % success rate. One obvious disadvantage is that this approach
 Style-based: Style-based journalism emphasizes how informa- may not produce an accurate result if a recent false news item is
tion is presented to its viewers. used, as similar news story sets may not be accessible.
 Stance-based: While this kind is similar to the style-based type
stated above, the instance is distinct in that it focuses on how iv. Predictive Modelling based Methods
statements are expressed in an artifact.
B. False news recognition approaches Robin et al. [15] used 100 out of 132 news reports as training
data. Affirmative coefficients enhance the likelihood of integrity,
False news recognition approaches and their key characteristics, while destructive coefficients increase the likelihood of deception.
advantages, and limitations are discussed underneath:
v. Content Cues-based Methods.
i. Linguistic Features-based Methods
According to Chen et al. [17], the Content Cues-based approach
Language techniques focus on utilizing/extracting essential lin- is founded on what journalists enjoy writing for users and what
guistic characteristics from fake news, as discussed underneath: consumers enjoy reading (choice gap). Including certain content
N-grams: Unigrams/bigrams are derived from a story’s assem- in a news report encourages readers to read on. Contaminated
blage of words. These are preferably preserved ‘‘Term Frequency- news items frequently urge interaction and encouragement, which
Inverse Document Frequency (TF-IDF)” values for information attracts users.
gathering. TF-IDF is an arithmetical indicator used to indicate the
prominence of a word in the document in which it is utilized i. Lexical and Semantic Analysis: The word choice is critical in
[52,53]. persuading readers to accept the story. Automated
Punctuation: Punctuation can assist the algorithm in detecting approaches can be utilized to excerpt stylometric features
false news in distinguishing between misleading and authentic from the text that can be exploited to distinguish between
texts. two journalistic genres.
Psycho-linguistic features: Some researchers advised using the ii. Syntactic and Pragmatic Levels of Analysis: The pragmatic
LIWC lexicon (‘‘Linguistic Inquiry and Word Count”) to extract function of headlines is to call attention to upcoming dis-
psycho-linguistic aspects. The system may therefore decide the course sections [17]. This is accomplished by referring to
tenor of the language (e.g., lovely sentiments, sensorial processes), open areas of the news story. Headlines are created to
the statistics of texts (e.g., number of words), and the portion of the occupy vacant thinking with the lever-aging material that
language category (e.g., sentences, verbs) [14]. follows. This analysis also compares news sites with a higher
Readability: This includes extracting content characteristics share activity to those that make significantly more news
such as character quantity, multifaceted words, lengthy words, content.
number of syllables, kind of word, and number of paragraphs
[14]. By incorporating these qualities into the content, we may cal- vi. Non-Text Cues-based Methods
culate readability measures.
Syntax: This technology extracts from the Context-free gram- Non-text cues refer to news stories’ non-text content, as Chen
mar several features. These features depend considerably in con- et al. addressed [17]. The non-textual aspect of a news article is
junction with their parent and grandparent node on lexicalized crucial for polluted news trust. Images are recognized to have a
production rules. To obtain information, the functions of this set powerful impact and are typically the most eye-catching compo-
are encoded in TFIDF. nent of news articles. This method includes two analyses:

2
A. Sharma, M.K Sharma and R. Kr. Dwivedi Ain Shams Engineering Journal 14 (2023) 102166

i. Image Exploration: The tactical usage of visuals is a recog- Buzz Face [22]: This data collection is derived by supplementing
nized technique for manipulating an observer’s emotion. the BuzzFeed data set with comments on Facebook news articles.
Given that many readers react to news stories based on There are 2263 news stories and 1.6 million words in the data
the title and a picture, image (multimedia) is critical in collection.
encouraging readers to trust the subject matter. Facebook Hoax [23]: This data set contains information about
ii. User Behaviour Exploration: It is a content-independent postings from Facebook pages devoted to scientific news (non-
technique that is particularly beneficial for determining hoax) and conspiracy words (hoax), which were gathered via the
how readers engage with the news after being enticed by Facebook Graph API.
the article. News organizations must attract traffic to their Table 2 summarises the various data sets in terms of the types
primary website via various channels. Understanding user of attributes upon which they are based.
behavior and utilizing teaser photos are critical components
of gaining traction on social media. D. Characteristics for detecting bogus news

Table 1 illustrates which sort of fake news revealing approach is When we evaluate activities relating to information credibility,
successful for various types of false news material. rumor detection, and news dissemination, the literature is quite
extensive. A reasonable study of these efforts to ascertain the char-
C. Fake news datasets acteristics is summarized in Table 3.

Popular data sets for detecting fake news include:


2. Related studies
Buzz Feed News [18]: Buzz Feed News is a compilation of head-
lines and links to actual news stories or posts deemed false news.
Scholars from various disciplines, including journalism, com-
This data set is valuable for testing Linguistic approaches.
munication, and political science, have long studied the news
LIAR [19]: LIAR is a benchmarking system developed by
media. Nonetheless, computer scientists were interested when
researchers at the University of California, Santa Barbara. This data
news organizations began to utilize Web and digital platforms.
collection, like Buzz Feed News, is linguistic in nature and com-
With the advent of the digital era, news organizations began dis-
prises just text data.
semination in digital format. To address these concerns, computer
BS detector [20]: This data comes from a browser addon called
scientists have looked at the news ecosystem on digital platforms
BS detector, which was created to check the trustworthiness of
but with different goals and purposes. Although detecting false
news. It checks every link on a page for suspicious sources and
news is not a new issue, recent attempts have concentrated on bet-
compares them to a manual list of domains.
ter understanding the phenomenon of fake news on digital plat-
CREDBANK [21]: This is the only data collection that contains
forms [1–48]. Vosoughi et al. [38] demonstrate that fake news
social media data and enables users to analyze it. This data set
spreads quicker than actual news. Lazer et al. [18,49] advocate
qualifies for all categories except visual data. While it lacks multi-
forming an interdisciplinary task group to address this problematic
media data, it remains a very tempting option for academics inter-
issue. However, several intrinsic aspects of digital platforms lead to
ested in detecting fake news in media.
the proliferation of fake news in these environments. Table 4 is a
comparative examination of similar work on fake news.

Table 1
Bogus news types and its detection methods.

Bogus News Types Bogus News Detection Methods


Linguistic Modelling Deceptive Clustering Predictive Modelling Content Cues Non-Text Cues
Visual-based      U
User-based    U U U
Post-based U U U U  U
Network-based     U 
Knowledge-based   U   
Style-based U   U U 
Stance-based      ”

Table 2
Evaluation of data sets containing bogus news.

Features data set News content Social context Spatiotemporal


information
Linguistic Visual User Post Response Network Spatial Temporal
Buzz Feed News U
LIAR U
BS detector U
CRED BANK U U U U U
Buzz Face U U U U
Face book Hoax U U U U
Fake News Net U U U U U U U U

3
A. Sharma, M.K Sharma and R. Kr. Dwivedi Ain Shams Engineering Journal 14 (2023) 102166

Table 3
Summary of the features for detecting bogus news.

Inference from. . . Characteristic Set Used Approach References


News Content Language Structures Sentence-level features (i.e‘‘n-grams‘‘,bag-of-words, POStagging [2 24–27]
Lexical Features Word-level and Character level features [2 27–31]
Moral Foundation Cues Moral foundation features [32,33]
Visual (Images/ Videos) Manipulation indicators and image distributions [34,35]
Psycho linguistic Cues Signals of persuasive language (i.e. Anger, grief, and other emotions, as well as evidence of [36–38]
slanted language)
Semantic Structure Contextual information [39–41]
Subjectivity Cues Subjectivity score, sentiment analysis, opinion lexicons [42–44]
News Source Bias Cues Favouritism indicators (for example, politics), polarisation [31]
Credibility and Trust Estimation of the user’s opinion of the credibility [2 38 45]
worthiness
Environment (Social Engagement Page visits/Likes/Retweets [8 22 37 40 43
Media) 46]
Network Structure Complex network metrics, friendship network [2 30 47]
Temporal Patterns and Metrics for time series, propagation, and uniqueness [2 25 30 47]
Novelty
User’Information Profiles and attributes of users at the individual and group levels (e.g. their friends and [2 25 26–47]
followers)

Table 4
Comparative analysis of related work.

Authors Proposed Approach Model Dataset Features


Volkova et al.[37] Identified tweets in which rumours RNN, CNN, Logistic Tweets Content-based, network-based, Twitter
endorsed. Regression specific memes.
Carvalho et.al.[33] Natural language processing NB, J48, RC, RF Brazilian Portuguese MFD Using moral aspects to discern between
credible and unreliable sources
Helmstetter etal. [43] Analysis of Twitter content. NB, DT, SVM, NN Tweeter API and DMOZ User and Tween level features
catalogue
Chen et al. [12 15 17] Compared the coherence of false VSM News samples from NPR’s Discourse
and true news. ‘‘Bluff the Listener”
Rubin et al. [15 16 17] merging linguistic and network- Linguistic, Net- Simple text sentences Bag of Words, n-gram
based behaviour data. work models
Conroy et al. [12] satire detection model SVM US and Canadian national Punctuation, Grammar, Negative Affect
newspapers
Ahmed et al. [28] n-gram based classifier to identify Linear SVM News articles TF-IDF
bogus ads.
Kapusta, et al. [41] Tf Idf, Posf Idf, PosF, merge Linguistic models COVID-19 Infodemic N-grams extraction from POS tags

3. Materials and methods Aggressive (PA)] with and without text processing (TP). The overall
proposed methodology is shown in Fig. 1.
a) Dataset and data source Before categorization, we perform the following pre-processing
We conduct in-depth research on the Buzzfeed news dataset on the data:
[18,49]. This dataset contains an exhaustive sample of news arti-
cles published on Facebook by nine news organizations during Pre-processing steps before categorization Steps
the week leading up to the 2016 United States presidential Step 1: Lowercase text conversion
election. Step 2: Eliminating numerals from the corpus of text
Step 3: Eliminating punctuation from the corpus of text
b) Methodology Step 4: Eliminating special characters from the text corpus,
such as ‘‘, ’. . .’
We are primarily concerned with the source of fake news and Step 5: Elimination of English stop words
the language utilized in the fake news. We are particularly inter- Step 6: From stemming to root words
ested in identifying sites that spread fake news and in identifying Step 7: Elimination of unnecessary whitespaces from the text
phrases that are more closely related to one category than another. corpus
This analysis’s primary objective is to determine the difference
between fake and true news. This paper is broken into two sec-
tions, Data exploration, and Classification. The first section analy-
ses real and fake news datasets to identify sites that frequently i. Data Splitting.
publish fake news and the most frequently used words in the title
and body of fake and real news. The second section’s objective is to In machine learning, it is usual practise to divide the data into
develop a classifier capable of predicting and detecting fake docu- two distinct sets, known as the train set and the test set. Through
ments into real/fake news categories using three different classi- this method, we are able to assess the generalisation performance
fiers [Random Forest (RF), Naive Bayes (NB), and Passive and determine the model’s hyper-parameter.
4
A. Sharma, M.K Sharma and R. Kr. Dwivedi Ain Shams Engineering Journal 14 (2023) 102166

Fig. 1. Proposed Methodology.

The primary goal when determining the splitting ratio is to which contains a Fake news dataset and a real news dataset with
ensure that both sets retain the original dataset’s overall trend. 1627 observations and 12 features (Main features-Id, Title, Text,
Our deployed models are just estimators that study the data for Source, Images, Movies, etc.). We divided the data into 70:30 sub-
patterns and predict future outcomes. Consequently, it is crucial sets for training and testing purposes. For result and analysis, we
that the learning data and the validation or test data have a statis- have selected python programming language and run-on Windows
tical distribution that is as close to the same as feasible. Data sam- 10 Operating system, 8th Generation i7 processors, 8 GB RAM, and
ples were chosen at random using stratified random sampling Anaconda platform.
under well outlined constraints. It guarantees that the information
is fairly split across the training and testing sets. Dataset has been a. Real versus Fake News Source Analysis.
split into a train set and a test set with a 70:30 split, the exact pro-
portions of which depend on the size of the original data pool. As a Fig. 2 demonstrates that politi.co reports the most authentic
rule, when splitting into parts, one part is used for training the news, followed by cnn.it [18,49].
model and the other is utilised for testing. Fig. 3 reveals that the right-wing news reports top fake news.
Additionally, bogus news sources the number of legitimate news
sources [18,49].
4. Data exploration and buzzfeed news dataset analysis using As seen in Fig. 4, there are seven familiar sources of legitimate
python and erroneous news. Interestingly, these sources report on fake
news more frequently than they do on legitimate news. The
We used the Buzzfeed news dataset [18,49] for data exploration right-wing news website publishes most fake news. However, it
and analysis. We utilized the BuzzFeed News dataset from [18,49], also includes some legitimate news. Around two-thirds of all news
5
A. Sharma, M.K Sharma and R. Kr. Dwivedi Ain Shams Engineering Journal 14 (2023) 102166

Fig. 2. Sources of Real News.

Fig. 3. Sources of Publishing Maximum Fake News.

reported by right-wing outlets are bogus. On the other side, free- v. Analysis of News Body.
dom daily, the second most significant source of fake news, says Following the title analysis, we examine the body text of news
very little true news. addictinginfo.org is the only common source stories. We are looking for the top 30 representative terms in the
that reports more actual news than fake news, although the overall body of bogus and legitimate news. As seen in the following gra-
number of stories it reports is minimal [18,49]. phic, the most often used words in the news body are trump and
clinton. Fig. 6 illustrates how some terms, such as clinton, hillari,
b. Analysis of Title and Body of News Articles. and trump, indicate false news, while others, such as trump, said,
(I) Analysis of News Title. and clinton, is indicative of true news.

We process news story titles and then extract the top 20 most (II) Analysis of Title Length.
often occurring words in the headlines for both authentic and fake
news described in Fig. 5. Fig. 6 demonstrates that while specific Following the examination of the words in the headline and
terms such as hillari, clinton, freedom, and obama represent the body of the news, we want to determine whether the length of
title of false news, others such as trump, clinton, donald, and the title is also a distinguishing feature/factor between fake and
debate represent the title of real news [18,49].
6
A. Sharma, M.K Sharma and R. Kr. Dwivedi Ain Shams Engineering Journal 14 (2023) 102166

Fig. 4. Common Sources of Publishing Both Real and Fake News.

Fig. 5. Sources Including Movies and images in the News and their integrity.

 X 
true news. As illustrated in Fig. 7, the title length of fake news is 1 N 2
MSE ¼ ðP i  A i Þ ð1Þ
slightly longer than that of legitimate news. N i¼1

The distribution of title lengths in real news is around 60 char-


Here, MSE ? Mean Square Error, N ? Data Points of the social
acters, but the distribution of title lengths in fake news is slightly
media network, Pi ? Value returned by model, Ai ? Actual value of
skewed around 80 characters described in Fig. 8.
Data point I for decision tree.
This classifier is the advanced form of decision tree, which con-
5. Experimental setup and results evaluation for Fake/Real news tains a large number of decision tree to predict the output of a par-
detection ticular class’s voting. In our paper this classifier trained using
different estimators to predict the best model outcome with high
We describe the experimental setup and classify documents accuracy. The Gini indexed cost function is used in our paper for
into real/fake news categories using three different classifiers [Ran- random forest classification to split the BuzzFeed News dataset.
dom Forest (RF), Naive Bayes (NB), and Passive Aggressive (PA)] The Gini index is calculated using Equation number 2. Equation
with and without text processing (TP). We also provide implemen- (2) find the Ginni of each node’s branch for more likely occurred
tation details of our explored approaches in fake/real news between the nodes.
detection.
XNC
GiniIndex ¼ 1  ðF i Þ2 ð2Þ
I. Random Forest (RF) classifier i¼1

This classifier algorithm solves the regression problems and cal- Here, Fi ? Observed class relative frequency of dataset, Nc ?
culates the mean square error of data branches of each connected Total number of classes in dataset.
node in the social media network. Using Equation (1), calculates In random forest classifier Entropy is also considered to find
the distance of each node and predicted value that decides which nodes branches in the decision tree and probability of the out-
branch is better for the forest. comes to create the new branch of the nodes using Equation (3).
7
A. Sharma, M.K Sharma and R. Kr. Dwivedi Ain Shams Engineering Journal 14 (2023) 102166

Fig. 6. Frequency of word in title of news.

X
NC
EntropyRFC ¼ F i  log2 ðF i Þ ð3Þ
i¼1
II. Naïve Bayes (NB) classifier
Here, EntropyRFC ! Entropy for Random forest algorithm Nc ?
Total number of classes in dataset and Fi ? Observed class relative Naïve Bayes follows the Bayes theorem which tells an event
frequency of dataset. probability when the probability of another event related to it is
The pseudocode 1 used to perform the prediction using trained known. This classifier assumes that the presence of a one feature
random forest algorithm, this algorithm passes the selected fea- is a class is unconnected to the occurrence of any other feature.
tures based on the rules formed for each randomly created trees. The event A probability of happening, knowing the already hap-
To perform prediction using the trained random forest algorithm, pening of event B is given in Equation (4)
we use pseudocode 1.
PðAandBÞ
PðAjBÞ ¼ ð4Þ
PðBÞ
Pseudocode 1- Random Forest (RF) pseudocode:
Input: Predefined classes of Training BuzzFeed News Dataset (T) And for event B when event A has happened is given in Equation (5)
Output: Final predicated class of BuzzFeed News Dataset
PðAandBÞ
Begin PðBjAÞ ¼ ð5Þ
Steps 1: Collect the dataset.
PðAÞ
Step 2: Randomly extract some features (K) from total features these two equations (Equation (4) and Equation (5)) when put
(M) or text (T1, T2, T3,.Tn). (K < M) together, we get Equation (6) to determine the Posterior probabil-
Step 3 Store and predict the result of randomly created decision ity, shown in the Fig. 9 with notations.
tree.
Steps 4: Now calculate the node ‘‘D) from selected features using PðAjBÞ  PðBÞ
BayesTheoremPðBjAÞ ¼ ð6Þ
best split points and determine the best splinter point for node D. PðAÞ
Step 5: Split the node into two Child nodes using optimal splinter Here, P(A|B), P(B|A), P(A), P(B) are Likelihood, Posterior, Prior
point and Marginal Probability respectively.
Step 6: Repeat the step 2 to step 5 until total number of nodes The Equation (7) is used to classify the BuzzFeed News posts for
reached. Naïve bayes for Fake or real words in the network.
Step 7: Repeat the step 2 to step 6 for D time to build the forest
and d number of trees. Na€iv e Bayes  > ðFjWÞ
Step8:NowDeterminethefinaloutcomeasaresultofmajorityvoted ¼ PðWjFÞ  PðEÞ=ðPðWjFÞ  PðEÞ þ PðWjTÞ  PðTÞÞ
and calculate votes of each predicted target as final outcome
End ð7Þ
Here, F(Class)->Fake, W(Data)-> Word, and T-> True (Class).

8
A. Sharma, M.K Sharma and R. Kr. Dwivedi Ain Shams Engineering Journal 14 (2023) 102166

Fig. 7. Frequency of word in body(text) of news.

Fig. 8. Distribution of Title Length for Real and Fake News.

9
A. Sharma, M.K Sharma and R. Kr. Dwivedi Ain Shams Engineering Journal 14 (2023) 102166

III. Passive Aggressive (PA)Classifier

This classifier is the online learning algorithms that includes


both classification and regression parts [50,51] and used for
large-scale learning. In this classifier, if classification turns out to
be correct, called Passive process and called Aggressive process if
classification is wrong. In our research paper we have considered
Passive Aggressive for better outcome for fake news detection. This
Fig. 9. Bayes Theorem Notations of various variables. algorithm performance is superior as compared to other classifiers.
This algorithm is similar to the perception classifier which do not
require the learning rate but contains the regularization parameter
The complete pseudocode of Naïve bayes algorithm that we C. The working principle of this kind of classifier is similar to that of
have considered in our research paper is given in pseudocode 2. Perceptron classifier; meanwhile, they do not require a learning
These all steps are determining the final outcome of the greatest rate. However, it includes a regularization parameter c. The com-
likelihoods using Equation (4), Equation (5), Equation (6) and plete pseudocode of Passive Aggressive classifier is described in
Equation (7). the pseudocode 3.

Pseudocode 2- Naïve bayes (NB) pseudocode: Pseudocode 3- Passive Aggressive (PA) classifier
Input: pseudocode:
Training BuzzFeed News Dataset (T), P= (p1, p2,p3,. . ..,pn) // Input: BuzzFeed News Dataset (T), D= (X, Y) X- Training
Predictor variable value in the testing BuzzFeed News Dataset Instances and Y- Class Labels, Weight Vector Weighti
Output: Testing BuzzFeed News Dataset class with maximum (Initialized Weight Vetor (0,. . .0) for i = 1,2,3,. . .
probability value Output: Correctly classify weight
Begin Begin
Steps 1: Collect the data. Step 1: Passive: If prediction correct, do nothing.
Step 2: Read, separate and summarized the training dataset by Step 2: Aggressive: If prediction wrong, minimally update the
class. weights to correctly classify.
Step 3 Convert the dataset into a frequency table. (For correctly Classify)
Steps 4: Calculate the statistical values like means, standard Weighti+1 = MinAgg ||Weightt- Weight||2 s.t ƪ (Weight; Yn,
deviation of the predictor variable in each class. Xn) = 0 (8)
Step 5: Repeat (until probability of each (p1, p2, p3. . ..,pn) 
0 Y n WeightT X n P 1
predictor variable is calculated) - Calculate the probability of pi iðWeight; X n ; Y n Þ ¼ (9)
1  Y n WeightT X n Otherwise
using gaussian probability function of each class.
End
Step 6: Find the P(A|B), P(B|A), P(A), P(B) Likelihood, Posterior,
Prior and Marginal Probability respectively for each class.
Step 7: Calculate Naïve Bayes -> P(F|W) = P(W|F) * P(F)/(P(W|F)*
P(F)+ P(W|T) * P(T))
and find the greatest likelihood
End

Fig. 10. Workflow of proposed approach of Buzzfeed news dataset using RF, NB and PA Classifiers.

10
A. Sharma, M.K Sharma and R. Kr. Dwivedi Ain Shams Engineering Journal 14 (2023) 102166

Table 6
IV. Proposed workflow of for training and testing of BuzzFeed Fake/Real News Detection Based on News Body.
news using classifiers
Algorithms ACC PRE REC F1

We conduct in-depth research on the Buzzfeed news dataset RF with Pre-Processing 0.8 0.8 0.8 0.8
RF without Pre-Processing 0.76 0.77 0.76 0.76
[18,49]. This dataset contains an exhaustive sample of news arti-
NB with Pre-Processing 0.69 0.72 0.69 0.69
cles published on Facebook by nine news organizations during NB without Pre-Processing 0.67 0.77 0.67 0.65
the week leading up to the 2016 United States presidential PA with Pre-Processing 0.76 0.77 0.76 0.76
election. PA without Pre-Processing 0.91 0.91 0.91 0.91
The complete workflow of our approach is shown in the Fig. 10
for result evaluation to determine the news is Fake or real. In this
workflow, initially we have collected the Buzzfeed news dataset Table 7
from [18,49] then we have applied the data cleaning and explo- Fake/Real News Detection Based on News Title.
ration process. In next steps various features are selected from
Algorithms ACC PRE REC F1
the Fake/ Real dataset and dataset is divide into 70:30 subsets
for training and testing purposes as mentioned in the previous sec- RF with Pre-Processing 0.67 0.69 0.67 0.67
RF without Pre-Processing 0.62 0.62 0.62 0.62
tion also. Finally, based on the user’s query about the news, RF, NB,
NB with Pre-Processing 0.6 0.63 0.6 0.59
and PA are used to find the accuracy, precision, recall, and F1 score NB without Pre-Processing 0.58 0.63 0.6 0.56
with and without text processing (TP). PA with Pre-Processing 0.64 0.66 0.64 0.63
PA without Pre-Processing 0.56 0.58 0.56 0.55
V. Performance measure and evaluation metrics
The performance of the all the Classifier can be estimated after
applying on the test set and training set of BuzzFeed News dataset. Table 8
The performance measures accuracy, precision, recall, and F1 score Fake/Real News Detection Based on Both Body and Title of News.

of RF, NB and PA Classifiers for fake news detection problem are Algorithms ACC PRE REC F1
determined using four features (Parameters) such as True Positive RF with Pre-Processing 0.82 0.84 0.82 0.82
(TP)- Predicted fake news is actually fake news, True negative (TN)- RF without Pre-Processing 0.75 0.75 0.75 0.75
Predicted true news is actually true news, False Positive (FP)- Pre- NB with Pre-Processing 0.65 0.69 0.65 0.64
dicted fake news is actually true news and False Negative (FN)- NB without Pre-Processing 0.64 0.75 0.64 0.6
PA with Pre-Processing 0.73 0.73 0.73 0.73
Predicted true news is actually fake news of Class C. The result is
PA without Pre-Processing 0.87 0.87 0.87 0.87
determined and calculated using confusion matrix shown in the
Table 5 to detect the fake/real news. The confusion matrix is the
2 cross 2 matrix that compare the predicted classification with
actual classification. measures the predicted sensitivity or fraction of fake news that
By formulating this as a classification problem and Based on the are correctly classified as fake news and F1 score combine the pre-
confusion matrix, we can evaluate the classifier’s performance by cision and recall for overall performance prediction of fake news
defining accuracy, precision, recall, and F1 score metrics for fake detection. For better results and performance, these four matrices’
news detection given in the Equation (10), Equation (11), Equation values should be high.
(12) and Equation (12). The accuracy, precision, recall, and F1 score of RF, NB and PA
Classifiers for fake news detection using features news body, title,
jTPj þ jTNj
Classifier Accuracy ¼ ð10Þ and both are illustrated in Tables 6, 7, and 8, respectively.
jTP j þ jTN j þ jFP j þ FN In Table 6 classification result is shown to detect the fake/real
news based on news body, in this case the values of Accuracy
jTPj (ACC)  91 %, Precision (PRE)- 91 %, Recall (REC)- 91 % and F1 Score
Classifier Precision ¼ ð11Þ
jTP j þ jFPj (F1)- 91 % for Passive Aggressive classifier is superior using PA
without pre-processing is better. Random forest classifier also per-
jTPj forms using RF with Pre-processing.
Classifier Recall ¼ ð12Þ
jTP j þ jFNj Table 7 shows that Random Forest classifier perform better as
compared to other two classifier for fake/real news detection based
Classifier Precision  Classifier Recall on News Titles. The values are Accuracy (ACC)  67 %, Precision
Classifier F1Score ¼ 2  ð13Þ (PRE)- 69 %, Recall (REC)- 67 % and F1 Score (F1)- 67 %.
ClassifierPrecision þ Classifier Recall
Finally, Table 8 illustrate that if we combine the Body and Title
These are the mainly used matrices for prediction in machine of news together to determine the fake/real news, The Passive
learning that calculates the performance of various classifiers, Aggressive classifier outperforms, and performance values are
specifically accuracy to predict the Fake news or Rea news of Buzz- Accuracy (ACC)  87 %, Precision (PRE)- 87 %, Recall (REC)- 87 %,
Feed news dataset in our paper, Precision used to detect the actu- and F1 Score (F1)- 87 % for BuzzFeed News dataset [18, 50].
ally fake news and important to determine the fake new, Recall

6. Comparative result analysis using machine learning RF, NB


Table 5
Confusion matrix.
and PA classifiers

Predicted true (1) Predicted false (0) Fig. 11 reveals that combining titles with body does not
Actual true (1) True positives (TPs) False negatives (FNs) improve the accuracy of the Passive Aggressive models without
Correctly Classified Incorrect rejection of classified records pre-processed data. Also, for NB, combining titles with body does
Actual false (0) False positives (FPs) True negatives (TNs)
Incorrectly Classified Correct rejection of classified records
not improve the accuracy with pre-processed data. Regarding the
pre-processing, it should be mentioned that although some
11
A. Sharma, M.K Sharma and R. Kr. Dwivedi Ain Shams Engineering Journal 14 (2023) 102166

Fig. 11. Comparative Analysis of RF, NB and PA Classifiers.

phrases and bigrams should be cleaned and pre-processed, remov- numerous times. Deploying the model in the cloud using an over-
ing stop words and stemming might not be a good idea in this lay architecture that enables establishing numerous instances of
specific dataset as we might lose some language information. the prediction algorithm on-the-fly is one approach to solving
Therefore, we conclude that Passive Aggressive model on the com- the dynamic scaling problem. The second obstacle may be over-
bined feature matrix without text pre-processing is the best classi- come by determining how to selectively remove people who are
fication model in our analysis that can categorize the real and fake very vital to the repost chain. Combining these approaches is
news with maximum accuracy. another potential area of exploration for this study.

7. Conclusion Declaration of Competing Interest

While disinformation, spin, falsehoods, and deception have The authors declare that they have no known competing finan-
existed for centuries, the emergence of digital platforms may have cial interests or personal relationships that could have appeared
accelerated the transmission of misinformation and thus elevated to influence the work reported in this paper.
the problem of fake news to a global scale, where the absence of
scalable fact-checking procedures is particularly concerning. We
conducted a rigorous exploratory data analysis on the actual and
References
fake news datasets in this article. We identified the most often
occurring words in the title or body of fake news and legitimate [1] Vuorinen, J., Koivula, A. and Koiranen, I., 2020, The Confidence in social media
news. We developed binary classifiers to distinguish fake news platforms and private messaging. In International Conference on Human-
Computer Interaction (pp. 669-682). Springer, Cham.
from legitimate news based on the words in the article’s title, body,
[2] Shu K, Sliva A, Wang S, Tang J, Liu H. Fake news detection on social media: a
or both. We employed three different classifiers: Random Forest, data mining perspective. ACM SIGKDD Explorations Newslett 2017;19
Naive-Bayes, and passive Aggressive. Table 6, Table 7 and Table 8 (1):22–36.
illustrates and measures the performance of RF, NB and PA Classi- [3] Pérez-Curiel C, Molpeceres AMV. Impact of political discourse on the
dissemination of hoaxes about Covid-19. influence of misinformation in
fiers and calculates the four major matrices values Accuracy public and media. Rev Lat Comun Soc 2020;78:65–96.
(ACC)  87 %, Precision (PRE)- 87 %, Recall (REC)- 87 %, and F1 Score [4] Dai, E., Sun, Y. and Wang, S., 2020, May. Ginger cannot cure cancer: Battling
(F1)- 87 % for BuzzFeed News dataset and Fig. 11 described the fake health news with a comprehensive data repository. In Proceedings of the
International AAAI Conference on Web and Social Media (Vol. 14, pp. 853-862).
comparative performance accuracy result analysis of RF, NB and [5] Love JS, Blumenberg A, Horowitz Z. The parallel pandemic: Medical
PA Classifiers with Pre-Processing and without Pre-Processing of misinformation and COVID-19: Primumnon nocere. J Gen Intern Med
Text only, Title Only and Both Title and Text. Passive Aggressive 2020;35(8):2435.
[6] Shahsavari S, Holur P, Wang T, Tangherlini TR, Roychowdhury V. Conspiracy in
was the best model for this study of text of body feature matrix the time of corona: automatic detection of emerging COVID-19 conspiracy
without pre-processing, accurately predicting fake and authentic theories in social media and the news. J Comput Social Sci 2020;3(2):279–317.
news with 91 % accuracy. The Passive Aggressive classifier outper- [7] Poynter (2020). Fighting the infodemic: The coronavirusfacts alliance. https://
www.poynter.org/coronavirusfactsalliance/.
formed both the Random Forest and Naive Bayes classifiers. [8] Ribeiro, F.N., Saha, K., Babaei, M., Henrique, L., Messias, J., Benevenuto, F., Goga,
While the suggested method does a better job of distinguishing O., Gummadi, K.P. and Redmiles, E.M., 2019, January. On microtargeting
between bogus and real news, it may still face two major obstacles. socially divisive ads: A case study of russia-linked ad campaigns on facebook.
In Proceedings of the conference on fairness, accountability, and transparency
The first issue is that the suggested technique may suffer in perfor-
(pp. 140-149).
mance if it has to deal with a high number of social media posts in [9] Tardáguila C, Benevenuto F, Ortellado P. Fake news is poisoning Brazilian
real time. Second, the proposed predictive model cannot identify politics. WhatsApp can stop it. The New York Times 2018;17(10).
malicious URLs(containing fake news), embedded in social media [10] Vasudeva F, Barkdull N. WhatsApp in India? a case study of social media
related lynchings. Social Identities 2020;26(5):574–89.
posts. The first problem may be solved by giving the model the [11] Arun C. On WhatsApp, rumours, lynchings, and the Indian Government. Econ
ability to scale up by adding more resources or by replicating itself Pol Wkly 2019;54(6).

12
A. Sharma, M.K Sharma and R. Kr. Dwivedi Ain Shams Engineering Journal 14 (2023) 102166

[12] Rubin, V.L., Chen, Y. and Conroy, N.K., 2015. Deception detection for news: [43] Helmstetter, S. and Paulheim, H., 2018, August. Weakly supervised learning for
three types of fakes. Proceedings of the Association for Information Science fake news detection on Twitter. In 2018 IEEE/ACM International Conference on
and Technology, 52(1), pp.1-4. Advances in Social Networks Analysis and Mining (ASONAM) (pp. 274-277).
[13] Parikh, S.B. and Atrey, P.K., 2018, April. Media-rich fake news detection: A IEEE.
survey. In 2018 IEEE conference on multimedia information processing and [44] Jeronimo CL, Marinho LB, Campelo CE, Veloso A, Melo ASDC. Characterization
retrieval (MIPR) (pp. 436-441). IEEE. of fake news based on subjectivity lexicons. J Data Intell 2020;1(4):419–41.
[14] Pérez-Rosas, V., Kleinberg, B., Lefevre, A. and Mihalcea, R., 2017. Automatic [45] Dabbous A, AounBarakat K, de Quero Navarro B. Fake news detection and
detection of fake news. arXiv preprint arXiv:1708.07104. social media trust: a cross-cultural perspective. Behav Inform Technol
[15] Rubin, V.L., Conroy, N.J. and Chen, Y., 2015, January. Towards news 2021:1–20.
verification: Deception detection methods for news discourse. In Hawaii [46] Geeng, C., Yee, S. and Roesner, F., 2020, April. Fake News on Facebook and
International Conference on System Sciences (pp. 5-8). Twitter: Investigating How People (Don’t) Investigate. In Proceedings of the
[16] Rubin VL, Lukoianova T. Truth and deception at the rhetorical structure level. J 2020 CHI conference on human factors in computing systems (pp. 1-14).
Assoc Inf Sci Technol 2015;66(5):905–17. [47] Alassad M, Hussain MN, Agarwal N. In: September. Finding Fake News Key
[17] Chen, Y., Conroy, N.J. and Rubin, V.L., 2015, November. Misleading online Spreaders in Complex Social Networks by Using Bi-level Decomposition
content: recognizing clickbait as‘‘ false news”. In Proceedings of the 2015 ACM Optimization Method. Cham: Springer; 2019. p. 41–54.
on workshop on multimodal deception detection (pp. 15-19). [48] Resende, G., Melo, P., Sousa, H., Messias, J., Vasconcelos, M., Almeida, J. and
[18] ‘‘Buzzfeednews: 2017-12-fake-news-top-50,” https://github. com/ Benevenuto, F., 2019, May. (Mis) information dissemination in WhatsApp:
BuzzFeedNews/2017-12-fake-news-top-50. Gathering, analyzing and countermeasures. In The World Wide Web
[19] Wang, W.Y., 2017. ‘‘ liar, liar pants on fire”: A new benchmark dataset for fake Conference (pp. 818-828).
news detection. arXiv preprint arXiv:1705.00648. [49] Lazer DM, Baum MA, Benkler Y, Berinsky AJ, Greenhill KM, Menczer F, et al. The
[20] BS Detector: https://gitlab.com/bs-detector/bs-detector. science of fake news. Science 2018;359(6380):1094–6.
[21] Mitra, T. and Gilbert, E., 2015, April. Credbank: A large-scale social media [50] Amadou Olabi, Fopa Yuffon & Namba, Mikayilou & Moctar, Mohamadou.
corpus with associated credibility annotations. In Ninth international AAAI (2020). Fake News Detection: A Machine Learning Approach using Automated-
conference on web and social media. Text Analysis Technique.
[22] Santia, G.C. and Williams, J.R., 2018, June. Buzzface: A news veracity dataset [51] Koby Crammer & al. Online Passive-Aggressive Algorithms. Journal of Machine
with facebook user commentary and egos. In Twelfth International AAAI Learning Research, March 2006.
Conference on Web and Social Media. [52] N. Lal, M. Singh, S. Pandey, and A. Solanki, ‘‘A Proposed Ranked Clustering
[23] Tacchini, E., Ballarin, G., Della Vedova, M.L., Moret, S. and de Alfaro, L., 2017. Approach for Unstructured Data from Dataspace using VSM,” 2020 20th
Some like it hoax: Automated fake news detection in social networks. arXiv International Conference on Computational Science and Its Applications
preprint arXiv:1704.07506. (ICCSA), Cagliari, Italy, 2020, pp. 80-86, DOI: 10.1109/ICCSA50381.2020.00024.
[24] Conroy NK, Rubin VL, Chen Y. Automatic deception detection: methods for [53] N.Lal, B. Pathak, ‘‘Information Retrieval from Heterogeneous Data Sets using
finding fake news. Proc Assoc Inf Sci Technol 2015;52(1):1–4. Moderated IDF-Cosine Similarity in Vector Space Model” ‘‘International
[25] Kwon S, Cha M, Jung K. Rumor detection over varying time windows. PLoS one Conference on Energy, Communication, Data Analytics and Soft Computing
2017;12(1):e0168344. (ICECDS 2017)”, pp. 3793-3799 IEEE Explorer, 2018.
[26] Rubin, V.L., Conroy, N., Chen, Y. and Cornwell, S., 2016, June. Fake news or
truth? using satirical cues to detect potentially misleading news.
In Proceedings of the second workshop on computational approaches to
deception detection (pp. 7-17). Ms. Anu Sharma, received the B.Tech Degree from N.C
[27] Wei, W. and Wan, X., 2017. Learning to identify ambiguous and misleading College of Engineering,Israna,Panipat in 2006 and M.
news headlines. arXiv preprint arXiv:1705.06031. Tech degree for MMU, Mullna, Ambala, Haryana in
[28] Ahmed, H., Traore, I. and Saad, S., 2017, October. Detection of online fake news 2011. Assistant professor in CCSIT, TMU, Moradabad,
using n-gram analysis and machine learning techniques. In International UP, India. She is pursuing Ph.D in CSE from Uttrakhand
conference on intelligent, secure, and dependable systems in distributed and Technical University, Dehradun. Her research interest
cloud environments (pp. 127-138). Springer, Cham. includes data Mining, Neural Network, Machine Learn-
[29] Bhattacharjee, S.D., Talukder, A. and Balantrapu, B.V., 2017, December. Active ing. She is having experience of more than 12 years. She
learning based news veracity detection with feature weighting and deep- has published more than 11 research papers in Inter-
shallow fusion. In 2017 IEEE International Conference on Big Data (Big Data) national, UGC approved, Scopus Indexed Journals.
(pp. 556-565). IEEE.
[30] Kumar, S., West, R. and Leskovec, J., 2016, April. Disinformation on the web:
Impact, characteristics, and detection of wikipedia hoaxes. In Proceedings of
the 25th international conference on World Wide Web (pp. 591-602).
[31] Ribeiro, M.H., Calais, P.H., Almeida, V.A. and Meira Jr, W., 2017. ‘‘ Everything I
Disagree With is# FakeNews”: Correlating Political Polarization and Spread of
Misinformation. arXiv preprint arXiv:1706.05924. Dr. M.K Sharma, received the Ph.D from Kumoun
[32] Reis, J.C., Correia, A., Murai, F., Veloso, A. and Benevenuto, F., 2019, June. University.Professor in Amrapali Institute, Haldwani ,
Explainable machine learning for fake news detection. In Proceedings of the Uttarakhand, India. His research interest includes Neu-
10th ACM conference on web science (pp. 17-26). ral Network, Machine Learning. He is having experience
[33] Carvalho, F., Okuno, H.Y., Baroni, L. and Guedes, G., 2020, November. A of more than 20 years. He has published more than 35
Brazilian Portuguese Moral Foundations Dictionary for Fake News research papers in international journals, UGC
classification. In 2020 39th International Conference of the Chilean approved, Scopus Indexed Journals.
Computer Science Society (SCCC) (pp. 1-5). IEEE.
[34] Cao J, Qi P, Sheng Q, Yang T, Guo J, Li J. Exploring the role of visual content in
fake news detection. Disinformation, Misinformation, and Fake News in Social
Media 2020:141–61.
[35] Qi, P., Cao, J., Yang, T., Guo, J. and Li, J., 2019, November. Exploiting multi-
domain visual information for fake news detection. In 2019 IEEE International
Conference on Data Mining (ICDM) (pp. 518-527). IEEE.
[36] Zhang X, Ghorbani AA. An overview of online fake news: characterization,
detection, and discussion. Inf Process Manag 2020;57(2):102025.
[37] Volkova, S., Shaffer, K., Jang, J.Y. and Hodas, N., 2017, July. Separating facts from Dr. Rakesh Kr. Dwivedi, received the Ph.D from IIT
fiction: Linguistic models to classify suspicious and trusted news posts on Roorkee in Remote Sensing .Director in CCSIT, TMU,
twitter. In Proceedings of the 55th Annual Meeting of the Association for Moradabad, UP, India. He has published more than 45
Computational Linguistics (Volume 2: Short Papers) (pp. 647-653). research papers in international/national conferences
[38] Vosoughi S, Roy D, Aral S. The spread of true and false news online. Science and more than 30 research papers in international
2018;359(6380):1146–51. journals. His research interest includes Remote Sensing ,
[39] Stein RA, Jaques PA, Valiati JF. An analysis of hierarchical text classification
Digital Image Processing, Machine Learning, Deep
using word embeddings. Inf Sci 2019;471:216–32.
Learning, Sensor Network.
[40] Vogel, I. and Meghana, M., 2020, September. Fake News Spreader Detection on
Twitter using Character N-Grams. In CLEF (Working Notes).
[41] Kapusta J, Drlik M, Munk M. Using of n-grams from morphological tags for fake
news classification. PeerJ Comput Sci 2021;7:e624.
[42] Alonso MA, Vilares D, Gómez-Rodríguez C, Vilares J. Sentiment analysis for
fake news detection. Electronics 2021;10(11):1348.

13

You might also like