Professional Documents
Culture Documents
Fake News Detection Based On Word and Document Embedding Using Machine Learning Classifiers
Fake News Detection Based On Word and Document Embedding Using Machine Learning Classifiers
net/publication/351285564
Fake news detection based on word and document embedding using machine
learning classifiers
CITATION READS
1 627
2 authors, including:
Ibrahim Eldesoky
Beni Suef University
15 PUBLICATIONS 8 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Medical Text Annotation Tool based on IBM Watson Platform View project
All content following this page was uploaded by Ibrahim Eldesoky on 03 May 2021.
1
Computer Science Department, Faculty of Computers and Artificial Intelligence, Beni-Suef University,
Egypt.
2
Information Technology Department, Faculty of Computers and Artificial Intelligence, Beni-Suef
University, Egypt.
Email: 1Ibrahim_desoky@fcis.bsu.edu.eg , 2 faly@msa.eun.eg
ABSTRACT
Fake news is a problem that has a major effect on our life. Detection of fake news considered an interesting
research area that has some limitation of the available resources. In this research, we propose a classification
model that is capable of detecting fake news based on both Doc2vec and Word2vec embedding as feature
extraction methods. Firstly, we compare between the two approaches using different classification
algorithms. According to the applied experiments, the classification based on Doc2vec model provided
promising results with more than one classifier. The Support vector machine resulted the best accuracy with
95.5% followed by Logistic Regression 94.7% and the Long Short Term Memory produced the lowest
accuracy. On the other hand, the classification based Word2vec embedding model results high accuracy only
with Long Short Term Memory classifier with 94.3%. Secondly, the classification models based on proposed
Doc2vec have shown to outperform a corresponding model that based on TF-IDF on the same dataset using
Support Vector Machine and Logistic Regression classifiers.
Keywords: Fake News Detection; Word2Vec; Doc2Vec; Machine Learning; Deep Learning
1891
Journal of Theoretical and Applied Information Technology
30th April 2021. Vol.99. No 8
© 2021 Little Lion Scientific
news receives a huge boost and becomes viral Punctuation and Grammar. Researchers in [9]
around the country and worldwide. The method explored the term's uses, especially regarding fake
suggested helps in finding the news' credibility. If news. Six broad definitions of the word "fake news"
the news is not accurate, the news report would have been found such as; news satire, fabrication,
recommend the reader with the relevant news news parody, manipulation (for instance,
article[5]. photography), advertisement (for example
In this paper, we propose using both advertising presented as journalism) and
word embedding (Word2vec) and document propaganda. Their goals and appropriations were
embedding (Doc2vec) as feature extraction methods "the look and feel of real news".
for the creation of various classifications such Fallis [10] explores how disinformation has become
as logistic regression, support vector described. He concluded that "disinformation is
machines, random forests, neural networks with misleading information with a misleading function".
multilayer perceptron and long short-term memory In [8], Rubin splits false news into three separate
classifiers. The suggested classifiers used to categories: extreme manufacturing, huge
differentiate between misleading facts and real news. falsification and humorous falsehoods. Instead of
The classifiers are evaluated and educated on a web- any other grouping, they do not explain why they
based dataset. choose these types. However, they discussed in
This paper is prepared as follow: section 2 details what will each group contain and how to
will cover the area of related work, section 3 will discriminate between them. They further highlight
discuss the proposed model phases, section 4 will the absence of a body to perform such a study and
cover the used dataset, applied experiments and underline the 9 criteria on the creation of such an
discusses the obtained results, and section 5 will entity which are; ' 'Digital textual format entry,'
provides the conclusion and future work. 'Verifiable ground facts', 'Predefined time frame',
'Language and culture', 'The manner of delivering
news', 'pragmatic concerns' and 'Homogeneity
2. RELATED WORK
of lengths and writing matter. Testing with the
Mumbai released three research papers on classification challenge, Kai Shu, Amy Sliva,
fake news identification by three students from Suhang Wang and Jiliang Tang [11] suggested
Vivekananda Education Society Institute of features such as; a number of characters per words,
Technology in 2018. They wrote the social media era number of words, phrase rates and sentences,
began in the 20th century in their research paper. sections of speech tags (i.e., n-grams, and bag-of-
Eventually the use of the web expands, the postings word approaches).
rise, the number of papers increases. They used In the research paper, the authors of [12]
diverse methods and tools to identify fake news such proposed fraud detection with a named benchmark
as natural language processing (NLP) techniques, 'LIAR' dataset, and an apparent increase in the
artificial intelligence (AI) and machine learning efficiency of false messages/news identification
(ML).[6]. Nguyen Vo of Ho Chi Minh City appeared. The authors argued that corpus was used
University of Technology (HCMUT) student in to identify places, mining views, rumor
Cambodia has investigated and applied false news identification and analysis on NLP policies.
identification in 2017. In his project, false news The researchers of [13] implemented the
identification, he used Bi-Directional GRU. Yang et need to spot hoaxes. They used the ML approach by
al. used some deep learning algorithms besides other integrating advertising with approaches to social
deep learning models, such as Auto-Encoders, CNN content. The researchers stated that the study
and GAN.[7] performed well compared to the literature
Rubin et al.[8] recommended a model for mentioned. They used Facebook messenger chatbot
the detection of news stories on satire and comedy. to execute it. There are three numerous datasets of
They reviewed and inspected 360 news articles in Facebook Italian news posts included. The Boolean
Satiric, primarily in four areas; civic, scientific, the crowd sourcing algorithms are all content oriented
market and soft news. They suggested a support approaches were implemented with social and
vector machine classification model with five key content signals. Tabloidization in the form of Click
features that were built on the basis of their satirical baiting was identified by the authors in [14]. Click
news study. There are five features; Absurdity, baiting was described as a method of fast online
Grammar, Humor, Punctuation and Negative Affect. propagation of rumors and misinformation. As a
Its overall 90 percent accuracy has been reached with means of dissatisfaction, scholars have explored
just three attribute combinations; Absurdity, possible methods to automatically detect clickbait.
1892
Journal of Theoretical and Applied Information Technology
30th April 2021. Vol.99. No 8
© 2021 Little Lion Scientific
Content metrics including the lexical and standardized feature for handling to eliminate
semantic theoretical standard, were implemented punctuation and non-letter characteristics.
by the authors. In [15], the authors have studied and Stop Word Removal: Stop words in a language are
analyzed the respective scope and performance and nonsensical words that produce noise when used as
values, tools and algorithms were used to distinguish a text classification features.
falsified and manufactured news stories. The paper There are words widely found in phrases to better
also described the study difficulties by the unknown link thoughts or the structure of the expression.
features of fake news and the complex ties between Stop words are known as posts, preparations,
news stories, writers and topics. conjunctions, and pronouns. Popular terms have
The authors discussed the Fake-Detector been removed from each document such as; about, a,
paradigm for automatically false news inference. It an, as, are at, by, be, from, for, in, is, on, how, or,
depends on textual identification and provides a the, that, this, these, was, when, where, who, what,
wide-ranging network paradigm for concurrently will, too, etc. After that, the processed documents
studying the representations of news stories, writers were preserved and transmitted to the next stage.
and topics.[16]. Fake-Detector tackles two key Stemming: The next step is to change the tokens
components: the feature learning representation and after the tokenization into a regular form of the data.
reputation mark inference, which together form the Stemming essentially translates the terms to the
profoundly diffused Fake-Detector network model. original form and reduces the number of word types
Another research by William Yang et.al. in [17], or classes within the results. The terms 'running,'
because of a lack of fake data and its efficacy. He 'ran,' or 'runner' shall be reduced to 'run'. Our stubble
decided to introduce a new 'LIAR' dataset. It is a is used for easier and more effective classification.
news identification, freely accessible information set In addition, we use Porter stemmer due to its
for the false news. Using it utilizing a wide variety accuracy, as it is the most widely used stemming
of methods as logistic regression, support vector algorithms.
machine and deep-learning (Bidirectional-LSTM
and Convolutional Neural Networks) models. The 3.2 Feature Extraction
findings showed that CNN models are indeed the High dimension learning is one of the
best. Himank Gupta. et. a l. [18] has a system categorization challenges. text description.
focused on a particular approach of machine learning A large expression number, phrases and words are
that solves a number of problems, including contained in the documents which result in the
precision shortages, time lag (Bot-Maker) and fast learning process becoming highly computational.
time computation for thousands of tweets in In addition, the classifier precision and output may
one second. First, 400,000 HSpam14 dataset tweets be influenced by irrelevant and redundant functions.
were obtained. Then, the 150,000 spam and 250,000 It is better to reduce the feature size and prevent wide
non-spam tweets are further characterized. They also area measurements of the feature. The integrations
extract few lighter features alongside with top 30 for the majority of our models are generated with the
words which give the Bag-of-Words model the models Word2Vec and Doc2Vec.
highest information gain. They reached 91.65% 3.2.1 Word2vec embedding:
accuracy and exceeded the current approach by Word2vec was developed for "Representation of
almost 18%. The author referred to Kai Shu et al. in continuous vector computing of words from larger
[11], for a more detailed survey of false news datasets" by Tomas Mikolov and Google team in
identification work trying to concentrate on a 2013 [19]. They wanted to produce faster learning
lightweight click baits identification system focused models, which were common at the time and clearly
on high-level title functionality. easier, in a different way from neural network
models.
3. PROPOSED MODEL The author and his team have proposed two
predictive models – the CBOW (Continuous Bag of
Words) and the Skip-gram. The predictive models
3.1 Data Preprocessing train their vectors to boost their predictive capacities,
Until creating a vector model, the data must so that better outcomes are learned.
be liable to some refinements such as elimination of Continuous bag-of-words (CBOW):
stop - words, tokenization, lower case segmentation Words in this model are taken for instance, then all
and removal of punctuations. This helps one to the vectors are added and the relation between
decrease the size of the real data by eliminating the the words in a sentence to predict the missing word
unnecessary data. For each document, we built a in a sentence by applying the same projection
1893
Journal of Theoretical and Applied Information Technology
30th April 2021. Vol.99. No 8
© 2021 Little Lion Scientific
window. The reason why they call it bag-of-words using two different methods (Doc2vec and
is that it is not necessary to order words. Word2vc). We used for detecting the class of the
Skip-gram: The reverse model of the CBOW new document five classifiers. The classifiers are,
model. The input layer will contain the target word Support Vector Machine (SVM), Artificial Neural
and on the output layer, the word context. But how Network (ANN), Long Short Term Memory
do you guess the phrase from a word? (LSTM), Logistic Regression (LR), Multi-Layer
First, vocabulary words from the training documents Perceptron (MLP) and Random Forest (RF).
will be needed. Second, inserting the word we want Word2vec and DM model of Doc2vec are used as
to use in one-hot vector, then we will put 1 for the feature extraction step for the SVM, LSTM, MLP,
word used and 0 for the rest. After that, we multiply LR, and RF, the extracted features are used to train
the one-hot vector with a matrix of potential phrases the classifiers and then test each model against test
and our one-hot vector will fit the correct one. set.
3.2.2 Do2vec embedding:
Doc2vec is a technique that uses a vector to describe
a text which is a simplified version of Word2vec
embedding in natural language processing. The
distributed Bag of words (DBOW) and the
Distributed Memory (DM) models Doc2vec
suggested. DBWO is a doc2vec analogue model for
the word2vec skip-gram. In the current sense or in
the essay the DM mind remembers what is lost. it
will reduce the BOW problems to a minimum[20].
Doc2VecC consists of an input layer, a screen layer
and an output layer that forecasts the target word.
Identical to Word2Vec. The embedded words of
adjacent documents have a local meaning while the
whole document is interpreted by the vector as a
global context. Unlike Paragraph Vectors, which
explicitly learns a vector unique to each article,
Doc2VecC is the average of the word inlays sampled
randomly in the text.
Each paragraph is mapped in a DM to a single vector
described by a matrix D column and each term is
linked to a special vector, which is expressed by a W
matrix column. The vectors of paragraph and phrase
are averaged or concatenated in order to forecast iin
context the next word. Concatenation as the method
for integrating vectors is used in the experiments. It
can be used as another term in paragraph token. It
serves as a memory that recalls what is lost – or the
subject of the paragraph.
1894
Journal of Theoretical and Applied Information Technology
30th April 2021. Vol.99. No 8
© 2021 Little Lion Scientific
1895
Journal of Theoretical and Applied Information Technology
30th April 2021. Vol.99. No 8
© 2021 Little Lion Scientific
3.3.3. Multilayer perceptron (MLP) training data (i.e. the stock of training data) is known
A multi-layered perceptron (MLP) is a feedforward to be difficult to use ANN and thus poorly operate
artificial neural network model, where the input data on invisible data. If there are too many hidden nodes,
sets is mapped to a collection of suitable outputs. It the system could overwrite the current data while
comprises three layers forms; the input, the output stopping the system from accurately updating the
and the hidden layer. The input layer receives the input value if there is too few. In addition, it is
processing signal. important to pick a stop criterion. This can involve
The appropriate task is performed via the output stopping when the network cumulative loss is below
layer including prediction and classification. The those defaults or if there are a number of epochs
true processing engine of the MLP consists of an (iterations) done.
infinite number of hidden layers between the input 3.3.4 Long short-term memory (LSTM)
and output side. Like a transmission network in an Long short-term memory (LSTM) [16] are a
MLP, the data flow in a forward direction from the specialized type of recurrent neural networks that
input layer to the output layer, like the feed-forward have the ability to record long-lasting patterns
neural network. selectively. It is an excellent alternative for modeling
The MLP neurons are conditioned by the sequential data and thus, for studying complex
back propagation neural network. MLPs have been human behavior dynamics. The cell state is
developed to approximate any continuous function considered the long-term memory. Since the cells
and to solve problems that cannot be isolated are recursive in nature, prior data can be stored in
linearly. Pattern description, identification, them. To change the cell condition, the forget gate
estimation and approximation are the key cases of below the cell state is used. The forgot gate releases
MLP application[22]. values determine which data to forget by multiplying
The ANN classifier consists of nodes, an input layer zero to a position in the matrix. The knowledge is
as (𝑥 , 𝑥 , … 𝑥 ) and an optional hidden layer as well preserved within the cell while the output of the
as an output layer (y), (shown in Figure 3). The forget gate is one. The gates input specifies the
ANN's goal is to analyse a set of weights information, the cell states should enter. Finally, in
(w) (between the input nodes, hidden nodes, and the output gate, state the information should be
output nodes) that minimize the error of the total sum shifted to the hidden secret state.
square. LSTM is commonly used in multi-model
applications including photo subtitling and It
also enhances the vanilla recurrent neural networks
(RNNs). It is designed to ensure long-lasting
reliance and substantially reduce the issue of
progressive disappearance in RNNs. LSTM reads
these words one-on-one and retains a memory
state mt in dimensions D and a hidden state ht in D,
given a series of word {x1,x2,...,xn}.[23] Figure 4
shows the LSTM diagram
1896
Journal of Theoretical and Applied Information Technology
30th April 2021. Vol.99. No 8
© 2021 Little Lion Scientific
1897
Journal of Theoretical and Applied Information Technology
30th April 2021. Vol.99. No 8
© 2021 Little Lion Scientific
0.6
Table 3. Word2vec
0.4
Evaluation LR RF MLP SVM LSTM
0.2
Sensitivity 0.622 0.691 0.589 0.965 0
0.649
Doc2vec Word2vec
1898
Journal of Theoretical and Applied Information Technology
30th April 2021. Vol.99. No 8
© 2021 Little Lion Scientific
1899
Journal of Theoretical and Applied Information Technology
30th April 2021. Vol.99. No 8
© 2021 Little Lion Scientific
we introduced the Word2vec with a way [9] Tandoc Jr, E.C., Z.W. Lim, and R. Ling, Defining
suitable for the LSTM to preserve the order “fake news” A typology of scholarly
information of words and the resulting accuracy definitions. Digital journalism, 2018. 6(2): p.
was 94.3% that can be reasonable and 137-153.
[10] Fallis, D., What is disinformation? Library
promising. In addition, the model based on
trends, 2015. 63(3): p. 401-426.
Doc2vec compared with another model. The [11] Shu, K., et al. defend: Explainable fake news
model based on TF-IDF as feature extraction detection. in Proceedings of the 25th ACM
with N-gram. The classifiers SVM and LR that SIGKDD International Conference on
based on Doc2vec model resulted higher Knowledge Discovery & Data Mining. 2019.
accuracy than when they were based on TF- [12] Meel, P. and D.K. Vishwakarma, Fake news,
IDF. In the future, the models can be applied on rumor, information pollution in social media
different dataset with more documents, and and web: A contemporary survey of state-of-
hybridization between more than one classifier the-arts, challenges and opportunities. Expert
can enhance the accuracy of classification Systems with Applications, 2019: p. 112986.
[13] Ozbay, F.A. and B. Alatas, Fake news detection
within online social media using supervised
REFERENCES: artificial intelligence algorithms. Physica A:
Statistical Mechanics and its Applications,
[1] Welbers, K. and M. Opgenhaffen, Social media 2020. 540: p. 123174.
gatekeeping: An analysis of the gatekeeping [14] Collins, B., et al., Trends in combating fake
influence of newspapers’ public Facebook news on social media–a survey. Journal of
pages. new media & society, 2018. 20(12): p. Information and Telecommunication, 2020: p.
4728-4747. 1-20.
[2] Jankowski, N.W., Researching fake news: A [15] Mackey, T.K. and G. Nayyar, A review of
selective examination of empirical studies. existing and emerging digital technologies to
Javnost-The Public, 2018. 25(1-2): p. 248-255. combat the global trade in fake medicines.
[3] Carlson, M., Fake news as an informational Expert opinion on drug safety, 2017. 16(5): p.
moral panic: the symbolic deviancy of social 587-602.
media during the 2016 US presidential election. [16] Halder, M., M. Suresh, and S. Rao, Impact of
Information, Communication & Society, 2020. Discourse in Modern AI Based NLP
23(3): p. 374-388. Applications. Smitha Rao, Impact of Discourse
[4] Sabeeh, V.T., Machine Learning and Semantic in Modern AI Based NLP Applications (July 31,
Knowledge Assisted Fake News Detection 2020). Institute of Scholars (InSc), 2020.
Models. 2020, Oakland University. [17] Wang, W.Y., " liar, liar pants on fire": A new
[5] Molina, M.D., et al., “Fake news” is not simply benchmark dataset for fake news detection.
false information: a concept explication and arXiv preprint arXiv:1705.00648, 2017.
taxonomy of online content. American [18] Gupta, H., et al. A framework for real-time
behavioral scientist, 2019: p. spam detection in Twitter. in 2018 10th
0002764219878224. International Conference on Communication
[6] Dubey, A.K., A. Singhal, and S. Gupta, Rumor Systems & Networks (COMSNETS). 2018.
Detection System using Machine Learning. IEEE.
International Research Journal of Engineering [19] Al-Ansari, K., Survey on Word Embedding
and Technology (IRJET) Vol. 7, No. 5(2020). Techniques in Natural Language Processing.
[7] Lan, N.P.H., et al., Phenotypic and genotypic [20] Le, Q. and T. Mikolov. Distributed
characteristics of ESBL and AmpC producing representations of sentences and documents. in
organisms associated with bacteraemia in Ho International conference on machine learning.
Chi Minh City, Vietnam. Antimicrobial 2014. PMLR.
Resistance & Infection Control, 2017. 6(1): p. [21] Pranckevičius, T. and V. Marcinkevičius,
1-9. Comparison of naive bayes, random forest,
[8] Rubin, V.L., et al. Fake news or truth? using decision tree, support vector machines, and
satirical cues to detect potentially misleading logistic regression classifiers for text reviews
news. in Proceedings of the second workshop on classification. Baltic Journal of Modern
computational approaches to deception Computing, 2017. 5(2): p. 221.
detection. 2016.
1900
Journal of Theoretical and Applied Information Technology
30th April 2021. Vol.99. No 8
© 2021 Little Lion Scientific
1901