Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 18

CSE1015 – Machine Learning Essentials

J Component Report

A project report titled


Fake News Detection in English and Hindi
Languages
By
20BAI1020 Ajay Rajkumar
20BAI1083 Manieesh KR
20BAI1182 Arjun V K

BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
(WITH SPECIALIZATION IN AI AND ML)

Submitted to

Dr. R. Rajalakshmi
School of Computer Science and Engineering

April 2022

1
DECLARATION BY THE CANDIDATE

I hereby declare that the report titled “Fake News Detection in English
Hindi Language” submitted by me to VIT Chennai is a record of bona-fide
work undertaken by me under the supervision of Dr. R. Rajalakshmi,
Associate Professor, SCOPE, Vellore Institute of Technology, Chennai.

Signature of the Candidate

2
ACKNOWLEDGEMENT

We wish to express our sincere thanks and deep sense of gratitude to our
project guide, Dr. R. Rajalakshmi, School of Computer Science and
Engineering for her consistent encouragement and valuable guidance offered to
us throughout the course of the project work.

We are extremely grateful to Dr. R. Ganesan, Dean, School of Computer


Science and Engineering (SCOPE), Vellore Institute of Technology, Chennai,
for extending the facilities of the School towards our project and for his
unstinting support.

We express our thanks to our Head of the Department for his support
throughout the course of this project.

We also take this opportunity to thank all the faculty of the School for their
support and their wisdom imparted to us throughout the courses.

We thank our parents, family, and friends for bearing with us throughout
the course of our project and for the opportunity they provided us in undergoing
this course in such a prestigious institution.

3
BONAFIDE CERTIFICATE

Certified that this project report entitled “Fake News Detection in English and
Hindi Language” is a bona-fide work of Ajay Rajkumar K (20BAI1020),
Manieesh KR (20BAI1083), Arjun V K (20BAI1182) carried out the “J”-
Project work under my supervision and guidance for CSE1015 – Machine
Learning Essentials.

Dr. R. Rajalakshmi
SCOPE

4
TABLE OF CONTENTS

Ch.
Chapter Page Number
No
1 Introduction 7

2 Literature Survey 8

3 Proposed Methodology 9

4 Results and Discussion 14

5 Conclusion 16

6 Reference 17

5
ABSTRACT

With the exponential growth of the World Wide Web, the amount of fake news has been
rapidly increasing, spreading misinformation, changing peoples’ opinions on a wide array of
topics such as companies, individuals and politics. The amount of news being produced every
day is massive and a manual fact checking process is impossible. Therefore, in this paper we
propose a machine learning approach to detect and classify fake news. First an English
dataset is taken from Kaggle and Naïve Bayes, Decision Tree and Passive Aggressive
Classifier is trained and evaluated. Then a Hindi news dataset has been collected from BBC
news and Boom live fact checker. Pre-processing methods such as stemming, removal of stop
words is done, followed by Natural Language Processing (NLP) techniques such as TF-IDF
vectorizer to convert into numerical data. Finally, multiple models including Logistic
Regression, different Naive Bayes, Passive Aggressive Classifier, KNN, Support Vector
Machine, Multi-Layer Perceptron and ensemble methods of Random Forest and AdaBoost
are trained and tested. Complement Naive Bayes Classifier performs the best with accuracy
of 0.87.

6
INTRODUCTION

With the exponential growth of the World Wide Web in the last decades, the need for
traditional media sources such as television or newspapers have diminished. Social media is
the new news app for everyone. Anyone can create and share anything from their
smartphones and this information spreads in an instant to the whole world. This leads to
malicious use, some people making purposely misleading information for their own personal
gain.

Fake news can be used to spread misinformation online, manipulate public perceptions on any
topic, from a person to a company. The simplicity and accessibility of social media and hand-
held devices only further this problem. The spread of these rumours or fake news can cause
massive amounts of damage and confusion. At its worst, fake news can be used to sway the
votes of people, which ultimately decides the fate of their whole country. 

Classification of misinformation is challenging. Even human experts cannot judge the


truthfulness of an article, since a person cannot know everything happening around the world.
Therefore, a machine learning approach has been taken to solve this problem using various
algorithms. A lot of papers have already explored English datasets, but in this paper, we will
not only explore an English dataset, we also focus on a Hindi dataset obtained from Boomlive
and BBC news. The details of this dataset will be explored further in the paper.

This paper explores the performance of different classifiers on a Hindi news dataset. The
organization of this paper is as follows: 1. Introduction about the problem statement, 2.
Literature survey on the existing studies and models used to detect fake news. 3. Proposed
Methodology describes the dataset chosen, pre-processing steps taken and the classification
models used. 4. Results and Discussion contains the accuracy scores and confusion matrix of
all models. 5. Conclusion includes the final remarks and the observations this paper has
uncovered. 

7
LITERATURE SURVEY

Fake News is a problem growing every year, affecting every corner of the world. The main
technique used to fact check news articles is by using the number of repositories maintained
by researchers such as ‘PolitiFact’ and ‘Snopes’. But even these repositories are not hundred
percent accurate since they are maintained by a human ultimately. This means the process is
not automated and requires human expertise to function properly. Many papers have been
published in recent years on Fake News detection with the rise of machine learning. Mainly
using Natural Language Processing, NLP techniques to analyse the news dataset.

In Supervised Learning for Fake News Detection J C. S. Reis et al.,[1] extracted many
features, mainly the language features, i.e., the actual news, but also included more features
from the news outlet itself such as precious bias, credibility, location, engagement etc.  KNN,
Naive Bayes, Random Forest, SVM, XGBoost were trained and tested. XGB and RF
obtained the best F1-score of 0.81.

Improving upon this, Iftikhar Ahmad et al.,[2] applied various ensemble methods on ISOT
Fake News Dataset and two other datasets available in Kaggle. The LIWC2015 tool is used
for feature extraction and hyperparameters are tuned. The best accuracy was obtained by
bagging classifiers (decision tree) with 94% accuracy.

Fake news detection in multiple languages was done by Pedro H. A. Faustini et al.,[3] with
three language datasets consisting of Germanic, Latin and Slavic. The paper concluded that
extracting text features are mostly independent of the language, with techniques such as bag-
of-words and Word2Vec. Support Vector Machines outperformed the other models with
maximum accuracy of 94% on btv lifestyle dataset.

Additionally, fake news detection has been done on a Hindi News Dataset by Sudhanshu
Kumar and Doren Singh et al. [4]. In this work, Hindi news articles from various news
sources were collected. Pre-processing is done on the dataset and algorithms such as Naive
Bayes, Logistic Regression and Long Short-Term Memory (LSTM) are used, with Term
Frequency Inverse Document Frequency (TF-IDF) for feature extraction. LSTM achieved the
best accuracy of 92.36%.

8
PROPOSED METHODOLOGY

The basic NLP pipeline includes preprocessing, vectorizing, training and evaluating.

Dataset Description:

The English data set is obtained from Kaggle [6] with 20,387 training samples, with 50:50
distribution of true and fake news. It contains features of id, title, author, text and label. The
text feature is used for training the model with the label, 1 being fake and 0 being true news.

The paper also uses a Hindi news dataset [7] collected from BBC news website and
Boomlive fact checking news website. It contains 1250 samples of fake news and 893
samples of true news. It has 5 different columns, which include name, url, short description,
full title and long description. Statements are collected from a diverse area of topics and
contexts. The dataset is missing a lot of full descriptions and full titles. The short description
will be used for the model training.

Preprocessing:

Preprocessing is the most challenging part of any natural language processing problem, and
classification of fake news is one of them. First of all, any and all null values were found
from the true news and fake news datasets. Due to the presence of many null values in long
description and full title, they are dropped. The URL of the news does not have any bearing
on the truthfulness of the news in our case since they are from a fact checking website, so the
URL is dropped. There doesn’t exist a name parameter for fake news, so it is dropped from
the true news dataset. Finally, we are left with the short description of the samples and
without any null values.

9
Now the text present is cleaned by multiple processes. First all special characters are
removed. Next stemming is done on the dataset. Stemming is the process of reducing a word
to its stem or root of the word. Stemming is an important part of NLP. This allows the model
to extract meaningful information from the dataset and prevents it from being overflown with
multiple words with the same meaning. Stemming is done by having a list of common
suffixes. Each sample is split into their words and any suffixes in the words are removed.

Next stop words are removed. Stop words are the most common words that do not provide
any meaningful information about the data. For example, the words ‘the’, ‘a’, ‘an’ are stop
words. A list of Hindi stop words are obtained and they are filtered out of the dataset.

10
Finally, to have equal class sizes of true and fake news, 760 samples are taken from both and
combined to make our final dataset. A label of 1 refers to true news and label of 0 refers to
fake news. Now, machines cannot read the words in the news dataset, they have to be
converted to numerical inputs and fed into the models. This is discussed in the next section.

TF-IDF vectorization:

Vectorizing is the process of turning the text / tokens into meaningful numbers. The
transformer used in this paper is TfidfVectorizer from sklearn library. TF-IDF is a technique
used to weight the terms according to the importance of those terms within the document.

Term Frequency (TF) is the number of times each term appears in a document divided by the
total number of words in the document. Inverse Document Frequency (IDF) is the log of the
number of documents divided by the number of documents that contain the word. This
increases the weight of rare words. 

A matrix of shape documents number * vocabulary size is obtained, which is huge. But it is a
sparse matrix, meaning the majority of it is filled with zeros. ngram range of 1 to 2 is taken,
meaning single and group of two words will be considered. This matrix will be used to train
the models.

11
Models:

Logistic regression: It is a supervised classification algorithm. It fits an ‘S’ shaped logistic


function, which predicts two maximum values (0 or 1).

Gaussian Naive Bayes: It is a variant of Naive Bayes that follows Gaussian normal
distribution and supports continuous data. The model is fit by finding the mean and standard
deviation of the points within each label

Multinomial Naive Bayes: It is a popular approach for Natural Language Processing. It


considers a feature vector where the term frequency is given. It does not perform well on
imbalanced data.

12
Complement Naive Bayes: It is an adaptation of standard Multinomial Naive Bayes
classifier. Instead of calculating the probability of an item belonging to a particular class, we
calculate the probability of the item not belonging to it and choose the smallest value.

Passive Aggressive Classifier: It is an online classifier where the model is trained by


incrementally feeding it instances sequentially in small amounts groups. Works by
responding passively to correct classifications and responding aggressively for incorrect
classifications. 

K Nearest Neighbour: It is a simple algorithm which classifies samples based on a similarity


measure such as distance functions. A case is classified by a majority vote of its ‘k’
neighbours. But this does not work well with higher dimensional data.

Support Vector Machine: It works by creating the best decision boundary that can segregate
n-dimensional space into classes called a hyperplane. It chooses the extreme points and
maximizes the margin to find the hyperplane.

Multi Layer Perceptron: It is an artificial neural network composed of multiple layers of


perceptron. They contain three layers, input layer, hidden layers, and output layer and each
node uses a nonlinear activation function. It utilizes back propagation for training.

Random Forest Classifier: It is an ensemble technique that fits a number of decision tree
classifiers on various sub-samples of the dataset and uses averaging to improve the predictive
accuracy and control overfitting.

AdaBoostClassifier:  It is an ensemble technique that begins by fitting a classifier on the


original dataset and then fits additional copies of the classifier on the same dataset but with
weights of incorrectly classified instances adjusted. This improves the subsequent classifiers
accuracy and helps them focus more on the difficult cases.

13
RESULTS AND DISCUSSION

For evaluation, four basic parameters are used, which are True-Positive (TP), True-Negative
(TN), False-Positive (FP), False-Negative (FN).
Accuracy is the percentage of correctly predicted results.
TP+TN
Accuracy=
TP+ FP+TN + FN
Recall is the ratio of true positives to all the correctly predicted results
TP
Recall=
TP+ FN
Precision is the ratio of true positives to the total positive observations
TP
Precision=
TP+ FP
F1-score is the harmonic mean of precision, P and recall, R
2∗R∗P
F 1 score=
R+ P
For English Dataset:

Classifiers Accuracy (in %)

Logistic Regression 94.92

Multinomial NB 69.71

Decision Tree 89.17

Passive Aggressive 96.94

14
For Hindi Dataset:

Classifiers Accuracy Precisio Recall F1-score


n

Gaussian NB 0.81 0.81 0.81 0.81

Multinomial NB 0.86 0.87 0.86 0.86

Complement NB 0.87 0.87 0.87 0.87

Bernoulli NB 0.84 0.85 0.84 0.84

Passive Aggressive 0.85 0.85 0.85 0.85

Logistic Regression 0.86 0.87 0.86 0.86

K Nearest Neighbour 0.81 0.81 0.81 0.81

Support Vector Machine with linear kernel 0.84 0.85 0.84 0.84

Support Vector Machine with rbf kernel 0.85 0.87 0.85 0.85

Random Forest 0.81 0.81 0.81 0.81

Multi-Layer Perceptron 0.86 0.87 0.86 0.86

AdaBoost 0.81 0.81 0.81 0.81

15
CONCLUSION

Classification is performed on an English and Hindi news dataset to separate fake news and
true news. Even though plenty of papers have been published on fake news classification,
there is still much to be explored with different datasets, languages and models. On the
English dataset, Passive Aggressive Classifier gives the best accuracy of 96%. This paper
also covers many different models on a Hindi news dataset, such as Logistic Regression,
Naive Bayes Classifier and even ensemble methods such as Random Forest. The results show
that Complement Naive Bayes Classifier performs the best across all metrics with 87%
accuracy. It is followed closely by Multinomial Naive Bayes Classifier, Logistic Regression
and Multilayer Perceptron with 86% accuracy. More fine tuning of hyper parameters in MLP
may yield better results. Finally, the K Nearest Neighbour classifier performs the worst with
0.81% accuracy due to the dimensions of the data.

16
REFERENCE

1. J. C. S. Reis, A. Correia, F. Murai, A. Veloso and F. Benevenuto, "Supervised Learning


for Fake News Detection," in IEEE Intelligent Systems, vol. 34, no. 2, pp. 76-81, March-
April 2019, doi: 10.1109/MIS.2019.2899143.

2. Iftikhar Ahmad, Muhammad Yousaf, Suhail Yousaf, Muhammad Ovais Ahmad, "Fake
News Detection Using Machine Learning Ensemble Methods", Complexity, vol. 2020,
Article ID 8885861, 11 pages, 2020. https://doi.org/10.1155/2020/8885861

3. Pedro Henrique Arruda Faustini, Thiago Ferreira Covões, “Fake news detection in
multiple platforms and languages”, Expert Systems with Applications,Volume 158, 2020,
113503, ISSN 0957-4174, https://doi.org/10.1016/j.eswa.2020.113503.

4. Sudhanshu Kumar, Thoudam Doren Singh, “Fake News Detection on Hindi News
Dataset”, Global Transitions Proceedings, 2022, ISSN 2666-285X
https://doi.org/10.1016/j.gltp.2022.03.014.

5. A. Jain, A. Shakya, H. Khatter and A. K. Gupta, "A smart System for Fake News
Detection Using Machine Learning," 2019 International Conference on Issues and

17
Challenges in Intelligent Computing Techniques (ICICT), 2019, pp. 1-4, doi:
10.1109/ICICT46931.2019.8977659.

6. https://www.kaggle.com/code/gauravduttakiit/fake-news-classifier-with-ml-using-nlp-1/
data?select=train.csv

7. https://github.com/Jelwin13afc/FakeNewsDetection

APPENDIX
Implementation / Code

-Uploaded as separate file-

18

You might also like