Professional Documents
Culture Documents
CSE1015 - Machine Learning Essentials: J Component Report
CSE1015 - Machine Learning Essentials: J Component Report
J Component Report
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
(WITH SPECIALIZATION IN AI AND ML)
Submitted to
Dr. R. Rajalakshmi
School of Computer Science and Engineering
April 2022
1
DECLARATION BY THE CANDIDATE
I hereby declare that the report titled “Fake News Detection in English
Hindi Language” submitted by me to VIT Chennai is a record of bona-fide
work undertaken by me under the supervision of Dr. R. Rajalakshmi,
Associate Professor, SCOPE, Vellore Institute of Technology, Chennai.
2
ACKNOWLEDGEMENT
We wish to express our sincere thanks and deep sense of gratitude to our
project guide, Dr. R. Rajalakshmi, School of Computer Science and
Engineering for her consistent encouragement and valuable guidance offered to
us throughout the course of the project work.
We express our thanks to our Head of the Department for his support
throughout the course of this project.
We also take this opportunity to thank all the faculty of the School for their
support and their wisdom imparted to us throughout the courses.
We thank our parents, family, and friends for bearing with us throughout
the course of our project and for the opportunity they provided us in undergoing
this course in such a prestigious institution.
3
BONAFIDE CERTIFICATE
Certified that this project report entitled “Fake News Detection in English and
Hindi Language” is a bona-fide work of Ajay Rajkumar K (20BAI1020),
Manieesh KR (20BAI1083), Arjun V K (20BAI1182) carried out the “J”-
Project work under my supervision and guidance for CSE1015 – Machine
Learning Essentials.
Dr. R. Rajalakshmi
SCOPE
4
TABLE OF CONTENTS
Ch.
Chapter Page Number
No
1 Introduction 7
2 Literature Survey 8
3 Proposed Methodology 9
5 Conclusion 16
6 Reference 17
5
ABSTRACT
With the exponential growth of the World Wide Web, the amount of fake news has been
rapidly increasing, spreading misinformation, changing peoples’ opinions on a wide array of
topics such as companies, individuals and politics. The amount of news being produced every
day is massive and a manual fact checking process is impossible. Therefore, in this paper we
propose a machine learning approach to detect and classify fake news. First an English
dataset is taken from Kaggle and Naïve Bayes, Decision Tree and Passive Aggressive
Classifier is trained and evaluated. Then a Hindi news dataset has been collected from BBC
news and Boom live fact checker. Pre-processing methods such as stemming, removal of stop
words is done, followed by Natural Language Processing (NLP) techniques such as TF-IDF
vectorizer to convert into numerical data. Finally, multiple models including Logistic
Regression, different Naive Bayes, Passive Aggressive Classifier, KNN, Support Vector
Machine, Multi-Layer Perceptron and ensemble methods of Random Forest and AdaBoost
are trained and tested. Complement Naive Bayes Classifier performs the best with accuracy
of 0.87.
6
INTRODUCTION
With the exponential growth of the World Wide Web in the last decades, the need for
traditional media sources such as television or newspapers have diminished. Social media is
the new news app for everyone. Anyone can create and share anything from their
smartphones and this information spreads in an instant to the whole world. This leads to
malicious use, some people making purposely misleading information for their own personal
gain.
Fake news can be used to spread misinformation online, manipulate public perceptions on any
topic, from a person to a company. The simplicity and accessibility of social media and hand-
held devices only further this problem. The spread of these rumours or fake news can cause
massive amounts of damage and confusion. At its worst, fake news can be used to sway the
votes of people, which ultimately decides the fate of their whole country.
This paper explores the performance of different classifiers on a Hindi news dataset. The
organization of this paper is as follows: 1. Introduction about the problem statement, 2.
Literature survey on the existing studies and models used to detect fake news. 3. Proposed
Methodology describes the dataset chosen, pre-processing steps taken and the classification
models used. 4. Results and Discussion contains the accuracy scores and confusion matrix of
all models. 5. Conclusion includes the final remarks and the observations this paper has
uncovered.
7
LITERATURE SURVEY
Fake News is a problem growing every year, affecting every corner of the world. The main
technique used to fact check news articles is by using the number of repositories maintained
by researchers such as ‘PolitiFact’ and ‘Snopes’. But even these repositories are not hundred
percent accurate since they are maintained by a human ultimately. This means the process is
not automated and requires human expertise to function properly. Many papers have been
published in recent years on Fake News detection with the rise of machine learning. Mainly
using Natural Language Processing, NLP techniques to analyse the news dataset.
In Supervised Learning for Fake News Detection J C. S. Reis et al.,[1] extracted many
features, mainly the language features, i.e., the actual news, but also included more features
from the news outlet itself such as precious bias, credibility, location, engagement etc. KNN,
Naive Bayes, Random Forest, SVM, XGBoost were trained and tested. XGB and RF
obtained the best F1-score of 0.81.
Improving upon this, Iftikhar Ahmad et al.,[2] applied various ensemble methods on ISOT
Fake News Dataset and two other datasets available in Kaggle. The LIWC2015 tool is used
for feature extraction and hyperparameters are tuned. The best accuracy was obtained by
bagging classifiers (decision tree) with 94% accuracy.
Fake news detection in multiple languages was done by Pedro H. A. Faustini et al.,[3] with
three language datasets consisting of Germanic, Latin and Slavic. The paper concluded that
extracting text features are mostly independent of the language, with techniques such as bag-
of-words and Word2Vec. Support Vector Machines outperformed the other models with
maximum accuracy of 94% on btv lifestyle dataset.
Additionally, fake news detection has been done on a Hindi News Dataset by Sudhanshu
Kumar and Doren Singh et al. [4]. In this work, Hindi news articles from various news
sources were collected. Pre-processing is done on the dataset and algorithms such as Naive
Bayes, Logistic Regression and Long Short-Term Memory (LSTM) are used, with Term
Frequency Inverse Document Frequency (TF-IDF) for feature extraction. LSTM achieved the
best accuracy of 92.36%.
8
PROPOSED METHODOLOGY
The basic NLP pipeline includes preprocessing, vectorizing, training and evaluating.
Dataset Description:
The English data set is obtained from Kaggle [6] with 20,387 training samples, with 50:50
distribution of true and fake news. It contains features of id, title, author, text and label. The
text feature is used for training the model with the label, 1 being fake and 0 being true news.
The paper also uses a Hindi news dataset [7] collected from BBC news website and
Boomlive fact checking news website. It contains 1250 samples of fake news and 893
samples of true news. It has 5 different columns, which include name, url, short description,
full title and long description. Statements are collected from a diverse area of topics and
contexts. The dataset is missing a lot of full descriptions and full titles. The short description
will be used for the model training.
Preprocessing:
Preprocessing is the most challenging part of any natural language processing problem, and
classification of fake news is one of them. First of all, any and all null values were found
from the true news and fake news datasets. Due to the presence of many null values in long
description and full title, they are dropped. The URL of the news does not have any bearing
on the truthfulness of the news in our case since they are from a fact checking website, so the
URL is dropped. There doesn’t exist a name parameter for fake news, so it is dropped from
the true news dataset. Finally, we are left with the short description of the samples and
without any null values.
9
Now the text present is cleaned by multiple processes. First all special characters are
removed. Next stemming is done on the dataset. Stemming is the process of reducing a word
to its stem or root of the word. Stemming is an important part of NLP. This allows the model
to extract meaningful information from the dataset and prevents it from being overflown with
multiple words with the same meaning. Stemming is done by having a list of common
suffixes. Each sample is split into their words and any suffixes in the words are removed.
Next stop words are removed. Stop words are the most common words that do not provide
any meaningful information about the data. For example, the words ‘the’, ‘a’, ‘an’ are stop
words. A list of Hindi stop words are obtained and they are filtered out of the dataset.
10
Finally, to have equal class sizes of true and fake news, 760 samples are taken from both and
combined to make our final dataset. A label of 1 refers to true news and label of 0 refers to
fake news. Now, machines cannot read the words in the news dataset, they have to be
converted to numerical inputs and fed into the models. This is discussed in the next section.
TF-IDF vectorization:
Vectorizing is the process of turning the text / tokens into meaningful numbers. The
transformer used in this paper is TfidfVectorizer from sklearn library. TF-IDF is a technique
used to weight the terms according to the importance of those terms within the document.
Term Frequency (TF) is the number of times each term appears in a document divided by the
total number of words in the document. Inverse Document Frequency (IDF) is the log of the
number of documents divided by the number of documents that contain the word. This
increases the weight of rare words.
A matrix of shape documents number * vocabulary size is obtained, which is huge. But it is a
sparse matrix, meaning the majority of it is filled with zeros. ngram range of 1 to 2 is taken,
meaning single and group of two words will be considered. This matrix will be used to train
the models.
11
Models:
Gaussian Naive Bayes: It is a variant of Naive Bayes that follows Gaussian normal
distribution and supports continuous data. The model is fit by finding the mean and standard
deviation of the points within each label
12
Complement Naive Bayes: It is an adaptation of standard Multinomial Naive Bayes
classifier. Instead of calculating the probability of an item belonging to a particular class, we
calculate the probability of the item not belonging to it and choose the smallest value.
Support Vector Machine: It works by creating the best decision boundary that can segregate
n-dimensional space into classes called a hyperplane. It chooses the extreme points and
maximizes the margin to find the hyperplane.
Random Forest Classifier: It is an ensemble technique that fits a number of decision tree
classifiers on various sub-samples of the dataset and uses averaging to improve the predictive
accuracy and control overfitting.
13
RESULTS AND DISCUSSION
For evaluation, four basic parameters are used, which are True-Positive (TP), True-Negative
(TN), False-Positive (FP), False-Negative (FN).
Accuracy is the percentage of correctly predicted results.
TP+TN
Accuracy=
TP+ FP+TN + FN
Recall is the ratio of true positives to all the correctly predicted results
TP
Recall=
TP+ FN
Precision is the ratio of true positives to the total positive observations
TP
Precision=
TP+ FP
F1-score is the harmonic mean of precision, P and recall, R
2∗R∗P
F 1 score=
R+ P
For English Dataset:
Multinomial NB 69.71
14
For Hindi Dataset:
Support Vector Machine with linear kernel 0.84 0.85 0.84 0.84
Support Vector Machine with rbf kernel 0.85 0.87 0.85 0.85
15
CONCLUSION
Classification is performed on an English and Hindi news dataset to separate fake news and
true news. Even though plenty of papers have been published on fake news classification,
there is still much to be explored with different datasets, languages and models. On the
English dataset, Passive Aggressive Classifier gives the best accuracy of 96%. This paper
also covers many different models on a Hindi news dataset, such as Logistic Regression,
Naive Bayes Classifier and even ensemble methods such as Random Forest. The results show
that Complement Naive Bayes Classifier performs the best across all metrics with 87%
accuracy. It is followed closely by Multinomial Naive Bayes Classifier, Logistic Regression
and Multilayer Perceptron with 86% accuracy. More fine tuning of hyper parameters in MLP
may yield better results. Finally, the K Nearest Neighbour classifier performs the worst with
0.81% accuracy due to the dimensions of the data.
16
REFERENCE
2. Iftikhar Ahmad, Muhammad Yousaf, Suhail Yousaf, Muhammad Ovais Ahmad, "Fake
News Detection Using Machine Learning Ensemble Methods", Complexity, vol. 2020,
Article ID 8885861, 11 pages, 2020. https://doi.org/10.1155/2020/8885861
3. Pedro Henrique Arruda Faustini, Thiago Ferreira Covões, “Fake news detection in
multiple platforms and languages”, Expert Systems with Applications,Volume 158, 2020,
113503, ISSN 0957-4174, https://doi.org/10.1016/j.eswa.2020.113503.
4. Sudhanshu Kumar, Thoudam Doren Singh, “Fake News Detection on Hindi News
Dataset”, Global Transitions Proceedings, 2022, ISSN 2666-285X
https://doi.org/10.1016/j.gltp.2022.03.014.
5. A. Jain, A. Shakya, H. Khatter and A. K. Gupta, "A smart System for Fake News
Detection Using Machine Learning," 2019 International Conference on Issues and
17
Challenges in Intelligent Computing Techniques (ICICT), 2019, pp. 1-4, doi:
10.1109/ICICT46931.2019.8977659.
6. https://www.kaggle.com/code/gauravduttakiit/fake-news-classifier-with-ml-using-nlp-1/
data?select=train.csv
7. https://github.com/Jelwin13afc/FakeNewsDetection
APPENDIX
Implementation / Code
18