Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 46

SPAM REVIEW DETECTION USING MACHINE LEARNING

AND DEEP LEARNING CLASSIFIERS


Report submitted to the SASTRA Deemed to be University
as the requirement for the course

CSE 300 - MINI PROJECT

Submitted by

AMMISETTY BHANU VAMSI PRIYA


(Reg. No: 222003008, B. Tech CSE)
APPIREDDY SAI PRASADA REDDY
(Reg. No: 2220030613, B. Tech CSE)
MALLINENI MEGHANA
(Reg. No: 222003063, B. Tech CSE)

June 2022

SRINIVASA RAMANUJAN CENTRE


KUMBAKONAM, TAMIL NADU, INDIA – 612 001

i
SRINIVASA RAMANUJAN CENTRE
KUMBKONAM – 612 001

Bonafide Certificate

This is to certify that the report titled “Spam review detection using machine Learning

and Deep Learning classifiers” submitted as a requirement for the course, CSE 300 - MINI

PROJECT for B.Tech. is a bonafide record of the work done by Ms. A Bhanu Vamsi Priya (Reg.

No. 222003008, B. Tech-CSE), Mr. A Sai Prasada Reddy (Reg.No. 222003013, B. Tech -CSE),

Ms. M Meghana (Reg. No. 222003079, B. Tech-CSE) during the academic year 2021-22, in the

Srinivasa Ramanujan Centre, under my supervision.

Signature of Project Supervisor :


Name with Affiliation : Mrs. Umamaheswari. P
Date : 10.01.2022

Mini Project Viva voce held on 29-06-2022

Examiner 1 Examiner 2

ii
ACKNOWLEDGEMENT

We pay our sincere obeisance to the God Almighty for his grace and infinite mercy and for showing
on us his choicest blessings.

We would like to express our thanks to our Chancellor Prof. R. Sethuraman, Vice Chancellor Dr. S.
Vaidhyasubramaniam and Registrar Dr. R. Chandramouli for having given us an opportunity to be
a student of this esteemed institution.

We express our deepest thanks to Dr. V. Ramaswamy, Dean and Dr. A. Alli Rani, Associate
Dean,Srinivasa Ramanujan Centre for their constant support and suggestions when required without
any reservations.

We exhibit our pleasure in expressing our thanks to Mrs. Umamaheswari. P, Assistant Professor,
Department of CSE, our guide for her ever encouraging spirit and meticulous guidance for the
completion of the project.

We would like to thank our panel members Mr. Varahaswamy R, and Mrs. Menaga A for correcting
our mistakes in the project and for their marvelous support to complete the project successfully.

We would like to place on record the benevolent approach and painstaking efforts of guidance and
correction of Dr. V. Kalaichelvi, Smt. S. Hemamalini, the project coordinators and all department
staff to whom we owe our hearty thanks for ever.

Without the support of our parents and friends this project would never have become reality. We
dedicate this work to our well-wishers, with love and affection.

iii
List of Figures

Figure No. Title Page No.

1.11 Block diagram of the proposed methodology 3

1.12 Proposed CNN model architecture 4

1.13 Proposed RNN model architecture 5

1.14 Number of ham and spam messages from dataset 21

1.15 Length distribution of ham and spam messages 21


before and after cleaning the data
1.16 Word count distribution of ham and spam 21
messages before and after cleaning the data
1.17 Performance comparision graph for SVC and 22
MultinomialNB
1.18 Confusion Matrix for MultinomialNB before 22
and after normalization
1.19 Confusion Matrix for CNN before and after 23
normalization
1.2 Confusion Matrix for RNN before and after 23
normalization
1.21 Performance Evaluation Graph for CNN, 24
RNN, SVC and MultinomialNB

iv
List of Tables

Table No. Title Page No.


2.1 Input dataset 20
2.2 Dataset after removing unwanted columns 20
2.3 Dataset after preprocessing 20

v
Abbreviations

CNN Convolutional Neural Networks

RNN Recurrent Neural Networks

LSTM Long Term Short Memory

SVM Support Vector Machine

SVC Support Vector Classifier

vi
Abstract

In today's environment, a powerful and reliable method for detecting spam reviews is critical in order to
acquire things without getting duped by internet companies. In many cases, There are choices for
publishing reviews on web sites, and thus  allowing for the creation of fraudulent paid or untruthful
reviews. These fabricated evaluations have the potential to deceive the general public and  put them in a
confusion about whether or not to believe the review. Machine learning approaches that are well-known
have been introduced  to deal with the issue of spam review detection. Most recent research focuses on
supervised learning. When it comes to online updates,In this project, To achieve this, an Deep learning
strategies such as CNN, Recurrent Neural Networks (RNN) - Long Short Term Memory (LSTM) and
other traditional machine learning algorithms such as Vector Support Machine (SVM) and -Multinomial
Naive are developed
.The majority of recent research has focused on supervised learning.  When it comes to online reviews, In
this project, we'll look at how to spot any text reviews that are dishonest. In order to accomplish this, we
proposed some deep learning techniques like CNN and a variant of Recurrent Neural Networks (RNN) - 
Long Short Term Memory(LSTM) and also some traditional machine learning algorithms like Support
Vector Machine (SVM) and Multinomial Naive bayes. Lastly,we have done the comparision of the
accuracy for both machine learning and deep learning methods discussed.

Keywords: Spam Review Detection, SVM, Multinomial NB, CNN, LSTM

vii
Table of Contents
Title Page No.
Bonafide Certificate ii
Acknowledgements iii
List of Figures iv
List of Tables v
Abbreviations vi
Abstract vii
1 Summary of the base paper 1

2 Merits and Demerits of the base paper 4

3 Source Code 6

4 Snapshots 13

5 Conclusion and Future Plans 15

6 References 16

7 Appendix -Base Paper 18

8 Appendix-Originality Report 31

viii
CHAPTER 1

SUMMARY OF THE BASE PAPER

Title: Detection of spam reviews through a hierarchical attention architecture with N-gram CNN and Bi-
LSTM.

Journal name: Information Systems

Publisher: Elsevier

Year: 2022

Introduction:

The word spam usually refers to unwanted text that is sent or received through social media
sites like Facebook, Twitter, YouTube, e-mail, etc. Produced by spam senders to divert the attention of
users of the social media platform for marketing and distribution purposes other malware etc. Spam emails
are sent in bulk to various users, via with the intent of deceiving them into clicking on the fake ads and
distributing malicious software their tools. Spam messages provide a good source of income for spam
senders, and therefore, they continue to spread it quickly. Fight spam in email, many strategies are
involved, but the spam content continues to grow. These spam messages cause financial loss to business
email clients and for regular email users.
Spam is common on social networking sites such as YouTube, and it is especially
inclusive comments and links to pornographic websites and videos. This comment is sometimes created
automatically by bots. Although the definition of spam on online video game sharing services is
controversial, message -filled conditions , group joining requests , copyright infringement etc are
sometimes called spam. Usually comments are accompanied by links to commercial websites. Some
splogs have no different content and content items posted on other websites.
Neural-based methods have also been introduced to detect spam with great success for in-
depth learning in the native language process in recent years. These neural network models have been
shown to be effective in capturing hidden semantic information in spam updates that are difficult to detect
using various standard manual features. These in-depth learning-based approaches can reduce the
complexity of the engineering aspect and gain better performance than conventional speculation methods
such as Vector Support Machine or Naive Bayes. However, as there is a serious flaw again abbreviations
created by spam are usually meaningless, these methods are complex to learn the representation of high
level, updates from the multi-granularity angle. Other than that, these methods cannot model interactions
between sentences in a review, which is important to reflect the global contexts of the review.

Proposed Method:

The authors of the base paper proposed CNN(Convolution Neural Networks) and
LSTM(Long Short Term Memory) for detection of spam reviews. They proposed these deep learning-
based methods because they can reduce manual feature engineering complexity and achieve better
performance than traditional prediction methods such as Support Vector Machine or Naïve Bayes.
Additionally, Multinomial Naïve Bayes is also implemented for performance comparison and confusion
matrix is constructed for better understanding.
They constructed spam review detection in two hierarchial levels. Firstly word2sent-Level which
uses the N-Gram CNN to bring out the multigranularity and important information based on word
embedding. Secondly,Sent2doc level is performed with a combination of convolutional structure and Bi-
1
LSTM to take out the valuable information depending on the representation of a sentence. Finally spam
reviews are identified using softmax classification to the documents.

Methodology:

The goal of this project is to find ways and classify spam content in an online review system. There
are several sources of text data in fake reviews. Various methods have been used to detect and control
spam text. Our efforts are primarily motivated by a desire to learn more about the detection of distinctive
spam text and classification algorithms. This section discusses the process we used to perform our detailed
spam detection review.

Database Selection :
 The dataset in this project contains tagged SMS Messages that are gathered for SMS Spam
Research.
 The dataset contains two attributes one with text SMS messages with 5573 messages according to
being ham or spam in english.
 Classification with MultinomialNB, LSTM, CNN (99%) | Kaggle

Several processes need to be done in order to perform the required task of spam detection and
classification. Data is gathered as the first step. After data collection, a preliminary processing work begins,
which incorporates a few Indigenous Language Processing (NLP) methods to extract unwanted or
unimportant data. In the next section features are extracted from text data using methods such as Term
Frequency-Inverse Document Frequency (TF-IDF), N-grams, and Word embedding.Words/text is converted
into numerical vector that can be used for classification with the help of this feature extraction
approaches .The last step is to detect the spam, where the Machine Learning(ML) techniques : Support
Vector Machine(SVM) , Multinomial Naïve Bayes and Deep Learning (DL) techniques : Convolutional
Neural Networks(CNN) ,Long Term Short Memory(LSTM) are used which divides text and identifies the
category it fits in as spam or ham(non-spam).

2
Fig 1.1 Block diagram of proposed methodology

3
Pre Processing of textual data is done by using the techniques like tokenization for dividing
the sentences into smaller pieces known as tokens, Stemming to reduce the words to their root words
with basic meaning using natural language toolkit library ,Lemmatization lemmatizes the text words to
their origin of root words or lemmas,Normalization to reduce the number of different tokens in a text by
reducing the name to a simple version,stop words removal removes the words like “a,” “The,” “an,” and
“so” which are frequently used as it reduces the size of the database.
As most of the machine learning algorithms rely on numerical data rather than text,
feature extraction is performed to convert the text input into number vectors using count vectorizer
module. Distance plots are generated for the text message distribution and word count distribution before
and after cleaning the data to show the variation.The dataset is divided for training and testing.
Machine Learning models are implemented to detect the spam reviews and the model
performance graph is plotted to observe the comparision. The Multinomial Naïve Bayes gives the best
results and the confusion matrix is plotted to analyse the performance in more detail.The CNN and
LSTM models are created and and the dataset is trained generating the score values. Finally all the
machine learning and deep learning classifiers implemented are compared in terms of their performance
in detecting spam reviews or messages.

To work with CNN for text classification we implemented word embedding layer and one-dimensional
convolutional network.The embedding maps the words that are similar .It maps the words into an
embedding space and converts the text into numeric values.Uncommon words get a large index value.A
text CNN model is created and the architecture diagram for the proposed CNN model in this project is as
followed in Fig1.2.

Fig 1.2 Proposed CNN model architecture

In LSTM, multiple word strings can be used to determine which class it belongs to. An LSTM model is
defined with appropriate layers like embedding,LSTM,dropout,dense etc. and more accurate output is
obtained.The created LSTM model is trained with the train data and validated with the help of test data.
The architecture for the designed model can be observed in Fig 1.3.

4
Fig 1.3 Proposed LSTM(RNN) model architecture

Formulae used to calculate the performance measures:

 F1 score = 2∗ ( precision∗recall ) ¿( precision+ recall ¿


 Recall = TP /(TP+FN)
 Precision = TP/(TP+FP)
 Where, TP = True positive
FN = False Negative
FP = False Positive
 1D,2D Convolution:
 Feature map output calculation = (w-f+2p/s) +1
Where w = length of array, f = filter, p = pooling, s = strides

5
CHAPTER 2

MERITS AND DEMERITS

Literature review:

[1]. Ajay Rastogi, Monica Mehrotra, Syed Shafat Ali proposed method analyzes the effectiveness of two
types of features-Behavioural and Textual in Opinion spam detection. The methods discussed here can be
enhanced to perform well for all the data sets not only for YelpZip and YelpNYC of Yelp.com.

[2]. Works done by Tiago A. Almeida, Jurandy Almeida, Akebo Yamakami involves Comparitive study of
term selection techniques with respect to Naïve Bayes classifiers to evaluate the quality of spam filters.
While the filter tries to evolve its prediction capacity, the spammers try to evolve their spam messages and
create noise. By avoiding this problem probability estimation can be made easy.

[3]. Shirin Noekhah , Naomie binti Salim , Nor Hawaniah Zakaria proposed method that analyzes to what
extent the model accurately detects spammed entities. The MGSD model was applied to the Amazon dataset
in order to classify the entities into spam and non-spam basedon their spamicity scores. As the existing
features cannot capture spam increase in computational resources will be helpful

[4]. Work on LSTM has been proposed by Kasem Khalil, Omar Eldash, Ashok Kumar, Magdy Bayoumi
with variations to make it simple and improve its performance of accuracy. It can be improved so that it can
work effectively on larger number of computation units.

[5]. Mehul Bhundiya, Maulik Trivedi proposed method involves SVM Classification method. More
parameters can be added to the unsupervised detection process to boost the systems overall performance.

[6]. G . M. Shahariar, Swapnil Biswas, Faiza Omar, Faisal Muhammad Shah, Samiha Binte Hassan
proposed method involves Multilayer perceptron (MLP), KNN, LSTM. Overfitting problem might be
decreased by certain improvements. Overfitting problem might effect the detection accuracy due to the
small amount of data. They can use more dimensions for word embeddings.

[7]. Avinash Chandra Pandey, Dharmveer Singh Rajpoot proposed method involves Cuckoo Search


Clustering and Fermat Spiral methods. More feature selection techniques and optimization algorithms can
be explored for better accuracy

[8]. Yuliya Kontsewayaa, Evgeniy Antonova, Alexey Artamonov proposed method involves Comparision
of the performance machine learning algorithms - Naive Bayes, K-Nearest Neighbors, SVM, Logistic
regression, Decision tree, Random forest – in spam classification. Logistic regression and NB give highest
level of accuracy.The results can be more improved by combining algorithms and filtering methods.

[9]. For spam detection in Arabic opinion texts, Radwa M.K. Saeed, Sherine Rady, Tarek F. Gharib
proposed a method that uses Rule based, classical machine learning, majority voting ensemble and stacking
ensemble classification methods. Enhancement of the Arabic spam reviews detection methods, deep
learning methods can be studied. We can improve these methods for detecting the spam other languages
also.

[10]. Draško Radovanović, Božo Krstajić proposes a brief overview of spam detection methods that uses
different datasets that yield different results. A lack of gold standard dataset and combining content based
and
6
reviewer

7
Merits:

 The methods proposed in this project model gave more accurate results than compared to
other methods.
 The implementation of CNN and LSTM made the categorization and detection of spam in
the text more detailed and accurate..
 LSTM have a good hold in memorizing large patterns as it have multiple hidden layers.All
the unwanted and unimportant information gets dumped as it proceeds through every layer
one by one.

Demerits:

 The detection of spam in the review messages in this project is confined to prearranged data.
 A hybrid CNN-RNN model can be preferred for further more works.

8
9
10
11
12
13
14
15
16
17
18
19
20
CHAPTER -4

OUTPUT SNAPSHOTS

Table 2.1 Input dataset

Table 2.2 Dataset after removing unwanted columns

Table 2.3 Dataset after preprocessing

21
Fig 1.4 Number of ham and spam messages from dataset

Fig 1.5 Length distribution of ham and spam messages before and after cleaning the data

Fig 1.6 Word count distribution of ham and spam messages before and after cleaning the data

22
Fig 1.7 Performance comparision graph for SVC and MultinomialNB

Fig 1.8 Confusion Matrix for MultinomialNB before and after normalization

23
Fig 1.9 Confusion matrix for CNN before and after normalization

Table 2.5
RNN Summary Table

Fig 2.0 Confusion matrix for RNN before and after normalization

24
Fig 2.1 Performance Evaluation Graph for CNN ,RNN ,SVC and MultinomialNB

25
CHAPTER 5

CONCLUSION AND FUTURE PLANS

Conclusion:

Many of the techniques for spam detection are detailed in systematic review of spam content
and classification. The authors of the base paper tested their method for three datasets and achieved
accuracy of 83% for Hotel reviews, 87.1% for Restaurant reviews, and 91% for Doctor reviews. SMS spam
reviews dataset is used consisting of 5573 messages .Many different feature sets are tested and trained to
perform text categorization . For final accuracy score all extracted features were taken into consideration
and proposed method has achieved accuracy of about 98.38%,98.83% for SVC, and Multinomial NB
classification algorithms respectively. CNN and LSTM algorithms epocs are constructed, accuracy and loss
are calculated. This proves proposed method to be a promising one when all features taken in consideration
even for large datasets. The study also looked at different pre-processing techniques, feature extraction, and
spam classification. This research will assist researchers in conducting research in the field of spam
detection as it highlights some of the best work done in this field. The various previous activities in the
previous spam analysis, feature removal, and editing will assist researchers in determining the most
appropriate techniques for their research in this area.

Future plans:

In the future, there are numerous opportunities to improve our work. Most importantly, we
have worked only on the predefined data but we can extend this work for user defined text using Natural
Language Preprocessing (NLP). Second, by incorporating more advanced Deep Learning algorithms into
the data labelling process, the efficiency can be enhanced in spam review detection. The model's
performance can be assessed using GloVe and other pre-trained word representation vectors. Fourth we
exclusively worked with text reviews and did not include review spammers in our research. Review
spammer identification, as well as review spam detection, may be included in the future.
Finally, further CNN and RNN versions, as well as a hybrid CNN-RNN model, can be developed in
detecting the spam reviews more efficiently.

26
CHAPTER 6
REFERENCES

[1] . Ajay Rastogi , Monica Mehrotra , Syed Shafat Ali . Effective Opinion Spam Detection: A Study on
Review Metadata Versus Content . Journal of Data and Information Science, vol. 5, no. 2, 2020, pp. 76–110 .

[2] . Tiago A. Almeida , Jurandy Almeida , Akebo Yamakami . Spam filtering: how the dimensionality reduction
affects the accuracy of Naive Bayes classifiers . J Internet Serv Appl (2011) 1: 183–200,

[3] . Shirin Noekhah , Naomie binti Salim , Nor Hawaniah Zakaria . Opinion spam detection: Using multi-iterative
graph-based model Information Processing and Management 57 (2020) 102140

[4] . Kasem Khalil , Omar Eldash , Ashok Kumar , Magdy Bayoumi . Economic LSTM Approach for Recurrent Neural
Networks . IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 66, NO. 11,
NOVEMBER 2019

[5] . Mehul Bhundiya , Maulik Trivedi . An Efficient Spam Review Detection Using Active Deep Learning Classifiers
2021 International Conference on Artificial Intelligence and Machine Vision (AIMV)

[6] . G . M. Shahariar , Swapnil Biswas , Faiza Omar , Faisal Muhammad Shah , Samiha Binte Hassan. Spam Review
Detection using Deep Learning IEEE Xplore ,2019 .

[7] . Avinash Chandra Pandey , Dharmveer Singh Rajpoot . Spam review detection using Spiral cuckoo search
clustering method . Evolutionary Intelligence (2019) 12:147–164

[8] . Yuliya Kontsewayaa , Evgeniy Antonova , Alexey Artamonov . Evaluating the Effectiveness of Machine
Learning Methods for Spam Detection . The scientific committee of the 2020 Annual International Conference on
Brain-Inspired Cognitive Architectures for Artificial Intelligence: Eleventh Annual Meeting of the BICA Society
10.1016/j.procs.2021.06.056 .

[9] . Radwa M.K. Saeed, Sherine Rady , Tarek F. Gharib . An ensemble approach for spam detection in Arabic opinion
texts. R.M.K. Saeed et al. / Journal of King Saud University – Computer and Information Sciences 34 (2022) 1407–
1416 .

[10] . Draško Radovanović, Božo Krstajić . Review Spam Detection Using Machine learning. 2018 23rd International
Scientific-Professional Conference on Information Technology (IT) .

[11] . Dima Suleiman , Ghazi Al , Naymat . Fake Review Detection using machine learning Techniques. Machine
Learning with Applications 4 (2021) 100032

[12] . Junaed Younus Khan, Md. Tawkat Islam Khondaker, Sadia Afroz , Gias Uddin , Anindya Iqbal . Online
Products Fake Reviews Detection System Using Machine Learning. Dima Suleiman et al. / Procedia Computer Science
113 (2017) 154–161.

[13] . Naveed Hussain, Hamid Turab Mirza , Ibrar Hussain ,Faiza Iqbal , Imran Memon. Spam Review Detection
Using the Linguistic and Spammer Behavioral Methods. N. Hussain et al.: SRD Using the Linguistic and Spammer
Behavioral Methods .

[14] . Elshrif Elmurngi , Abdelouahed Gherbi . An empirical study on detecting fake reviews using machine learning
techniques. The Seventh International Conference on Innovative Computing Technology (INTECH 2017) .

[15] . Chuhao Yao , Jiahong Wang , Eiichiro Kodama . A Spam Review Detection Method by Verifying Consistency
Among Multiple Review Sites. 2019 IEEE 21st International Conference on High Performance Computing and
Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data
Science and Systems.

27
[16] J. Li, M. Ott, C. Cardie, E. Hovy, Towards a general rule for identifyingdeceptive opinion spam, in: Proceedings of
the 52nd Annual Meeting ofthe Association for Computational Linguistics (Volume 1: Long Papers),2014, pp. 1566–
1576.

[17] N. Jindal, B. Liu, Opinion spam and analysis, in: Proceedings of the 2008International Conference on Web Search
and Data Mining, 2008, pp.219–230.

[18] S. Kennedy, N. Walsh, K. Sloka, A. McCarren, J. Foster, Fact or factitious?Contextualized opinion spam
detection, in: Proceedings of the 57th AnnualMeeting of the Association for Computational Linguistics: Student
ResearchWorkshop, 2019, pp. 344–350.

[19] Y.-R. Chen, H.-H. Chen, Opinion spam detection in web forum: a real casestudy, in: Proceedings of the 24th
International Conference on World WideWeb, 2015, pp. 173–183.

[20] Z. Wang, R. Hu, Q. Chen, P. Gao, X. Xu, Collueagle: Collusive reviewspammer detection using markov random
fields, 2019, arXiv preprintarXiv:1911.01690.

28
29
30
31
32
33
34
35
36
37
CHAPTER-8

Originality Report

38

You might also like