Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/364956956

Classification and Analysis of Fake Product Review using Ai

Article · September 2022

CITATIONS READS

0 52

3 authors, including:

Prasad Joghee
KPR INSTITUTE OF ENGINEERING AND TECHNOLOGY
18 PUBLICATIONS   11 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

RESEARCH View project

All content following this page was uploaded by Prasad Joghee on 01 November 2022.

The user has requested enhancement of the downloaded file.


TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 4970 – 4977

Classification and Analysis of Fake Product Review using Ai

M. Kasiselvanathan1*, J. Dhanasekar2, J. Prasad3


1*
Assistant Professor, Department of Electronics and Communication Engineering,
Sri Ramakrishna Engineering College, Coimbatore, India
2
Assistant Professor, Department of Electronics and Communication Engineering,
Sri Eshwar College of Engineering, Coimbatore, India
3
Assistant Professor, Department of Electronics and Communication Engineering,
KPR Institute of Engineering and Technology, Coimbatore, India

Email id: 1kasiselvanathan@gmail.com, 2dhanasekar.j@sece.ac.in, 3prdece@gmail.com

Received 24/08/2022; Accepted 15/09/2022

Abstract:
In today's e-commerce, recommender systems play an essential part in decision-making.
Customers, for illustration, check product or store reviews before determining what to
purchase, where to buy it, and whether or not to buy it. Because there is monetary reward in
submitting fake/fraudulent reviews, there has been a large surge in difficult opinion spam on
online review platforms. In essence, an untruthful review is a phoney, fraudulent, or opinion
spam review. Positive ratings of a specific object can attract more consumers and improve
sales; bad evaluations might reduce demand and sales. In recent years, fake review detection
has received a lot of attention. However, most review sites still do not openly screen bogus
reviews. Yelp– is an exception over the past few years. The detection of phoney internet
reviews has become a prominent research topic as a result of the increasing use of fake reviews.
Despite earlier research' efforts to detect phoney reviews, the concerns of imbalanced data and
feature trimming remain unaddressed. The approach presents an ensemble approach for
detecting false online reviews to fill in these gaps.
Keywords: Classification, Dataset, Logistic function, Machine Learning, Fake Review

1. INTRODUCTION

Today, with the development of e-commerce, online shopping is becoming more and more
common [1-3]. Researchers have shown that online reviews significantly influence consumer
purchase decisions and influence product sales [4]. Unfortunately, some sellers or consumers
manipulate product reviews by writing fake reviews designed to mislead consumers into
making purchasing decisions. Studies show that there are many fake reviews on the internet.
For example, one study estimated that 16% of restaurant reviews on Yelp (one of the most
popular review sites in the US) are spam [5-6]. The proliferation of fake reviews is a serious
problem because it misleads consumers' purchasing decisions and outcomes [7]. It is a major
detriment to the sustainable development of online rating systems [8-9].
Some websites allow customers to report reviews that are suspected to be fake [10]. However,
some are carefully written and look real, making it difficult for consumers to spot fake reviews
[11]. Since fake reviews are difficult to detect manually, finding an automatic detection method
is a major goal of related research. Among the different types of fake review detection methods,

4970
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 4970 – 4977

machine learning techniques have been widely used. However, some issues remain unresolved
[12-14].

LITERATURE REVIEW
A. E-Commerce: Purchasing and Selling Online
World world global world worldwide play an important role in products. Because there are
many options for the product, there are many options for the product, so there are many options
for the product. Because it is different from the procedure taken when purchasing the product,
it is necessary to recognize the online site to recognize false reviews, because it is necessary to
consider false reviews, because it is a fraudulent fraud. This is due to the fact that each product
cannot be confirmed, because it cannot be confirmed. It is included in the image that the
program is trying to identify each sample of the estimated value offered by the client to sell
manually.

B. Implementation of fake product review monitoring system


Mupparam Sowjanya (2020), many people need accurate information about products online.
Before you save money on a particular product, check out the various reviews on the site. In
this scenario, they couldn't decide if it was fake or real. Overall, some of the reports on the site
are great, and the company's technical staff add their own reports to promote their products.
This person is highly valued by the company as part of the media and social organization team.
Online shoppers have not identified any fakes of this fake product in site reviews. In this study,
we used the SVM estimation engine to detect incorrect rankings based on IP addresses. This
post will help users to find the right rating for online products. This improves accuracy by
98.79% and increases F1 score by 10%.

C. Implementation of fake product review monitoring system


Aiswarya K (2021), in today's environment, network data is growing exponentially. Social
networks generate large amounts of data such as reviews, comments, ratings and reviews on a
daily basis. A huge amount of user data that is meaningless unless destruction is required. There
are so many fake reviews and there are ways to analyze them, so you need to spot the spam to
get real referrals. Today, many people use social media platforms to make phone calls to
purchase products or services. Opinion spam is a deep and complex problem, with many false
and fake opinions created by organizations and individuals for a variety of purposes. Write fake
reviews to mislead readers or promote or use auto-detection systems and specific products to
better understand your reputation. The proposed method consists of an ontology that monitors
the location and IP address of the spam keyword list using naive Bayes and fire detection and
control.

D. Intelligent Interface for Fake Product Review Monitoring


Ata-Ur-Rehman (2019), +as the trend of online stores is increasing day by day, people are
showing their interest to buy the products they need from online stores. Such a purchase does
not require much time for the customer. Customers enter the online store, search for the items
they need and place an order. However, what people struggle with when buying products from
online stores is that the products are of poor quality. Customers place an order simply by
viewing reviews and reading comments about a particular product. These opinions of others
are a source of satisfaction for buyers of new products. A single negative review here can
change the angle of customers not buying this product. In this case, this review may be fake.
So to get rid of these fake reviews and provide users with genuine product reviews and ratings,
we provide a smart interface and Fi (URL) related to Amazon, Flipkart and Daraz products and
analyze the reviews to provide original ratings for Customers. The unique feature of the

4971
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 4970 – 4977

proposed system is that it analyzes not only reviews written in English, but also reviews written
in Urdu and Roman Urdu by linking with three e-commerce sites. Previous works on fake
reviews do not support analysis of reviews written in languages such as Urdu and Roman Urdu,
and cannot handle reviews from multiple e-commerce sites. The proposed task has an accuracy
of detecting false notes written in English using a smart learning method of 87% over that of
the existing system.

E. Fake Review Monitoring System


Mohan Rao.NSC (2020), As most of the customers buy their product based on the review of
the products. In such cases people go through with the rating or review of the products while
observing those, people may not be able to find whether the report is real or fake. Some
companies exhibit their own review for the demand of product and company rating purpose.
To resolve this problem to find out fake review in the website this “Fake Review Monitoring”
system is introduced. This system includes with verification process of reviews by the reference
of IP address and then separate them into spam and non-spam reviews.

2. METHODOLOGY

The following figure represents the block diagram how the process takes place in the machine
learning algorithm and getting the data from dataset and pre-processing the data and applying
it in the machine learning model then the classifier will classify the test sample data and the
data from ML algorithm and the data is classified and we get a desired result.

Figure. 1 Block Diagram

F. Dataset
A dataset is a collection of data in which the data is arranged in a specific order. Datasets can
contain anything from arrays to database tables. Tabular datasets can be thought of as database
tables or arrays, where each column corresponds to a specific variable and each row
corresponds to a field in the dataset. The most supported file formats for tabular datasets are
comma-separated or CSV files. However, JSON files can be used more efficiently to store
"tree-like data".
Types of data:
● Numerical data: Such as Rating etc.
● Categorical data: Such as Yes/No, True/False etc.
● Ordinal data: These data are similar to categorical data but can be measured on the basis of
comparison.

4972
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 4970 – 4977

G. Data Pre-Processing
Data preprocessing is the process of preparing raw data and making it suitable for machine
learning models. This is the first and important step in building a machine learning model.
When developing a machine learning project, you don't always come across clean and well-
organized data. And when working with data, organization and formatting is essential.

H. Data Pre-Processing With NLTK


The Natural Language Toolkit, or often NLTK, is a set of symbolic and statistical natural
language processing libraries and programs for the English language written in the Python
programming language. It includes general algorithms such as punctuation, speech mark, roots,
sentiment analysis, topic segmentation, and named object recognition, some of which we will
use in this proposed model.

i) Lower Casing
Lowercase is an important text editing step that can make America, America, and AMERICA
equivalent to "America" by converting the text to uppercase, preferably all lowercase. This
frequency term is useful for text highlighting methods such as TFIDF. Because it prevents the
repetition of the same word in different cases.

ii) Tokenization
Tokenization is the process of dividing text into parts called vocabularies. Text bodies can be
converted to sentence, word, or character tags. This is a requirement for many NLP operations,
so we usually convert text to word symbols during preprocessing.

iii) Removing Stop words


Stop words are trivial words like "me", "me" and "you" that appear frequently in texts, so they
can distort many NLP operations without adding much valuable information. Therefore, as part
of preprocessing, stop words should almost always be removed from the corpus.

iv) Removal of Tags


When scraping data from other websites, removing HTML tags becomes an important step as
part of preprocessing.

I. Logistic Regression
Supervised learning logistic regression is one of the most popular machine learning algorithms.
Logistic functions are used in this technique to describe the probabilities of the possible
outcomes of a single test. For this purpose, a logistic regression analysis (classification) is
performed, in particular, it is good to know how many independent factors affect an outcome
variable. Linear regression is used for regression problems and logistic regression is used for
problems. The only downside to the algorithm is that the predictors are binary, all predictors
must be independent of each other, and it only works if you expect your data to have no missing
values.
Logistic regression can be used to classify observations using different types of data and makes
it easy to determine the most effective variables to use for classification. The figure.2 exhibits
the logistic function.

4973
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 4970 – 4977

Figure.2 Logistic Function Graph

J. Logistic Function (Sigmoid Function)


A sigmoid function is a mathematical function used to associate predicted values to
probabilities. Collate each real value with another value in the range 0 and 1. Logistic
regression must hold values among 0 and 1 and cannot exceed this limit, thus forming a curve
like an "S" shape. S-curves are called sigmoid functions or logistic functions. Logistic
regression uses the concept of a threshold to define the probability of a 0 or 1.

K. Logistic Regression Equation


Logistic regression equations can be derived from linear regression equations. The
mathematical steps to obtain the logistic regression equation are:

Equation of the straight line:

In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):
But we need range between - [infinity] to + [infinity], then take logarithm of the equation it
will become:

L. Procedure involved

Initial step is to gather the named dataset of audits. There are various datasets accessible
internet based which are utilized in past examination, but viewing as a named dataset for
surveys was a troublesome undertaking. Luckily, we had the option to find a named dataset on
kaggle gave by Amazon. The dataset contains a sum of 21,000 surveys in which half are phony
surveys and half are certified surveys.
After finding the dataset, the following stage includes preprocessing the information. The
dataset can't be straightforwardly used to prepare the classifier model as the model can't handle
the text information. Preprocessing incorporates eliminating stop words and accentuations,
stemming, lemmatization, and so on. Preprocessing is talked about in the following segment
exhaustively.
In the element extraction stage, Tf-idf (Term Frequency and Inverse Document Frequency) and
CountVectorizer were utilized. Tf-idf expands the heaviness of unprecedented words and
diminishes the heaviness of familiar words and results in making a vector of highlights.

4974
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 4970 – 4977

CountVectorizer makes a vector of highlights in view of the recurrence of the words in the
survey. Our examination shows that CountVectorizer outflanks Tf-idf which might be on the
grounds that the dataset has audits on various items bringing about numerous extraordinary
words.
Followed by the element extraction stage, we prepared our calculated relapse model on the
arrangement of highlights chosen in the component extraction stage. Preparing the model is
simply not adequate to characterize the surveys effectively. For better exactness we want to
tune the hyper parameters of the model viz. C, punishment, solver, and so on. The course of
hyper parameter tuning is examined in detail in the Experiment segment.
After the model is prepared and tuned, the model necessities to be assessed to grasp its
presentation. Assessment of the model is finished by testing it on unlabeled information and
working out the exactness, accuracy and review. Results are talked about in the Experiment
segment.

M. Data Pre-Processing and Feature Extraction


Variety of features that have been proposed and used separately by supervised approaches
to identify fake reviews. We used two methods to find the fake review.

i) Rating
Clients can rate the item from 1 to 5 stars addressing fulfillment/disappointment about the item.
This component can be utilized to approve that the survey composed and the appraisals given
by the commentator are expected exclusively in one heading and don't go against. Evaluations
comparing to the phony audits generally go amiss from the typical rating of the item.
Subsequently, helping in arranging the phony surveys.

ii) Review Classification


The type of the review can be easily determined by checking review or comments of the
products and removing the stop words from the review and after the pre-processing of the
comments then the common words in the good review and good comments will be classified
then the bad reviews and the bad comments will be taken into an account.
Good products will have a more good comments then the fake reviews so the common words
in good comments will be taken and it will be processed to feed to the model this will help the
model to understand the good product and bad products. The review text should be processed
before passing it to the model. The first step of processing includes removing of all the
characters and expressions other than the letters as the model can’t make any decision of the
punctuations and expressions.
The survey text is then parted into a rundown of words and afterward each word is changed
over completely to its base structure by stemming; trailed by expulsion of stop words. Stop
words are the regularly happening words helpful grammatically furthermore, linguistically that
enhance the model. A corpus of these words is created.
The CountVectorizer capability gave by sklearn in python is utilized to address the corpus of
words utilizing a scanty framework where each word goes about as a section and the survey as
a column having the most successive 1400 words from the corpus. This scanty framework of
1400 most successive words is utilized as a component vector to the model alongside the
checked buy, rating and survey length of the item.
Essentially, we likewise make an element vector of 1400 words utilizing Tf-idf (Term
Frequency and Inverse Document Recurrence).

3. RESULTS AND DISCUSSION

4975
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 4970 – 4977

This section analyses in our exploration. Dataset can't be passed straightforwardly for
preparing. From the get go significant component segments were separated like rating, checked
buy. Then, at that point, unmitigated elements were switched over completely into
mathematical elements utilizing Label Encoder. Text
Information was extricated utilizing CountVectorizer and Tf-idf (max_features = 1400). This
was trailed by working out length for each survey. At last, our information having an element
vector of 1403 (1400 highlights of survey content, rating, confirmed buy, audit length) was
produced. Train test split of 80-20% was completed on the information. This was followed by
scaling the information by utilizing Min-Max-Scaler class from sklearn library. The Min-Max-
Scaler changes highlights by scaling each element to a given reach. This reach can be set by
determining the feature range boundary (default at (0, 1)).
Logistic regression model was trained on the train data for both CountVectorizer and Tf-idf
using Logistic Regression in sklearn.
The last phase of the experiment is evaluation of the model to understand the performance.
Binary classification involves classifying the data in two groups eg. Yes/no, true/false,
fake/genuine, etc. Target variables in such problems are not continuous but they predict the
probabilities to be yes/no, etc. Such models are evaluated using a metric called confusion
matrix. Using confusion matrix, we have calculated accuracy, precision and recall for both
logistic regression with Tf-idf; with and without “verified purchase” feature. It was the result
of the Tf-idf.

4. CONCLUSION

The recommender systems play an essential part in decision-making in e-commerce. An


untruthful review is a phoney, fraudulent, or opinion spam review. Positive ratings of a specific
object can attract more consumers and improve sales; bad evaluations might reduce demand
and sales. The fake review detection has received a lot of attention. In this study, the proposed
approach detects the false online reviews to minimize the fraudulent reviews. This will help
customers to check product or store reviews before determining what to purchase, where to
buy it, and whether or not to buy it.

5. REFERENCES

[1]. Sifat Ahmed, Faisal Muhammad Shah, “Using Boosting Approaches to Detect Spam
Reviews”, 2019 1st International Conference on Advances in Science, Engineering and
Robotics Technology (ICASERT), 2019.
[2]. Faisal Khurshid, Yan Zhu, Zhuang Xu, Mushtaq Ahmad, Muqeet Ahmad, “Enactment
of Ensemble learning for Review Spam Detection on Selected Features”. International
Journal of Computational Intelligence System 2019, Vol 12(1); pp.387-394; ISSN:
1875- 6891; 2019.
[3]. Faliang Huang, Guoqing Xie, Ruliang Xiao, “Research on Ensemble Learning”,
International Conference on Artificial Intelligence and Computational Intelligence,
2009.
[4]. Brian Heredia, Taghi M. Khoshgoftaar, Joseph D. Prusa, Michael Crawford,
“Improving detection of untrustworthy online reviews using ensemble learners
combined with feature selection”, Springer, Article number: 37, 2017.
[5]. Ioannis Dematis, Eirini Karapistoli, Athena Vakali, “Fake Review Detection via
Exploitation of Spam Indicators and Reviewer Behavior Characteristics”. Springer
International Publishing AG pp.581-595, 2018.

4976
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 4970 – 4977

[6]. Anna V. Sandifer, Casey Wilson, Aspen Olmsted, “Detection of fake online hotel
reviews” ,IEEE Internet Technology and Secured Transactions (ICITST), International
Conference, 2017.
[7]. Xinyue Wang, Xianguo Zhang, Chengzhi Jiang, Haihang Liu, “Identification of Fake
Reviews Using Semantic and Behavioral Features”, 2018 4th IEEE International
Conference on Information Management, 2018.
[8]. Amitkumar B. Jadhav, Vijay U. Rathod and Dr. Hemantkumar B. Jadhav, “Improving
Performance of Fake Reviews Detection in Online Review's using Semi-Supervised
Learning”. International Research Journal of Engineering and Technology (IRJET),
Volume: 06 Issue: 06 | June 2019.
[9]. Min-Yuh, Chih-Chien Wang, Chien-Chang Chen and Shao-Chieh Yang,“Exploring
Review Spammers by Review Similarity: A Case of Fake Review in Taiwan”.
Proceedings of the Third International Conference on Electronics and Software Science
(ICESS2017), Takamatsu, Japan, 2017.
[10]. Fathima Nada, Bariya Firdous Khan, Aroofa Maryam, Nooruz-Zuha, Zameer Ahmed.
“Fake news detection using Logistic Regression” International Research Journal of
Engineering and Technology (2019): pISSN: 2395-0072
[11]. Pankaj Chaudhary, Abhimanyu Tyagi, Santosh Mishra. “Fake Review Detection
through Supervised Classification.” Indian Conference on Recent Innovations in
Emerging Technology & Science (2018):ISSN: 2320-2882
[12]. Muhammad Syahmi Mokhtar, Yusmadi Yah Jusoh, Novia Admodisastro, Noraini Che
Pa, Amru Yusrin Amruddin. “Fakebuster: Fake News Detection System Using Logistic
Regression Technique in Machine Learning.” International Journal of Engineering and
Advanced Technology (2019): ISSN: 2249 – 8958.
[13]. ] Huaxun Deng, Linfeng Zhao, Ning Luo, Yuan Liu, Guibing Guo, Xingwei Wang,
Zhenhua Tan, Shuang Wang and Fucai Zhou, “Semi-supervised Learning based Fake
Review Detection”, 2017 IEEE International Symposium on Parallel and Distributed
Processing with Applications and 2017 IEEE International Conference on Ubiquitous
Computing and Communications (ISPA/IUCC), 2017.
[14]. SP. Rajamohana, Dr. K. Umamaheswari, M. Dharani and R. Vedackshya, “A Survey
On Online Review Spam Detection Techniques'. IEEE International Conference on
Innovations in Green Energy and Healthcare Technologies (ICIGEHT’17), 2017.

4977

View publication stats

You might also like