Professional Documents
Culture Documents
Classification and Analysis of Fake Product Review Using Ai
Classification and Analysis of Fake Product Review Using Ai
net/publication/364956956
CITATIONS READS
0 52
3 authors, including:
Prasad Joghee
KPR INSTITUTE OF ENGINEERING AND TECHNOLOGY
18 PUBLICATIONS 11 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Prasad Joghee on 01 November 2022.
Abstract:
In today's e-commerce, recommender systems play an essential part in decision-making.
Customers, for illustration, check product or store reviews before determining what to
purchase, where to buy it, and whether or not to buy it. Because there is monetary reward in
submitting fake/fraudulent reviews, there has been a large surge in difficult opinion spam on
online review platforms. In essence, an untruthful review is a phoney, fraudulent, or opinion
spam review. Positive ratings of a specific object can attract more consumers and improve
sales; bad evaluations might reduce demand and sales. In recent years, fake review detection
has received a lot of attention. However, most review sites still do not openly screen bogus
reviews. Yelp– is an exception over the past few years. The detection of phoney internet
reviews has become a prominent research topic as a result of the increasing use of fake reviews.
Despite earlier research' efforts to detect phoney reviews, the concerns of imbalanced data and
feature trimming remain unaddressed. The approach presents an ensemble approach for
detecting false online reviews to fill in these gaps.
Keywords: Classification, Dataset, Logistic function, Machine Learning, Fake Review
1. INTRODUCTION
Today, with the development of e-commerce, online shopping is becoming more and more
common [1-3]. Researchers have shown that online reviews significantly influence consumer
purchase decisions and influence product sales [4]. Unfortunately, some sellers or consumers
manipulate product reviews by writing fake reviews designed to mislead consumers into
making purchasing decisions. Studies show that there are many fake reviews on the internet.
For example, one study estimated that 16% of restaurant reviews on Yelp (one of the most
popular review sites in the US) are spam [5-6]. The proliferation of fake reviews is a serious
problem because it misleads consumers' purchasing decisions and outcomes [7]. It is a major
detriment to the sustainable development of online rating systems [8-9].
Some websites allow customers to report reviews that are suspected to be fake [10]. However,
some are carefully written and look real, making it difficult for consumers to spot fake reviews
[11]. Since fake reviews are difficult to detect manually, finding an automatic detection method
is a major goal of related research. Among the different types of fake review detection methods,
4970
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 4970 – 4977
machine learning techniques have been widely used. However, some issues remain unresolved
[12-14].
LITERATURE REVIEW
A. E-Commerce: Purchasing and Selling Online
World world global world worldwide play an important role in products. Because there are
many options for the product, there are many options for the product, so there are many options
for the product. Because it is different from the procedure taken when purchasing the product,
it is necessary to recognize the online site to recognize false reviews, because it is necessary to
consider false reviews, because it is a fraudulent fraud. This is due to the fact that each product
cannot be confirmed, because it cannot be confirmed. It is included in the image that the
program is trying to identify each sample of the estimated value offered by the client to sell
manually.
4971
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 4970 – 4977
proposed system is that it analyzes not only reviews written in English, but also reviews written
in Urdu and Roman Urdu by linking with three e-commerce sites. Previous works on fake
reviews do not support analysis of reviews written in languages such as Urdu and Roman Urdu,
and cannot handle reviews from multiple e-commerce sites. The proposed task has an accuracy
of detecting false notes written in English using a smart learning method of 87% over that of
the existing system.
2. METHODOLOGY
The following figure represents the block diagram how the process takes place in the machine
learning algorithm and getting the data from dataset and pre-processing the data and applying
it in the machine learning model then the classifier will classify the test sample data and the
data from ML algorithm and the data is classified and we get a desired result.
F. Dataset
A dataset is a collection of data in which the data is arranged in a specific order. Datasets can
contain anything from arrays to database tables. Tabular datasets can be thought of as database
tables or arrays, where each column corresponds to a specific variable and each row
corresponds to a field in the dataset. The most supported file formats for tabular datasets are
comma-separated or CSV files. However, JSON files can be used more efficiently to store
"tree-like data".
Types of data:
● Numerical data: Such as Rating etc.
● Categorical data: Such as Yes/No, True/False etc.
● Ordinal data: These data are similar to categorical data but can be measured on the basis of
comparison.
4972
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 4970 – 4977
G. Data Pre-Processing
Data preprocessing is the process of preparing raw data and making it suitable for machine
learning models. This is the first and important step in building a machine learning model.
When developing a machine learning project, you don't always come across clean and well-
organized data. And when working with data, organization and formatting is essential.
i) Lower Casing
Lowercase is an important text editing step that can make America, America, and AMERICA
equivalent to "America" by converting the text to uppercase, preferably all lowercase. This
frequency term is useful for text highlighting methods such as TFIDF. Because it prevents the
repetition of the same word in different cases.
ii) Tokenization
Tokenization is the process of dividing text into parts called vocabularies. Text bodies can be
converted to sentence, word, or character tags. This is a requirement for many NLP operations,
so we usually convert text to word symbols during preprocessing.
I. Logistic Regression
Supervised learning logistic regression is one of the most popular machine learning algorithms.
Logistic functions are used in this technique to describe the probabilities of the possible
outcomes of a single test. For this purpose, a logistic regression analysis (classification) is
performed, in particular, it is good to know how many independent factors affect an outcome
variable. Linear regression is used for regression problems and logistic regression is used for
problems. The only downside to the algorithm is that the predictors are binary, all predictors
must be independent of each other, and it only works if you expect your data to have no missing
values.
Logistic regression can be used to classify observations using different types of data and makes
it easy to determine the most effective variables to use for classification. The figure.2 exhibits
the logistic function.
4973
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 4970 – 4977
In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):
But we need range between - [infinity] to + [infinity], then take logarithm of the equation it
will become:
L. Procedure involved
Initial step is to gather the named dataset of audits. There are various datasets accessible
internet based which are utilized in past examination, but viewing as a named dataset for
surveys was a troublesome undertaking. Luckily, we had the option to find a named dataset on
kaggle gave by Amazon. The dataset contains a sum of 21,000 surveys in which half are phony
surveys and half are certified surveys.
After finding the dataset, the following stage includes preprocessing the information. The
dataset can't be straightforwardly used to prepare the classifier model as the model can't handle
the text information. Preprocessing incorporates eliminating stop words and accentuations,
stemming, lemmatization, and so on. Preprocessing is talked about in the following segment
exhaustively.
In the element extraction stage, Tf-idf (Term Frequency and Inverse Document Frequency) and
CountVectorizer were utilized. Tf-idf expands the heaviness of unprecedented words and
diminishes the heaviness of familiar words and results in making a vector of highlights.
4974
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 4970 – 4977
CountVectorizer makes a vector of highlights in view of the recurrence of the words in the
survey. Our examination shows that CountVectorizer outflanks Tf-idf which might be on the
grounds that the dataset has audits on various items bringing about numerous extraordinary
words.
Followed by the element extraction stage, we prepared our calculated relapse model on the
arrangement of highlights chosen in the component extraction stage. Preparing the model is
simply not adequate to characterize the surveys effectively. For better exactness we want to
tune the hyper parameters of the model viz. C, punishment, solver, and so on. The course of
hyper parameter tuning is examined in detail in the Experiment segment.
After the model is prepared and tuned, the model necessities to be assessed to grasp its
presentation. Assessment of the model is finished by testing it on unlabeled information and
working out the exactness, accuracy and review. Results are talked about in the Experiment
segment.
i) Rating
Clients can rate the item from 1 to 5 stars addressing fulfillment/disappointment about the item.
This component can be utilized to approve that the survey composed and the appraisals given
by the commentator are expected exclusively in one heading and don't go against. Evaluations
comparing to the phony audits generally go amiss from the typical rating of the item.
Subsequently, helping in arranging the phony surveys.
4975
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 4970 – 4977
This section analyses in our exploration. Dataset can't be passed straightforwardly for
preparing. From the get go significant component segments were separated like rating, checked
buy. Then, at that point, unmitigated elements were switched over completely into
mathematical elements utilizing Label Encoder. Text
Information was extricated utilizing CountVectorizer and Tf-idf (max_features = 1400). This
was trailed by working out length for each survey. At last, our information having an element
vector of 1403 (1400 highlights of survey content, rating, confirmed buy, audit length) was
produced. Train test split of 80-20% was completed on the information. This was followed by
scaling the information by utilizing Min-Max-Scaler class from sklearn library. The Min-Max-
Scaler changes highlights by scaling each element to a given reach. This reach can be set by
determining the feature range boundary (default at (0, 1)).
Logistic regression model was trained on the train data for both CountVectorizer and Tf-idf
using Logistic Regression in sklearn.
The last phase of the experiment is evaluation of the model to understand the performance.
Binary classification involves classifying the data in two groups eg. Yes/no, true/false,
fake/genuine, etc. Target variables in such problems are not continuous but they predict the
probabilities to be yes/no, etc. Such models are evaluated using a metric called confusion
matrix. Using confusion matrix, we have calculated accuracy, precision and recall for both
logistic regression with Tf-idf; with and without “verified purchase” feature. It was the result
of the Tf-idf.
4. CONCLUSION
5. REFERENCES
[1]. Sifat Ahmed, Faisal Muhammad Shah, “Using Boosting Approaches to Detect Spam
Reviews”, 2019 1st International Conference on Advances in Science, Engineering and
Robotics Technology (ICASERT), 2019.
[2]. Faisal Khurshid, Yan Zhu, Zhuang Xu, Mushtaq Ahmad, Muqeet Ahmad, “Enactment
of Ensemble learning for Review Spam Detection on Selected Features”. International
Journal of Computational Intelligence System 2019, Vol 12(1); pp.387-394; ISSN:
1875- 6891; 2019.
[3]. Faliang Huang, Guoqing Xie, Ruliang Xiao, “Research on Ensemble Learning”,
International Conference on Artificial Intelligence and Computational Intelligence,
2009.
[4]. Brian Heredia, Taghi M. Khoshgoftaar, Joseph D. Prusa, Michael Crawford,
“Improving detection of untrustworthy online reviews using ensemble learners
combined with feature selection”, Springer, Article number: 37, 2017.
[5]. Ioannis Dematis, Eirini Karapistoli, Athena Vakali, “Fake Review Detection via
Exploitation of Spam Indicators and Reviewer Behavior Characteristics”. Springer
International Publishing AG pp.581-595, 2018.
4976
TELEMATIQUE Volume 21 Issue 1, 2022
ISSN: 1856-4194 4970 – 4977
[6]. Anna V. Sandifer, Casey Wilson, Aspen Olmsted, “Detection of fake online hotel
reviews” ,IEEE Internet Technology and Secured Transactions (ICITST), International
Conference, 2017.
[7]. Xinyue Wang, Xianguo Zhang, Chengzhi Jiang, Haihang Liu, “Identification of Fake
Reviews Using Semantic and Behavioral Features”, 2018 4th IEEE International
Conference on Information Management, 2018.
[8]. Amitkumar B. Jadhav, Vijay U. Rathod and Dr. Hemantkumar B. Jadhav, “Improving
Performance of Fake Reviews Detection in Online Review's using Semi-Supervised
Learning”. International Research Journal of Engineering and Technology (IRJET),
Volume: 06 Issue: 06 | June 2019.
[9]. Min-Yuh, Chih-Chien Wang, Chien-Chang Chen and Shao-Chieh Yang,“Exploring
Review Spammers by Review Similarity: A Case of Fake Review in Taiwan”.
Proceedings of the Third International Conference on Electronics and Software Science
(ICESS2017), Takamatsu, Japan, 2017.
[10]. Fathima Nada, Bariya Firdous Khan, Aroofa Maryam, Nooruz-Zuha, Zameer Ahmed.
“Fake news detection using Logistic Regression” International Research Journal of
Engineering and Technology (2019): pISSN: 2395-0072
[11]. Pankaj Chaudhary, Abhimanyu Tyagi, Santosh Mishra. “Fake Review Detection
through Supervised Classification.” Indian Conference on Recent Innovations in
Emerging Technology & Science (2018):ISSN: 2320-2882
[12]. Muhammad Syahmi Mokhtar, Yusmadi Yah Jusoh, Novia Admodisastro, Noraini Che
Pa, Amru Yusrin Amruddin. “Fakebuster: Fake News Detection System Using Logistic
Regression Technique in Machine Learning.” International Journal of Engineering and
Advanced Technology (2019): ISSN: 2249 – 8958.
[13]. ] Huaxun Deng, Linfeng Zhao, Ning Luo, Yuan Liu, Guibing Guo, Xingwei Wang,
Zhenhua Tan, Shuang Wang and Fucai Zhou, “Semi-supervised Learning based Fake
Review Detection”, 2017 IEEE International Symposium on Parallel and Distributed
Processing with Applications and 2017 IEEE International Conference on Ubiquitous
Computing and Communications (ISPA/IUCC), 2017.
[14]. SP. Rajamohana, Dr. K. Umamaheswari, M. Dharani and R. Vedackshya, “A Survey
On Online Review Spam Detection Techniques'. IEEE International Conference on
Innovations in Green Energy and Healthcare Technologies (ICIGEHT’17), 2017.
4977