Fake Product Review Detection and Elimination Using Opinion Mining

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2023 World Conference on Communication & Computing (WCONF)

Raipur, India. July 14-16, 2023

Fake Product Review Detection and Elimination


using Opinion Mining
A.Thilagavathy, P.R.Therasa, J.Jeno Jasmine, Sneha M, Shree Lakshmi R, Yuvanthika S
2023 World Conference on Communication & Computing (WCONF) | 979-8-3503-1120-4/23/$31.00 ©2023 IEEE | DOI: 10.1109/WCONF58270.2023.10234996

Department of Computer Science and Engineering, R.M.K. Engineering College, Tamil Nadu, India
atv.cse@rmkec.ac.in, prt.cse@rmkec.ac.in, jje.cse@rmkec.ac.in, sneh19407.cs@rmkec.ac.in, shre19405.cs@rmkec.ac.in,
yuva19441.cs@rmkec.ac.in

Abstract - Identification and elimination of fake reviews and this method can be used to catch spammers who didn't use the
their removal from the dataset provided using the supervised product. In order to falsely filter reviews of the product and give
machine learning algorithm and natural language processing it a high rating, spam reviews or the use of different customer ids
techniques based on a vast variety of aspects. In this proposed may be used. This can be filtered by looking at how often words
paper, we trained the counterfeit review dataset by the process like "awesome," "so good," "fantastic," etc. are used. This
of using two independently developed machine learning encourages us to create a system that uses a review's text and
algorithm models for assessing the extent to which the rating information to identify fake customer reviews of a product.
information being provided is real. The counterfeit product The credibility grading and evaluation for a fraudulent
evaluations can be found on numerous online retailers are
review will be calculated utilizing machine learning models. By
mostly influencing the customers to buy those products and
profit for those products is probably dependent on the reviews
deriving topics and convictions from online reviews, a
of those products. Hence these counterfeit reviews must be computerized system could be used to monitor consumer
noticed so that large E-commerce companies like Meesho, analyses. It might additionally block out fraudulent critiques.
Amazon, Flipkart, Nykaa, etc. can address this issue so that Hence this issue of fake review identification and removal
fraudsters and fraudulent critics are taken out, sustaining takes a lot of data to train and be effective, along with additional
users' credibility in shopping sites. This approach may be subject knowledge such as the sarcastic sentences users employ
utilized for websites and apps with relatively few consumers,
to convey their displeasure of the product. In some cases, what is
estimating the authenticity of reviews so that online businesses
being reviewed may be good, but the method of distribution or
can respond to them suitably. This model is developed using
Naïve Bayes, Support Vector Machine,and TF-IDF (term
shipping may not be, which impacts the review classification. As
frequency-inverse document frequency )Vectorizer. To detect contrasted with classification errors as an adverse rating as in the
spam reviews on a website or application instantly, one can evaluation of sentiment, currently an NLP method has been
make use of these models. However, effectively countering utilized for detecting such reviews. The preliminary processing
spammers requires a sophisticated model that has to undergo of data is utilized to eradicate irrelevant or old evaluations of
training on a large dataset of millions of reviews. In this work ” products. Because the number of users on these
Reviews of 20 Hotels in Chicago hotel dataset” a limited websites/applications is growing daily,Evaluation of sentiment is
dataset is utilized to train the models on a small scale, but it used by businesses like Twitter, WhatsApp, and Facebook for
can be expanded to achieve greater accuracy and authenticity recognizing fake news and harmful or disparaging messages and
in the reviews. to ban perpetrators. The primary objective of this research project
is to build a platform for internet shopping where users may
Index Terms – Opinion Mining, Data Preprocessing, develop confidence in a system where the goods they purchase
Supervised Machine Learning Algorithm. are real and consumer testimonials are precise and often
validated by the company themselves. In addition, businesses in
I. INTRODUCTION
the e-commerce (Walmart, Amazon), logistics, travel, job search
The trend of people giving reviews for the product they (LinkedIn, Glassdoor, Indeed), and food (Shopsy, Swiggy,
are buying online has become a day-to-day activity Zomato) sectors use algorithms to combat Fraudulent who trick
nowadays. Based on the feedback consumers are buying customers into purchasing subpar goods and services by posting
products through various e-commerce websites. But when false reviews. Users shouldn't worry about such fake users
the reviews given by the critics are counterfeit there is no considering that they will be apprised of scammers like "not
way that the consumers would not know the authenticity of verified listings." Instruction Manual labeling of the reviews
the reviews provided by the critics to the customers. So consumes a lot of effort and is less efficient. As a result, the
consumers are being manipulated to buy a product that is not evaluations are assigned labels using an algorithm for supervised
trustworthy product. The task is straightforward but time- learning, and the designation then appears to be untenable. The
consuming because each review must be read and marked as Naïve Bayes,SVM,and TF-IDF vectorizer methods have been
a counterfeit or ambiguous category to identify the issue’s utilized to identify and remove fake reviews. The fake review
true cause. By teaching a machine learning model that deals identification problem is addressed fairly and helps consumers to
with the review section to flag a specific review as genuine view authenticated reviews.
or spam, this issue can be solved. The intriguing part is that

979-8-3503-1120-4/23/$31.00 ©2023 IEEE 1


Authorized licensed use limited to: China University of Petroleum. Downloaded on April 10,2024 at 07:35:52 UTC from IEEE Xplore. Restrictions apply.
II. LITERATURE REVIEW Title: the title of a review
In [1], a preliminary study is conductedendeavoring to Author: the creator of the review.
fabricate an ingenious LMS ( Learning Management System
) for distance learning which employs techniques from NLP. Text: the review’s text; it may be incomplete.
This study is based on a Systematic Literature Review (SLR) Label: a label indicating that the review may not be reliable
focused on Recommender Systems (RS). In [2], two
Machine Learning (ML) models are applied to train a dataset 1: Untrustworthy
of fake reviews in order to predict their authenticity. In [3] 0: Dependable
proposes an algorithm to track customer reviews and extract
topic and sentiment information from online reviews. The Data pre-processing: The process of processing and
algorithm is also capable of identifying and blocking fake improving data by removing superfluous and unnecessary
reviews. information as well as noisy and inaccurate data from the view
dataset.Data preparation, a fundamental stage in machine
The proposed system in [4], called ICF++, is designed to learning methods, is the first step in the suggested strategy. Pre-
measure a review's honesty, the reviewer's trustworthiness, processing data involves cleaning up the view dataset of
and the product's reliability. In [5] examines review-centric extraneous and pointless information as well as noisy and
features proposed for detecting fake reviews, with a inaccurate data. One of the crucial phases in machine learning
particular focus on approaches that employ supervised approaches is data preparation, which is the first stage in the
machine learning techniques. The authors of reference [6] recommended strategy.
have extended a recently proposed method for detecting
opinion spam that relies on n-gram techniques. Their Step 1: Sentence tokenization
approach involves introducing feature selection and utilizing The NLTK software is used to tokenize the full review into
different methods for representing opinions. sentences after receiving it as input.Tokenization is a popular
Reference [7] presents a new and reliable system for natural language processing technique that serves as a critical
identifying spam reviews. The proposed approach effectively first step before implementing any additional preprocessing
utilizes three distinct features: (i) the sentiment of the review techniques. This process involves breaking down the text into
and its accompanying comments, (ii) content-based factors, individual words, which are referred to as tokens. For example,
and (iii) rating deviation.The purpose of [8] is to serve as a tokenization would separate the sentence "Going to college for
literature review for beginners and a survey for identifying attendance" into the following tokens: "Going", "to", "college",
opportunities in the field. In [9] proposes a Fake Product "for", and "attendance".
Review Monitoring and Removal System (FPRMS) that uses Step 2: Removal of punctuation marks
an Intelligent Interface and Uniform Resource Locators
(URLs) to remove fake reviews and provide users with The reviews' initial and final punctuation, as well as
genuine reviews and ratings. additional whitespace, have been removed. Stop words are the
most typically used terms, although they are useless. Typical
In [10], analyzes Yelp's filtered reviews to understand its instances of stop words are (an, a, the, this). Before moving on to
filtering algorithm and concludes that it is reasonable and the procedure of identifying false reviews in this study, all data
linked to unusual spamming behaviors. [11] Suggests a are cleared of stop words.
comprehensive technique called SPEAGLE that makes use
of relational data and information including text, timestamps, Step 3: Word Tokenization
and ratings to find suspicious reviews, individuals, and goods
Each individual review is tokenized into terms and
that have been the victim of spam. The purpose of [12] is to
maintained in a list to aid retrieval.
develop a machine-learning model that can tell the difference
between real and false reviews in the Yelp dataset. In [13] Step 4: Removal of stop words
uses Naïve Bayes and Logistic Regression to classify Twitter
reviews and assess the algorithms' performance based on The stem is free of affixes. For instance, "cooking" has the
accuracy, precision, and throughput. word "cook" as its stem, andthestemmingalgorithm is aware that
the "ing" suffix might be dropped.
After converting our text strings into numerical
III. METHODOLOGY representations with TF-IDF Vectorizer, we initialize a Naive
Bayes Classifier to fit the model. In conclusion, the accuracy
Utilizing the Naive Bayes, TF-IDF Vectorizer, and
score and confusion matrix show how well our model performs.
extracted dataset feature extraction techniques. False
TF-IDF. Vectorizer is a well-known approach for turning text
reviewer identification has been created. package contains
into intelligible appropriate units. We assume that a term will
the kaggle.com dataset, which includes the following
become more important inside the given text as it appears more
characteristics::
frequently. We normalizethe frequency of a term based on the
size of the document, and we call this term frequency.

2
Authorized licensed use limited to: China University of Petroleum. Downloaded on April 10,2024 at 07:35:52 UTC from IEEE Xplore. Restrictions apply.
Calculated definition:
݀‫ܿ݋‬Ǥ ܿ‫ݐ݊ݑ݋‬
ܶ‫ܨ‬ሺ‫ݓ‬ሻ ൌ 
‫ݐ݊݁݉ݑܿ݋̴݄݀݁ݐ̴̴݊݅ݏ݀ݎ݋ݓ̴݈ܽݐ݋ݐ‬

While assessing term frequency, each syllable is assigned a


comparable weight. Due to their high frequency, words that are
often used in papers may be less useful in interpreting the
meaning of the text. Words like "a," "the," and other similar
words may lower the weights of more significant words. To
mitigate this effect, TF is discounted by the inverse document
factor..
‫ݏݐ݊݁݉ݑܿ݋̴݂݀݋̴ݎܾ݁݉ݑ̴݈݊ܽݐ݋ݐ‬
‫ܨܦܫ‬ሺ‫ݓ‬ሻ ൌ ݈‫݃݋‬ሺ ሻ
݊‫݀ݎ݋ݓ̴݃݊݅݊݅ܽݐ݊݋̴ܿݏݐ݊݁݉ݑܿ݋̴݂݀݋̴ݎܾ݁݉ݑ‬

Thus, one can get TF-IDF by multiplying TF and IDF. With


more important phrases, the TF-IDF score would rise
ܶ‫ ܨ‬െ ‫ܨܦܫ‬ሺ‫ݓ‬ሻ ൌ ܶ‫ܨ‬ሺ‫ݓ‬ሻ ‫ܨܦܫ כ‬ሺ‫ݓ‬ሻ
Simple machine learning is a subset of naive Bayes classifiers
in artificial intelligence. The renowned Naive Bayes algorithm is
extended to multinomial NB and pipelining concepts. An
Fig. 1. Flowchart for fake review detection
algorithm decides whether the news is accurate and reliable.
There are other strategies that can be employed for training these
classifiers that emphasis on common concepts, so this is not the
only one.
To determine whether the news is true or false, adopt Naive
Bayes. It is a variety of algorithm used to classify texts into
various people. The Bayes theorem is utilised to ascertain
whether the news is true after the usage of tokens is applied to
information that may or may not be inaccurate. The components
for naiveté are as follows:The likelihood of the prior occurrence
is used in Bayes classification, which contrasts it with the current
event. After computing each and every probability of the event, a
final calculation is done to establish the overall likelihood of the
news when compared to the dataset. Hence, by calculating the
overall likelihood, we may estimate the value and decide if the
news is real or fake.
௉ሺ஺ሻ
ܲሺ‫ܤ‬ሻ ൌ ܲሺ‫ܣ‬ሻ െ  (1)
௉ሺ஻ሻ

Finding the probability of an event, A when event B is true.


ܲሺ‫ܣ‬ሻ ൌ ܴܲ‫ܻܶܫܮܫܤܣܤܱܴܴܱܲܫ‬
ܲሺ‫ܣ‬ȁ‫ܤ‬ሻ ൌ ܱܲܵܶ‫ܻܶܫܮܫܤܣܤܱܴܴܱܲܫܴܧ‬
Finding probability:
ܲሺ‫ͳܤ‬ሻ ൌ ܲሺ‫ͳܣ‬ȁȁ‫ͳܤ‬ሻǤ ܲሺ‫ʹܣ‬ȁȁ‫ͳܤ‬ሻǤ ܲሺ‫͵ܣ‬ȁȁ‫ͳܤ‬ሻ (2)
ܲሺ‫ʹܤ‬ሻ ൌ ܲሺ‫ͳܣ‬ȁȁ‫ʹܤ‬ሻǤ ܲሺ‫ʹܣ‬ȁȁ‫ʹܤ‬ሻǤ ܲሺ‫͵ܣ‬ȁȁ‫ʹܤ‬ሻ (3)
If the probability is 0
ܹ‫ ݐ݊ݑ݋ܿ݀ݎ݋‬൅ ͳ
ܲሺܹ‫݀ݎ݋‬ሻ ൌ
ܶ‫݋݈݊ܽݐ݋‬Ǥ ‫ ݏ݀ݎ݋ݓ݂݋‬൅ ܰ‫݋‬Ǥ ‫ݏ݀ݎ݋ݓ݁ݑݍ݅݊ݑ ݂݋‬

As a result, using this approach, one may assess the reviews'


accuracy.
Fig. 2. Workflow for identifying and analyzing reviews using different
algorithms

3
Authorized licensed use limited to: China University of Petroleum. Downloaded on April 10,2024 at 07:35:52 UTC from IEEE Xplore. Restrictions apply.
Support Vector Machine (SVM) is a commonly used The spread of erroneous information seriously harms users
supervised learning approach that is utilized for classification and the social environment. It is difficult to spot a false review in
and regression problems. Although it can be used for both, it the first place because it is meant to deceive the user. Many
is primarily used in Machine Learning Classification different avenues are used to spread false information, which
problems. SVM algorithm aims to identify the optimal line disturbs society and the lives of its residents. Further
or decision boundary, known as a hyperplane, that can divide improvements would include identifying the source of the
n-dimensional space into classes so that new data points can erroneous information and stopping its spread on social media
be categorized quickly in the future. The SVM algorithm and online platforms. It would also be able to find and pinpoint
chooses the extreme points and vectors of the hyperplane. the sources of misleading information in order to stop those who
The name "support vector machine" comes from these are seeking to deceive the public. Also, they would track down
exceptional circumstances, which are described by support the social media profiles of those spreading rumors and false
vectors. Below is an example of how a decision boundary or information so they could halt them before they spread.
hyperplane can be used to separate two different categories:
V. CONCLUSION
Linear SVM classifier is employed for data that can be
separated by a single straight line into two classes, while In this proposed work, independently working machine
non-linear data and non-linear SVM classifier refer to data learning algorithm models was developed for assessing the
that cannot be classified using a straight line.The terms "non- reviews of the products. The propsed model was developed using
linear data" and "non-linear SVM classifier" refer to data that Naïve Bayes, Support Vector Machine,and TF-IDF (term
cannot be categorized using a straight line. For non-linearly frequency-inverse document frequency )Vectorizer. The
separated
p data,, non-linear SVM is utilized. proposed model efficiently detected the spam reviews on a
website or application instantly. The proposed work was tested
using” Reviews of 20 Hotels in Chicago hotel dataset” and
achieved greater accuracy and authenticity in the reviews.
REFERENCES
[1] D. F. Murad, Y. Heryadi, B. D. Wijanarko, S. M. Isa and W. Budiharto,
"Recommendation System for Smart LMS Using Machine Learning: A
Literature Review," 2018 International Conference on Computing,
Engineering, and Design (ICCED), Bangkok, Thailand, 2018, pp. 113-118,
doi: 10.1109/ICCED.2018.00031.
[2] S. M. Anas and S. Kumari, "Opinion Mining based Fake Product review
Monitoring and Removal System," 2021 6th International Conference on
Inventive Computation Technologies (ICICT), Coimbatore, India, 2021,
pp. 985-988, doi: 10.1109/ICICT50816.2021.9358716.
[3] Jain, Piyush &Chheda, Karan & Lade, Mihir. (2019). Fake Product Review
Monitoring System. International Journal of Trend in Scientific Research
and Development. Volume-3. 105-107. 10.31142/ijtsrd21644.
[4] Wahyuni, Eka &Djunaidy, Arif. (2016). Fake Review Detection From a
Product Review Using Modified Method of Iterative Computation
Framework. MATEC Web of Conferences. 58. 03003.
Fig. 3. Statistical analysis of review 10.1051/matecconf/20165803003.
[5] Kashid, Aishwarya & Lalwani, Ankita &Gaikawad, Saniksha& Patil,
IV. RESULTS & DISCUSSION Rajal&Sonkamble, Rahul & More, Shivaprasad. (2021). Fake Review
Detection System Using Machine Learning.
The identification of fake reviews has become [6] R. Patel and P. Thakkar, "Opinion Spam Detection Using Feature
increasingly common on websites and social media Selection," 2014 International Conference on Computational Intelligence
networks. To address this issue, our team utilized text and Communication Networks, Bhopal, India, 2014, pp. 560-564, doi:
processing and Naive Bayes to develop a model that can 10.1109/CICN.2014.127.
detect fake reviews. By leveraging machine learning tools, [7] Saumya, S., Singh, J.P. Detection of spam reviews: a sentiment analysis
we were able to classify news as fake or not fake in a shorter approach. CSIT 6, 137–148 (2018). https://doi.org/10.1007/s40012-018-
0193-0
amount of time by drawing upon prior data set values. This
[8] N. Sodera and A. Kumar, "Open problems in recommender systems
provides users with a greater sense of trust in reviews that diversity," 2017 International Conference on Computing, Communication
appear on social media and other sources. and Automation (ICCCA), Greater Noida, India, 2017, pp. 82-87, doi:
10.1109/CCAA.2017.8229776.
TABLE I. SUMMARY OF THE DATASET [9] Ata-Ur-Rehman et al., "Intelligent Interface for Fake Product Review
Monitoring and Removal," 2019 16th International Conference on
Total number of reviews 5853 reviews Electrical Engineering, Computing Science and Automatic Control (CCE),
Number of fake reviews 1144 reviews Mexico City, Mexico, 2019, pp. 1-6, doi: 10.1109/ICEEE.2019.8884529.
Number of real reviews 4709 reviews
[10] Mukherjee, A., Venkataraman, V., Liu, B., & Glance, N. (2021). What
Number of distinct reviews 102739 words Yelp Fake Review Filter Might Be Doing?. Proceedings of the
Total number of tokens 103052 tokens International AAAI Conference on Web and Social Media, 7(1), 409-418.
The maximum review length 875 words
[11] 2015. Proceedings of the 21th ACM SIGKDD International Conference on
The minimum review length 4 words Knowledge Discovery and Data Mining. Association for Computing
The average review length 439.5 words Machinery, New York, NY, USA.

4
Authorized licensed use limited to: China University of Petroleum. Downloaded on April 10,2024 at 07:35:52 UTC from IEEE Xplore. Restrictions apply.
[12] A. Sihombing and A. C. M. Fong, "Fake Review Detection on Yelp
Dataset Using Classification Techniques in Machine Learning," 2019
International Conference on contemporary Computing and
Informatics (IC3I), Singapore, 2019, pp. 64-68, doi:
10.1109/IC3I46837.2019.9055644
[13] A. Prabhat and V. Khullar, "Sentiment classification on big data using
Naïve bayes and logistic regression," 2017 International Conference
on Computer Communication and Informatics (ICCCI), Coimbatore,
India, 2017, pp. 1-5, doi: 10.1109/ICCCI.2017.8117734.

5
Authorized licensed use limited to: China University of Petroleum. Downloaded on April 10,2024 at 07:35:52 UTC from IEEE Xplore. Restrictions apply.

You might also like