Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

2022 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2022 IEEE International

Conference on Ubiquitous Computing and Communications (ISPA/IUCC)

Semi-supervised Learning based Fake Review Detection

Huaxun Deng, Linfeng Zhao, Ning Luo, Yuan Liu∗, Guibing Guo, Xingwei Wang,
Zhenhua Tan, Shuang Wang and Fucai Zhou

Software College, Northeastern University, Shen Yang, Liao Ning, China, 110169
Email: *liuyuan@mail.neu.edu.cn

Abstract—The impact of product reviews on the business reviews. Gathering the behavioural of the spammers can
platform is growing, giving consumers more information detect them well [3].
about their products and directly influencing consumers’
However, most of the above studies depend on supervised
buying decisions. However, the existence of fake reviews
makes the consumer cannot make the right judgments of learning requires labelled data set which is difficult to
sellers, which can also causes the credibility of the platform build because it is hard to label a huge amount of data
downgraded. Thus, it is of practical significance to identify the accurately. In this paper, we propose a new algorithm based
fake reviews in the platform. The way of manually on PU-Learning considering multi-aspect features, which
annotating the data set is difficult, meanwhile it is nearly uses a few labelled data and a large amount of unlabelled
impossible to make the correct annotation by reading only a
small portion of comments based on the classifier trained data to classify the reviews. In the proposed model, the
under the traditional method. In previous studies, it has been features are captured from two aspects, including metadata
shown that false comments have characteristics such as high features and review content features. We use autoencoder
similarity in content and high concentration of comments. for dimensionality reduction and use K-means to classify
In this paper, we propose a new algorithm to identify fake the data. According to the classification results, our method
reviews based on semi- supervised learning method. Real data
based experiments have demonstrated that the proposed
can properly label the unlabelled data and achieve a neural
method can achieve desired performance. network based classifier with designed performance.
Keywords-Fake review, Similarity, Semi-supervised learning
II. RELATED WORK
I. INTRODUCTION The previous research on fake review detection can be
As the technology of e-commerce continues to grow roughly divided into three aspects: based on the single
rapidly, the impact of reviews online increases. For con- comment characteristic, based on the entire comment char-
sumers, the reviews become an essential way for them to acteristic and based on the comment dataset.
get more information of the product quality helping them to The first type of methods is based on the single comment
make purchase decision. For business owners, the reviews characteristic. Deng song [4] and his fellow analyse the
give an effort to improve and enhance their businesses. The comment from all aspects, for instance, based on the number
feedback from the customer reviews is helpful for product of comments, frequency, length and so on. They use the
improvements. However, the reviews may not always be unsupervised clustering techniques to distinguish the fake
truthfully provided and fake reviews generally exist. The comments for the technology products.
business owners may pay someone to write good reviews In the second type of methods based on the entire
about their own products or bad reviews about their competi- comment characteristic, Wu et al [5] collect a set of data
tors products. These fake reviews lead the consumers making from a specific product for a long time, aiming to find the
wrong judgments about the product quality. Therefore, de- characteristic of the comments. After analysing, he draws a
tecting fake reviews is important and is also a challenging conclusion that the early comments of the product are quite
issue in both industrial and academic areas. different from the late comments of the product.
Researchers have proposed various fake review detection In the third type of methods based on the comment
approaches. The methods can be divided into two categories: dataset, Jindal N[7] and his fellow also assert that the
review centric review spam detection and reviewer centric highly similarly comments are fake comments. Therefore,
review spam detection [1]. In the first category, researchers they hire a group of people to label the data according to
use machine learning techniques to build models with con- this principle. After that, they also introduce the supervised
tent and metadata of the reviews. Supervised learning refers learning using these label data to detect the fake review. Ott
to the task of learning from labelled data [2] and is the M using Mechanical Amazon Mechanical Turk II crowd-
most prevalent method used for review spam detection [1]. sourcing platform to create an English golden data set of
In the second category, the spammers can be identified via the fake review. After that, he utilizes the traditional text
the similar review contents and similar time they post the classification techniques to distinguish the fake comments.

0-7695-6329-5/17/31.00 ©2022 IEEE 1278


DOI 10.1109/ISPA/IUCC.2017.00195
III. BUILD REVIEW FEATURE VECTOR
Z = gθ' (Y ) = s(W ∗Y + b∗) (5)
Before introducing our method in detecting fake reviews,
we first process the input of our module by quantifying The parameter of encoder is θ = W, b and the parameter
the features. We classify the features as metadata of decoder is θ∗ = W ∗, b∗ . W is the weight matrix and W ∗ =
features, review content features. The features of a review W T . We use the following loss function to make Z to be
is denoted by a vector
{ X = xlength, xuserLevel, xisMobile, R the same size as R.
, where the first }three values are metadata features and n
Σ
R is set of features related with the review content. L(R, Z)= KL(Ri||Zi) (6)
i=1
A. Metadata Feature
where R is a matrix of samples consists of n vectors.
The first metadata feature is the length of the review,
KL(Ri ||Zi) is the KullbackLeibler divergence of input
which is quantified as follows.
vector Ri and output Zi, to measure the difference
xlength = length(review) (1) between Ri and Zi. The autoencoder use stochastic
where length(review) indicates the length of the review gradient descent method to train the network and the
weight matrix updates as follows where l is the updating
content.
step [4].
As different evaluation features often have different di- ∂L(r,
z) W → W − l ∗ (7)
mensions and dimension units, and such a situation will ∂W
affect the results of data analysis. To eliminate the above IV. SEMI-SUPERVISED LEARNING TO DETECT FAKE
influence, we need to normalize the above feature, so that REVIEW
it is comparable with other features, suitable for compre- PU-Learning is a semi-supervised classification technique.
hensive evaluation. For normalization, we used min-max It’s described as a two-step strategy which addresses the
normalization: problem of building a two-class classifier with only positive
x − xmin
x∗ = (2) and unlabelled examples[9]. In our model, we build classifier
xmax − xmin with relative negative and unlabeled examples.
where x∗ means the sample data after normalization. x According to the previous work for fake review
means the sample data before normalization. xmax means detection, we can find that the similarity is an important
the biggest data in the sample dataset. xmin means the feature to detect fake reviews. First, we labelled review as
smallest data in the sample dataset. true fake review sample which is duplicates or near-
The other two metadata features xuserLevel, xisMobile are duplicates with other reviews. Then the following two steps
processed in the similar way. In the following discussion, are repeated. In the first step, we use K-means to classify
we will misuse xlength, xuserLevel,and xisMobile as the value the whole set and divide the set into k groups. In the
after normalization. second step, we calculate the percentage of the true fake
B. Review Content Feature review examples pi in each group. Define pl as the
maximum threshold and ps as the minimum threshold. If
1) Text Segment and Content Related Feature Quantifi- the pi is larger than pl, we consider the rest of reviews in
cation: The review content features are mainly about the
this group are fake. If pi is 0 or pi < ps, calculate the
text content of the review, and the process is divided into distance between the centre of this group and the centre of
two steps: text segmentation and the document feature vector another group with pi > pl. We set the reviews in the group
establishment. After text segmentation, we use bag of words which has the biggest distance as the normal review
model in which each review is represented by a feature examples.
vector R: In the next iteration, we adjust the k, pl and ps, then
R = (t1, t2, t3, ..., ti, ..., tm) (3) classify the data. This process repeates until the rest data
where m is the total number of the words in the bag, and ti are labelled as fake review or not.
records the number of occurrences of the i-th word. When
V. EXPERIMENTAL EVALUATIONS
a word does not appear in a review, ti = 0.
2) Dimensionality Reduction: It’s necessary to reduce the A. The Real Dataset
dimension of the review content vector. We use autoencoder One of the problems of dataset on fake reviews is the
to reduce dimension [8]. It takes the input R and first uses absence of a golden dataset. According to the previous re-
an encoder to map it to a hidden representation Y through search, we use the similarity feature to label relative negative
an activation function such as sigmoid. After that, we map examples. In our experiment, all the data we use is crawled
it back using a decoder into a reconstruction Z of the same from JD.com. an ecommerce platform. We selected around
shape as R. one hundred products related to the electronic products.
Y = f θ (R)= s(WR + b) (4) We collect 48562 comments data. After that, we eliminate
some useless comments according to the following principle.

1279
Firstly, we eliminate the comments whose text is the same as the positive case if the comment far away from the
or the textual similarity is high. Secondly, we eliminate the trusted negative case. After iterations, we finally get the
comments which can not constitute a complete sentence or positive set LP and the negative set LN. Eventually, our
the content of the comment is not semantic. Thirdly, we experiment proves the proposed model can detect the fake
eliminate the comments whose length of the content is less review effectively.
than 20 words. Following these principles, we eliminate
17994 comments, leaving the 30568 valid comments. ACKNOWLEDGMENT
This research is partially supported by the National Sci-
B. Evaluation Results
ence Foundation for Distinguished Young Scholars of China
Our model is including three processes: building the under Grant No.61225012 and No.71325002; National Nat-
review feature vector, dividing the data into two classes, ural Science Foundation of China under Grant No.61402097
training the classifier using neural network. and No.61602102; the Natural Science Foundation of
1) Building the Review Feature Vector: The data named Liaoning Province of China under Grant No.20170540319,
’afterdays’ means how long the consumer give the additional 201602261; and the Fundamental Research Funds for
comments. The data named ’score’ means the consumer the Central Universities under Grant No. N162410002,
give his opintion about the product. The data named ‘use- N161704001, N151708005, N161704004.
fulyvotecount’ means the other customer think this review
is useful for his judgement towards this product. the data REFERENCES
named ’length’ means the length of the review text. The [1] M. Crawford, T. M. Khoshgoftaar, J. D. Prusa, A. N. Richter,
data named ’viewcout’ means how many other customers and H. Al Najada, “Survey of review spam detection using
review this comment. The data named ’replycount’ means machine learning techniques,” Journal of Big Data, vol. 2,
no. 1, p. 23, 2015.
how many reply in this comment from the consumer to the
customer. The data named ’userLevel’ means the JD Level [2] N. Jindal and B. Liu, “Opinion spam and analysis,” in Pro-
of the consumer. The data named ’isMobile’ means if the ceedings of the 2008 International Conference on Web Search
consumer buy this product via phone. The data named ’days’ and Data Mining, pp. 219–230, ACM, 2008.
means how long the consumer comment this product after
[3] E.-P. Lim, V.-A. Nguyen, N. Jindal, B. Liu, and H. W. Lauw,
receiving the product. the data named ’uelessVoteCount’ “Detecting product review spammers using rating behaviors,”
means how many customers think this comment is useless. in Proceedings of the 19th ACM international conference on
2) Dividing the Data into Two Classes: For each turn Information and knowledge management, pp. 939–948, ACM,
we set different k, ps, pl to classify data into different set. 2010.
The following table shows the parameters we set at each
[4] S. Deng, C. Wan, A. Guan, and H. Chen, “Deceptive reviews
turn. We choose 2000 reviews and 800 reviews are negative detection of technology products based on behavior and con-
examples. tent,” Journal of Chinese Computer Systems, vol. 36, no. 11,
To illustrate the accuracy of identify the unlabelled data, p. 2498, 2015.
we choose some labelled data as the test data set. The
following table shows the result of identification. [5] F. Wu and B. A. Huberman, “Opinion formation under costly
expression,” ACM Transactions on Intelligent Systems and
Table I Technology (TIST), vol. 1, no. 1, p. 5, 2010.
THE ACCURACY OF IDENTIFY THE UNLABELLED
DATA [6] A. Mukherjee, A. Kumar, B. Liu, J. Wang, M. Hsu, M.
Castel- lanos, and R. Ghosh, “Spotting opinion spammers
Dataset Accuracy using behav- ioral footprints,” in Proceedings of the 19th
600-Labelled,200-Test Data 87.6% ACM SIGKDD international conference on Knowledge
discovery and data mining, pp. 632–640, ACM, 2013.
700-Labelled,100-Test Data 89.3%
[7] N. Jindal and B. Liu, “Analyzing and detecting review spam,”
in Data Mining, 2007. ICDM 2007. Seventh IEEE
VI. CONCLUSION International Conference on, pp. 547–552, IEEE, 2007.
In this paper, we consider both the metadata features
[8] Z. Kaixu and Z. Changle, “Unsupervised feature learning for
and content related features to construct a semi-supervised chinese lexicon based on auto-encoder,” J. Chin. Inf, vol. 27,
learning based fake review classifier. Firstly, we use the no. 5, pp. 1–7, 2013.
similarity characteristics of the text to determine a set of true
negative cases or fake reviews and extract the characteristic [9] B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu, “Building
vector from multiple aspects. Then, we take the technique text classifiers using positive and unlabeled examples,” in
Data Mining, 2003. ICDM 2003. Third IEEE International
of K-Means to cluster towards the comments. We label the Conference on, pp. 179–186, IEEE, 2003.
comments as the negative case if the comments is close
to the true negative cases, whereas label the comments

1280

You might also like