Professional Documents
Culture Documents
Semi-Supervised+Learning+Based+Fake+Review+Detection Abcdpdf PDF To Word
Semi-Supervised+Learning+Based+Fake+Review+Detection Abcdpdf PDF To Word
Huaxun Deng, Linfeng Zhao, Ning Luo, Yuan Liu∗, Guibing Guo, Xingwei Wang,
Zhenhua Tan, Shuang Wang and Fucai Zhou
∗
Software College, Northeastern University, Shen Yang, Liao Ning, China, 110169
Email: *liuyuan@mail.neu.edu.cn
Abstract—The impact of product reviews on the business reviews. Gathering the behavioural of the spammers can
platform is growing, giving consumers more information detect them well [3].
about their products and directly influencing consumers’
However, most of the above studies depend on supervised
buying decisions. However, the existence of fake reviews
makes the consumer cannot make the right judgments of learning requires labelled data set which is difficult to
sellers, which can also causes the credibility of the platform build because it is hard to label a huge amount of data
downgraded. Thus, it is of practical significance to identify the accurately. In this paper, we propose a new algorithm based
fake reviews in the platform. The way of manually on PU-Learning considering multi-aspect features, which
annotating the data set is difficult, meanwhile it is nearly uses a few labelled data and a large amount of unlabelled
impossible to make the correct annotation by reading only a
small portion of comments based on the classifier trained data to classify the reviews. In the proposed model, the
under the traditional method. In previous studies, it has been features are captured from two aspects, including metadata
shown that false comments have characteristics such as high features and review content features. We use autoencoder
similarity in content and high concentration of comments. for dimensionality reduction and use K-means to classify
In this paper, we propose a new algorithm to identify fake the data. According to the classification results, our method
reviews based on semi- supervised learning method. Real data
based experiments have demonstrated that the proposed
can properly label the unlabelled data and achieve a neural
method can achieve desired performance. network based classifier with designed performance.
Keywords-Fake review, Similarity, Semi-supervised learning
II. RELATED WORK
I. INTRODUCTION The previous research on fake review detection can be
As the technology of e-commerce continues to grow roughly divided into three aspects: based on the single
rapidly, the impact of reviews online increases. For con- comment characteristic, based on the entire comment char-
sumers, the reviews become an essential way for them to acteristic and based on the comment dataset.
get more information of the product quality helping them to The first type of methods is based on the single comment
make purchase decision. For business owners, the reviews characteristic. Deng song [4] and his fellow analyse the
give an effort to improve and enhance their businesses. The comment from all aspects, for instance, based on the number
feedback from the customer reviews is helpful for product of comments, frequency, length and so on. They use the
improvements. However, the reviews may not always be unsupervised clustering techniques to distinguish the fake
truthfully provided and fake reviews generally exist. The comments for the technology products.
business owners may pay someone to write good reviews In the second type of methods based on the entire
about their own products or bad reviews about their competi- comment characteristic, Wu et al [5] collect a set of data
tors products. These fake reviews lead the consumers making from a specific product for a long time, aiming to find the
wrong judgments about the product quality. Therefore, de- characteristic of the comments. After analysing, he draws a
tecting fake reviews is important and is also a challenging conclusion that the early comments of the product are quite
issue in both industrial and academic areas. different from the late comments of the product.
Researchers have proposed various fake review detection In the third type of methods based on the comment
approaches. The methods can be divided into two categories: dataset, Jindal N[7] and his fellow also assert that the
review centric review spam detection and reviewer centric highly similarly comments are fake comments. Therefore,
review spam detection [1]. In the first category, researchers they hire a group of people to label the data according to
use machine learning techniques to build models with con- this principle. After that, they also introduce the supervised
tent and metadata of the reviews. Supervised learning refers learning using these label data to detect the fake review. Ott
to the task of learning from labelled data [2] and is the M using Mechanical Amazon Mechanical Turk II crowd-
most prevalent method used for review spam detection [1]. sourcing platform to create an English golden data set of
In the second category, the spammers can be identified via the fake review. After that, he utilizes the traditional text
the similar review contents and similar time they post the classification techniques to distinguish the fake comments.
1279
Firstly, we eliminate the comments whose text is the same as the positive case if the comment far away from the
or the textual similarity is high. Secondly, we eliminate the trusted negative case. After iterations, we finally get the
comments which can not constitute a complete sentence or positive set LP and the negative set LN. Eventually, our
the content of the comment is not semantic. Thirdly, we experiment proves the proposed model can detect the fake
eliminate the comments whose length of the content is less review effectively.
than 20 words. Following these principles, we eliminate
17994 comments, leaving the 30568 valid comments. ACKNOWLEDGMENT
This research is partially supported by the National Sci-
B. Evaluation Results
ence Foundation for Distinguished Young Scholars of China
Our model is including three processes: building the under Grant No.61225012 and No.71325002; National Nat-
review feature vector, dividing the data into two classes, ural Science Foundation of China under Grant No.61402097
training the classifier using neural network. and No.61602102; the Natural Science Foundation of
1) Building the Review Feature Vector: The data named Liaoning Province of China under Grant No.20170540319,
’afterdays’ means how long the consumer give the additional 201602261; and the Fundamental Research Funds for
comments. The data named ’score’ means the consumer the Central Universities under Grant No. N162410002,
give his opintion about the product. The data named ‘use- N161704001, N151708005, N161704004.
fulyvotecount’ means the other customer think this review
is useful for his judgement towards this product. the data REFERENCES
named ’length’ means the length of the review text. The [1] M. Crawford, T. M. Khoshgoftaar, J. D. Prusa, A. N. Richter,
data named ’viewcout’ means how many other customers and H. Al Najada, “Survey of review spam detection using
review this comment. The data named ’replycount’ means machine learning techniques,” Journal of Big Data, vol. 2,
no. 1, p. 23, 2015.
how many reply in this comment from the consumer to the
customer. The data named ’userLevel’ means the JD Level [2] N. Jindal and B. Liu, “Opinion spam and analysis,” in Pro-
of the consumer. The data named ’isMobile’ means if the ceedings of the 2008 International Conference on Web Search
consumer buy this product via phone. The data named ’days’ and Data Mining, pp. 219–230, ACM, 2008.
means how long the consumer comment this product after
[3] E.-P. Lim, V.-A. Nguyen, N. Jindal, B. Liu, and H. W. Lauw,
receiving the product. the data named ’uelessVoteCount’ “Detecting product review spammers using rating behaviors,”
means how many customers think this comment is useless. in Proceedings of the 19th ACM international conference on
2) Dividing the Data into Two Classes: For each turn Information and knowledge management, pp. 939–948, ACM,
we set different k, ps, pl to classify data into different set. 2010.
The following table shows the parameters we set at each
[4] S. Deng, C. Wan, A. Guan, and H. Chen, “Deceptive reviews
turn. We choose 2000 reviews and 800 reviews are negative detection of technology products based on behavior and con-
examples. tent,” Journal of Chinese Computer Systems, vol. 36, no. 11,
To illustrate the accuracy of identify the unlabelled data, p. 2498, 2015.
we choose some labelled data as the test data set. The
following table shows the result of identification. [5] F. Wu and B. A. Huberman, “Opinion formation under costly
expression,” ACM Transactions on Intelligent Systems and
Table I Technology (TIST), vol. 1, no. 1, p. 5, 2010.
THE ACCURACY OF IDENTIFY THE UNLABELLED
DATA [6] A. Mukherjee, A. Kumar, B. Liu, J. Wang, M. Hsu, M.
Castel- lanos, and R. Ghosh, “Spotting opinion spammers
Dataset Accuracy using behav- ioral footprints,” in Proceedings of the 19th
600-Labelled,200-Test Data 87.6% ACM SIGKDD international conference on Knowledge
discovery and data mining, pp. 632–640, ACM, 2013.
700-Labelled,100-Test Data 89.3%
[7] N. Jindal and B. Liu, “Analyzing and detecting review spam,”
in Data Mining, 2007. ICDM 2007. Seventh IEEE
VI. CONCLUSION International Conference on, pp. 547–552, IEEE, 2007.
In this paper, we consider both the metadata features
[8] Z. Kaixu and Z. Changle, “Unsupervised feature learning for
and content related features to construct a semi-supervised chinese lexicon based on auto-encoder,” J. Chin. Inf, vol. 27,
learning based fake review classifier. Firstly, we use the no. 5, pp. 1–7, 2013.
similarity characteristics of the text to determine a set of true
negative cases or fake reviews and extract the characteristic [9] B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu, “Building
vector from multiple aspects. Then, we take the technique text classifiers using positive and unlabeled examples,” in
Data Mining, 2003. ICDM 2003. Third IEEE International
of K-Means to cluster towards the comments. We label the Conference on, pp. 179–186, IEEE, 2003.
comments as the negative case if the comments is close
to the true negative cases, whereas label the comments
1280