Professional Documents
Culture Documents
Fake Review
Fake Review
Fake Review
International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems
Abstract—Considering the limitations of comment text in the and it could be identified by detecting semantically repeated
study of fake reviews recognition, this paper proposes to build a reviews. Mukherjee[7] et al. used the SVM classification
classification model by integrating the features of comment text model to achieve 67.8% accuracy based on plain text features
and user behavior. However, the comment data obtained in in Yelp data sets. After the characteristics of the commenter
reality are mostly unlabeled. Therefore, this paper proposes an were integrated, the recognition accuracy was increased to
MPINPUL (Mixing Population and Individual Nature PU
84.8%. Therefore, the behavior characteristics of reviewers
Learning) model based on multiple features to build a fake
reviews classification model. In this paper, the MPINPUL model have a significant impact on the identification of fake reviews.
is divided into four steps. Firstly, the constrained k-means That's the state of the art for researchers trying to
algorithm is proposed to calculate a negative example of trust. identify fake reviews. In view of the limitations of text
The advantage of constrained k-means is that it can expand the
features, this paper divides it into multiple behavioral features
set of positive examples while identifying the trusted negative
examples. Then, LDA and k-means are used to calculate from the perspective of reviewers for research. At the same
multiple representative samples for positive and negative time, this paper proposes to use the MPINPUL (Mixing
examples respectively. Then we use the idea of population and Population and Individual Nature PU Learning[8]) model to
individuality to determine the category label of the sample. solve the problem that fake comments are widely distributed
Finally, the classifier is established. Experimental results on real in reality, have huge orders of magnitude, are difficult to be
data sets show that the recognition rate of the MPINPUL model accurately marked manually, and have some misjudgment
proposed in this paper is higher than that of other single features samples.
under fusion feature conditions.
II. FEATURE CONSTRUCTION
Keywords—fake reviews, fusion feature, PU-Learning,
constrained k-means, classification model Feature engineering plays an important role in the
research of natural language processing. Due to the
I. INTRODUCTION concealment and diversity of fake reviews, feature selection
Fake reviews identification was first proposed by Jindal is particularly important. In this paper, in order to integrate
and Liu[1] in 2008. The difficulty of this research is how to multiple features for recognition, three major categories of
effectively extract or represent the features of the comment features including text, behavior and relationship are studied.
text and user behavior, so as to achieve the purpose of fake A. Text Features
reviews recognition. Although the fabricator tries to simulate
The features extracted in this paper include Unigram
the truth of the content as much as possible, there are some
lexical features, POS features and LDA semantic features.
verbal and behavioral details that may be flawed. The
The comment text feature indicators and feature name
researchers identified fake reviews based on different
research scenarios [2]. descriptions are shown in table ĉ:
Ott[3,4] et al. used support vector machine classifier based TABLE I. COMMENTARY TEXT CHARACTERISTICS
on word bag features to identify fake reviews on the gold data Feature Name Feature Description
set built by Amazon crowd-sourced platform with an Unigram N-gram Lexical features
accuracy rate of 84%. However, fake reviews deliberately POS POS features
imitate real comments in terms of language and vocabulary, LDA LDA thematic features
so the ability to identify fake reviews by word bag alone is
B. Behavioral Characteristics of Reviewers
not strong. Li [5] et al. also used part of speech characteristics
in Amazon data set and found that comments constructed by Compared with the real commenters, the behaviors of
crowd-sourcing presented different characteristics from real the false commenters are often abnormal. Therefore, the
comments in terms of part of speech characteristics. extraction of the behavior characteristics of the reviewers is
Crowdsourced fake reviews contained more verbs, adverbs helpful to accurately identify the fake reviews. The
and pronouns. while real reviews contained more nouns, behavioral characteristics of reviewers extracted in this paper
adjectives, prepositions, qualifiers and conjunctions. Lau[6] et are shown in table Ċ:
al. believed that there was mutual copy between fake reviews,
Step1˖Prepare the data sets that can be entered into the B. Calculation of Representative Samples
model; A certain accuracy can be obtained by training the
Step2˖The data set is divided into training set and test classifier with the set of trusted negative examples RN and
set; positive examples P obtained in the first stage. However, it
ignores a large number of spy samples in the unlabeled data
Step3˖Using training set to train classification model set, resulting in poor performance of the classifier. However,
and output; the spy samples play an important role in improving the
performance of the classifier. In order to determine the
Step4˖The test set is used to evaluate the trained model.
category label of spy samples, it is necessary to find samples
Due to the large amount of unlabeled data obtained in that can represent positive and negative examples
real applications, it is difficult to manually annotate them. respectively. Therefore the first to use LDA algorithm to
Therefore, this paper uses MPINPUL model to learn a small achieve reliable negative set an RN distribution on different
number of marked samples and a large number of unmarked topics, and then use the K - Means clustering algorithm for
samples. The MPINPUL model in this paper refers to the set reliable cases, let trusted negative case theme distribution of
of real comments as the positive example set P, and a large the sample of relatively consistent category. Finally using
number of unlabeled data as the set U. Rocchio classifier of positive and negative cases respectively
calculate the 10 representative sample.
A. Extraction of Trusted Negative Examples Based on
Constrained K-Means Alogorithm C. Determine the Category Label for the Spy Sample
Based on semi-supervised clustering, this paper Determining the category label of the spy sample is the
proposes to use constrained k-means algorithm to extract most critical step in the MPINPUL model framework. The
reliable negative examples. The clustering process is purpose is to divide the set US into LP and LN. This step
influenced by users' needs, and the must-link constraint is requires 10 representative samples of positive and negative
used to guide the clustering process. The advantage of the samples to calculate the category label of the spy sample.
constrained k-means algorithm is that the positive example DPMM (Dirichlet Process Mixture Model) is adopted to
set is used to initialize the positive example center in the cluster the sample spies. The idea is: mixing spy sample
clustering process. The positive example marker was used as population sex and individuality, at the same time using the
the must-link constraint to conduct constrained clustering. It probability model for spy sample calculate the probability
not only marks the reliable negative examples, but also weights respectively belong to two categories. The
expands the positive examples. classification label error of the spy sample can be reduced to
a certain extent in order to train the classifier with higher
The constrained k-means algorithm proposed in this
accuracy. The steps are as follows: first, calculate the
paper is based on the following two points:
probability that a single sample belongs to positive example
2008
and negative example. When the two probabilities are close, to learn a better classifier. SVM optimization is shown in
the category tag of the subclass of the sample is used to formula (8):
determine the category tag of the sample. | | | |
1
The idea of population is that samples in the same ( , , ) = || || + + ( ( )
)
2
subclass should have a high probability of belonging to the
same category. Formula (1) and (2) are used to calculate the | | | |
2009
A. Experiment Data Set experimental data set, the data was divided into 9 training set
The data included 64,445 reviews from 99 restaurants and 1 test set by the sampling crossover method. The
on Yelp. Among them, 8,035 fake comments were filtered, proportion of positive examples in the training set was fixed
while 56,410 were real. In this chapter, all 8,035 false at 40%. Ten experiments were conducted in total, and
comments were selected and 56,410 real comments were different test sets were selected for each experiment. Each
randomly selected. A total of 16,070 comments constitute the review contains a review ID, user ID, review content, stars,
data set of the experiment in this chapter. After obtaining the date, "useful" count, whether or not it was filtered, and more.
2010
2) Results of MPINPUL Model under Different examples was taken as 40% for the experiment. Table Ⅴ
Characteristics shows the experimental results of MPINPUL model on Yelp
In the experiment of this chapter, the method of ten fold data set. It can be seen that the classification model trained on
crossing is adopted and ten experiments are carried out. fusion features is about 10% more accurate than that trained
Different test sets were selected each time, and the average of on text features. It fully proves that the behavior
the results of 10 experiments was taken as the final characteristics of commenters are helpful to the identification
experimental results. The value of parameter s was selected of false comments.
as 0.15, the value of a was 0.3, and the proportion of positive
Non-fake fake
Classification Accuracy
Features
model ˄%˅ Precision Recall F1 Precision Recall F1
(%) (%) (%) (%) (%) (%)
Unigram 77.31 77.89 78.25 78.07 78.93 76.15 77.52
POS 76.45 75.48 72.16 73.78 73.68 77.03 75.32
LDA 76.82 75.13 78.74 76.89 77.17 74.36 75.74
MPINPUL Behavioral 83.84 89.21 79.39 84.01 82.67 92.36 87.25
Behavioral+
82.65 89.68 75.68 82.09 78.14 92.07 84.53
Relational
fusion feature 87.51 89.92 83.67 86.68 86.32 90.36 88.29
3) Comparison of Experimental Results of Several PU model is MPINPUL>XGboost>SVM. It not only proves the
Learning Models importance of integrating text and behavior characteristics for
In order to verify the effectiveness of the algorithm false comment recognition, but also fully reflects the
proposed in this chapter, two additional mainstream PU effectiveness of MPINPUL classification model of mixed
learning models LELC and SPUL were implemented for population and individuality in fake reviews recognition.
comparison. In the fourth stage of MPINPUL model, two
additional multi-core learning algorithms, SILP and SVM XGBoost MPINPUL
90
SimpleMKL, were implemented to train the classifier and
compared with the improved SVM classifier. Firstly, the 85
comparison between the MPINPUL model proposed in this
80
chapter and the previous PU algorithm is discussed. Table Ⅵ
shows that the MPINPUL model designed in this chapter is 75
superior to the previous PU learning algorithm, and its 70
accuracy is as high as 87.51%. Based on the first three stages
of the MPINPUL model, the multi-core learning algorithms 65
SILP and SimpleMKL are trained in the fourth stage, and the
accuracy rate is 85.16% and 86.57% respectively, which is
also higher than the traditional PU learning algorithm LELC
and SPUL, fully proving the effectiveness of the MPINPUL
model designed in this chapter.
2011
MPINPUL were confirmed from the two aspects of features modeling for online review spam detection[J]. ACM Transactions on
and classification model. Management Information Systems (TMIS) ,2012 (4).
[7] Mukherjee A, Venkataraman V,Liu B,et al.What Yelp Fake Review
Filter Might be Doing? Proceedings of the7th International AAAI
Conference on Weblogs and Social Media , 2013.
REFERENCES [8] Ren Ya-Feng, Ji Dong-Hong, Zhang Hong-Bin, et al. Deceptive
reviews detection based on positive and unalbeled Learning.Journal of
Computer Research and Development,2015,52(3):639-648(in
[1] Jindal N,Liu B.Opinion Spam and Analysis. Proceedings of the 2008 Chinese).
International Conference on Web Search and Data Mining , 2008.
[9] Li X L, Philip S Y, Liu B, et al. Positive unlabeled learning for data
[2] Li Lu-Yang, Qin Bing,Liu Ting. Survey on Fake Review Detection stream classification[C]. Proceeding of the 9th SIAM Int Conference
Research[J]. Chinese Journal of Computers, 2018(04)(in Chinese). on Data Ming. Philadelphia, PA: SIAM, 2009.
[3] Ott M, Cardie C, Hancock J T. Negative deceptive opinion spam. [10] Xiao Yanshan, Liu Bing, Yin Jie, et al.Similarity-based approach for
Proceedings of the Conference of the North American Chapter of the positive and unlabeled learning[C]. Proceeding of the 22nd Int Joint
Association for Computational Linguistics:Human Language Conference on Artificial Intelligence. San Francisco: Morgan
Technologies ,2013. Kaufmann,2011.
[4] Ott M,Choi Y,Cardie C, et al.Finding deceptive opinion spam by any [11] Gert L, Nello C, Peter B,et al.Learning the Kernel matrix with semi-
stretch of the imagination. Meeting of the Association for definite programming [C]. Journal of Machine Learning Research,
Computational Linguistics:Human Language Technologies,2011. 2004.
[5] Li J, Ott M, Cardie C, et al. Towards a General Rule for Identifying [12] Alain R, Framcis R B, Stephane C, et al.SimpleMKL[J]. Journal of
Deceptive Opinion Spam. Meeting of the Association for Machine Learning Research,2008.
Computational Linguistics ,2014.
[13] Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System[J].
[6] Raymond Y. K. Lau,S. Y. Liao,Ron Chi-Wai Kwok,Kaiquan Proceedings of the 22nd ACM Sigkdd International Conference on
Xu,Yunqing Xia,Yuefeng Li.Text mining and probabilistic language Knowledge Discovery and Data Mining. ACM, 2016:785-794.
.
2012