Professional Documents
Culture Documents
Final Presentation
Final Presentation
Introduction
Related work
Problem statement
Research Objective
Propose Methodology
Experimental Result
Publication
Reference
2
INTRODUCTION
online product reviews occupy a central place in the product evaluation process for a
company and its customers.
Company: improve product, quality, plan and monitor product that result boosting its
productivity and profit.
Product reviews
Spam product reviews
Non-spam product reviews
Google official report and clearly direct the users to not purchase and receive
payments from company that make available fake reviews.
RELATED WORK
The study [1] focused on behavioral feature and n-gram based approach for detection of
spam reviews classification, on behavioral features improved accuracy using svm model.
Authors in [2] used Naïve Bayes, Max entropy, support vector model(SVM) and RF
Techniques for the iPhone mobile review dataset collected from Kaggle. Part of speech
(POS) tagging and count vectoring features are exploited to detect spam reviews. The best
accuracy was given by RF.
The authors in [3] used Naïve Bayes (NB), SVM, KNN, and Decision Tree (DT) for
classification of Movies products reviews via sentiments analysis, stop words or without
stop words used as features vector space or feature vector. SVM show best result.
RELATED WORK(CONT)
The authors in [4] used Count Vectorizer and TF-IDF features using the SVM classifier on Mturk
and Yelp Amazon Dataset of different product reviews.
Study in [5] used logistic regression, Naïve Bayes, RF, SVM and Deep neural network classifier
on the dataset of amazon product reviews using TF-IDF features.
Authors [6] used statistical features on two different dataset(English and Malay) using boosting
techniques like GBM perform on Malay dataset and XGBoost on English dataset.
PROBLEM STATEMENT
Online product, peoples post a huge amount of product reviews on daily basic.
It is hard for the user to scan the whole reviews to make a decision whether a
product reviews is spam or non-spam.
7
RESEARCH OBJECTIVES
To evaluate the ensemble model with all features extracted from spam product
reviews in terms of classification accuracy.
To evaluate the effectiveness of the ensemble model with best features obtained
using three feature selection techniques (Chi-square, Univariate and Information
gain)
8
PROPOSE METHODOLOGY (THREE PHASES)
Pre-Processing
Feature Extraction
Sentence Segmentation
Product Reviews
Stop Word Removal
Feature Selection
Word Stemming (Chi Square, Univariate, Information Gain)
Sentence Segmentation
Commonly, exclamation (!), interrogation (?) and full stop (.) signs are used as indicators
to segment the text.
Tokenization
sentences are divided into distinct words by dividing them at whitespaces like tabs,
blanks, and punctuation signs i.e. dot (.), comma (,), semicolon (;), colon (:), etc
Word stemming:
changes the derived words to its base or stem word to root word
FEATURE EXTRACTION & BEST FEATURE SELECTION
25 Statistical features extract from mobile application reviews of Yelp Dataset
Some features are valuable and contribute more in model prediction; while others are
less valuable and have a serious impact on the effectiveness of the model
Moreover, the relevant and valuable features avoid over fitting, enhance accuracy and
lessen the training time of the predictive model
Features Selection:
Chi-square
Univariate
Information gain
LIST OF ALL FEATURES
3 Standard division review text and average review 11 Automated readability index for review text
rating
4 Part of speech in review text in ascending order 12 Average number of letters per word in review
text
5 Average cosine similarity in review title 13 Number of unique words in review text
6 Part of speech in reviews text in descending 14 Standard division for review on application and
order. review rating
7 Average levenshtein distance in review text 15 Average words in frequent review text
17 Standard devision for number of words and 22 Brand names in review title
review text title
19 Only review on application 24 Ratio between unique words and review text
body
20 Number of unique words in review title 25 Ratio between unique words and title words
CLASSIFICATION
Simple Majority Voting Ensemble or voting classifier has been employed to combine
the predictions from multiple machine learning algorithms (MLP, RF, KNN), in order
to get an improved combined result.
RF works by developing a number of decision trees at training time and predicting the
most frequent class decided by the contributing decision trees.
The KNN algorithm works by calculating the distance between a query and all
examples in the data, picking the specified number of examples (K) that are nearest to
the query.
RF 85.75 86 99 92
KNN 84.02 85 99 92
100
95
90
85
80
75
RF KNN MLP Ensemble classifier(MLP, GBM Gaussian XGBoost GBM Ada Boost
KNN, RF)
RF 85.81 87 97 92
KNN 84.75 85 99 92
XGBoost 85.03 86 98 92
100
95
90
85
80
75
RF KNN MLP Ensemble classifier(MLP, GBM Gaussian XGBoost GBM Ada Boost
KNN, RF)
RF 85.72 86 97 92
KNN 84.46 85 99 92
XGBoost 85.31 86 99 92
6 Reviews rating
7 Standard deviation of review application rating
100
95
90
85
80
75
RF KNN MLP Ensemble classifier(MLP, GBM Gaussian XGBoost GBM Ada Boost
KNN, RF)
RF 84.90 86 99 92
KNN 84.75 85 99 92
105
100
95
90
85
80
75
RF KNN MLP Ensemble classifier(MLP, GBM Gaussian XGBoost GBM Ada Boost
KNN, RF)
90
89
88
87
RF
KNN
MLP
86
Ensemble ML
GBM Gaussian
XGBoost
85
GBM Ada Boost
84
83
82
All Features Best 10 features(Chi2) Best 10 features(Univariate) Best 10 features(Information Gain)
CONCLUSION AND FUTURE WORK
The accuracy of the proposed ensemble model improved with best features obtained using
Chi-square and Univariate selection techniques.
The accuracy of the GBM Gaussian remained constant on all feature selection techniques.
The accuracy of XGBoost and GBM Ada Boost either remained constant or slightly
downgraded on best features.
The accuracy of RF and KNN classifiers either improved or slightly downgraded on best
features; while the accuracy of MLP improved or remained constant on best features.
Overall, the classification accuracy of the proposed ensemble model is superior than all
individuals models as well as other boosting approaches.
we want to explore the deep learning approach, longest short term memory with weighted TF-
IDF embedding for the task of spam review classification.
PUBLICATION
1. N. Jindal and B. Liu, "Opinion spam and analysis," in Proceedings of the 2008 international conference
on web search and data mining, 2008, pp. 219-230.
2. G. Wang, S. Xie, B. Liu, and S. Y. Philip, "Review graph based online store review spammer detection,"
in 2011 IEEE 11th International Conference on Data Mining, 2011, pp. 1242-1247. [21] J. Ye and L.
Akoglu, "Discovering opinion spammer groups by network footprints," in Joint European Conference
on Machine Learning and Knowledge Discovery in Databases, 2015, pp. 267-282.
4. E. Elmurngi and A. Gherbi, "An empirical study on detecting fake reviews using machine
learning techniques," in 2017 seventh international conference on innovative computing
technology (INTECH), 2017, pp. 107-114.
5. A. V. Sandifer, C. Wilson, and A. Olmsted, "Detection of fake online hotel reviews," in 2017
12th International Conference for Internet Technology and Secured Transactions (ICITST),
2017, pp. 501-502.