Professional Documents
Culture Documents
Sentiment Analysis of Restaurant Review - Project Report
Sentiment Analysis of Restaurant Review - Project Report
Submitted By:
Acknowledgements
First of all, we would like to express thanks to our guide Dr Monika Mittal,
Assistant Professor at Great Lakes Institute of Management, Chennai for being an
excellent mentor during the period of this project thesis. Her encouragement and
valuable advice has made it possible for us to complete our work.
We would also like to say thanks to some of our friends who devoted their valuable
time in reviewing our work and giving suggestions when required.
Certificate of Completion
I hereby certify that the project titled “Sentiment Analysis of restaurant reviews:
Mining opinion” was undertaken and completed under my supervision by Anish Kumar,
Navaneethan Krishnan, Praveen Christopher, and students of July batch of
Postgraduate Program in Business Analytics (PGPBA Jul 2016).
Contents
ABSTRACT 6
Introduction 7
Title & Objective of the Study 7
Need of the study 7
Company under Study 9
Data Source 9
Tools & Techniques 10
Limitations: 11
LITERATURE REVIEW 11
DATA DESCRIPTION AND PREPARATION 12
Data Collection 12
Data Cleansing 13
Text Exploration 14
EXPLORATORY ANALYSIS 15
Clustering and Classification Algorithms used : 19
Hierarchical Clustering 19
K Means Clustering 21
Naïve Bayes: 23
Support Vector Machine 23
Predicting polarity of the review comments using Naïve Bayes and SVM 24
Classification of Review Texts 24
Feature Extraction 25
Results and Analysis 29
Definition of Terms used in the results tables 30
Classifier Accuracy 30
Classifier Precision 30
Classifier Recall 30
F-measure Metric 30
Measuring Precision and Recall of a Naive Bayes Classifier 30
Inference from Results 31
Aspect Based Analysis of Review text 32
Extracting Aspects from review using Boot-strapping Method:33
Recommendation and Applications 38
Recommendation: 38
Applications: 38
CONCLUSIONS: 39
LIST OF REFERENCES 39
ABSTRACT
Zomato.com is an online restaurant search and discovery service which enables user
to search for a restaurant to dine-out and also let them to share their review and
rating about the restaurant. These reviews and ratings help other users who are
searching for restaurant to dine out to check what is good and what is bad about
any particular restaurant which helps them to have a good meal outside. This
project deals with business insights that can be derived from text / opinion mining
of restaurant reviews shared on the Zomato by the foodies. The text data available
is not structured and so to look for what a majority of the crowd is looking for
when they dine can make good business sense for owners. Business owners have to
modify their operations based on customer preferences, and there is no better way
to understand the customer and what they need, feel and want changed than a review
that has no personal agenda.
Capturing the emotion of a customer who has written a review, by the choice of
words they use in the review is an essential part of improving the overall customer
experience at a restaurant. While business owners benefit from constant
improvements done to the various aspects of the restaurants, customers who are
willing to invest their time are hugely benefitted by online reviews. So, this
project looks at how text used in a review can influence a potential costumer and,
what is it that works or does not work for a business, why do people go frequently
a certain restaurant, what is it that people look for in a restaurant.
Introduction
“Sentiment Analysis of restaurant reviews: Mining opinion” is the title for this
project. The main goal of this project is to understand the polarity of the review
comments whether it is positive or negative and also to extract aspect based such
as food, service, ambience review analysis.
Need of the study
Zomato has share its API to the developers through a separate website
developers.zomato.com where we need to register and get a unique API key. Through
this API key we can make a web service call to Zomato to collect the relevant data.
The number of web services call is limited to 1000 calls per unique API key.
We have mainly used R and Python to do data cleaning and for running NLP (Natural
Language Processing) algorithms. Tableau and excel is mainly used to draw graphs
and charts. Data collection, extraction and cleaning the data for the NLP
algorithms are the most challenging work for us. The basics steps executed after
collecting data from the Zomato API is as follow.
♣ Data integration in R.
♣ Data cleaning in R and Python
♣ missing value
♣ removing unnecessary spacing, punctuation and numbers
♣ removing stop word
♣ removing special and junk characters
♣ Running classification and Aspect based model in Python
Limitations:
♣ For each restaurant there are only 20 reviews shared by the Zomato Api.
♣ Data cleaning
♣ Most of the review comments don’t have proper content, it includes Junk
characters.
♣ Users have shared images in review comments which have not been for this
project.
♣ Restaurant menu are in image format, hard to extract content from that for
Aspects.
♣ Review comments contain spelling mistakes and some regional languages as
well.
LITERATURE REVIEW
The existing work on sentiment analysis can be classified from different points of
views: technique used, view of the text, level of detail of text analysis, rating
level, etc. From a technical point of view, we identified machine learning,
lexicon-based, statistical and rule-based approaches. The machine learning method
uses several learning algorithms to determine the sentiment by training on a known
dataset. The lexicon-based approach involves calculating sentiment polarity for a
review using the semantic orientation of words or sentences in the review. The
“semantic orientation” is a measure of subjectivity and opinion in text. The rule-
based approach looks for opinion words in a text and then classifies it based on
the number of positive and negative words. It considers different rules for
classification such as dictionary polarity, negation words, booster words, idioms,
emoticons, mixed opinions etc. A Study and Comparison of Sentiment Analysis Methods
for Reputation Evaluation Statistical models represent each review as a mixture of
latent aspects and ratings. It is assumed that aspects and their ratings can be
represented by multinomial distributions and try to cluster head terms into aspects
and sentiments into ratings. Another classification is oriented more on the
structure of the text: document level, sentence level or word/feature level
classification. Document-level classification aims to find a sentiment polarity for
the whole review, whereas sentence level or word-level classification can express a
sentiment polarity for each sentence of a review and even for each word. Our study
shows that most of the methods tend to focus on a document-level classification.
Most of the solutions on review classification consider only the polarity of the
review (positive/negative) and rely on machine learning techniques. Solutions that
aim a more detailed classification of reviews (e.g., three or five star ratings)
use more linguistic features.
"When you write a review on the web you're providing a window into your own psyche
– and the vast amount of text on the web means that researchers have millions of
pieces of data about people's mind sets," said Jurafsky, whose co-authors include
Victor Chahuneau, Bryan Routledge and Noah Smith, all from Carnegie Mellon
University.
DATA DESCRIPTION AND PREPARATION
Data Collection
The reviews have been collected from the official developers.zomato.com site, which
offers download of reviews from an API generated from the site. Though it is free
to download data it is labour intensive to collect as it allows only 10 reviews per
call per API. The extracted reviews were then converted from the JSON format to
a .CSV format using an online converter, then consolidated into a Comma Separated
Value file.
Data Cleansing
Data cleaning is viewed as a series of steps where each step increases the ‘value’
of the data. It moves from the unorganised raw state and start to clean it up.
Post installation of the necessary packages in R, punctuations, numbers, HTML
links, unnecessary spaces & ‘not applicable’ (NAs), were removed from the review
text, in that order. Then the text was converted to lower case to prepare for
analysis.
The flow chart below shows an overview of a typical data analysis project. Each
rectangle represents data in a certain state while each arrow represents the
activities needed to get from one state to the other. The first state (Raw data) is
the data as it comes in. Raw data files may lack headers, contain wrong data types
(e.g. numbers stored as strings), wrong category labels, unknown or unexpected
character encoding and so on. In short, reading such files into an R data.frame
directly is either difficult or impossible without some sort of pre-processing.
Once this pre-processing has taken place, data can be deemed Technically correct.
That is, in this state data can be read into an R data.frame, with correct names,
types and labels, without further trouble. However, that does not mean that the
values are error-free or complete. For example, an age variable may be reported
negative, an under-aged person may be registered to possess a driver's license, or
data may simply be missing. Such inconsistencies obviously depend on the subject
matter
The negative reviews seem to be more towards poor service than poor food.
Ice creams and desserts are way more reviewed in Hyderabad than in Bangalore,
whereas Italian cuisine has got a much higher rating and reviews in Bangalore than
Hyderabad.
Reviewers have been split into 4 types based on the number of reviews they have
submitted and their regularity, there is no clear bias towards a certain food type,
all levels of foodies have submitted reviews to all types of cuisines.
Eyeballing the data tells us that a people who think a restaurant is great or who
really loved a restaurant are more likely to write a review, holding the fact that
a disappointed customer walks away quietly and never returns, true.
Word Cloud formation from the review head and review text:
Figure.6 –Word cloud showing most frequent word used as review header
Figure.7 –Word cloud showing most frequent word used as review text
Hierarchical Clustering
What is Hierarchical Clustering algorithm: Each observation is a (n) data point and
each data point is a cluster the distance matrix from each data point is computed
i.e. distance of a data point from all other data points. Since each data point is
a cluster, merge two closest data points and what is created is a cluster and what
is left is n-1 data points. Now this time it is not a cluster in A data point there
are 2 clusters, now from each cluster. We again merge the 2 closest data points and
have n-2 data points and this is repeated till a singe cluster remains. The
distance between 2 data points is easier to calculate that the distance between 2
clusters, the mostly commonly used algorithm is the hierarchical clustering method.
The steps and associated terms : How to merge observations and clusters and
calculating the distance between observations/clusters. Post calculation of
distance of one data point to another, the obvious is first taken care of i.e. the
distance of one data point to itself is zero, so that is marked and made a note of.
The Euclidean distance method is used to calculate the distance between 2 data
points which is the most widely used (the Euclidean distance or Euclidean metric is
the "ordinary" (i.e. straight-line) distance between two points in Euclidean space.
With this distance, Euclidean space becomes a metric space. The associated norm is
called the Euclidean norm). Here the difference between 2 data points is taken and
squared and sum of square is taken before its square root. A Euclidean distance is
as good as a straight line distance.
K Means Clustering
5) Recalculate the distance between each data point and new obtained cluster
centers.
6) If no data point was reassigned then stop, otherwise repeat from step 3).
Naïve Bayes:
In simple terms, a naive Bayes classifier assumes that the presence (or absence) of
a particular feature of a class is unrelated to the presence (or absence) of any
other feature, given the class variable. It's called naive because it makes the
assumption that all attributes are independent of each other.
Naïve Bayes has been studied from the 1950s, and is still a very popular method for
text categorization. It is a simple technique for constructing classifiers: models
that assign class labels to problem instances, represented as vectors
of feature values, where the class labels are drawn from some finite set. It is not
a single algorithm for training such classifiers, but a family of algorithms based
on a common principle: all naive Bayes classifiers assume that the value of a
particular feature is independent of the value of any other feature, given the
class variable. For example, a fruit may be considered to be an apple if it is red,
round, and about 10 cm in diameter. A naive Bayes classifier considers each of
these features to contribute independently to the probability that this fruit is an
apple, regardless of any possible correlations between the colour, roundness, and
diameter features.
For some types of probability models, naive Bayes classifiers can be trained very
efficiently in a supervised learning setting. In many practical applications,
parameter estimation for naive Bayes models uses the method of maximum likelihood;
in other words, one can work with the naive Bayes model without accepting Bayesian
probability or using any Bayesian methods.
Despite their naive design and apparently oversimplified assumptions, naive Bayes
classifiers have worked quite well in many complex real-world situations. In 2004,
an analysis of the Bayesian classification problem showed that there are sound
theoretical reasons for the apparently implausible efficacy of naive Bayes
classifiers. Still, a comprehensive comparison with other classification algorithms
in 2006 showed that Bayes classification is outperformed by other approaches, such
as boosted trees or random forests.
Support Vector Machine
“Support Vector Machine” (SVM) is a supervised machine learning algorithm which can
be used for both classification or regression problem. However, it is mostly
used in classification problems. In this algorithm, we plot each data item as a
point in n-dimensional space (where n is number of features you have) with the
value of each feature being the value of a particular coordinate. Then, we perform
classification by finding the hyper-plane that differentiate the two classes very
well (look at the below snapshot).
Feature Extraction
As mentioned, training and test data have been collected from the Zomato API. We
have 4000 restaurant reviews from 2 cities Bengaluru, & Hyderabad. We have divided
the reviews into positive and negative reviews based on the ratings given by the
users. The reviews which are associated with a rating above 3 have been taken as a
positive reviews and the reviews associated with rating less than 3 have been taken
as a negative reviews. The training data set has 750 positive reviews and 750
negative reviews, the remaining reviews have been used as the test data set.
Both the training and test data must be represented in the same order for learning.
One of the ways that data can be represented is feature-based. By features, it is
meant that some attributes that are thought to capture the pattern of the data are
first selected and the entire dataset must be represented in terms of them before
it is fed to a machine learning algorithm. Different features such as n-gram
presence or n-gram frequency, POS (Part of Speech) tags, syntactic features, or
semantic features can be used. For example, one can use the keyword lexicons as
features. Then the dataset can be represented by these features using either their
presence or frequency.
Feature vector plays a very important role in classification and helps to determine
the working of the built classifier. Feature vector also help in predicting the
unknown data sample. There are many types of feature vectors, but in this process,
we used the unigram and the bigram approach. Each review word have been added to
generate the feature vectors. The presence/absence of sentimental word helps to
indicate the polarity of the sentences. We create a python script to extract the
features from the training data. Code snippet for extracting features is shown in
Figure 7.
Once we extracted the features from training data, they were then passed through
our classifier. A script written in python was used to pass training sets to the
classifier. Once, the classifier is trained we can also check the accuracy of each
classifier by passing the testing set. Sample script of training and testing of
classifier is shown in Figure 8.
Figure 8. Sample code for training classifier
The evaluation of the model is done using cross-validation. For cross validation,
at first the positive and negative features which are extracted from the reviews
are combined and then it is randomly shuffled. This is done mainly because in
cross-validation if the shuffling is not done then the test sets might contain only
negative or only positive review text data. To build up a test set having fairly or
random distribution of both positive and negative feature the set is shuffled 5
times. The code below indicates the folds. n = 5 means 5-fold cross-validation.
accuracy
0.852159468
0.877076412
0.906860707
0.88981289
precision
0.839616773
0.853890082
0.889465812
0.878829495
recall
0.890572305
0.870490547
0.899685505
0.862341099
f-measure
0.844331002
0.861129469
0.893929583
0.869597673
bigram
accuracy
0.810631229
0.893687707
0.887318087
0.895218295
precision
0.807653241
0.875606428
0.864492936
0.890466027
recall
0.855978538
0.879790495
0.897148814
0.862744523
f-measure
0.803328862
0.877642276
0.875646631
0.87425202
accuracy
0.828903654
0.895348837
0.900207900
0.893555093
precision
0.820006641
0.880854236
0.877991014
0.888068313
recall
0.869328053
0.875332141
0.905064000
0.860037066
f-measure
0.820934229
0.878002413
0.888529565
0.871886306
RATED 4.0
XXXXXXXX->Reviewer Name
45 Reviews , 110 Follower
Visited today. Barbeque nation is a usual place. I have visited several branches
all over India.. the food was as usual good. We ordered cocktails and Mocktails,
which was good either.
We had a very unusual problem today. App in my mobile was not at all working. It
was updated to latest version in iOS. Then o tried calling their toll free number
to reserve. It is connecting with only Gujarat BBQ nation where I lived some years.
I tried hard to get the t nagar branch number to do the booking. I guess the
technological transformation of your system had done serious flaws which you need
to address immediately for better experience. Hope you rectify it soon. There is no
complaint against food, it was great as usual.NO complains about Service.
So I'm reducing the rating for the booking experience not for the food.
Positive
Neutral
Negative
Consider a typical restaurant review shown in the above snapshot. This review
discusses multiple aspect of the restaurant, such as room condition, food,
experience and service, but the reviewer only gives an overall rating for the
restaurant; without an explicit rating on each aspect, a user would not be able to
easily know the reviewer's opinion on each aspect. Even though he has rated food
and ambience to be great, but he doesn’t like the service on that day, and because
of that the rating turned out to be low.
From the above sentence, we can extract different aspects and get more insights,
for example if the reviewer might like conveniences like valet-parking, he might
not necessarily be price conscious, so those users tend to express their views in
the comments. So, it very important to conduct a text based analysis on the
different aspects for meaningful insights.
Extracting Aspects from reviews using Boot-strapping Method:
Since its restaurant review related data, we assume that only keywords are required
to describe the specific aspects, we have referred the boot-strapping method from
the journal below:
https://pdfs.semanticscholar.org/6ff5/05e63ffebf419736d6c65741ee63b3ea720e.pdf
Step 1: Match the aspect keywords in each sentence of X and record the matching
hits
for each aspect i in
Count(i);
Step 2: Assign the sentence an aspect label by
ai =argmaxi Count(i). If there is a tie, assign the sentence with multiple aspects.
Step 3: Calculate Â2 measure of each word (in V);
Step 4: Rank the words under each aspect with respect to their Â2 value and join
the top p words for each aspect into their corresponding aspect
keyword list Ti;
Step 5: If the aspect keyword list is unchanged or iteration exceeds I, go to Step
6, else go to Step 1;
Step 6: Output the annotated sentences with aspect assignments.
Figure 12. Algorithm for Aspect based Analysis
Task 1: We ran the above algorithm with defined set aspects key words for our
dataset and with the review text. Please refer below for the screenshot, we had
considered the food, ambience and service as the main aspects. Extracting the
Aspect ‘food’ was a challenge, since most of the reviewers quote the name of the
dish in their review text, we needed to have the menus of all 110 restaurants in
our analysis which was not feasible. We worked around this by having a sample menu
only for Italian cuisine in the ‘Food’ Aspect.
Figure 13. Snippet of the code applying the algorithm for Aspect based Analysis
After running the code above we will get a .csv file which will contain aspects and
the review text that fits well for that aspect based on the feature words and
weight passed to the Chi-Square test. A review text can not only be mapped to one
aspect also can also be done for two or more aspects based on the feature words
used in the review comments and how much that sentence is closer to that particular
Aspect.
Figure 14. Snippet of the resulted excel file with different aspects mapped
sentences
Output :
Recommendation:
By extracting aspects from each review sentence , we can use this to calculate the
weight of each aspect against the reviewers rating for that restaurant , to score
the overall sentiment for the restaurant across different aspects like ambience,
food and service.
To calculate the weight we can use the Latent Rating Regression Model and the same
to predict the overall weight based on the Word Frequency with given pre-defined
keywords.
Applications:
Bibliography: