Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Incorporating Social Features into Sentiment Classification of

Yelp Reviews


Hugh Cunningham
Stanford University
hughec@stanford.edu
Andrew Adams
Stanford University
aladams@stanford.edu

Abstract
We investigate the efficacy of incorporating
social edges in a graph of users from the Yelp
Academic Dataset to improve binary sentiment
classification of Yelp reviews. Our model
includes both the preference of an individual
document classifier to place a review within a
given class as well as the preference for reviews
of the same business created by users who are
friends to be placed in the same class. We then
frame this problem in terms of minimizing cost
and solve using minimum cuts in graphs. Using
three approaches to model the preference that
two linked reviews receive the same sentiment
label, our results ultimately cannot determine
whether the incorporation of social graph edges
can improve sentiment classification.
1 Introduction
Ambiguous language may doom
efforts to analyze the sentiment of a
particular segment of text. Similarly,
semantic features of the text such as
negation or entailment may limit the ability
of a classifier based on language features to
correctly analyze sentiment. However,
incorporating non-language features may
mitigate some of the difficulties imposed by
the language itself. In particular,
incorporating the social context in which a
text is situated may ease the task of
sentiment analysis:
Burger was great, but the service
wasn't so great. When thumbing
through my wallet for cash, the owner
rudely interrupted me and told me
they don't accept cards, only cash
payment.

When my friend asked what he would
recommend, he shoved a menu in his
face and said "I'd recommend
looking at a menu".

I fought back the urge to just leave
and go somewhere else and ordered
the Good Ass burger, which was
avocado and bacon, so you can't really
go wrong with that. The fries were
good and the burger was pretty
good.
Looking at the above Yelp review from our
data set, it is plain to see why a sentiment
analysis classifier would have difficulties
classifying it accurately as a negative (!
star) review. There is some genuinely
positive sentiment, and the second
paragrapha highly negative anecdoteis
couched in language likely to be interpreted
by a classifier based upon surface-level
features as positive sentiment. But if a
classifier were to incorporate social context,
it could contextually weigh the following
one-star and highly negative review of the
same business, which is published on the
same day as the above review and written by
a friend of the above user:

Killer burger but I want to k1ll the
owner! I tell the owner (joking) " It's
my 1st time here, what's the one
burger?" He pushes the menu at me
and says "the menu" then looks past
me and yells " anyone here know
what they want!?!? "

The burger Nazi owns two hippies
burger... He was cool after that but
wow! Everything else about the
experience was pure gold though!

Still left a salty taste tho... ONE STAR
FOR YOU! - the review nazi

In this paper, we evaluate the hypothesis that
incorporating both social context and
language features in a sentiment classifier
will improve performance over the use of a
classifier that considers language features
alone in a binary (i.e. negative/positive)
sentiment classification task. We will model
the social context of a review r written by
user u by considering prior reviews r of the
same business created by users connected to
u by an edge in a social graph. Given this
social context, we establish a measure of
preference that two such reviews r and r
receive the same sentiment label and
incorporate it into a model that also
considers language features of the
review. Intuitively, this added social context
should provide additional information that
improves performance in the sentiment
classification task.

2 Previous Work
The existing literature on the
sentiment analysis task contains a
cumbersome volume of work and a wide
array of vastly different
approaches. However, our examination of
the literature turned up relatively few
instances in which researchers combined
Social Network Analysis or features
capturing social context with a classifier
based on linguistic features.
Recent work by Tan et. al (2011)
addresses a research question most similar to
our own; using social network information
from Twitter to aid in the classification of
user sentiment towards particular
topics. More specifically, Tan et. al
leverage users with known sentiment
towards a given topic (e.g., Barack Obama
or Fox News) to classify the sentiment of
other socially-connected users. Tan et. al
model two different types of social
connection from Twitter in what they refer
to as the @-network and the
follower/followee network and
additionally suggest different social
phenomena driving each. For the @-
network, in which users mutually tweet at
one another, Tan et. al posit that homophily
(the idea that shared sentiment and social
connection tend to co-occur) will yield
mutually held beliefs among connected
users. In cases where the social connection
between users is follower/followee, Tan et.
al suggest that the followers approval or
desire to pay attention to the followee may
result in the follower adopting the sentiment
of the followee. Using these social network
connections, Tan et. al exceed the
performance of text-based classifiers with a
factor graph model. Tan et. al demonstrate
that significant performance gains are
achievable where the underlying graph
shows strong correlation between user
connectedness and shared sentiment even in
the face of sparse graph connections. Our
own investigation follows the intuition of
Tan et. al regarding homophily, but we aim
to use this shared social sentiment to
supplement a linguistic classifier rather than
to supersede it.
Previous work from Thomas et. al
(2006) addresses the problem of
supplementing a classifier based on
linguistic features with information about
links between texts utilizing a model from
Pang and Lee (2004). Thomas et. al
successfully use links of agreement between
segments of Congressional floor speeches to
supplement language features to classify
support or opposition to Congressional
bills. Thomas et. al train an SVM on
language features of speech segments and
another to detect links of agreement between
speech segments. Thomas et. al then
proceed to adopt the model from Pang and
Lee (2004), which determines the optimal
classification of a sequence of texts
considering associations between texts using
minimum cuts in graphs. Our own
investigation adapts this model, first inspired
by Blum and Chawla (2001), which we will
discuss in detail later in this paper. Pang
and Lees own usage of the cut-based
classification model uses physical proximity
between sentences for the measure of their
association as they model whether given
sentences are subjective; a related task, but
tangential to our own. Thomas et. al
demonstrate that modeling agreement links
can yield significant improvements over the
use of language-based classifiers alone at the
document level, and even at the user-level
addressed by Tan et. al when they require
that all speech segments from a given
speaker receive the same label. Our work
combines features of both Tan et. al and
Thomas et. al by incorporating the type of
social connections that Tan et. al model into
the framework that Thomas et. al use.

3 Data
To train and test our sentiment classifier, we
used the publicly released Yelp Academic
dataset of 335,022 reviews of businesses in
the Phoenix area, published during 2012. Of
these 335,022 reviews, 244,847 reviews are
from users with at least one friend on the
network, where being a friend on the site is
constituted as a friendship request and
reciprocal approval from a user who
likewise has a profile on the site. For the
sake of our endeavor, we further narrowed
our data set to the 100,491 reviews that
come from users with at least one friend
who has also reviewed that business; 90,455
of these reviews we used for training, and
10,042 we reserved for testing.
The fact that over 25% of the
reviews are from users without any friends
on the network suggests that Yelp is a
business review site first and foremost, and
as such, the social data of Yelp are much
sparser than that of an out-and-out social
network.

(Distribution of friend degree for users in our
data set)

The median number of friends (degree) for a
given user in our data set is 3.0, which is
significantly smaller than the median
number of friends for other social networks.
For instance, Ugander et. al (2011) find the
median number of friends for a user of
Facebook to be 99.0, thirty-three times that
of Yelp. Because of the relatively low
friendship degree on Yelp, for a given
review, a user has, on average, less than one
(.77) friend who has also reviewed that
business, and even looking at just our subset
of reviews where at least one friend of the
user has also reviewed the same business,
the expected number of friends who have
reviewed the same business increases to
only 1.75.
Furthermore, despite the average
ego network (local network focalized with a
user as the center node) being of a degree
thirty-three times smaller on Yelp than on
Facebook, the global clustering
coefficienta probabilistic measure of
triadic closure for a networkis
significantly higher on Facebook than on
Yelp. Firstly, a definition of clustering
coefficient:
Formally, given a graph G = (V, E) and a
vertex v V , the local clustering
coe!cient (cc) of v = (number of pairs
of neighbors connected by edges/
number of pairs of neighbors), and the
global clustering coefficient is the
average cc over all v V.

As 0 ! cc ! 1, and the number of possible
undirected friendship edges between k
friends grows by a square factor of k (the
number of such possible edges being k
choose 2), one would expect networks with
a higher degree to express a lower clustering
coefficient. Indeed, in empirical studies of
the Facebook and MSN messenger social
graphs, as degree increases for a given ego
network, the local clustering coefficient
decreases non-linearly, as modeled by an
exponential decay function (Ugander et al,
2011; Leskovec and Horvitz, 2008).
However, the global clustering coefficient of
Yelp is .08, and the global clustering
coefficient of Facebook is .14furthermore,
for users with degree 3 (equal to that of the
median Yelp user), the local clustering
coefficient of a Facebook ego network is, on
average, nearly .50. The local clustering
coefficient of ego networks on MSN
messenger is also ~double what it is on Yelp
(Ugander et. al, 2011). All of this is to say
that the relatively low clustering coefficient
of Yelp entails less clustered friendship
groups and suggests, perhaps, less shared
affinity and sentiment among friends on
Yelp than there are for friends on Facebook
and other social networks.
Despite its relative sparsity and
relatively low clustering coefficient,
homophily, or shared sentiment, is
unambiguously expressed in the data.
Namely, the probability of a negative review
is .26, but for all friendship dyads where
both users have reviewed the same business,
the probability of a negative review given a
friends review is negative rises to .37.
Likewise, a friend giving a positive review
also boosts the conditional probability that a
user will review the business positive:
P(positive)=.74, P(positive | a friend =
positive)= .78. These data give reason to
hope that incorporating social features into
sentiment classification of Yelp reviews
should improve accuracy.

4 Model
As mentioned previously, we adapt
our classification framework from that used
in Thomas et. al (2006), itself adapted from
that used by Pang and Lee (2004). The
framework incorporates both the preference
of an individual classifier to classify a
document as either negative or positive, as
well as a measure of preference that two
linked reviews receive the same
classification. For our purpose, we consider
a pair of reviews as linked if the reviewers
share an edge in Yelps social graph (i.e. the
users are friends), and the reviews address
the same business. This latter preference
incorporates homophily into the sentiment
classification task.
4.1 Classification Framework
We denote the preference for the
individual bag of words model to classify a
review r as a member of a class c by
ind(r,c(r)). Similarly, where there exists a
link " between two reviews as defined
above, we use str(!) to denote the preference
that those two reviews receive the same
classification, which we will refer to as the
strength of the link between these
reviews. For a classification of reviews
C=c(!
!
),c(!
!
),...,c(!
!
), we define the cost of
that classification as follows:

! ! !

!"# !! ! ! ! !"#!!!
!!!
!
!! ! !! !
!
!!! !"#$""% !!!! !


The first sum in the above equation gives the
sum of the individual classifiers preference
to label each review with the opposite
classification of that given to r by
C. Adding this sum to the second sum
above, which adds the strength of all links
between two reviews r and r that C does
not assign the same label, the cost J(C) then
represents the total sum of all dispreference
of the classification C from the individual
review classifier and the strength of links.
The minimization of J(C) over all
classifications C yields an optimal
classification that abides by highly preferred
classifications of the individual review
classifier but also avoids placing reviews
with strong linkages in different classes. As
Pang and Lee discuss, minimum cuts in
graphs can efficiently solve the above
optimization problem. Without going into
unnecessary detail in describing the
minimum cut procedure, we provide a brief
description of the formulation of the graph
problem. The graph constructed to facilitate
this cut contains a vertex for each review in
the set and a vertex for each class, negative
and positive. An edge connects each review
r with each class c with weight equal to
ind(r,c(r)). Edges with weight str(!)
likewise connect linked reviews. Using the
negative vertex as source and the
positive vertex as sink we partition the
graph using a minimum cut and classify as
negative each review whose vertex belongs
to the same set as the source vertex and
likewise for positive classification.
4.2 Individual Review Classification
We use a Multinomial Nave Bayes
Classifier using unigram bag of words
features to classify individual reviews as
having either negative or positive
sentiment. Given the size of our training
data (i.e., on the order of one hundred
thousand reviews), Multinomial Nave
Bayes achieves performance comparable to
that of Logistic Regression and SVM
models. We select Multinomial Nave
Bayes from these options due to the speed
and facility of model training and in order to
express the preference of the individual
classifier to place a review r in a class c,
ind(r,c(r)) in probabilistic terms.
In addition to furnishing the
preference figures ind(r,c(r)), the individual
review classifier serves as the baseline
model against which we compare our own
model in the sentiment classification of
reviews. This facilitates the evaluation of
our hypothesis that incorporating features
capturing social context into a model based
on language features will improve sentiment
classification over a model using language
features alone.
4.3 Link Strength
To determine the strength of the link
between two reviews, where two reviews
share a link if the reviewers are friends and
the reviews discuss the same business, we
consider three approaches to modeling the
preference that linked reviews should
receive the same label. Our first approach
considers simultaneously all reviews r in
the training set with which a review r shares
a link. Using the set of reviews r we
determine the majority classification of such
reviews and assign this classification to the
review r. Where the number of reviews r
classified as negative equals the number
classified as positive we classify r as
positive, since positive reviews make up the
vast majority of our data. The intuition for
this approach follows the idea that a
reviewer will adopt the sentiment of a group
of friends in a sort of group homophily. To
frame this approach in the context of the
model discussed in the previous section, for
each link ! between two reviews r and r if
c(r) equals the majority classification over
all r, then str(!) receives positive infinite
weight, and if c(r) equals the minority
classification then str(!) equals zero. This
effectively restricts each review r to the
majority classification of reviews with
which it shares links. We will refer to this
approach as the Nave Influence
approach.
The second approach utilizes a
measure of similarity between the text of
two linked reviews to determine the strength
of the link, str(!). Intuitively, two reviews
that share a link in the manner we have
defined should share the same sentiment
classification if the reviews use similar
language. We use cosine similarity using
raw term frequency as a measure of the
similarity of two review texts. The
remainder of this paper will refer to this as
the Similarity approach.
Our third approach attempts to use a
Linear-chain Conditional Random Field
(CRF) to model the probability that
c(r)=c(r) for a given pair of reviews r and
r (where r precedes r
chronologically). We will forgo a
discussion of the mechanics of CRFs here
since our work leverages these models
without significant adaptation. However,
the feature functions we define within the
CRF model warrant further
discussion. Aside from consideration of the
label of the previous review we considered
the average similarity of review text to
reviews within each class from the training
corpus, the number of friends for each
reviewer, and the date of each review to
account for the chronological distance
between reviews. We will refer to this
approach as the Linear-chain CRF
approach to modeling link strength.

5 Results
We combine each of our approaches
to calculating str(!) with the baseline
Multinomial Nave Bayes model of
preference ind(r,c(r)) in the framework
discussed above. We train these models on
roughly ninety percent of the reviews in our
dataset, after pruning it of all reviews that
did not have any reviews in the dataset
created by friends of the reviewer and
discussing the same business. The table
below gives the results of each combination
of individual review classifier (ind) and link
strength model (str), evaluated on macro-
averaged precision (P), recall (R), and F1
scores:
ind str P R F1
Random - 0.60 0.51 0.54
Random Nave Influence 0.67 0.69 0.68
Random Similarity 0.60 0.51 0.53
Random Linear-chain
CRF
0.60 0.51 0.53
Multinomial
NB
- 0.82 0.82 0.82
Multinomial
NB
Nave Influence 0.71 0.73 0.72
Multinomial
NB
Similarity 0.82 0.82 0.82
Multinomial
NB
Linear-chain
CRF
0.82 0.82 0.82
These results illustrate that, of the models
we consider for link strength, only the Nave
Influence model produced any change from
the performance of the individual review
classifier. In the case of a random
classification of individual reviews, the
Nave Influence on link strength improved
performance across all measures, but the
reverse holds where we use Multinomial
Nave Bayes as the individual review
classifier. The other two models of link
strength Similarity and Linear-chain CRF
neither helped nor hindered the
performance of the individual review
classifier for either random or Multinomial
Nave Bayes classification. In the ensuing
discussion, we discuss potential sources of
error that may help to explain these results.

6 Analysis
The results above certainly do not
match our expectations for the performance
of the model in that the inclusion of
preference for linked reviews to receive the
same label does not increase classifier
performance, but also, and more
importantly, in that it does not seem to affect
it at all. We discuss here a number of
factors that may contribute to this result.
6.1 Model Error
A significant source of error in our
investigation may lie in the manner in which
we split the data between training and
testing sets. In particular, by splitting the
data randomly (or even chronologically), a
number of links between reviews span the
partition between training and testing
data. This essentially excises these links
from the data, with potentially huge
impact. During the training of the link
strength model, we do not capture links
between reviews in the training and reviews
in the test set, and so do not incorporate such
information in the model of link
strength. More impactful still, during the
construction of the graph for calculation of
the final classification, we do not model
edges between reviews in the test set and
any linked reviews in the training set. This
increases the sparsity of an already sparse
graph, and may explain why the inclusion of
edges between reviews has such little effect
on the performance of our sentiment
classification model compared with the
baseline model.
Beyond the loss of edges between
reviews due to a somewhat reckless splitting
of the data, examination of the output of our
Linear-chain CRF model suggests another
significant source of error. We had hoped
that this model might provide the most
robust, and indeed most effective, model of
strength between pairs of reviews of the
same business generated by
friends. However, the output of the Linear-
chain CRF gives a higher probability to
classifying pairs of reviews with the same
sentiment label rather than opposite labels
for almost every pair of linked
reviews. Furthermore, these probabilities do
not vary significantly, which results in
nearly every edge connecting two review
vertices in the graph having equal
weight. Such a phenomenon would limit the
effects of these edges within our framework
since one review-review edge would have
the same capacity in a maximum flow
calculation as almost any other, leading to a
cut that considers only the edges connecting
reviews to classes. We believe that this
behavior from the Linear-chain CRF results
from a poor construction of features on our
part.
6.2 Data Error
Another explanation for the lack of
significant performance boost when
incorporating social context may lie in the
manner in which we measured social links
between users. Mark Granovetter introduced
the concept of link strength, otherwise
known as tie strength, in his landmark 1973
paper The Strength of Weak Ties
(Granovetter, 1973), wherein he proposed
four tie strength dimensions: duration,
intimacy, intensity and reciprocal services
(Gilbert and Karahailos, 2009). In their
literature review and research on the subject
of tie strength in social media, Gilbert and
Karahailos add four additional dimensions:
intimacy, structural, emotional support,
social distance.
Because of the nature of the Yelp
data set, we have ability to infer tie strength
based only on two of these seven
dimensions. Intimacy, intensity, and
emotional support are dimensions all
relating to content and frequency of inter-
user communicationaspects neither part of
the Yelp data set nor the core service;
similarly, there is no explicit declaration of
social distance (a measure of relatedness of
educational attainment, political beliefs) in
the Yelp data set. While this social distance
could plausibly be inferred to some degree
from user review corpora, the former inter-
user communication features simply are not
elements of the Yelp network.
The two dimensions have used to
infer tie strength have been structural
(namely, cosine similarity between users
reviews; mutual friends) and shared services
(neighborhood overlap/Jaccard Index of
businesses reviewed).
Unfortunately, according to their
research conducted on Facebook data,
structural and shared services are among the
dimensions that carry the lowest relative
predictive weight, and the dimensions based
upon inter-user communication aspects that
are not elements of the Yelp network carry
the highest weights. Consider the following
table, which shows the predictive weight, as
calculated by portion of the sum of the
coefficients of a linear regression model
predicting tie strength.
Dimension Weight
Intimacy 32.8%
Intensity 19.7%
Duration 16.5%
Social Distance 13.8%
Shared Services 7.9%
Emotional Support 4.8%
Structural 4.5%
Results from Gilbert and Karahailos, 2011
Indeed, the inherent lack of strength in the
Yelp network seems borne out in our data
when one recalls the low degree (1/33 of
Facebook) and clustering coefficient (~1/7
of Facebook) of the data set.
7 Conclusion
In this paper we implemented a
model for sentiment analysis incorporating
both individual review classification by a
classifier using language features and the
strength of social links between reviews
based on a social graph. We were ultimately
unable to confirm our hypothesis that the
inclusion of these social links would
improve performance over individual,
language-based review models in binary
sentiment classification, as results indicate
that our models for link strength did not
have any effectpositive or negativeon the
individual classifier.
Despite the inconclusive nature of
our results, we believe our approach to
including social context in sentiment
classification could bear fruitful results in
future work. For instance, even more
careful handling of the sparse social links
within our chosen dataset, that is, ensuring
that links between reviews do not span the
train-test split, might yield different
results. We also believe that a better set of
features for the Linear-chain CRF model of
link strength will be fruitful in future work
using this approach on the Yelp Academic
Dataset, or one with more robust social
linking.
Additionally, evaluating this model
and approach on a dataset with more
numerous or more meaningful social
connections would likely produce more
compelling results as to the effectiveness (or
ineffectiveness) of the model. Namely,
because of the sparsity of the Yelp ego
network, expanding social context beyond a
friendship dyad could be useful. For
instance, future work could limit the review
data set to reviews for which there are
multiple friends who have also authored
reviews on the target business and
incorporate such reviews into a Linear CRF
model with longer chains. However, as
mentioned, there are aspects of the Yelp
networknamely, lack of focus around
inter-user dynamicsthat suggest a more
fruitful path could be moving beyond the
friendship network by looking at community
clusters based upon friends of friends or not
based around explicit social features
(friendship edges) at all, but rather based
upon latent communities of shared business
interest or language similarity.

Acknowledgements and Team
Contributions
All team members contributed
equally to all aspects of the project.
We consulted lecture notes on Social
Networks: Models, Algorithms, and
Applications from Geoffrey Fairchild and
Jason Fries at the University of Iowa to
inform some of our discussion in this
paper. We would also like to thank
Christopher Potts and Bill MacCartney for
valuable feedback and guidance.

Reference
Blum, A., & Chawla, S. (2001). Learning from
labeled and unlabeled data using graph
mincuts.
de Paula Peixoto, T. (2011). graph-tool
documentation.
Gilbert, E., & Karahalios, K. (2009, April).
Predicting tie strength with social
media. In Proceedings of the SIGCHI
Conference on Human Factors in
Computing Systems (pp. 211-220).
ACM.
Leskovec, J., & Horvitz, E. (2008, April).
Planetary-scale views on a large instant-
messaging network. In Proceedings of
the 17th international conference on
World Wide Web (pp. 915-924). ACM.
Pang, B., & Lee, L. (2004, July). A sentimental
education: Sentiment analysis using
subjectivity summarization based on
minimum cuts. In Proceedings of the
42nd annual meeting on Association for
Computational Linguistics (p. 271).
Association for Computational
Linguistics.
Pedregosa, F., Varoquaux, G., Gramfort, A.,
Michel, V., Thirion, B., Grisel, O., ... &
Duchesnay, E. (2011). Scikit-learn:
Machine learning in Python. The
Journal of Machine Learning Research,
12, 2825-2830.
Okazaki, N. (2007). CRFsuite: a fast
implementation of conditional random
fields (CRFs). URL http://www.
chokkan. org/software/crfsuite.
Tan, C., Lee, L., Tang, J., Jiang, L., Zhou, M., &
Li, P. (2011, August). User-level
sentiment analysis incorporating social
networks. In Proceedings of the 17th
ACM SIGKDD international conference
on Knowledge discovery and data
mining (pp. 1397-1405). ACM.
Thomas, M., Pang, B., & Lee, L. (2006, July).
Get out the vote: Determining support or
opposition from Congressional floor-debate
transcripts. InProceedings of the 2006
conference on empirical methods in natural
language processing (pp. 327-335). Association
for Computational Linguistics.
Ugander, J., Karrer, B., Backstrom, L., &
Marlow, C. (2011). The anatomy of the facebook
social graph. arXiv preprint arXiv:1111.4503.

You might also like