Professional Documents
Culture Documents
MinorProject Report
MinorProject Report
SENTIMENT ANALYSIS
A PROJECT REPORT
SUBMITTED IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS FOR THE AWARD OF THE DEGREE OF
BACHELOR OF TECHNOLOGY
IN
COMPUTER ENGINEERING
Azzam Jafri
(17BCS092)
The matter embodied in this project work has not been submitted earlier
for the award of any degree or diploma to the best of my knowledge.
2
DECLARATION
(17BCS062) (17BCS092)
NEW DELHI-110025
3
ACKNOWLEDGEMENT
(17BCS062) (17BCS092)
NEW DELHI-110025
4
ABSTRACT
5
TABLE OF CONTENTS
3.1 Approach 22
3.2 Dataset 23
3.2.1 Characteristic Features of Tweets 23
3.2.2 Stanford Twitter Corpus 24
3.3 Pre-Processing 25
3.3.1 Reduction in Feature Space 28
3.4 Stemming Algorithms 29
3.4.1 Porter Stemmer 30
3.4.2 Lemmatization 30
3.5 Features 31
6
3.5.1 Unigrams 31
3.5.2 N-grams 32
3.5.3 Negation Handling 33
3.6 Experimentation 35
3.6.1 Naïve Bayes Classifier 36
3.6.2 Maximum Entropy Classifier 38
4. FUTURE WORK 39
CONCLUSION 40
REFERENCES 41
7
LIST OF FIGURES
6. Scope of Negation 35
8
LIST OF TABLES
5. List of Emoticons 27
6. List of Punctuations 28
9
ABBREVIATIONS/ NOTATIONS/
NOMENCLATURE
10
CHAPTER 1
INTRODUCTION
1.1. Motivation
The internet is where consumers talk about brands, products, services,
share their experiences and recommendations. Social platforms, product
reviews, blog posts and discussion forums are boiling with opinions and
comments which, if collected and analyzed, are a source of business
information. We have chosen to work with twitter since we feel it is a better
approximation of public sentiment as compared to conventional internet
articles and web blogs. t the amount of relevant data is much larger for twitter,
as compared to traditional blogging sites. Moreover, the response on twitter is
more prompt and also more general (since the number of users who tweet is
substantially more than those who write web blogs on a daily basis). Sentiment
analysis of public reviews is highly critical in macro-scale socioeconomic
phenomena like predicting the stock market rate of a particular firm. This
could be done by analysing overall public sentiment towards that firm with
respect to time and using economics tools for finding the correlation between
public sentiment and the firm’s stock market value. Firms can also estimate
how well their product is responding in the market, which areas of the market
is it having a favourable response and in which a negative response (since
11
twitter allows us to download stream of geo-tagged tweets for particular
locations. If firms can get this information, they can analyze the reasons
behind geographically differentiated response, and so they can market their
product in a more optimized manner by looking for appropriate solutions like
creating suitable market segments. Predicting the results of popular political
elections and polls is also an emerging application to sentiment analysis. One
such study was conducted by Tumasjan et al. in Germany for predicting the
outcome of federal elections in which concluded that twitter is a good
reflection of offline sentiment [6].
A. Online Commerce
12
hotels, travel destinations. They contain 75 million opinions and reviews
worldwide. Sentiment analysis helps such websites by converting dissatisfied
customers into promoters by analyzing this huge volume of opinions.
13
conversations are taking place online at a high rate. That creates opportunities
for organizations to manage and strengthen brand reputation. Now brand
perception is determined not only by advertising, public relations and
corporate messaging. Brands are now a sum of the conversations about them.
Sentiment analysis helps in determining how company’s brand, product or
service is being perceived by community online.
E. Government
14
2. Domain dependence [24]: The same sentence or phrase can have
different meanings in different domains.
For Example, the word “unpredictable” is positive in the domain of
movies, dramas, etc, but if the same word is used in the context of a
vehicle's steering, then it has a negative opinion.
15
Example: “I hate Microsoft, but I like Linux”. A simple bag-of-words
approach will label it as neutral, however, it carries a specific sentiment
for both the entities present in the statement.
16
language based and informal blogging based. Language based features are
those that deal with formal linguistics and include prior sentiment polarity of
individual words and phrases, and parts of speech tagging of the sentence.
Prior sentiment polarity means that some words and phrases have a natural
innate tendency for expressing particular and specific sentiments in general.
For example, the word “excellent” has a strong positive connotation while the
word “evil” possesses a strong negative connotation. So, whenever a word
with positive connotation is used in a sentence, chances are that the entire
sentence would be expressing a positive sentiment. Parts of Speech tagging,
on the other hand, is a syntactical approach to the problem. It means to
automatically identify which part of speech each individual word of a sentence
belongs to: noun, pronoun, adverb, adjective, verb, interjection, etc. Patterns
can be extracted from analyzing the frequency distribution of these parts of
speech (ether individually or collectively with some other part of speech) in a
particular class of labelled tweets. Twitter based features are more informal
and relate with how people express themselves on online social platforms and
compress their sentiments in the limited space of 140 characters offered by
twitter. They include twitter hashtags, retweets, word capitalization, word
lengthening [10], question marks, presence of URL in tweets, exclamation
marks, internet emoticons and other punctuation marks.
17
There are several metrics proposed for computing and comparing the
results of our experiments. Some of the most popular metrics include:
Precision, Recall, Accuracy, F1-measure e, True rate and False alarm rate
(each of these metrics is calculated individually for each class and then
averaged for the overall classifier performance.) A typical confusion table for
our problem is given below along with illustration of how to compute our
required metric.
18
CHAPTER 2
RELATED WORK
Happy emoticons are different versions of smiling face, like ":)", ":-)", ": )",
":D", "=)" etc.
Sad emoticons include frowns, like ":(", ":-(", ":(" etc.
19
features with lexicon features, while using Part-of-Speech features causes a
drop in accuracy.
20
The scope of negation of a cue can be taken from that word to the next
following punctuation. Councill, McDonald and Velikovich (2010) discuss a
technique to identify negation cues and their scope in a sentence. They identify
explicit negation cues in the text and for each word in the scope. Then they
find its distance from the nearest negative cue on the left and right.
21
CHAPTER 3
ABOUT THE PROJECT
3.1. Approach
22
3.2. Dataset
Language used: Twitter is used via a variety of media including SMS and
mobile phone apps. Because of this and the 140-character limit, language used
in Tweets tend be more colloquial, and filled with slang and misspellings. Use
of hashtags also gained popularity on Twitter and is a primary feature in any
given tweet.
Domain of topics: People often post about their likes and dislikes on social
media. These are not al concentrated around one topic. This makes twitter a
unique place to model a generic classifier as opposed to domain specific
classifiers that could be build datasets such as movie reviews.
23
3.2.2 Stanford Twitter Corpus
24
3.3. Pre-Processing
25
1. Hashtags:
2. Handles:
3. URLs:
Users often share hyperlinks in their tweets. Twitter shortens them using
its in-house URL shortening service, like http://t.co/FCWXoUd8 - such links
also enables Twitter to alert users if the link leads out of its domain. From the
point of view of text classification, a particular URL is not important. However,
presence of a URL can be an important feature. Regular expression for detecting
a URL is fairly complex because of different types of URLs that can be there,
but because of Twitter’s shortening service, we can use a relatively simple
regular expression.
26
4. Emoticons:
5. Punctuations:
Although not all Punctuations are important from the point of view of
classification but some of these, like question mark, exclamation mark can also
provide information about the sentiments of the text. We replaced every word
boundary by a list of relevant punctuations present at that point. Table 6 lists the
punctuations currently identified. We also removed any single quotes that might
exist in the text.
27
Table 6: List of Punctuations
6. Repeating Characters:
28
Table 7: Number of words before and after pre-processing
29
3.4.1 Porter and Stemmer
Martin Porter wrote a stemmer that was published in July 1980. This
stemmer was very widely used and became and remains the de facto standard
algorithm used for English stemming. It offers excellent trade-off between speed,
readability, and accuracy. It uses a set of around 60 rules applied in 6 successive
steps [9]. An important feature to note is that it doesn’t involve recursion. The
steps in the algorithm are described in Table 8.
3.4.2 Lemmatization
30
3.5. Features
A wide variety of features can be used to build a classifier for tweets. The
most widely used and basic feature set is word n-grams. However, there's a lot
of domain specific information present in tweets that can also be used for
classifying them. We have experimented with two sets of features:
3.5.1 Unigrams
Unigrams are the simplest features that can be used for text classification.
A Tweet can be represented by a multiset of words present in it. We, however,
have used the presence of unigrams in a tweet as a feature set. Presence of a word
is more important than how many times it is repeated. Pang et al. found that
presence of unigrams yields better results than repetition [3]. This also helps us
to avoid having to scale the data, which can considerably decrease training time
[2]. Figure 3 illustrates the cumulative distribution of words in our dataset
31
3.5.2 N-grams
As the order of the n-grams increases, they tend to be more and more
sparse. Based on our experiments, we find that number of bigrams and trigrams
increase much more rapidly than the number of unigrams with the number of
Tweets. Figure 4 shows the number of n-grams versus number of Tweets. We
can observe that bigrams and trigrams increase almost linearly whereas unigrams
are increasing logarithmically.
32
Because higher order n-grams are sparsely populated, we decided to trim off the
n-grams that are not seen more than once in the training corpus, because chances
are that these n-grams are not good indicators of sentiments. After filtering out
non-repeating n-grams, we see that the number of n-grams is considerably
decreased and equals the order of unigrams, as shown in Figure 5.
33
Councill et al. look at whether negation detection is useful for sentiment
analysis and also to what extent is it possible to determine the exact scope of a
negation in the text [9]. They describe a method for negation detection based on
Left and Right Distances of a token to the nearest explicit negation cue.
34
Scope of Negation
Words immediately preceding and following the negation cues are the
most negative and the words that come farther away do not lie in the scope of
negation of such cues. We define left and right negativity of a word as the
chances that meaning of that word is actually the opposite. Left negativity
depends on the closest negation cue on the left and similarly for Right negativity.
Figure 6 illustrates the left and right negativity of words in a tweet.
3.6. Experimentation
We train 90% of our data using different combinations of features and test
them on the remaining 10%. We take the features in the following combinations:
The task of classification of a tweet can be done in two steps - first, classifying
"neutral" (or "subjective") vs. "objective" tweets and second, classifying
objective tweets into "positive" vs. "negative" tweets. We also trained 2 step
classifiers. The accuracies for each of these configurations are shown in Figure
7.
35
Figure 7: Accuracy for Naive Bayes Classifier
Naive Bayes classifier is the simplest and the fastest classifier. Many
researchers [2], [3] claim to have gotten best results using this classifier.
For a given tweet, if we need to find the label for it, we find the probabilities of
all the labels, given that feature and then select the label with maximum
probability.
The results from training the Naive Bayes classifier are shown below in
Figure 7. The accuracy of Unigrams is the lowest at 79.67%. The accuracy
increases if we also use Negation detection (81.66%) or higher order n-grams
(86.68%). We see that if we use both Negation detection and higher order n-
grams, the accuracy is marginally less than just using higher order n-grams
36
(85.92%). We can also note that accuracies for double step classifier are lesser
than those for corresponding single step.
We have also shown Precision versus Recall values for Naive Bayes classifier
corresponding to different classes – Negative, Neutral and Positive in Figure 8.
The solid markers show the P-R values for single step classifier and hollow
markers show the effect of using double step classifier. Different points are for
different feature sets. We can see that both precision as well as recall values are
higher for single step than that for double step.
37
3.6.2 Maximum Entropy Classifier
38
CHAPTER 4
FUTURE WORK
39
CONCLUSION
We created a sentiment classifier for twitter using labelled data sets. We also
investigated the relevance of using negation detection for the purpose of
sentiment analysis. Our baseline classifier that uses just the unigrams achieves
an accuracy of around 80.00%. Accuracy of the classifier increases if we use
negation detection or introduce bigrams and trigrams. Thus, we can conclude
that both Negation Detection and higher order n-grams are useful for the purpose
of text classification. However, if we use both n-grams and negation detection,
the accuracy falls marginally. In general, Naive Bayes Classifier performs better
than Maximum Entropy Classifier. We achieved the best accuracy of 86.68% in
the case of Unigrams + Bigrams + Trigrams, trained on Naive Bayes Classifier.
40
REFERENCES
[1] Vishal A. Kharde, S.S. Sonawane. Sentiment Analysis of Twitter Data: A
Survey of Techniques. In International Journal of Computer Applications
(0975 – 8887) Volume 139 – No.11, April 2016
[2] Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification
using distant supervision. Processing, pages 1-6, 2009
[3] Alexander Pak and Patrick Paroubek. Twitter as a corpus for sentiment
analysis and opinion mining. Volume 2010, pages 1320-1326, 2010
[5] Hassan Saif, Yulan He, and Harith Alani. Semantic sentiment analysis of
twitter. In The Semantic Web-ISWC 2012, pages 508-524. Springer, 2012.
[9] Isaac G Councill, Ryan McDonald, and Leonid Velikovich. What's great and
what's not: learning to classify the scope of negation for improved sentiment
analysis. In Proceedings of the workshop on negation and speculation in
natural language processing, pages 51-59. Association for Computational
Linguistics, 2010.
[11] R. Xia, C. Zong, and S. Li, “Ensemble of feature sets and classification
algorithms for sentiment classification,” Information Sciences: an
International Journal, vol. 181, no. 6, pp. 1138–1152, 2011.
41
[12] V. M. K. Peddinti and P. Chintalapoodi, “Domain adaptation in sentiment
analysis of twitter,” in Analyzing Microtext Workshop, AAAI, 2011.
42