Professional Documents
Culture Documents
Sample Proposal 4
Sample Proposal 4
DEDICATION
This work is dedicated to my entire family and my fellow students who has always supported me
and guided me throughout my education. I thank you for the mental and emotional support I have
got from you. My God bless you all.
ACKNOWLEDGEMENT
ABSTRACT
Abstract – With the advancement of web technology and its growth, there is a huge volume of
data present in the web for internet users and a lot of data is generated too. Internet has become a
Social networking sites like Twitter, Facebook, Google+ are rapidly gaining popularity as they
allow people to share and express their views about topics, have discussion with different
communities, or post messages across the world. Twitter is one of the most commonly used
platforms for sharing opinions, expressing views. There has been lot of work in the field of
sentiment analysis of twitter data. This survey focuses mainly on sentiment analysis of twitter
data which is helpful to analyze the information in the tweets where opinions are highly
structured, heterogeneous and are either positive or negative, or neutral in some cases. The
organizations can use sentiment analysis to get an idea of the customer reviews of their
products, and subsequently try and improve their services based on the reviews.
Key Words: Twitter, Sentiment analysis (SA), Opinion mining, Machine learning, Naïve Bayes
(NB).
TABLE OF CONTENTS
DECLARATION Error! Bookmark not defined.
ACKNOWLEDGEMENT 2
ABSTRACT 3
1 CHAPTER 1 5
INTRODUCTION 5
1.1 BACKGROUND OF STUDY 5
1.2 PROBLEM STATEMENT 8
1.3 MOTIVATION 9
1.4 AIM OF THE STUDY 9
1.4.1 RESEARCH OBJECTIVES 9
1.5 SIGNIFICANCE OF THE STUDY 10
1.6 SCOPE 11
1.6.1 Data collection: 11
1.6.2 Preprocessing: 11
1.6.3 Training Data: 12
1.6.4 Classification 12
1.6.5 Results 14
1.7 ASSUMPTIONS 14
1.8 LIMITATIONS 14
2 LITERATURE REVIEW 15
2.1 APPLICATIONS OF SENTIMENT ANALYSIS 19
2.2 Machine learning methods for sentimental analysis 20
2.3 RELATED STUDIES 25
2.3.1 Lexical analysis 25
2.3.1.1 Limitations of lexical analysis 26
2.3.2 Machine learning approach 26
2.3.2.1 Advantages of machine learning approach 27
2.3.2.2 Disadvantage 27
2.3.3 Hybrid approach 27
2.3.4 Summary 28
2.4 RELATED SYSTEMS 29
2.4.1 Trump’s Tweets Sentiment analyzer example 29
2.5 LIMITATIONS OF RELATED SYSTEMS 31
2.6 HOW THE PROPOSED SYSTEM SEEKS TO HANDLE THESE CHALLENGES 32
1 CHAPTER 1
INTRODUCTION
Now-a-days, the age of internet has changed the way people express their views, opinions. It is
now mainly done through blog posts, online forums, product review websites, social media etc.
Nowadays, millions of people are using social network sites like Facebook, Twitter, Google plus,
etc. to express their emotions, opinion and share views about their lives.
With advance in technology especially the explosion of social media, brands and companies are
taking social media as a useful marketing campaign tool to reach masses with their product.
Social media is useful in that it contains tons and tons of reviews, opinions, sentiments,
appraisals, attitudes and emotions towards a particular product, service, organization, issues,
The analysis of sentiments may be document based where the sentiment in the entire document
sentences, bearing sentiments, in the text are classified. SA can be phrase based where the
Sentiment Analysis identifies the phrases in a text that bears some sentiment. The author may
speak about some objective facts or subjective opinions. It is necessary to distinguish between
the two. SA finds the subject towards whom the sentiment is directed. A text may contain many
entities but it is necessary to find the entity towards which the sentiment is directed. It identifies
the polarity and degree of the sentiment. Sentiments are classified as objective (facts), positive
(denotes a state of happiness, bliss or satisfaction on part of the writer) or negative (denotes a
state of sorrow, dejection or disappointment on part of the writer). The sentiments can further
Through the online communities, we get an interactive media where consumers inform and
influence others through forums. Social media is generating a large volume of sentiment rich
data in the form of tweets, status updates, blog posts, comments, reviews, etc. Moreover, social
media provides an opportunity for business by giving a platform to connect with their customers
for advertising. People mostly depend upon user generated content over online to a great extent
for decision making. For e.g., if someone wants to buy a product or wants to use any service,
then they firstly look up its reviews online, discuss about it on social network but the data
generated by users is too vast for a normal user to analyze. So, there is a need to automate this,
various sentiment analysis techniques are widely used. Sentiment analysis (SA) tells user
whether the information about the product is satisfactory or not before they buy it. Marketers and
firms use this analysis data to understand products or services in such a way that it can be offered
Twitter data is interesting because tweets happen at the “speed of thought” and are available for
consumption in real time, and you can obtain data from anywhere in the world.
The choice of Twitter is predominantly suited for data mining because of the three key features.
o Twitter’s terms of use for the data are relatively liberal as compared to
other API’s.
1.2 Twitter’s API
for accessing a web-based software application. Twitter bases its API of the Representational
State Transfer (REST) Architecture. REST architecture refers to a collection of network design
principles that define resources and ways to address and access data.
The power of the sentiment: Sentimental analysis is the process of computationally pinpointing
as well as categorizing opinions conveyed in a piece of script or text, specifically in order to find
out whether the writer’s attitude to a specific subject, product, etc. is positive, negative, or
neutral (RENTOUMI, 2012). For example: “I am so happy today, good day everybody”, is a
generally positive text and take another text “Black Panther is such an excellent movie, highly
acclaims 10/10” expresses positive sentiment towards this movie which is considered as the topic
of this text. Sometimes the job of ascertaining the exact sentiment is complex even for humans
take an example: “I am so surprised many people put Black Panther in their favorite movies ever
list, I felt it was a good watch but definitely not that good” the sentiment expressed here is
possibly positive but not as good as the previous text. In many other cases, knowing sentiment
for other texts becomes more tough for an algorithm even if it appears easy from the human
perspective for example:” if you haven’t watched Black Panther, you are not worth my time. If
Imagine launching a new product and having a real-time picture of its perception and feel by real
consumers based on what they said about the product without the company having to beg for
customer’s opinion usually through annoying and highly unverifiable product surveys. Do they
like it or Not? What should be changed about it? Do they recommend it to other People? These are
products, or to analyze client fulfilment and satisfaction. Organizations may also use Sentimental
Apart from helping companies know how they’re doing with their customers; Sentiment analysis
also provides them a clear picture of how they fair against their competitors. Knowing the
sentimentalities associated with competitors helps corporations gauge their own performance and
opinions conveyed in a piece of script or text, specifically in order to find out whether the
Opinions are central to humans, they are the main influencers of our behaviors, if we want to make
decisions, we have to look into what other people chose. That is the essential character of the
human story, learning from others or simply copying what others do. In the real-world, businesses
would like to know the people’s opinion towards their services or products furthermore, people
would like to know how other humans feel about a particular product before making a choice if to
buy the product or not and in other cases opinions about a political candidate before deciding who
to vote.
Traditionally, companies conducted public surveys, opinion polls and focus groups to get people’s
opinions and feelings towards an issue, this which is limited to few people, it is time consuming
and may give a biased overview of the opinion towards the issue itself. With the rise of social
media, a product hashtag is more than enough to ascertain the sentiment toward it since it contains
The purpose of this project is to provide a real-time sentiment analysis tool to be used to determine
the consumer perception of a particular brand or certain political topic, issue or an event. Or even
track a public relations crisis due to a particular occurrence affecting a particular product. This tool
utilizes data mining to crawl through thousands or even millions of social media feeds relating to
1.3 MOTIVATION
The emergence in the last decade of social media platforms such as Twitter, Facebook, and
Instagram, enabled people to engage in social activities to express their opinions, thoughts, and
emotions on a variety of topics. On such platforms, large amounts of data are produced, this
representing an opportunity for companies to assess their social influence and people opinions
opinion mining and sentiment analysis which can adapt to the activity domain of the user.
a) To provide real-time sentiment towards a particular product, political issue, event etc.
b) To track sentimental change through a certain period of time, this is so as to track brand
marketing campaigns
c) To provide a graphical user interface detailing and visualizing the most used words and
This study aims to develop a natural language processing-based tool to analyze sentiment in
realtime as the social-media feeds stream in. With the use of machine learning algorithm, this tool
will be able to work with large datasets of golden user data especially on twitter where tweeting
characters(messages) are limited to a short message that means a lot of opinions are crammed into
short messages.
Furthermore, this tool can give a business valuable instinct on how a product they launched is
faring, it can help identify and avert an emerging public crisis relating to a product, take an example
when an independent user found out about how apple had been knowingly slowing its iPhone down
after just one year of usage in order for users to buy new phones, this thread quickly became a
major topic on twitter and utter customer dissatisfaction was expressed through a ton of tweets
against the brand. With this tool such emerging customer dissatisfactions are quickly averted and
This tool can also be used to identify the right influencers (brand advocates) to push your product
or service to the mainstream by determining the sentiment they carry towards the social-media
users.
Last but not least, this tool is can be deployed to an email application and used in determining the
tone of an email you are writing in real-time, just like an emotional spellchecker for emotions.
1.6 SCOPE
To develop the system, I will use supervised machine learning, it comprises of three stages: Data
The project is built using python as programming language which has wide range of
● NumPy
● TensorFlow
● Text-blob
The data used in this application is sourced from twitter since twitter data is better suitable for these
reasons:
c) When someone is tweeting, most probably they will express a pressing opinion that they
feel others should know about it, therefore the probability of finding a sentiment in a tweet
is high.
A crawler functionality within the application will automatically pull tweets for a specified
1.6.2 Preprocessing:
Online data is usually full of noise and gibberish parts such as the HTML tags, scripts and ads and
none-English texts. Keeping these words will have a strain on the classifier and will slow its
performance. Here’s the thing, having data properly cleansed is going to improve the performance
of the classifier and should speed up the classification process therefore enabling real-time
sentiment analysis.
In this phase the extracted data is cleansed and prepared for feeding into the classifier. This phase
mostly involves mining keywords and symbols, getting rid of unnecessary whitespaces, tabs,
removing non-English texts and converting all uppercase and lowercase text to a common case.
A dataset mostly crowdsourced e.g. the IDM movie dataset is fed into the classifier for learning
purpose. This dataset is like jet-fuel for the classifier. In this project I am going to use the twitter
Sentiment analysis dataset which has over 1,578,627 classified tweets which are well labelled as
1.6.4 Classification
This is the most crucial part of the whole process. Naïve Bayes theorem is deployed for analysis.
Due to its simplicity naïve Bayes can outperform more sophisticated classification methods It is a
collection of algorithms that all share a mutual principle, that every feature being classified is
independent of the value of any other feature (Harry Zhang, 2016). eg a fruit may be considered to
be an apple if it is red, round, and about 3cm in diameter. A Naive Bayes classifier studies each of
these “structures” (red, round, 3” in diameter) to contribute autonomously to the probability that
the fruit is an apple, irrespective of any correlations between features. Features, however, aren’t
always autonomous which is often seen as a limitation of the Naive Bayes algorithm and this is
P (features label) is the prior probability that feature set is being classified as a label. P(features) is
the prior probability that a given feature set is occurred. Given the Naive assumption which states
that all features are independent of each other, the equation could be rewritten as follows:
Once a classifier for sentiment analysis is selected, the trained model classifier must be validated
using cross fold validation. The performance of the model can be determined using the following
measures:
1. Accuracy: It is measured by the fraction of number of correct predictions over total number
of predictions. The accepted accuracy is usually in the range 70% to 90%. If a model is
1005 accurate then it usually depicts that model overfits the data.
2. Precision: This measure shows how accurately the model makes predictions w.r.t each
class. It is measured by number of correct predictions over total number of true positives
3. Recall: This measure shows the completeness of the model w.r.t each class. It is measured
by number of correct predictions over total number of true positives and false negative
examples.
sentimental analysis.
1.6.5 Results
Results of the sentimental analysis are represented in a graphical user interface with charts, graphs
1.7 ASSUMPTIONS
While conducting this study, the researcher assumed that all companies and brands rely mostly on
social media to monitor and advertise their products, brands and also to communicate with their
customers. Furthermore, it is assumed that customers express their concerns pertaining a certain
brand through social media: this includes their reviews, dissatisfactions and complaints about a
product or brand. Finally, it is assumed that opinion expressed by customers concerning a product
or brand is honest and is not in any way biased or pushed by a political or personal agenda
1.8 LIMITATIONS
The human language is too complex for a machine to read and understand, Opinions are sometimes
expressed as sarcasm and furthermore the order of words adds to more confusion. For example, I
currently use the Blackberry priv and love it, but not as much as the Samsung galaxy s5 I chose
the priv for the camera lens. My blunder.” In this example the sentiment is not clear as to which
Understanding sarcasm, “Oh, yeah, Fast Food Eatery. I just LOVE the 40-minute wait for food.”
CHAPTER TWO
2 LITERATURE REVIEW
This chapter looks deep at the large pool of existing literature relevant to my topic and its
objectives. It gives an insight into the literature by other scholars and researchers on the field of
Sentiment analysis. It covers the past studies where it discusses literature related to the specific
objectives of the study. It also presents literature on the critical review of major issue, summary,
gaps to be filled and the conceptual framework. Several sentiment analyzers have been previously
developed to classify data into either positive or negative here are some of the techniques used to
analyze sentiments.
This section summarizes some of the scholarly and research works in the field of Machine
Learning and data mining to analyze sentiments on the Twitter and preparing prediction model
for various applications. In recent years a lot of work has been done in the field of “Sentiment
Analysis on Twitter” by number of researchers. In its early stage it was intended for binary
classification which assigns opinions or reviews to bipolar classes such as positive or negative
only. With quick growth in client of Social Media as of late, the researcher gets attracted towards
the utilization of social media data for sentiment analysis of individuals or particular product or
person or event. Twitter is one of the broadly utilized social media platforms to express the
considerations.
Sentiment analysis, also referred to as opinion mining, is a technique developed in the fields of
artificial intelligence and natural language processing. It is an information retrieval tool that can
classify text into subjective categories (negative, positive, or neutral) or measure sentiment
strength (Pang and Lee, 2008; Thelwall et al.,2010). There are two major steps in sentiment
analysis: opinion extraction and sentiment classification (Pang and Lee, 2008). Opinion
extraction is to differentiate subjective texts from factual ones, while sentiment classification
focuses on assigning opinion words into different sentiment categories (Chiu et al., 2015).
Opinion words are words that express desirable (e.g., fantastic, amazing, etc.) and undesirable
There are two common methods determining an opinion word’s semantic orientation or
subjective categories: corpus based and dictionary-based approach (Chiu et al., 2015).
Corpusbased approaches involve using the syntactic and co-occurrence pattern of a large corpus
(texts that are most representative of a document’s content) in identifying sentiment category
(Capriello et al., 2011; Chiu et al., 2015; Thelwall et al., 2011). For instance, Turney and Littman
(2003) determined a word’s semantic orientation by calculating the strength of its association
with a set of positive words minus its association with a set of negative words. The associations
were estimated by issuing a search engine query, and then noting the query’s co-occurrence
words to determine the text’s semantic orientation (Chiu et al., 2015; Thelwall et al., 2011). For
example, Hu and Liu (2004) generated a set of adjective synonyms and antonyms (opinion
words) through bootstrapping process using the WordNet dictionary. They then used this
collection of opinion words to predict the sentiment orientation of electronic product reviews at a
sentence level. The dictionary-based approach typically counts the numbers of positive and
negative opinion words in a sentence. If positive opinion words prevail, the orientation of the
sentence is positive and otherwise negative. Sentiment analysis can be conducted not only at a
sentence level, but also at others levels: document-, paragraph-, or attribute-level. As the level of
granularity increases, so does its complexity. Attribute-level sentiment analysis aims to associate
Sentiment analysis has been successfully applied in various contexts, such as detecting influenza
outbreak (Culotta, 2010), determining overall trends in the level of happiness (Dodds et
al.,2011), predicting movie box office revenues (Asur & Huberman,2010), and understanding
some consumer opinions on tourism and hospitality related products (Chiu et al., 2015; Claster et
al., 2013; Duan et al., 2013). Chiu et al. (2015) developed a Chinese sentiment analysis method
The rising popularity of sentiment analysis in research can be attributed to its unique advantages.
First, as compared to manual coding, computer-aided sentiment analyses are not only more
efficient, but also produce comparable results. Capriello et al. (2013) compare the efficiency of
manual content coding and two computer-aided sentiment analyses techniques in analyzing 800
travel reviews of former farm stay guests. They found that all three analyses produce similarly
reliable results. Second, as compared to traditional methods (e.g., surveys or focus groups),
sentiment analysis can effectively reduce cost, time, and manual labor by using automatic
Despite its advantages, sentiment analysis also has many draw-backs. As Pang and Lee (2008)
pointed out, sentiment analysis is domain and event dependent. Words considered positive in one
domain might not be so in another area. Sarcasm is an obvious example of this limitation.
Despite its inherent disadvantages, the technique still appears as a promising tool for researchers
and industry practitioners. Wang et al. (2013) reviewed previous studies, and reported that the
analysis technique yields a rather high accuracy rate (roughly 70% to 80% accuracy rate in
training-test data matching tasks). The objective of sentiment analysis is to obtain useful insight
from a large quantity of aggregated data, rather than perfect classification of all data points.
Pak and Paroubek (2010) proposed a model to classify the tweets as objective, positive and
negative. They created a twitter corpus by collecting tweets using Twitter API and automatically
annotating those tweets using emoticons. Using that corpus, they developed a sentiment classifier
based on the multinomial Naïve Bayes method that uses features like N-gram and POS-tags. The
training set they used was less efficient since it contains only tweets having emoticons.
Parikh and Movassate (2009) implemented two models, a Naïve Bayes bigram model and a
Maximum Entropy model to classify tweets. They found that the Naïve Bayes classifiers worked
Barbosa et al. (2010) designed a two-phase automatic sentiment analysis method for classifying
tweets. They classified tweets as objective or subjective and then in second phase, the subjective
tweets were classified as positive or negative. The feature space used included retweets,
hashtags, link, punctuation and exclamation marks in conjunction with features like prior polarity
Po-Wei Liang et.al. (2014) used Twitter API to collect twitter data. Their training data falls in
three different categories (camera, movie, mobile). The data is labeled as positive, negative and
non-opinions. Tweets containing opinions were filtered. Unigram Naive Bayes model was
implemented and the Naive Bayes simplifying independence assumption was employed. They
also eliminated useless features by using the Mutual Information and Chi square feature
extraction method. Finally, the orientation of a tweet is predicted. i.e. positive or negative.
Bakhtawar Seerat et al (2012) proposed the method of opinions extraction from an online web
page and the limitation of Sentiment analysis. Meena Rambocas (2013) concluded all the
challenges marketers can face when using sentiment analysis as an alternative technique capable
of triangulating qualitative and quantitative methods through innovative real time data collection
and analysis.
Selvam et al (2013) proposed different approaches of sentiment classification and the existing
methods with the framework. Rudy Prabowo (2009) formed a new approach by combining
rulebased classification, supervised learning and machine learning and tested it on movie
reviews, product reviews and Myspace comments. And also proposed a semi-automatic approach
Word of mouth (WOM) is the process of conveying information from person to person and
plays a major role in customer buying decisions. In commercial situations, WOM involves
consumers sharing attitudes, opinions, or reactions about businesses, products, or services with
other people. WOM communication functions based on social networking and trust. People rely
on families, friends, and others in their social network. Research also indicates that people
appear to trust seemingly disinterested opinions from people outside their immediate social
network, such as online reviews. This is where Sentiment Analysis comes into play. Growing
availability of opinion rich resources like online review sites, blogs, social networking sites
have made this “decision-making process” easier for us. With explosion of Web 2.0 platforms
consumers have a soapbox of unprecedented reach and power by which they can share
opinions. Major companies have realized these consumer voices affect shaping voices of other
consumers.
Sentiment Analysis thus finds its use in Consumer Market for Product reviews, Marketing for
knowing consumer attitudes and trends, Social Media for finding general opinion about recent
hot topics in town, Movie to find whether a recently released movie is a hit.
Pang-Lee et al. (2002) broadly classifies the applications into the following categories.
Detecting antagonistic, heated language in mails, spam detection, context sensitive information
detection etc.
Knowing public opinions for political leaders or their notions about rules and regulations in
place etc.
Sentimental analysis approaches use different machine learning classifiers and feature extractors.
In this context, the goal of machine learning is to study the algorithms that are capable in fully
automated situations to predict something out of input. There are many ways to do this i.e. the
use of Naive Bayes, support vector machines (SVM), maximum entropy etc. There are several
applications that have been developed using these algorithms for example; Microsoft Cern with
inbuilt random forest for predicting body parts, given what is on the sensor of the camera. Many
prototypes and models for sentiment classification treat classifiers and feature extractors as two
(Pang &Lee, 2008) describes naive Bayes classifier as a supervised machine learning algorithm
with a simple probabilistic classifier based on Bayes’ theorem with strong independence
assumptions. The classifier assumes the presence (or absence) of a particular feature of a class is
unrelated to the presence (or absence) of the other feature. It can learn the pattern by examining a
set of documents that has been categorized. It compares the contents with the list of words to
classify the documents to their right category or class (Vishal & Sonawane, 2016). Let d be the
From the above equation, “f” is a feature, count of feature “(fi)” is denoted with ni (d) and is
present in d which represents a tweet. Here, m denotes number of features. Parameters P(c) and
P(f|c) are computed through maximum likelihood estimates, and smoothing is utilized for unseen
features. Python NLTK library can be used to train and classify a text using Naïve Bayes (Vishal
Maximum entropy
Maximum entropy is a technique for estimating probability distributions from data. In text
classification, maximum entropy estimates the conditional distribution of the class label given a
document. A document is represented by a set of word count features. The labeled training data
is used to estimate the expected value of these word counts on a class-by-class basis
The principle in maximum entropy is that when nothing is known, the distribution should
be as uniform as possible, that is, have maximal entropy. Labeled training data is used to
derive a set of constraints for the model that is characterized by the class-specific
expectations for the distribution. Using maximum entropy model, prediction of outcome is
between the features extracted from dataset. This classifier always tries to maximize the
entropy of the system by estimating the conditional distribution of the class label.
Maximum entropy even handles overlap feature and is same as logistic regression method
which finds the distribution over classes (Vishal & Sonawane, 2016). Maximum entropy
makes no independence assumptions for its features, like Naive Bayes. The model is
SVM is a supervised learning model which analyzes the data and identifies the pattern for
classification. The concept of SVM algorithm is based on decision plane that defines decision
boundaries. A decision plane separates group of instances having different class memberships.
For example, consider an instance which belongs to either class Circle or Diamond. There is a
separating line (figure 2.2) which defines a boundary. At the right side of boundary all instances
are Circle and at the left side all instances are Diamond (Pravesh & Mohd, 2014).
Principle of SVM
In text classification sometimes data are linearly divisible, for very high dimensional
problems and for multi-dimensional problems data are also separable linearly. Generally,
(in maximum cases) the opinion mining solution is one that classifies most of the data and
ignores outliers and noisy data. If a training set data say D cannot be separated clearly then
the solution is to have fat decision classifiers and make some mistake (Pravesh & Mohd,
2014). The SVM can be used to extract terrorist entities from a collection of untagged news
documents in the terrorist domain. This method segments each document into sentences,
parses the latter into parse trees and delivers features for the entities within the documents.
Lexicon-Based Approaches
Lexicon based method uses sentiment dictionary with opinion words and match them with
the data to determine polarity (Vishal & Sonawane, 2016). They assign sentiment scores to
the opinion words describing how positive, negative and objective the words contained in
collection of known and precompiled sentiment terms, phrases and even idioms, developed
for traditional genres of communication, such as the opinion finder lexicon (Vishal &
Sonawane, 2016).
Dictionary-based
It is based on the usage of terms (seeds) that are usually collected and annotated manually.
This set grows by searching the synonyms and antonyms of a dictionary (Vishal &
The corpus-based approach provides the dictionaries related to a specific domain. These
dictionaries are generated from a set of seed opinion terms that grows through the search
Sonawane,2016).
Word
Matching
Match
YES NO
Increment Decrement
score score
words to determine the text’s semantic orientation (Chiu et al., 2015; Thelwall et al., 2011). This
technique is governed by the use of a dictionary consisting pre-tagged lexicons the input text is
converted to tokens by the Tokenizer. Every new token encountered is then matched for the
lexicon in the dictionary. If there is a positive match, the score is added to the total pool of score
For instance, if “dramatic” is a positive match in the dictionary then the total score of the text is
incremented. Otherwise, the score is decremented or the word is tagged as negative. Though this
The classification of a text depends on the total score it achieves. Considerable amount of work
has been devoted for measuring which best lexical information works.
An accuracy of about 80% on single phrases can be achieved by the use of hand tagged lexicons
comprised of only adjectives, which are crucial for deciding the subjectivity of an evaluative text.
The dictionary can be grown by searching the synonyms and antonyms of words in the wellknown
➢ Lexical analysis has a limitation its performance (in terms of time complexity and
accuracy) degrades drastically with the exponential growth of the size of dictionary
(number of words)
task
Machine learning is one of the most prominent techniques gaining interest of researchers due to its
adaptability and accuracy. In sentiment analysis mostly, the supervised learning variants of this
technique are employed. It comprises of three stages: Data collection, Pre-processing, Training
data, Classification and plotting results in the training data, a collection of tagged corpora is
provided. The Classifier is presented a series of feature vectors from the previous data. A model is
created based on the training data set which is employed over the new/unseen text for classification
purpose. In machine learning technique the key to accuracy of a classifier is the selection of
appropriate features. Generally, unigrams (single word phrases), bigrams (two consecutive
phrases), tri-grams (three consecutive phrases) are selected as feature vectors. There are a variety
of proposed features namely number of positive words, number of negative words, length of the
document,
Support Vector Machines (SVM), and Naïve Bayes (NB) algorithm. Accuracy is reported to vary
from 63% to 80% depending upon the combination of various features selected.
➢ it works well even when the dictionary size grows exponentially due to its ability to learn
and adapt.
2.3.2.2 Disadvantage
The advances in sentiment analysis lured researchers to explore the possibility of a hybrid approach
which could collectively exhibit the accuracy of a machine learning approach and the speed of
lexical approach. In Hybrid approach authors use two-word lexicons and an unlabeled data,
dividing these two-word lexicons in two discrete classes negative and positive. Pseudo documents
encompassing all the words from the set of chosen lexicons are created.
Then computed the cosine similarity amongst the pseudo documents and the unlabeled documents.
Depending upon the measure of similarity, the documents were either assigned a positive or a
negative sentiment. This training dataset was then fed to a naïve bayes classifier for training
purpose.
Another approach presented by, derived a ‘unified framework’ using back-ground lexical
information as word class associations. Authors renewed information for particular areas using the
available datasets or training examples and proposed a classifier called as Polling Multinomial
Classifier (PMC) (also known as the multinomial naïve bayes) Manually labeled data was
They claimed that making use of lexical knowledge improved performance. Another variant of this
approach was presented by but so far only have been able to claim good results
2.3.4 Summary
Comparison of all approaches has showed that best results have been observed from machine
learning approaches, and least by lexical approaches. However, without any proper training of a
classifier in machine learning approach results may deteriorate drastically. Work is being carried
on hybrid approaches. The techniques were tested by on a movie review & recommendation and
Their results seem to be promising for further research. Open social networks are best examples of
sociological trust. The exchange of messages, followers and friends and varying sentiments of
users provide a crude platform to study behavioral trust in sentiment analysis domain.
Machine learning approaches have been so far good in delivering accurate results. Depending upon
the application, the success of any approach will vary. Lexical approach is a ready to go and doesn’t
require any prior information or training. While on the other hand machine learning requires well
designed classifier, huge amount of training data sets and performance tuning prior to deployment.
Hybrid approach has so far displayed positive sentiment as far as performance is concerned.
Though they have been deployed using unigrams and diagrams, their performance is worse on
A research was conducted by (Face, Chris, 2016) of Macquarie University on the sentiment of
Donald Trump tweets.During the election campaign of 2016, much discussion revolved around
who was sending out Donald Trump’s Tweets. A number of articles described how the tone of
Trump’s tweets is more positive when they come from an iPhone device, than when they come
from an Android. The hypothesis is that Trump tweets from an Android device, and that he
employs social media assistants who tweet from an iPhone. But how do you work that out?
In a data set containing 1,512 tweets from @realDonaldTrump sent during the primaries, there is
a small but positive average sentiment score of 0.3, with scores ranging from -5 to 6. This means
that the average tweet has slightly more positive language than negative. The magnitude of the
The power of sentiment arises when considering other variables in the data. Think of the
nowfamous example of the Trump sentiment gap between Android and iPhone. The mean
sentiment score of Tweets from Android, 0.1, is significantly lower than the overall average of 0.3:
Engagement
The data from Twitter includes the number of times each tweet has been Favorited. This is used as
a proxy for engagement. For this data set, the average is around 19,000. By considering how the
average number of favorites varies with the sentiment, the study discovered another interesting
pattern.
FIGURE 2.4 ENGAGEMENT
Those tweets which have a negative sentiment (scoring -2 or fewer) garner a significantly higher
number of favorites on average. It would seem that Trump’s followers are noticeably more engaged
by negative content.
A little sentiment analysis can reveal patterns in the data which would be difficult to gain by
● Related systems can identify and analyze many pieces of text automatically and quickly.
computer programs have problems recognizing things like sarcasm and irony, negations,
jokes, and exaggerations - the sorts of things a person would have little trouble identifying.
● 'Disappointed' may be classified as a negative word for the purposes of sentiment analysis,
● We would find it easy to recognize as sarcasm the statement "I'm really loving the
● With short sentences and pieces of text, for example like those you find on Twitter
especially, and sometimes on Facebook, there might not be enough context for a reliable
sentiment analysis. However, in general, Twitter has a reputation for being a good source
of information for sentiment analysis, and with the new increased word count for tweets
The proposed system uses social media feeds of a particular brand or trend to analyze the sentiment
and polarity of the feeds therefore determine the general feeling of the masses towards a particular
To ensure good quality of data being used to analyze sentiment, this system removes unnecessary
data or noise thus remains with good quality and plausible data from social media feeds i.e., data
that contains ‘unknown language’ or too many links will be filtered out.
The overall sentiment will require a huge collection of datasets i.e., a whole week of social media
Bakhtawar Seerat, Farouque Azam, “Opinion Mining: Issues and Challenges (A Survey)”,
51.
Capriello, A., Mason, P.R., Davis, B., Crotts, J.C., 2011. Farm tourism experiences in
Chiu, C., Chiu, N.H., Sung, R.J., Hsieh, P.Y., 2015. Opinion mining of hotel customergenerated
Claster, W., Pardo, P., Cooper, M., Tajeddini, K., 2013. Tourism, travel and tweets:
1(1), 81–99.
Ding, X., Liu, B., Yu, P.S., 2008. A holistic lexicon-based approach to opinion
Dodds, P.S., Harris, K.D., Kloumann, I.M., Bliss, C.A., Danforth, C.M., 2011.
Temporal patterns of happiness and information in a global social network:
Duan, W., Cao, Q., Yu, Y., Levy, S., 2013. Mining online user-generated content:
G. Vinodhini et al, “Sentiment Analysis and Opinion Mining: A Survey”, International Journal
Govindarajan & Romina (2013), a survey of classification methods and applications for
Hu, M., Liu, B., 2004. Mining and summarizing customer reviews. In: Proceedings of
http://dx.doi.org/10.1145/1014052.1014073.
Liang, P. W., Liao, C. Y., Chueh, C. C., Zuo, F., Williams, S. T., Xin, X. K., ... & Jen,
Pak, A., & Paroubek, P. (2010, May). Twitter as a corpus for sentiment analysis and opinion
Parikh, R., & Movassate, M. (2009). Sentiment analysis of user-generated twitter updates
Pravesh & Mohd (2014), methodological study of opinion mining and sentiment analysis
Rambocas, M., & Gama, J. (2013). Marketing research: The role of sentiment analysis
Thelwall, M., Buckley, K., Paltoglou, G., 2011. Sentiment in Twitter events. J.
Turney, P.D., Littman, M.L., 2003. Measuring praise and criticism: inference ofsemantic
(4),315–346.
Vishal & Sonawane (2016), sentiment analysis of twitter data: a survey of techniques,
Wang, J., Gu, Q., Wang, G., 2013. Potential power and problems in sentiment mining of