Sample Proposal 4

Jomo Kenyatta University of Agriculture and Technology (JKUAT)
Project Proposal and Literature Review Report

TITLE: MACHINE LEARNING BASED SENTIMENT ANALYSIS TO
DETERMINE BRAND POPULARITY USING SOCIAL MEDIA FEEDS.
DEDICATION
This work is dedicated to my entire family and my fellow students who has always supported me
and guided me throughout my education. I thank you for the mental and emotional support I have
got from you. My God bless you all.
ACKNOWLEDGEMENT
ABSTRACT
Abstract – With the advancement of web technology and its growth, there is a huge volume of
data present in the web for internet users and a lot of data is generated too. Internet has become a
platform for online learning, exchanging ideas and sharing opinions.
Social networking sites like Twitter, Facebook, Google+ are rapidly gaining popularity as they
allow people to share and express their views about topics, have discussion with different
communities, or post messages across the world. Twitter is one of the most commonly used
platforms for sharing opinions, expressing views. There has been lot of work in the field of
sentiment analysis of twitter data. This survey focuses mainly on sentiment analysis of twitter
data which is helpful to analyze the information in the tweets where opinions are highly
structured, heterogeneous and are either positive or negative, or neutral in some cases. The
organizations can use sentiment analysis to get an idea of the customer reviews of their
products, and subsequently try and improve their services based on the reviews.
Key Words: Twitter, Sentiment analysis (SA), Opinion mining, Machine learning, Naïve Bayes
(NB).
TABLE OF CONTENTS
DECLARATION Error! Bookmark not defined.
ACKNOWLEDGEMENT 2
ABSTRACT 3
1 CHAPTER 1 5
INTRODUCTION 5
1.1 BACKGROUND OF STUDY 5
1.2 PROBLEM STATEMENT 8
1.3 MOTIVATION 9
1.4 AIM OF THE STUDY 9
1.4.1 RESEARCH OBJECTIVES 9
1.5 SIGNIFICANCE OF THE STUDY 10
1.6 SCOPE 11
1.6.1 Data collection: 11
1.6.2 Preprocessing: 11
1.6.3 Training Data: 12
1.6.4 Classification 12
1.6.5 Results 14
1.7 ASSUMPTIONS 14
1.8 LIMITATIONS 14
2 LITERATURE REVIEW 15
2.1 APPLICATIONS OF SENTIMENT ANALYSIS 19
2.2 Machine learning methods for sentimental analysis 20
2.3 RELATED STUDIES 25
2.3.1 Lexical analysis 25
2.3.1.1 Limitations of lexical analysis 26
2.3.2 Machine learning approach 26
2.3.2.1 Advantages of machine learning approach 27
2.3.2.2 Disadvantage 27
2.3.3 Hybrid approach 27
2.3.4 Summary 28
2.4 RELATED SYSTEMS 29
2.4.1 Trump’s Tweets Sentiment analyzer example 29
2.5 LIMITATIONS OF RELATED SYSTEMS 31
2.6 HOW THE PROPOSED SYSTEM SEEKS TO HANDLE THESE CHALLENGES 32
1 CHAPTER 1
INTRODUCTION
1.1 BACKGROUND OF STUDY
Now-a-days, the age of internet has changed the way people express their views, opinions. It is
now mainly done through blog posts, online forums, product review websites, social media etc.
Nowadays, millions of people are using social network sites like Facebook, Twitter, Google plus,
etc. to express their emotions, opinion and share views about their lives.
With advance in technology especially the explosion of social media, brands and companies are
taking social media as a useful marketing campaign tool to reach masses with their product.
Social media is useful in that it contains tons and tons of reviews, opinions, sentiments,
appraisals, attitudes and emotions towards a particular product, service, organization, issues,
individuals, topics, events, etc.
The analysis of sentiments may be document based where the sentiment in the entire document
is summarized as positive, negative or objective. It can be sentence based where individual
sentences, bearing sentiments, in the text are classified. SA can be phrase based where the
phrases in a sentence are classified according to polarity.
Sentiment Analysis identifies the phrases in a text that bears some sentiment. The author may
speak about some objective facts or subjective opinions. It is necessary to distinguish between
the two. SA finds the subject towards whom the sentiment is directed. A text may contain many
entities but it is necessary to find the entity towards which the sentiment is directed. It identifies
the polarity and degree of the sentiment. Sentiments are classified as objective (facts), positive
(denotes a state of happiness, bliss or satisfaction on part of the writer) or negative (denotes a
state of sorrow, dejection or disappointment on part of the writer). The sentiments can further
be given a score based on their degree of positivity, negativity or objectivity.
Through the online communities, we get an interactive media where consumers inform and
influence others through forums. Social media is generating a large volume of sentiment rich
data in the form of tweets, status updates, blog posts, comments, reviews, etc. Moreover, social
media provides an opportunity for business by giving a platform to connect with their customers
for advertising. People mostly depend upon user generated content over online to a great extent
for decision making. For e.g., if someone wants to buy a product or wants to use any service,
then they firstly look up its reviews online, discuss about it on social network but the data
generated by users is too vast for a normal user to analyze. So, there is a need to automate this,
various sentiment analysis techniques are widely used. Sentiment analysis (SA) tells user
whether the information about the product is satisfactory or not before they buy it. Marketers and
firms use this analysis data to understand products or services in such a way that it can be offered
as per the users’ requirements.
1.1 Twitter’s simplicity
Twitter data is interesting because tweets happen at the “speed of thought” and are available for
consumption in real time, and you can obtain data from anywhere in the world.
The choice of Twitter is predominantly suited for data mining because of the three key features.
o Twitter’s API is well designed and easy to access. o Twitter data in a
convenient format for analysis.
o Twitter’s terms of use for the data are relatively liberal as compared to
other API’s.
1.2 Twitter’s API
An Application Programming Interface (API) is a set of programming instructions and standards
for accessing a web-based software application. Twitter bases its API of the Representational
State Transfer (REST) Architecture. REST architecture refers to a collection of network design
principles that define resources and ways to address and access data.
The power of the sentiment: Sentimental analysis is the process of computationally pinpointing
as well as categorizing opinions conveyed in a piece of script or text, specifically in order to find
out whether the writer’s attitude to a specific subject, product, etc. is positive, negative, or
neutral (RENTOUMI, 2012). For example: “I am so happy today, good day everybody”, is a
generally positive text and take another text “Black Panther is such an excellent movie, highly
acclaims 10/10” expresses positive sentiment towards this movie which is considered as the topic
of this text. Sometimes the job of ascertaining the exact sentiment is complex even for humans
take an example: “I am so surprised many people put Black Panther in their favorite movies ever
list, I felt it was a good watch but definitely not that good” the sentiment expressed here is
possibly positive but not as good as the previous text. In many other cases, knowing sentiment
for other texts becomes more tough for an algorithm even if it appears easy from the human
perspective for example:” if you haven’t watched Black Panther, you are not worth my time. If
you plan to see it, there’s hope for you yet”
Imagine launching a new product and having a real-time picture of its perception and feel by real
consumers based on what they said about the product without the company having to beg for
customer’s opinion usually through annoying and highly unverifiable product surveys. Do they
like it or Not? What should be changed about it? Do they recommend it to other People? These are
the questions sentimental analysis seeks to answer.

Brand marketers can make use of sentiment analysis to study public attitude of their firm and
products, or to analyze client fulfilment and satisfaction. Organizations may also use Sentimental
analysis to gather critical feedback about problems in newly released products.
Apart from helping companies know how they’re doing with their customers; Sentiment analysis
also provides them a clear picture of how they fair against their competitors. Knowing the
sentimentalities associated with competitors helps corporations gauge their own performance and
explore other ways to improve.
1.2 PROBLEM STATEMENT
Sentimental analysis is the process of computationally pinpointing as well as categorizing
opinions conveyed in a piece of script or text, specifically in order to find out whether the
writer’s attitude to a specific subject, product, etc. is positive, negative, or neutral.
Opinions are central to humans, they are the main influencers of our behaviors, if we want to make
decisions, we have to look into what other people chose. That is the essential character of the
human story, learning from others or simply copying what others do. In the real-world, businesses
would like to know the people’s opinion towards their services or products furthermore, people
would like to know how other humans feel about a particular product before making a choice if to
buy the product or not and in other cases opinions about a political candidate before deciding who
to vote.
Traditionally, companies conducted public surveys, opinion polls and focus groups to get people’s
opinions and feelings towards an issue, this which is limited to few people, it is time consuming
and may give a biased overview of the opinion towards the issue itself. With the rise of social
media, a product hashtag is more than enough to ascertain the sentiment toward it since it contains
abundant honest consumer information, feel and reviews of the product
The purpose of this project is to provide a real-time sentiment analysis tool to be used to determine
the consumer perception of a particular brand or certain political topic, issue or an event. Or even
track a public relations crisis due to a particular occurrence affecting a particular product. This tool
utilizes data mining to crawl through thousands or even millions of social media feeds relating to
a particular issue/product and identifying the sentiment in them.
1.3 MOTIVATION
The emergence in the last decade of social media platforms such as Twitter, Facebook, and
Instagram, enabled people to engage in social activities to express their opinions, thoughts, and
emotions on a variety of topics. On such platforms, large amounts of data are produced, this
representing an opportunity for companies to assess their social influence and people opinions
towards their products. Consequently, a computational framework is desirable to perform
opinion mining and sentiment analysis which can adapt to the activity domain of the user.
1.4 AIM OF THE STUDY
1.4.1 RESEARCH OBJECTIVES
The main hypotheses of the study include:
a) To provide real-time sentiment towards a particular product, political issue, event etc.
b) To track sentimental change through a certain period of time, this is so as to track brand
marketing campaigns
c) To provide a graphical user interface detailing and visualizing the most used words and
emotions towards the issue.
1.5 SIGNIFICANCE OF THE STUDY
This study aims to develop a natural language processing-based tool to analyze sentiment in
realtime as the social-media feeds stream in. With the use of machine learning algorithm, this tool
will be able to work with large datasets of golden user data especially on twitter where tweeting
characters(messages) are limited to a short message that means a lot of opinions are crammed into
short messages.
Furthermore, this tool can give a business valuable instinct on how a product they launched is
faring, it can help identify and avert an emerging public crisis relating to a product, take an example
when an independent user found out about how apple had been knowingly slowing its iPhone down
after just one year of usage in order for users to buy new phones, this thread quickly became a
major topic on twitter and utter customer dissatisfaction was expressed through a ton of tweets
against the brand. With this tool such emerging customer dissatisfactions are quickly averted and
a corrective method implemented.
This tool can also be used to identify the right influencers (brand advocates) to push your product
or service to the mainstream by determining the sentiment they carry towards the social-media
users.
Last but not least, this tool is can be deployed to an email application and used in determining the
tone of an email you are writing in real-time, just like an emotional spellchecker for emotions.
1.6 SCOPE
To develop the system, I will use supervised machine learning, it comprises of three stages: Data
collection, pre-processing, Training data, classification and plotting of results.
The project is built using python as programming language which has wide range of
dependencies(libraries). The following are only applicable to the project:
● NumPy
● TensorFlow
● Text-blob
1.6.1 Data collection:
The data used in this application is sourced from twitter since twitter data is better suitable for these
reasons:
a) its short nature of tweets
b) emotions are crammed onto few characters
c) When someone is tweeting, most probably they will express a pressing opinion that they
feel others should know about it, therefore the probability of finding a sentiment in a tweet
is high.
A crawler functionality within the application will automatically pull tweets for a specified
hashtag, issue or event.
1.6.2 Preprocessing:
Online data is usually full of noise and gibberish parts such as the HTML tags, scripts and ads and
none-English texts. Keeping these words will have a strain on the classifier and will slow its
performance. Here’s the thing, having data properly cleansed is going to improve the performance
of the classifier and should speed up the classification process therefore enabling real-time
sentiment analysis.
In this phase the extracted data is cleansed and prepared for feeding into the classifier. This phase
mostly involves mining keywords and symbols, getting rid of unnecessary whitespaces, tabs,
removing non-English texts and converting all uppercase and lowercase text to a common case.
1.6.3 Training Data:
A dataset mostly crowdsourced e.g. the IDM movie dataset is fed into the classifier for learning
purpose. This dataset is like jet-fuel for the classifier. In this project I am going to use the twitter
Sentiment analysis dataset which has over 1,578,627 classified tweets which are well labelled as
1 for positive sentiment and 0 for negative sentiment.
1.6.4 Classification
This is the most crucial part of the whole process. Naïve Bayes theorem is deployed for analysis.
Due to its simplicity naïve Bayes can outperform more sophisticated classification methods It is a
collection of algorithms that all share a mutual principle, that every feature being classified is
independent of the value of any other feature (Harry Zhang, 2016). eg a fruit may be considered to
be an apple if it is red, round, and about 3cm in diameter. A Naive Bayes classifier studies each of
these “structures” (red, round, 3” in diameter) to contribute autonomously to the probability that
the fruit is an apple, irrespective of any correlations between features. Features, however, aren’t
always autonomous which is often seen as a limitation of the Naive Bayes algorithm and this is
why it’s labeled “naive”.
P(label/features) =P(features/label) P(label)/P(features)

P(label) is the prior probability of a label or the likelihood that a label is observed. Given a feature,
P (features label) is the prior probability that feature set is being classified as a label. P(features) is
the prior probability that a given feature set is occurred. Given the Naive assumption which states
that all features are independent of each other, the equation could be rewritten as follows:
P (label features) =P(label) P(fl|label) …P (fn|label )P(features)
Once a classifier for sentiment analysis is selected, the trained model classifier must be validated
using cross fold validation. The performance of the model can be determined using the following
measures:
1. Accuracy: It is measured by the fraction of number of correct predictions over total number
of predictions. The accepted accuracy is usually in the range 70% to 90%. If a model is
1005 accurate then it usually depicts that model overfits the data.
2. Precision: This measure shows how accurately the model makes predictions w.r.t each
class. It is measured by number of correct predictions over total number of true positives
and true negative examples.
3. Recall: This measure shows the completeness of the model w.r.t each class. It is measured
by number of correct predictions over total number of true positives and false negative
examples.
4. F-score: It is measured as,
F−score=2 Precision Recall Precision + Recall

The classifier after training is now ready to be deployed to analyze tweets in real-time for
sentimental analysis.
1.6.5 Results
Results of the sentimental analysis are represented in a graphical user interface with charts, graphs
and other types of representations. Performance tuning is done prior to deployment
1.7 ASSUMPTIONS
While conducting this study, the researcher assumed that all companies and brands rely mostly on
social media to monitor and advertise their products, brands and also to communicate with their
customers. Furthermore, it is assumed that customers express their concerns pertaining a certain
brand through social media: this includes their reviews, dissatisfactions and complaints about a
product or brand. Finally, it is assumed that opinion expressed by customers concerning a product
or brand is honest and is not in any way biased or pushed by a political or personal agenda
1.8 LIMITATIONS
The human language is too complex for a machine to read and understand, Opinions are sometimes
expressed as sarcasm and furthermore the order of words adds to more confusion. For example, I
currently use the Blackberry priv and love it, but not as much as the Samsung galaxy s5 I chose
the priv for the camera lens. My blunder.” In this example the sentiment is not clear as to which
phone the user hates or likes.
Understanding sarcasm, “Oh, yeah, Fast Food Eatery. I just LOVE the 40-minute wait for food.”
As humans we clearly understand sarcasm. We recognize the sentiment of this comment is

obviously negative. Yet a machine would flag it as positive, possibly even very positive because
of the shouting CAPs LOVE. How awfully wrong.
CHAPTER TWO
2 LITERATURE REVIEW
This chapter looks deep at the large pool of existing literature relevant to my topic and its
objectives. It gives an insight into the literature by other scholars and researchers on the field of
Sentiment analysis. It covers the past studies where it discusses literature related to the specific
objectives of the study. It also presents literature on the critical review of major issue, summary,
gaps to be filled and the conceptual framework. Several sentiment analyzers have been previously
developed to classify data into either positive or negative here are some of the techniques used to
analyze sentiments.
This section summarizes some of the scholarly and research works in the field of Machine
Learning and data mining to analyze sentiments on the Twitter and preparing prediction model
for various applications. In recent years a lot of work has been done in the field of “Sentiment
Analysis on Twitter” by number of researchers. In its early stage it was intended for binary
classification which assigns opinions or reviews to bipolar classes such as positive or negative
only. With quick growth in client of Social Media as of late, the researcher gets attracted towards
the utilization of social media data for sentiment analysis of individuals or particular product or
person or event. Twitter is one of the broadly utilized social media platforms to express the
considerations.
Sentiment analysis, also referred to as opinion mining, is a technique developed in the fields of
artificial intelligence and natural language processing. It is an information retrieval tool that can
classify text into subjective categories (negative, positive, or neutral) or measure sentiment
strength (Pang and Lee, 2008; Thelwall et al.,2010). There are two major steps in sentiment
analysis: opinion extraction and sentiment classification (Pang and Lee, 2008). Opinion
extraction is to differentiate subjective texts from factual ones, while sentiment classification
focuses on assigning opinion words into different sentiment categories (Chiu et al., 2015).
Opinion words are words that express desirable (e.g., fantastic, amazing, etc.) and undesirable
(terrible, disgusting, etc.) states (Ding et al., 2008).
There are two common methods determining an opinion word’s semantic orientation or
subjective categories: corpus based and dictionary-based approach (Chiu et al., 2015).
Corpusbased approaches involve using the syntactic and co-occurrence pattern of a large corpus
(texts that are most representative of a document’s content) in identifying sentiment category
(Capriello et al., 2011; Chiu et al., 2015; Thelwall et al., 2011). For instance, Turney and Littman
(2003) determined a word’s semantic orientation by calculating the strength of its association
with a set of positive words minus its association with a set of negative words. The associations
were estimated by issuing a search engine query, and then noting the query’s co-occurrence
probability with the positive and negative seed words.
A dictionary-based approach, also called a lexicon-based method, uses a bank of pre-coded
words to determine the text’s semantic orientation (Chiu et al., 2015; Thelwall et al., 2011). For
example, Hu and Liu (2004) generated a set of adjective synonyms and antonyms (opinion
words) through bootstrapping process using the WordNet dictionary. They then used this
collection of opinion words to predict the sentiment orientation of electronic product reviews at a
sentence level. The dictionary-based approach typically counts the numbers of positive and
negative opinion words in a sentence. If positive opinion words prevail, the orientation of the
sentence is positive and otherwise negative. Sentiment analysis can be conducted not only at a
sentence level, but also at others levels: document-, paragraph-, or attribute-level. As the level of
granularity increases, so does its complexity. Attribute-level sentiment analysis aims to associate
opinions associated with certain features (Chiu et al., 2015).
Sentiment analysis has been successfully applied in various contexts, such as detecting influenza
outbreak (Culotta, 2010), determining overall trends in the level of happiness (Dodds et
al.,2011), predicting movie box office revenues (Asur & Huberman,2010), and understanding
some consumer opinions on tourism and hospitality related products (Chiu et al., 2015; Claster et
al., 2013; Duan et al., 2013). Chiu et al. (2015) developed a Chinese sentiment analysis method
to extract opinions toward various hotel attributes in Chinese blogs.
The rising popularity of sentiment analysis in research can be attributed to its unique advantages.
First, as compared to manual coding, computer-aided sentiment analyses are not only more
efficient, but also produce comparable results. Capriello et al. (2013) compare the efficiency of
manual content coding and two computer-aided sentiment analyses techniques in analyzing 800
travel reviews of former farm stay guests. They found that all three analyses produce similarly
reliable results. Second, as compared to traditional methods (e.g., surveys or focus groups),
sentiment analysis can effectively reduce cost, time, and manual labor by using automatic
algorithms to sort through text (Chiu et al., 2015).
Despite its advantages, sentiment analysis also has many draw-backs. As Pang and Lee (2008)
pointed out, sentiment analysis is domain and event dependent. Words considered positive in one
domain might not be so in another area. Sarcasm is an obvious example of this limitation.
Despite its inherent disadvantages, the technique still appears as a promising tool for researchers
and industry practitioners. Wang et al. (2013) reviewed previous studies, and reported that the
analysis technique yields a rather high accuracy rate (roughly 70% to 80% accuracy rate in
training-test data matching tasks). The objective of sentiment analysis is to obtain useful insight
from a large quantity of aggregated data, rather than perfect classification of all data points.
Pak and Paroubek (2010) proposed a model to classify the tweets as objective, positive and
negative. They created a twitter corpus by collecting tweets using Twitter API and automatically
annotating those tweets using emoticons. Using that corpus, they developed a sentiment classifier
based on the multinomial Naïve Bayes method that uses features like N-gram and POS-tags. The
training set they used was less efficient since it contains only tweets having emoticons.
Parikh and Movassate (2009) implemented two models, a Naïve Bayes bigram model and a
Maximum Entropy model to classify tweets. They found that the Naïve Bayes classifiers worked
much better than the Maximum Entropy model.
Barbosa et al. (2010) designed a two-phase automatic sentiment analysis method for classifying
tweets. They classified tweets as objective or subjective and then in second phase, the subjective
tweets were classified as positive or negative. The feature space used included retweets,
hashtags, link, punctuation and exclamation marks in conjunction with features like prior polarity
of words and POS.
Po-Wei Liang et.al. (2014) used Twitter API to collect twitter data. Their training data falls in
three different categories (camera, movie, mobile). The data is labeled as positive, negative and
non-opinions. Tweets containing opinions were filtered. Unigram Naive Bayes model was
implemented and the Naive Bayes simplifying independence assumption was employed. They
also eliminated useless features by using the Mutual Information and Chi square feature
extraction method. Finally, the orientation of a tweet is predicted. i.e. positive or negative.
Bakhtawar Seerat et al (2012) proposed the method of opinions extraction from an online web
page and the limitation of Sentiment analysis. Meena Rambocas (2013) concluded all the
challenges marketers can face when using sentiment analysis as an alternative technique capable
of triangulating qualitative and quantitative methods through innovative real time data collection
and analysis.
G. Vinodhini et al (2012) proposed an Overview of different opinion mining techniques. Blessy
Selvam et al (2013) proposed different approaches of sentiment classification and the existing
methods with the framework. Rudy Prabowo (2009) formed a new approach by combining
rulebased classification, supervised learning and machine learning and tested it on movie
reviews, product reviews and Myspace comments. And also proposed a semi-automatic approach
to get better effectiveness.
2.1 APPLICATIONS OF SENTIMENT ANALYSIS
Word of mouth (WOM) is the process of conveying information from person to person and
plays a major role in customer buying decisions. In commercial situations, WOM involves
consumers sharing attitudes, opinions, or reactions about businesses, products, or services with
other people. WOM communication functions based on social networking and trust. People rely
on families, friends, and others in their social network. Research also indicates that people
appear to trust seemingly disinterested opinions from people outside their immediate social
network, such as online reviews. This is where Sentiment Analysis comes into play. Growing
availability of opinion rich resources like online review sites, blogs, social networking sites
have made this “decision-making process” easier for us. With explosion of Web 2.0 platforms
consumers have a soapbox of unprecedented reach and power by which they can share
opinions. Major companies have realized these consumer voices affect shaping voices of other
consumers.
Sentiment Analysis thus finds its use in Consumer Market for Product reviews, Marketing for
knowing consumer attitudes and trends, Social Media for finding general opinion about recent
hot topics in town, Movie to find whether a recently released movie is a hit.
Pang-Lee et al. (2002) broadly classifies the applications into the following categories.
a. Applications to Review-Related Websites
Movie Reviews, Product Reviews etc.
b. Applications as a Sub-Component Technology
Detecting antagonistic, heated language in mails, spam detection, context sensitive information
detection etc.
c. Applications in Business and Government Intelligence
Knowing Consumer attitudes and trends
d. Applications across Different Domains
Knowing public opinions for political leaders or their notions about rules and regulations in
place etc.
2.2 Machine learning methods for sentimental analysis
Sentimental analysis approaches use different machine learning classifiers and feature extractors.
In this context, the goal of machine learning is to study the algorithms that are capable in fully
automated situations to predict something out of input. There are many ways to do this i.e. the
use of Naive Bayes, support vector machines (SVM), maximum entropy etc. There are several
applications that have been developed using these algorithms for example; Microsoft Cern with
inbuilt random forest for predicting body parts, given what is on the sensor of the camera. Many
prototypes and models for sentiment classification treat classifiers and feature extractors as two
distinct components (Pang & Lee, 2008).
2.2.1 Naïve Bayes classifier
(Pang &Lee, 2008) describes naive Bayes classifier as a supervised machine learning algorithm
with a simple probabilistic classifier based on Bayes’ theorem with strong independence
assumptions. The classifier assumes the presence (or absence) of a particular feature of a class is
unrelated to the presence (or absence) of the other feature. It can learn the pattern by examining a
set of documents that has been categorized. It compares the contents with the list of words to
classify the documents to their right category or class (Vishal & Sonawane, 2016). Let d be the
tweet and c* be a class that is assigned to d, where
From the above equation, “f” is a feature, count of feature “(fi)” is denoted with ni (d) and is
present in d which represents a tweet. Here, m denotes number of features. Parameters P(c) and
P(f|c) are computed through maximum likelihood estimates, and smoothing is utilized for unseen
features. Python NLTK library can be used to train and classify a text using Naïve Bayes (Vishal
& Sonawane, 2016).
Maximum entropy
Maximum entropy is a technique for estimating probability distributions from data. In text
classification, maximum entropy estimates the conditional distribution of the class label given a
document. A document is represented by a set of word count features. The labeled training data
is used to estimate the expected value of these word counts on a class-by-class basis
(Govindarajan & Romina, 2013).
The principle in maximum entropy is that when nothing is known, the distribution should
be as uniform as possible, that is, have maximal entropy. Labeled training data is used to
derive a set of constraints for the model that is characterized by the class-specific
expectations for the distribution. Using maximum entropy model, prediction of outcome is
based on everything that is known and assumes nothing about unknown.
Maximum entropy for text classification
In Maximum Entropy Classifier, no assumptions are taken regarding the relationship in
between the features extracted from dataset. This classifier always tries to maximize the
entropy of the system by estimating the conditional distribution of the class label.
Maximum entropy even handles overlap feature and is same as logistic regression method
which finds the distribution over classes (Vishal & Sonawane, 2016). Maximum entropy
makes no independence assumptions for its features, like Naive Bayes. The model is
represented by the following:

Where c is the class, d is the tweet and λi the weight vector. The weight vectors decide the
importance of a feature in classification (Vishal & Sonawane, 2016).
2.4.3 Support Vector Machine (SVM)
SVM is a supervised learning model which analyzes the data and identifies the pattern for
classification. The concept of SVM algorithm is based on decision plane that defines decision
boundaries. A decision plane separates group of instances having different class memberships.
For example, consider an instance which belongs to either class Circle or Diamond. There is a
separating line (figure 2.2) which defines a boundary. At the right side of boundary all instances
are Circle and at the left side all instances are Diamond (Pravesh & Mohd, 2014).
Principle of SVM
In text classification sometimes data are linearly divisible, for very high dimensional
problems and for multi-dimensional problems data are also separable linearly. Generally,
(in maximum cases) the opinion mining solution is one that classifies most of the data and
ignores outliers and noisy data. If a training set data say D cannot be separated clearly then
the solution is to have fat decision classifiers and make some mistake (Pravesh & Mohd,
2014). The SVM can be used to extract terrorist entities from a collection of untagged news
documents in the terrorist domain. This method segments each document into sentences,
parses the latter into parse trees and delivers features for the entities within the documents.
Lexicon-Based Approaches
Lexicon based method uses sentiment dictionary with opinion words and match them with
the data to determine polarity (Vishal & Sonawane, 2016). They assign sentiment scores to
the opinion words describing how positive, negative and objective the words contained in
the dictionary. Lexicon-based approaches mainly rely on a sentiment lexicon, i.e., a
collection of known and precompiled sentiment terms, phrases and even idioms, developed
for traditional genres of communication, such as the opinion finder lexicon (Vishal &
Sonawane, 2016).
Dictionary-based
It is based on the usage of terms (seeds) that are usually collected and annotated manually.
This set grows by searching the synonyms and antonyms of a dictionary (Vishal &
Sonawane, 2016). An example of that dictionary is WordNet, which is used to develop a
thesaurus called SentiWordNet. However, Dictionary-based approach can’t deal with
domain and context specific orientations (Vishal & Sonawane, 2016).

Corpus-Based
The corpus-based approach provides the dictionaries related to a specific domain. These
dictionaries are generated from a set of seed opinion terms that grows through the search
of related words by means of either statistical or semantic techniques (Vishal &
Sonawane,2016).
2.3 RELATED STUDIES
2.3.1 Lexical analysis
Input Text Tokenizer Dictionary
Word
Matching
Match
YES NO
Increment Decrement
score score
FIGURE2.1 LEXICAL ANALYSIS
A dictionary-based approach, also called a lexicon-based method, uses a bank of pre-coded
words to determine the text’s semantic orientation (Chiu et al., 2015; Thelwall et al., 2011). This
technique is governed by the use of a dictionary consisting pre-tagged lexicons the input text is
converted to tokens by the Tokenizer. Every new token encountered is then matched for the
lexicon in the dictionary. If there is a positive match, the score is added to the total pool of score
for the input text.
For instance, if “dramatic” is a positive match in the dictionary then the total score of the text is
incremented. Otherwise, the score is decremented or the word is tagged as negative. Though this
technique appears to be amateur in nature, its variants have proved to be worthy.
The classification of a text depends on the total score it achieves. Considerable amount of work
has been devoted for measuring which best lexical information works.
An accuracy of about 80% on single phrases can be achieved by the use of hand tagged lexicons
comprised of only adjectives, which are crucial for deciding the subjectivity of an evaluative text.
The dictionary can be grown by searching the synonyms and antonyms of words in the wellknown
thesaurus or wordnet dictionaries.
2.3.1.1 Limitations of lexical analysis
➢ Lexical analysis has a limitation its performance (in terms of time complexity and
accuracy) degrades drastically with the exponential growth of the size of dictionary
(number of words)
➢ The manual construction of a dictionary is time consuming a difficult and time-consuming
task
2.3.2 Machine learning approach
Machine learning is one of the most prominent techniques gaining interest of researchers due to its
adaptability and accuracy. In sentiment analysis mostly, the supervised learning variants of this
technique are employed. It comprises of three stages: Data collection, Pre-processing, Training
data, Classification and plotting results in the training data, a collection of tagged corpora is
provided. The Classifier is presented a series of feature vectors from the previous data. A model is
created based on the training data set which is employed over the new/unseen text for classification
purpose. In machine learning technique the key to accuracy of a classifier is the selection of
appropriate features. Generally, unigrams (single word phrases), bigrams (two consecutive
phrases), tri-grams (three consecutive phrases) are selected as feature vectors. There are a variety
of proposed features namely number of positive words, number of negative words, length of the
document,
Support Vector Machines (SVM), and Naïve Bayes (NB) algorithm. Accuracy is reported to vary
from 63% to 80% depending upon the combination of various features selected.
2.3.2.1 Advantages of machine learning approach
➢ It overcomes the limitation of lexical approach of performance degradation
➢ it works well even when the dictionary size grows exponentially due to its ability to learn
and adapt.
2.3.2.2 Disadvantage
➢ The machine learning technique faces challenges in designing a classifier
➢ availability of training data, correct interpretation of an unforeseen phrase
2.3.3 Hybrid approach
The advances in sentiment analysis lured researchers to explore the possibility of a hybrid approach
which could collectively exhibit the accuracy of a machine learning approach and the speed of
lexical approach. In Hybrid approach authors use two-word lexicons and an unlabeled data,
dividing these two-word lexicons in two discrete classes negative and positive. Pseudo documents
encompassing all the words from the set of chosen lexicons are created.
Then computed the cosine similarity amongst the pseudo documents and the unlabeled documents.
Depending upon the measure of similarity, the documents were either assigned a positive or a
negative sentiment. This training dataset was then fed to a naïve bayes classifier for training
purpose.
Another approach presented by, derived a ‘unified framework’ using back-ground lexical
information as word class associations. Authors renewed information for particular areas using the
available datasets or training examples and proposed a classifier called as Polling Multinomial
Classifier (PMC) (also known as the multinomial naïve bayes) Manually labeled data was
incorporated for training purpose.
They claimed that making use of lexical knowledge improved performance. Another variant of this
approach was presented by but so far only have been able to claim good results
2.3.4 Summary
Comparison of all approaches has showed that best results have been observed from machine
learning approaches, and least by lexical approaches. However, without any proper training of a
classifier in machine learning approach results may deteriorate drastically. Work is being carried
on hybrid approaches. The techniques were tested by on a movie review & recommendation and
news review area based on user tweets.
Their results seem to be promising for further research. Open social networks are best examples of
sociological trust. The exchange of messages, followers and friends and varying sentiments of
users provide a crude platform to study behavioral trust in sentiment analysis domain.
Machine learning approaches have been so far good in delivering accurate results. Depending upon
the application, the success of any approach will vary. Lexical approach is a ready to go and doesn’t
require any prior information or training. While on the other hand machine learning requires well
designed classifier, huge amount of training data sets and performance tuning prior to deployment.
Hybrid approach has so far displayed positive sentiment as far as performance is concerned.
Though they have been deployed using unigrams and diagrams, their performance is worse on
trigrams. This definitely leaves researchers to explore the terrain.
2.4 RELATED SYSTEMS
2.4.1 Trump’s Tweets Sentiment analyzer example
FIGURE 2.2 TRUMP WORDMAP
A research was conducted by (Face, Chris, 2016) of Macquarie University on the sentiment of
Donald Trump tweets.During the election campaign of 2016, much discussion revolved around
who was sending out Donald Trump’s Tweets. A number of articles described how the tone of
Trump’s tweets is more positive when they come from an iPhone device, than when they come
from an Android. The hypothesis is that Trump tweets from an Android device, and that he
employs social media assistants who tweet from an iPhone. But how do you work that out?
In a data set containing 1,512 tweets from @realDonaldTrump sent during the primaries, there is
a small but positive average sentiment score of 0.3, with scores ranging from -5 to 6. This means
that the average tweet has slightly more positive language than negative. The magnitude of the
scores is small as the length of a tweet is restricted.
The power of sentiment arises when considering other variables in the data. Think of the
nowfamous example of the Trump sentiment gap between Android and iPhone. The mean
sentiment score of Tweets from Android, 0.1, is significantly lower than the overall average of 0.3:
FIGURE 2.3 ANDROID VS IPHONE TRUMP
Engagement
The data from Twitter includes the number of times each tweet has been Favorited. This is used as
a proxy for engagement. For this data set, the average is around 19,000. By considering how the
average number of favorites varies with the sentiment, the study discovered another interesting
pattern.
FIGURE 2.4 ENGAGEMENT
Those tweets which have a negative sentiment (scoring -2 or fewer) garner a significantly higher
number of favorites on average. It would seem that Trump’s followers are noticeably more engaged
by negative content.
A little sentiment analysis can reveal patterns in the data which would be difficult to gain by
reading through the sea of content.
2.5 LIMITATIONS OF RELATED SYSTEMS
● Related systems can identify and analyze many pieces of text automatically and quickly.
computer programs have problems recognizing things like sarcasm and irony, negations,
jokes, and exaggerations - the sorts of things a person would have little trouble identifying.
And failing to recognize these can skew the results.
● 'Disappointed' may be classified as a negative word for the purposes of sentiment analysis,
but within the phrase “I wasn't disappointed", it should be classified as positive.
● We would find it easy to recognize as sarcasm the statement "I'm really loving the
enormous pool at my hotel!", if this statement is accompanied by a photo of a tiny

swimming pool; whereas an automated sentiment analysis tool probably would not and
would most likely classify it as an example of positive sentiment.
● With short sentences and pieces of text, for example like those you find on Twitter
especially, and sometimes on Facebook, there might not be enough context for a reliable
sentiment analysis. However, in general, Twitter has a reputation for being a good source
of information for sentiment analysis, and with the new increased word count for tweets
it's likely it will become even more useful.
2.6 HOW THE PROPOSED SYSTEM SEEKS TO HANDLE THESE CHALLENGES.
The proposed system uses social media feeds of a particular brand or trend to analyze the sentiment
and polarity of the feeds therefore determine the general feeling of the masses towards a particular
product, service or issue being raised.
To ensure good quality of data being used to analyze sentiment, this system removes unnecessary
data or noise thus remains with good quality and plausible data from social media feeds i.e., data
that contains ‘unknown language’ or too many links will be filtered out.
The overall sentiment will require a huge collection of datasets i.e., a whole week of social media
feeds therefore more accurate sentiment will be generated.

REFERENCES
Bakhtawar Seerat, Farouque Azam, “Opinion Mining: Issues and Challenges (A Survey)”,
International Journal of Computer Applications, Vol49 No 9 July 2012Pg No 42-
51.
Bluesy Selvam, A. Abirami, “A Survey on Opinion Mining Framework”, International
Journal of Advanced Research in Computer and Communication Engineering, Vol
2, Issue 9, Sep 2013Pg No 3544-3549.
Capriello, A., Mason, P.R., Davis, B., Crotts, J.C., 2011. Farm tourism experiences in
travel reviews: a cross-comparison of three alternative methods for data analysis.
J. Bus. Res. 66, 778–785.
Chiu, C., Chiu, N.H., Sung, R.J., Hsieh, P.Y., 2015. Opinion mining of hotel customergenerated
contents in Chinese weblogs. Curr. Issues Tourism 18 (5),477–495.
Claster, W., Pardo, P., Cooper, M., Tajeddini, K., 2013. Tourism, travel and tweets:
algorithmic text analysis methodologies in tourism. Middle East J. Manage.
1(1), 81–99.
Culotta, A., 2010. Detecting Influenza Outbreaks by Analyzing Twitter Messages,
Retrieved from Cornell University Library _http://arxiv.org/abs/1007.4748_.
Ding, X., Liu, B., Yu, P.S., 2008. A holistic lexicon-based approach to opinion
mining.In: Proceedings of the 2008 International Conference on Web Search
and DataMining, ACM, pp. 231–240.
Dodds, P.S., Harris, K.D., Kloumann, I.M., Bliss, C.A., Danforth, C.M., 2011.
Temporal patterns of happiness and information in a global social network:
Hedonometrics and Twitter. PLoS ONE 6 (12), 1–26.
Duan, W., Cao, Q., Yu, Y., Levy, S., 2013. Mining online user-generated content:
using sentiment analysis technique to study hotel service quality. In:
SystemSciences (HICSS), 2013 46th Hawaii International Conference on,
IEEE, January2013, pp. 3119–3128.
G. Vinodhini et al, “Sentiment Analysis and Opinion Mining: A Survey”, International Journal
of Advanced Research in Computer Science and Software Engineering (IJARCSSE),
Vol 2, Issue 6, June 2012.
Govindarajan & Romina (2013), a survey of classification methods and applications for
sentiment analysis, the international journal of engineering and science (IJES).
Hu, M., Liu, B., 2004. Mining and summarizing customer reviews. In: Proceedings of
the Tenth ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, ACM, August 2004, pp. 168–177,
http://dx.doi.org/10.1145/1014052.1014073.
Liang, P. W., Liao, C. Y., Chueh, C. C., Zuo, F., Williams, S. T., Xin, X. K., ... & Jen,
A. K. Y. (2014). Additive enhanced crystallization of solution‐processed
perovskite for highly efficient planar‐heterojunction solar cells. Advanced
materials, 26(22), 3748-3754.
Pak, A., & Paroubek, P. (2010, May). Twitter as a corpus for sentiment analysis and opinion
mining. In LREc (Vol. 10, No. 2010, pp. 1320-1326).

Pang, B., Lee, L., 2008. Opinion mining and sentiment analysis. Found. Trends
Inf.Retrieval 2 (1–2), 1–135.
Parikh, R., & Movassate, M. (2009). Sentiment analysis of user-generated twitter updates
using various classification techniques. CS224N Final Report, 118.
Prabowo, R., & Thelwall, M. (2009). Sentiment analysis: A combined approach.
Journal of Informetrics, 3(2), 143-157.
Pravesh & Mohd (2014), methodological study of opinion mining and sentiment analysis
techniques, international journal on soft computing (IJSC) vol. 5.
Rambocas, M., & Gama, J. (2013). Marketing research: The role of sentiment analysis
(No. 489). Universidade do Porto, Faculdade de Economia do Porto.
Thelwall, M., Buckley, K., Paltoglou, G., 2011. Sentiment in Twitter events. J.
Am.Soc. Inf. Sci. Technol. 62 (2), 406–418.
Turney, P.D., Littman, M.L., 2003. Measuring praise and criticism: inference ofsemantic
orientation from association. ACM Trans. Inf. Syst. (TOIS) 21
(4),315–346.
Vishal & Sonawane (2016), sentiment analysis of twitter data: a survey of techniques,
international journal of computer applications, volume 139.
Wang, J., Gu, Q., Wang, G., 2013. Potential power and problems in sentiment mining of
social media. Int. J. Strateg. Decis. Sci. (IJSDS) 4 (2), 16–26.

Sample Proposal 4

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sample Proposal 4

Uploaded by

Copyright:

Available Formats

Jomo Kenyatta University of Agriculture and Technology (JKUAT)

Project Proposal and Literature Review Report

platform for online learning, exchanging ideas and sharing opinions.

1.1 BACKGROUND OF STUDY

individuals, topics, events, etc.

is summarized as positive, negative or objective. It can be sentence based where individual

phrases in a sentence are classified according to polarity.

be given a score based on their degree of positivity, negativity or objectivity.

as per the users’ requirements.

1.1 Twitter’s simplicity

o Twitter’s API is well designed and easy to access. o Twitter data in a

convenient format for analysis.

An Application Programming Interface (API) is a set of programming instructions and standards

you plan to see it, there’s hope for you yet”

the questions sentimental analysis seeks to answer.

analysis to gather critical feedback about problems in newly released products.

explore other ways to improve.

1.2 PROBLEM STATEMENT

Sentimental analysis is the process of computationally pinpointing as well as categorizing

writer’s attitude to a specific subject, product, etc. is positive, negative, or neutral.

abundant honest consumer information, feel and reviews of the product

a particular issue/product and identifying the sentiment in them.

towards their products. Consequently, a computational framework is desirable to perform

1.4 AIM OF THE STUDY

1.4.1 RESEARCH OBJECTIVES

The main hypotheses of the study include:

emotions towards the issue.

1.5 SIGNIFICANCE OF THE STUDY

a corrective method implemented.

collection, pre-processing, Training data, classification and plotting of results.

dependencies(libraries). The following are only applicable to the project:

1.6.1 Data collection:

a) its short nature of tweets

b) emotions are crammed onto few characters

hashtag, issue or event.

1.6.3 Training Data:

1 for positive sentiment and 0 for negative sentiment.

why it’s labeled “naive”.

P(label/features) =P(features/label) P(label)/P(features)

P (label features) =P(label) P(fl|label) …P (fn|label )P(features)

and true negative examples.

4. F-score: It is measured as,

F−score=2 Precision Recall Precision + Recall

and other types of representations. Performance tuning is done prior to deployment

phone the user hates or likes.

As humans we clearly understand sarcasm. We recognize the sentiment of this comment is

of the shouting CAPs LOVE. How awfully wrong.

(terrible, disgusting, etc.) states (Ding et al., 2008).

probability with the positive and negative seed words.

A dictionary-based approach, also called a lexicon-based method, uses a bank of pre-coded

opinions associated with certain features (Chiu et al., 2015).

to extract opinions toward various hotel attributes in Chinese blogs.

algorithms to sort through text (Chiu et al., 2015).

much better than the Maximum Entropy model.

of words and POS.

G. Vinodhini et al (2012) proposed an Overview of different opinion mining techniques. Blessy

to get better effectiveness.

2.1 APPLICATIONS OF SENTIMENT ANALYSIS

a. Applications to Review-Related Websites

Movie Reviews, Product Reviews etc.

b. Applications as a Sub-Component Technology

c. Applications in Business and Government Intelligence