Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Jomo Kenyatta University of Agriculture and Technology (JKUAT)

Project Proposal and Literature Review Report


TITLE: MACHINE LEARNING BASED SENTIMENT ANALYSIS TO
DETERMINE BRAND POPULARITY USING SOCIAL MEDIA FEEDS.

DEDICATION

This work is dedicated to my entire family and my fellow students who has always supported me
and guided me throughout my education. I thank you for the mental and emotional support I have
got from you. My God bless you all.
ACKNOWLEDGEMENT
ABSTRACT
Abstract – With the advancement of web technology and its growth, there is a huge volume of

data present in the web for internet users and a lot of data is generated too. Internet has become a

platform for online learning, exchanging ideas and sharing opinions.

Social networking sites like Twitter, Facebook, Google+ are rapidly gaining popularity as they

allow people to share and express their views about topics, have discussion with different

communities, or post messages across the world. Twitter is one of the most commonly used

platforms for sharing opinions, expressing views. There has been lot of work in the field of

sentiment analysis of twitter data. This survey focuses mainly on sentiment analysis of twitter

data which is helpful to analyze the information in the tweets where opinions are highly

structured, heterogeneous and are either positive or negative, or neutral in some cases. The

organizations can use sentiment analysis to get an idea of the customer reviews of their

products, and subsequently try and improve their services based on the reviews.

Key Words: Twitter, Sentiment analysis (SA), Opinion mining, Machine learning, Naïve Bayes

(NB).

TABLE OF CONTENTS
DECLARATION Error! Bookmark not defined.
ACKNOWLEDGEMENT 2
ABSTRACT 3
1 CHAPTER 1 5
INTRODUCTION 5
1.1 BACKGROUND OF STUDY 5
1.2 PROBLEM STATEMENT 8
1.3 MOTIVATION 9
1.4 AIM OF THE STUDY 9
1.4.1 RESEARCH OBJECTIVES 9
1.5 SIGNIFICANCE OF THE STUDY 10
1.6 SCOPE 11
1.6.1 Data collection: 11
1.6.2 Preprocessing: 11
1.6.3 Training Data: 12
1.6.4 Classification 12
1.6.5 Results 14
1.7 ASSUMPTIONS 14
1.8 LIMITATIONS 14
2 LITERATURE REVIEW 15
2.1 APPLICATIONS OF SENTIMENT ANALYSIS 19
2.2 Machine learning methods for sentimental analysis 20
2.3 RELATED STUDIES 25
2.3.1 Lexical analysis 25
2.3.1.1 Limitations of lexical analysis 26
2.3.2 Machine learning approach 26
2.3.2.1 Advantages of machine learning approach 27
2.3.2.2 Disadvantage 27
2.3.3 Hybrid approach 27
2.3.4 Summary 28
2.4 RELATED SYSTEMS 29
2.4.1 Trump’s Tweets Sentiment analyzer example 29
2.5 LIMITATIONS OF RELATED SYSTEMS 31
2.6 HOW THE PROPOSED SYSTEM SEEKS TO HANDLE THESE CHALLENGES 32
1 CHAPTER 1

INTRODUCTION

1.1 BACKGROUND OF STUDY

Now-a-days, the age of internet has changed the way people express their views, opinions. It is

now mainly done through blog posts, online forums, product review websites, social media etc.

Nowadays, millions of people are using social network sites like Facebook, Twitter, Google plus,

etc. to express their emotions, opinion and share views about their lives.

With advance in technology especially the explosion of social media, brands and companies are

taking social media as a useful marketing campaign tool to reach masses with their product.

Social media is useful in that it contains tons and tons of reviews, opinions, sentiments,

appraisals, attitudes and emotions towards a particular product, service, organization, issues,

individuals, topics, events, etc.

The analysis of sentiments may be document based where the sentiment in the entire document

is summarized as positive, negative or objective. It can be sentence based where individual

sentences, bearing sentiments, in the text are classified. SA can be phrase based where the

phrases in a sentence are classified according to polarity.

Sentiment Analysis identifies the phrases in a text that bears some sentiment. The author may

speak about some objective facts or subjective opinions. It is necessary to distinguish between

the two. SA finds the subject towards whom the sentiment is directed. A text may contain many

entities but it is necessary to find the entity towards which the sentiment is directed. It identifies

the polarity and degree of the sentiment. Sentiments are classified as objective (facts), positive

(denotes a state of happiness, bliss or satisfaction on part of the writer) or negative (denotes a
state of sorrow, dejection or disappointment on part of the writer). The sentiments can further

be given a score based on their degree of positivity, negativity or objectivity.

Through the online communities, we get an interactive media where consumers inform and

influence others through forums. Social media is generating a large volume of sentiment rich

data in the form of tweets, status updates, blog posts, comments, reviews, etc. Moreover, social

media provides an opportunity for business by giving a platform to connect with their customers

for advertising. People mostly depend upon user generated content over online to a great extent

for decision making. For e.g., if someone wants to buy a product or wants to use any service,

then they firstly look up its reviews online, discuss about it on social network but the data

generated by users is too vast for a normal user to analyze. So, there is a need to automate this,

various sentiment analysis techniques are widely used. Sentiment analysis (SA) tells user

whether the information about the product is satisfactory or not before they buy it. Marketers and

firms use this analysis data to understand products or services in such a way that it can be offered

as per the users’ requirements.

1.1 Twitter’s simplicity

Twitter data is interesting because tweets happen at the “speed of thought” and are available for

consumption in real time, and you can obtain data from anywhere in the world.

The choice of Twitter is predominantly suited for data mining because of the three key features.

o Twitter’s API is well designed and easy to access. o Twitter data in a

convenient format for analysis.

o Twitter’s terms of use for the data are relatively liberal as compared to
other API’s.
1.2 Twitter’s API

An Application Programming Interface (API) is a set of programming instructions and standards

for accessing a web-based software application. Twitter bases its API of the Representational

State Transfer (REST) Architecture. REST architecture refers to a collection of network design

principles that define resources and ways to address and access data.

The power of the sentiment: Sentimental analysis is the process of computationally pinpointing

as well as categorizing opinions conveyed in a piece of script or text, specifically in order to find

out whether the writer’s attitude to a specific subject, product, etc. is positive, negative, or

neutral (RENTOUMI, 2012). For example: “I am so happy today, good day everybody”, is a

generally positive text and take another text “Black Panther is such an excellent movie, highly

acclaims 10/10” expresses positive sentiment towards this movie which is considered as the topic

of this text. Sometimes the job of ascertaining the exact sentiment is complex even for humans

take an example: “I am so surprised many people put Black Panther in their favorite movies ever

list, I felt it was a good watch but definitely not that good” the sentiment expressed here is

possibly positive but not as good as the previous text. In many other cases, knowing sentiment

for other texts becomes more tough for an algorithm even if it appears easy from the human

perspective for example:” if you haven’t watched Black Panther, you are not worth my time. If

you plan to see it, there’s hope for you yet”

Imagine launching a new product and having a real-time picture of its perception and feel by real

consumers based on what they said about the product without the company having to beg for

customer’s opinion usually through annoying and highly unverifiable product surveys. Do they

like it or Not? What should be changed about it? Do they recommend it to other People? These are

the questions sentimental analysis seeks to answer.


Brand marketers can make use of sentiment analysis to study public attitude of their firm and

products, or to analyze client fulfilment and satisfaction. Organizations may also use Sentimental

analysis to gather critical feedback about problems in newly released products.

Apart from helping companies know how they’re doing with their customers; Sentiment analysis

also provides them a clear picture of how they fair against their competitors. Knowing the

sentimentalities associated with competitors helps corporations gauge their own performance and

explore other ways to improve.

1.2 PROBLEM STATEMENT

Sentimental analysis is the process of computationally pinpointing as well as categorizing

opinions conveyed in a piece of script or text, specifically in order to find out whether the

writer’s attitude to a specific subject, product, etc. is positive, negative, or neutral.

Opinions are central to humans, they are the main influencers of our behaviors, if we want to make

decisions, we have to look into what other people chose. That is the essential character of the

human story, learning from others or simply copying what others do. In the real-world, businesses

would like to know the people’s opinion towards their services or products furthermore, people

would like to know how other humans feel about a particular product before making a choice if to

buy the product or not and in other cases opinions about a political candidate before deciding who

to vote.

Traditionally, companies conducted public surveys, opinion polls and focus groups to get people’s

opinions and feelings towards an issue, this which is limited to few people, it is time consuming

and may give a biased overview of the opinion towards the issue itself. With the rise of social
media, a product hashtag is more than enough to ascertain the sentiment toward it since it contains

abundant honest consumer information, feel and reviews of the product

The purpose of this project is to provide a real-time sentiment analysis tool to be used to determine

the consumer perception of a particular brand or certain political topic, issue or an event. Or even

track a public relations crisis due to a particular occurrence affecting a particular product. This tool

utilizes data mining to crawl through thousands or even millions of social media feeds relating to

a particular issue/product and identifying the sentiment in them.

1.3 MOTIVATION

The emergence in the last decade of social media platforms such as Twitter, Facebook, and

Instagram, enabled people to engage in social activities to express their opinions, thoughts, and

emotions on a variety of topics. On such platforms, large amounts of data are produced, this

representing an opportunity for companies to assess their social influence and people opinions

towards their products. Consequently, a computational framework is desirable to perform

opinion mining and sentiment analysis which can adapt to the activity domain of the user.

1.4 AIM OF THE STUDY

1.4.1 RESEARCH OBJECTIVES

The main hypotheses of the study include:

a) To provide real-time sentiment towards a particular product, political issue, event etc.
b) To track sentimental change through a certain period of time, this is so as to track brand

marketing campaigns

c) To provide a graphical user interface detailing and visualizing the most used words and

emotions towards the issue.

1.5 SIGNIFICANCE OF THE STUDY

This study aims to develop a natural language processing-based tool to analyze sentiment in

realtime as the social-media feeds stream in. With the use of machine learning algorithm, this tool

will be able to work with large datasets of golden user data especially on twitter where tweeting

characters(messages) are limited to a short message that means a lot of opinions are crammed into

short messages.

Furthermore, this tool can give a business valuable instinct on how a product they launched is

faring, it can help identify and avert an emerging public crisis relating to a product, take an example

when an independent user found out about how apple had been knowingly slowing its iPhone down

after just one year of usage in order for users to buy new phones, this thread quickly became a

major topic on twitter and utter customer dissatisfaction was expressed through a ton of tweets

against the brand. With this tool such emerging customer dissatisfactions are quickly averted and

a corrective method implemented.

This tool can also be used to identify the right influencers (brand advocates) to push your product

or service to the mainstream by determining the sentiment they carry towards the social-media

users.

Last but not least, this tool is can be deployed to an email application and used in determining the

tone of an email you are writing in real-time, just like an emotional spellchecker for emotions.
1.6 SCOPE

To develop the system, I will use supervised machine learning, it comprises of three stages: Data

collection, pre-processing, Training data, classification and plotting of results.

The project is built using python as programming language which has wide range of

dependencies(libraries). The following are only applicable to the project:

● NumPy

● TensorFlow

● Text-blob

1.6.1 Data collection:

The data used in this application is sourced from twitter since twitter data is better suitable for these

reasons:

a) its short nature of tweets

b) emotions are crammed onto few characters

c) When someone is tweeting, most probably they will express a pressing opinion that they

feel others should know about it, therefore the probability of finding a sentiment in a tweet

is high.

A crawler functionality within the application will automatically pull tweets for a specified

hashtag, issue or event.

1.6.2 Preprocessing:

Online data is usually full of noise and gibberish parts such as the HTML tags, scripts and ads and

none-English texts. Keeping these words will have a strain on the classifier and will slow its

performance. Here’s the thing, having data properly cleansed is going to improve the performance
of the classifier and should speed up the classification process therefore enabling real-time

sentiment analysis.

In this phase the extracted data is cleansed and prepared for feeding into the classifier. This phase

mostly involves mining keywords and symbols, getting rid of unnecessary whitespaces, tabs,

removing non-English texts and converting all uppercase and lowercase text to a common case.

1.6.3 Training Data:

A dataset mostly crowdsourced e.g. the IDM movie dataset is fed into the classifier for learning

purpose. This dataset is like jet-fuel for the classifier. In this project I am going to use the twitter

Sentiment analysis dataset which has over 1,578,627 classified tweets which are well labelled as

1 for positive sentiment and 0 for negative sentiment.

1.6.4 Classification

This is the most crucial part of the whole process. Naïve Bayes theorem is deployed for analysis.

Due to its simplicity naïve Bayes can outperform more sophisticated classification methods It is a

collection of algorithms that all share a mutual principle, that every feature being classified is

independent of the value of any other feature (Harry Zhang, 2016). eg a fruit may be considered to

be an apple if it is red, round, and about 3cm in diameter. A Naive Bayes classifier studies each of

these “structures” (red, round, 3” in diameter) to contribute autonomously to the probability that

the fruit is an apple, irrespective of any correlations between features. Features, however, aren’t

always autonomous which is often seen as a limitation of the Naive Bayes algorithm and this is

why it’s labeled “naive”.

P(label/features) =P(features/label) P(label)/P(features)


P(label) is the prior probability of a label or the likelihood that a label is observed. Given a feature,

P (features label) is the prior probability that feature set is being classified as a label. P(features) is

the prior probability that a given feature set is occurred. Given the Naive assumption which states

that all features are independent of each other, the equation could be rewritten as follows:

P (label features) =P(label) P(fl|label) …P (fn|label )P(features)

Once a classifier for sentiment analysis is selected, the trained model classifier must be validated

using cross fold validation. The performance of the model can be determined using the following

measures:

1. Accuracy: It is measured by the fraction of number of correct predictions over total number

of predictions. The accepted accuracy is usually in the range 70% to 90%. If a model is

1005 accurate then it usually depicts that model overfits the data.

2. Precision: This measure shows how accurately the model makes predictions w.r.t each

class. It is measured by number of correct predictions over total number of true positives

and true negative examples.

3. Recall: This measure shows the completeness of the model w.r.t each class. It is measured

by number of correct predictions over total number of true positives and false negative

examples.

4. F-score: It is measured as,

F−score=2 Precision Recall Precision + Recall


The classifier after training is now ready to be deployed to analyze tweets in real-time for

sentimental analysis.

1.6.5 Results

Results of the sentimental analysis are represented in a graphical user interface with charts, graphs

and other types of representations. Performance tuning is done prior to deployment

1.7 ASSUMPTIONS

While conducting this study, the researcher assumed that all companies and brands rely mostly on

social media to monitor and advertise their products, brands and also to communicate with their

customers. Furthermore, it is assumed that customers express their concerns pertaining a certain

brand through social media: this includes their reviews, dissatisfactions and complaints about a

product or brand. Finally, it is assumed that opinion expressed by customers concerning a product

or brand is honest and is not in any way biased or pushed by a political or personal agenda

1.8 LIMITATIONS

The human language is too complex for a machine to read and understand, Opinions are sometimes

expressed as sarcasm and furthermore the order of words adds to more confusion. For example, I

currently use the Blackberry priv and love it, but not as much as the Samsung galaxy s5 I chose

the priv for the camera lens. My blunder.” In this example the sentiment is not clear as to which

phone the user hates or likes.

Understanding sarcasm, “Oh, yeah, Fast Food Eatery. I just LOVE the 40-minute wait for food.”

As humans we clearly understand sarcasm. We recognize the sentiment of this comment is


obviously negative. Yet a machine would flag it as positive, possibly even very positive because

of the shouting CAPs LOVE. How awfully wrong.

CHAPTER TWO

2 LITERATURE REVIEW

This chapter looks deep at the large pool of existing literature relevant to my topic and its

objectives. It gives an insight into the literature by other scholars and researchers on the field of

Sentiment analysis. It covers the past studies where it discusses literature related to the specific

objectives of the study. It also presents literature on the critical review of major issue, summary,

gaps to be filled and the conceptual framework. Several sentiment analyzers have been previously

developed to classify data into either positive or negative here are some of the techniques used to

analyze sentiments.

This section summarizes some of the scholarly and research works in the field of Machine

Learning and data mining to analyze sentiments on the Twitter and preparing prediction model

for various applications. In recent years a lot of work has been done in the field of “Sentiment

Analysis on Twitter” by number of researchers. In its early stage it was intended for binary

classification which assigns opinions or reviews to bipolar classes such as positive or negative

only. With quick growth in client of Social Media as of late, the researcher gets attracted towards

the utilization of social media data for sentiment analysis of individuals or particular product or

person or event. Twitter is one of the broadly utilized social media platforms to express the

considerations.
Sentiment analysis, also referred to as opinion mining, is a technique developed in the fields of

artificial intelligence and natural language processing. It is an information retrieval tool that can

classify text into subjective categories (negative, positive, or neutral) or measure sentiment

strength (Pang and Lee, 2008; Thelwall et al.,2010). There are two major steps in sentiment

analysis: opinion extraction and sentiment classification (Pang and Lee, 2008). Opinion

extraction is to differentiate subjective texts from factual ones, while sentiment classification

focuses on assigning opinion words into different sentiment categories (Chiu et al., 2015).

Opinion words are words that express desirable (e.g., fantastic, amazing, etc.) and undesirable

(terrible, disgusting, etc.) states (Ding et al., 2008).

There are two common methods determining an opinion word’s semantic orientation or

subjective categories: corpus based and dictionary-based approach (Chiu et al., 2015).

Corpusbased approaches involve using the syntactic and co-occurrence pattern of a large corpus

(texts that are most representative of a document’s content) in identifying sentiment category

(Capriello et al., 2011; Chiu et al., 2015; Thelwall et al., 2011). For instance, Turney and Littman

(2003) determined a word’s semantic orientation by calculating the strength of its association

with a set of positive words minus its association with a set of negative words. The associations

were estimated by issuing a search engine query, and then noting the query’s co-occurrence

probability with the positive and negative seed words.

A dictionary-based approach, also called a lexicon-based method, uses a bank of pre-coded

words to determine the text’s semantic orientation (Chiu et al., 2015; Thelwall et al., 2011). For

example, Hu and Liu (2004) generated a set of adjective synonyms and antonyms (opinion

words) through bootstrapping process using the WordNet dictionary. They then used this

collection of opinion words to predict the sentiment orientation of electronic product reviews at a
sentence level. The dictionary-based approach typically counts the numbers of positive and

negative opinion words in a sentence. If positive opinion words prevail, the orientation of the

sentence is positive and otherwise negative. Sentiment analysis can be conducted not only at a

sentence level, but also at others levels: document-, paragraph-, or attribute-level. As the level of

granularity increases, so does its complexity. Attribute-level sentiment analysis aims to associate

opinions associated with certain features (Chiu et al., 2015).

Sentiment analysis has been successfully applied in various contexts, such as detecting influenza

outbreak (Culotta, 2010), determining overall trends in the level of happiness (Dodds et

al.,2011), predicting movie box office revenues (Asur & Huberman,2010), and understanding

some consumer opinions on tourism and hospitality related products (Chiu et al., 2015; Claster et

al., 2013; Duan et al., 2013). Chiu et al. (2015) developed a Chinese sentiment analysis method

to extract opinions toward various hotel attributes in Chinese blogs.

The rising popularity of sentiment analysis in research can be attributed to its unique advantages.

First, as compared to manual coding, computer-aided sentiment analyses are not only more

efficient, but also produce comparable results. Capriello et al. (2013) compare the efficiency of

manual content coding and two computer-aided sentiment analyses techniques in analyzing 800

travel reviews of former farm stay guests. They found that all three analyses produce similarly

reliable results. Second, as compared to traditional methods (e.g., surveys or focus groups),

sentiment analysis can effectively reduce cost, time, and manual labor by using automatic

algorithms to sort through text (Chiu et al., 2015).

Despite its advantages, sentiment analysis also has many draw-backs. As Pang and Lee (2008)

pointed out, sentiment analysis is domain and event dependent. Words considered positive in one
domain might not be so in another area. Sarcasm is an obvious example of this limitation.

Despite its inherent disadvantages, the technique still appears as a promising tool for researchers

and industry practitioners. Wang et al. (2013) reviewed previous studies, and reported that the

analysis technique yields a rather high accuracy rate (roughly 70% to 80% accuracy rate in

training-test data matching tasks). The objective of sentiment analysis is to obtain useful insight

from a large quantity of aggregated data, rather than perfect classification of all data points.

Pak and Paroubek (2010) proposed a model to classify the tweets as objective, positive and

negative. They created a twitter corpus by collecting tweets using Twitter API and automatically

annotating those tweets using emoticons. Using that corpus, they developed a sentiment classifier

based on the multinomial Naïve Bayes method that uses features like N-gram and POS-tags. The

training set they used was less efficient since it contains only tweets having emoticons.

Parikh and Movassate (2009) implemented two models, a Naïve Bayes bigram model and a

Maximum Entropy model to classify tweets. They found that the Naïve Bayes classifiers worked

much better than the Maximum Entropy model.

Barbosa et al. (2010) designed a two-phase automatic sentiment analysis method for classifying

tweets. They classified tweets as objective or subjective and then in second phase, the subjective

tweets were classified as positive or negative. The feature space used included retweets,

hashtags, link, punctuation and exclamation marks in conjunction with features like prior polarity

of words and POS.

Po-Wei Liang et.al. (2014) used Twitter API to collect twitter data. Their training data falls in

three different categories (camera, movie, mobile). The data is labeled as positive, negative and

non-opinions. Tweets containing opinions were filtered. Unigram Naive Bayes model was
implemented and the Naive Bayes simplifying independence assumption was employed. They

also eliminated useless features by using the Mutual Information and Chi square feature

extraction method. Finally, the orientation of a tweet is predicted. i.e. positive or negative.

Bakhtawar Seerat et al (2012) proposed the method of opinions extraction from an online web

page and the limitation of Sentiment analysis. Meena Rambocas (2013) concluded all the

challenges marketers can face when using sentiment analysis as an alternative technique capable

of triangulating qualitative and quantitative methods through innovative real time data collection

and analysis.

G. Vinodhini et al (2012) proposed an Overview of different opinion mining techniques. Blessy

Selvam et al (2013) proposed different approaches of sentiment classification and the existing

methods with the framework. Rudy Prabowo (2009) formed a new approach by combining

rulebased classification, supervised learning and machine learning and tested it on movie

reviews, product reviews and Myspace comments. And also proposed a semi-automatic approach

to get better effectiveness.

2.1 APPLICATIONS OF SENTIMENT ANALYSIS

Word of mouth (WOM) is the process of conveying information from person to person and

plays a major role in customer buying decisions. In commercial situations, WOM involves

consumers sharing attitudes, opinions, or reactions about businesses, products, or services with

other people. WOM communication functions based on social networking and trust. People rely

on families, friends, and others in their social network. Research also indicates that people

appear to trust seemingly disinterested opinions from people outside their immediate social

network, such as online reviews. This is where Sentiment Analysis comes into play. Growing
availability of opinion rich resources like online review sites, blogs, social networking sites

have made this “decision-making process” easier for us. With explosion of Web 2.0 platforms

consumers have a soapbox of unprecedented reach and power by which they can share

opinions. Major companies have realized these consumer voices affect shaping voices of other

consumers.

Sentiment Analysis thus finds its use in Consumer Market for Product reviews, Marketing for

knowing consumer attitudes and trends, Social Media for finding general opinion about recent

hot topics in town, Movie to find whether a recently released movie is a hit.

Pang-Lee et al. (2002) broadly classifies the applications into the following categories.

a. Applications to Review-Related Websites

Movie Reviews, Product Reviews etc.

b. Applications as a Sub-Component Technology

Detecting antagonistic, heated language in mails, spam detection, context sensitive information

detection etc.

c. Applications in Business and Government Intelligence

Knowing Consumer attitudes and trends

d. Applications across Different Domains

Knowing public opinions for political leaders or their notions about rules and regulations in

place etc.

2.2 Machine learning methods for sentimental analysis

Sentimental analysis approaches use different machine learning classifiers and feature extractors.

In this context, the goal of machine learning is to study the algorithms that are capable in fully
automated situations to predict something out of input. There are many ways to do this i.e. the

use of Naive Bayes, support vector machines (SVM), maximum entropy etc. There are several

applications that have been developed using these algorithms for example; Microsoft Cern with

inbuilt random forest for predicting body parts, given what is on the sensor of the camera. Many

prototypes and models for sentiment classification treat classifiers and feature extractors as two

distinct components (Pang & Lee, 2008).

2.2.1 Naïve Bayes classifier

(Pang &Lee, 2008) describes naive Bayes classifier as a supervised machine learning algorithm

with a simple probabilistic classifier based on Bayes’ theorem with strong independence

assumptions. The classifier assumes the presence (or absence) of a particular feature of a class is

unrelated to the presence (or absence) of the other feature. It can learn the pattern by examining a

set of documents that has been categorized. It compares the contents with the list of words to

classify the documents to their right category or class (Vishal & Sonawane, 2016). Let d be the

tweet and c* be a class that is assigned to d, where

From the above equation, “f” is a feature, count of feature “(fi)” is denoted with ni (d) and is

present in d which represents a tweet. Here, m denotes number of features. Parameters P(c) and

P(f|c) are computed through maximum likelihood estimates, and smoothing is utilized for unseen
features. Python NLTK library can be used to train and classify a text using Naïve Bayes (Vishal

& Sonawane, 2016).

Maximum entropy

Maximum entropy is a technique for estimating probability distributions from data. In text

classification, maximum entropy estimates the conditional distribution of the class label given a

document. A document is represented by a set of word count features. The labeled training data

is used to estimate the expected value of these word counts on a class-by-class basis

(Govindarajan & Romina, 2013).

The principle in maximum entropy is that when nothing is known, the distribution should

be as uniform as possible, that is, have maximal entropy. Labeled training data is used to

derive a set of constraints for the model that is characterized by the class-specific

expectations for the distribution. Using maximum entropy model, prediction of outcome is

based on everything that is known and assumes nothing about unknown.

Maximum entropy for text classification

In Maximum Entropy Classifier, no assumptions are taken regarding the relationship in

between the features extracted from dataset. This classifier always tries to maximize the

entropy of the system by estimating the conditional distribution of the class label.

Maximum entropy even handles overlap feature and is same as logistic regression method

which finds the distribution over classes (Vishal & Sonawane, 2016). Maximum entropy

makes no independence assumptions for its features, like Naive Bayes. The model is

represented by the following:


Where c is the class, d is the tweet and λi the weight vector. The weight vectors decide the

importance of a feature in classification (Vishal & Sonawane, 2016).

2.4.3 Support Vector Machine (SVM)

SVM is a supervised learning model which analyzes the data and identifies the pattern for

classification. The concept of SVM algorithm is based on decision plane that defines decision

boundaries. A decision plane separates group of instances having different class memberships.

For example, consider an instance which belongs to either class Circle or Diamond. There is a

separating line (figure 2.2) which defines a boundary. At the right side of boundary all instances

are Circle and at the left side all instances are Diamond (Pravesh & Mohd, 2014).

Principle of SVM
In text classification sometimes data are linearly divisible, for very high dimensional

problems and for multi-dimensional problems data are also separable linearly. Generally,

(in maximum cases) the opinion mining solution is one that classifies most of the data and

ignores outliers and noisy data. If a training set data say D cannot be separated clearly then

the solution is to have fat decision classifiers and make some mistake (Pravesh & Mohd,

2014). The SVM can be used to extract terrorist entities from a collection of untagged news

documents in the terrorist domain. This method segments each document into sentences,

parses the latter into parse trees and delivers features for the entities within the documents.

Lexicon-Based Approaches

Lexicon based method uses sentiment dictionary with opinion words and match them with

the data to determine polarity (Vishal & Sonawane, 2016). They assign sentiment scores to

the opinion words describing how positive, negative and objective the words contained in

the dictionary. Lexicon-based approaches mainly rely on a sentiment lexicon, i.e., a

collection of known and precompiled sentiment terms, phrases and even idioms, developed

for traditional genres of communication, such as the opinion finder lexicon (Vishal &

Sonawane, 2016).

Dictionary-based

It is based on the usage of terms (seeds) that are usually collected and annotated manually.

This set grows by searching the synonyms and antonyms of a dictionary (Vishal &

Sonawane, 2016). An example of that dictionary is WordNet, which is used to develop a

thesaurus called SentiWordNet. However, Dictionary-based approach can’t deal with

domain and context specific orientations (Vishal & Sonawane, 2016).


Corpus-Based

The corpus-based approach provides the dictionaries related to a specific domain. These

dictionaries are generated from a set of seed opinion terms that grows through the search

of related words by means of either statistical or semantic techniques (Vishal &

Sonawane,2016).

2.3 RELATED STUDIES

2.3.1 Lexical analysis

Input Text Tokenizer Dictionary

Word
Matching

Match

YES NO

Increment Decrement
score score

FIGURE2.1 LEXICAL ANALYSIS

A dictionary-based approach, also called a lexicon-based method, uses a bank of pre-coded

words to determine the text’s semantic orientation (Chiu et al., 2015; Thelwall et al., 2011). This

technique is governed by the use of a dictionary consisting pre-tagged lexicons the input text is

converted to tokens by the Tokenizer. Every new token encountered is then matched for the
lexicon in the dictionary. If there is a positive match, the score is added to the total pool of score

for the input text.

For instance, if “dramatic” is a positive match in the dictionary then the total score of the text is

incremented. Otherwise, the score is decremented or the word is tagged as negative. Though this

technique appears to be amateur in nature, its variants have proved to be worthy.

The classification of a text depends on the total score it achieves. Considerable amount of work

has been devoted for measuring which best lexical information works.

An accuracy of about 80% on single phrases can be achieved by the use of hand tagged lexicons

comprised of only adjectives, which are crucial for deciding the subjectivity of an evaluative text.

The dictionary can be grown by searching the synonyms and antonyms of words in the wellknown

thesaurus or wordnet dictionaries.

2.3.1.1 Limitations of lexical analysis

➢ Lexical analysis has a limitation its performance (in terms of time complexity and

accuracy) degrades drastically with the exponential growth of the size of dictionary

(number of words)

➢ The manual construction of a dictionary is time consuming a difficult and time-consuming

task

2.3.2 Machine learning approach

Machine learning is one of the most prominent techniques gaining interest of researchers due to its

adaptability and accuracy. In sentiment analysis mostly, the supervised learning variants of this

technique are employed. It comprises of three stages: Data collection, Pre-processing, Training
data, Classification and plotting results in the training data, a collection of tagged corpora is

provided. The Classifier is presented a series of feature vectors from the previous data. A model is

created based on the training data set which is employed over the new/unseen text for classification

purpose. In machine learning technique the key to accuracy of a classifier is the selection of

appropriate features. Generally, unigrams (single word phrases), bigrams (two consecutive

phrases), tri-grams (three consecutive phrases) are selected as feature vectors. There are a variety

of proposed features namely number of positive words, number of negative words, length of the

document,

Support Vector Machines (SVM), and Naïve Bayes (NB) algorithm. Accuracy is reported to vary

from 63% to 80% depending upon the combination of various features selected.

2.3.2.1 Advantages of machine learning approach

➢ It overcomes the limitation of lexical approach of performance degradation

➢ it works well even when the dictionary size grows exponentially due to its ability to learn

and adapt.

2.3.2.2 Disadvantage

➢ The machine learning technique faces challenges in designing a classifier

➢ availability of training data, correct interpretation of an unforeseen phrase

2.3.3 Hybrid approach

The advances in sentiment analysis lured researchers to explore the possibility of a hybrid approach

which could collectively exhibit the accuracy of a machine learning approach and the speed of

lexical approach. In Hybrid approach authors use two-word lexicons and an unlabeled data,
dividing these two-word lexicons in two discrete classes negative and positive. Pseudo documents

encompassing all the words from the set of chosen lexicons are created.

Then computed the cosine similarity amongst the pseudo documents and the unlabeled documents.

Depending upon the measure of similarity, the documents were either assigned a positive or a

negative sentiment. This training dataset was then fed to a naïve bayes classifier for training

purpose.

Another approach presented by, derived a ‘unified framework’ using back-ground lexical

information as word class associations. Authors renewed information for particular areas using the

available datasets or training examples and proposed a classifier called as Polling Multinomial

Classifier (PMC) (also known as the multinomial naïve bayes) Manually labeled data was

incorporated for training purpose.

They claimed that making use of lexical knowledge improved performance. Another variant of this

approach was presented by but so far only have been able to claim good results

2.3.4 Summary

Comparison of all approaches has showed that best results have been observed from machine

learning approaches, and least by lexical approaches. However, without any proper training of a

classifier in machine learning approach results may deteriorate drastically. Work is being carried

on hybrid approaches. The techniques were tested by on a movie review & recommendation and

news review area based on user tweets.

Their results seem to be promising for further research. Open social networks are best examples of

sociological trust. The exchange of messages, followers and friends and varying sentiments of

users provide a crude platform to study behavioral trust in sentiment analysis domain.
Machine learning approaches have been so far good in delivering accurate results. Depending upon

the application, the success of any approach will vary. Lexical approach is a ready to go and doesn’t

require any prior information or training. While on the other hand machine learning requires well

designed classifier, huge amount of training data sets and performance tuning prior to deployment.

Hybrid approach has so far displayed positive sentiment as far as performance is concerned.

Though they have been deployed using unigrams and diagrams, their performance is worse on

trigrams. This definitely leaves researchers to explore the terrain.

2.4 RELATED SYSTEMS

2.4.1 Trump’s Tweets Sentiment analyzer example

FIGURE 2.2 TRUMP WORDMAP

A research was conducted by (Face, Chris, 2016) of Macquarie University on the sentiment of

Donald Trump tweets.During the election campaign of 2016, much discussion revolved around

who was sending out Donald Trump’s Tweets. A number of articles described how the tone of
Trump’s tweets is more positive when they come from an iPhone device, than when they come

from an Android. The hypothesis is that Trump tweets from an Android device, and that he

employs social media assistants who tweet from an iPhone. But how do you work that out?

In a data set containing 1,512 tweets from @realDonaldTrump sent during the primaries, there is

a small but positive average sentiment score of 0.3, with scores ranging from -5 to 6. This means

that the average tweet has slightly more positive language than negative. The magnitude of the

scores is small as the length of a tweet is restricted.

The power of sentiment arises when considering other variables in the data. Think of the

nowfamous example of the Trump sentiment gap between Android and iPhone. The mean

sentiment score of Tweets from Android, 0.1, is significantly lower than the overall average of 0.3:

FIGURE 2.3 ANDROID VS IPHONE TRUMP

Engagement

The data from Twitter includes the number of times each tweet has been Favorited. This is used as

a proxy for engagement. For this data set, the average is around 19,000. By considering how the

average number of favorites varies with the sentiment, the study discovered another interesting

pattern.
FIGURE 2.4 ENGAGEMENT

Those tweets which have a negative sentiment (scoring -2 or fewer) garner a significantly higher

number of favorites on average. It would seem that Trump’s followers are noticeably more engaged

by negative content.

A little sentiment analysis can reveal patterns in the data which would be difficult to gain by

reading through the sea of content.

2.5 LIMITATIONS OF RELATED SYSTEMS

● Related systems can identify and analyze many pieces of text automatically and quickly.

computer programs have problems recognizing things like sarcasm and irony, negations,

jokes, and exaggerations - the sorts of things a person would have little trouble identifying.

And failing to recognize these can skew the results.

● 'Disappointed' may be classified as a negative word for the purposes of sentiment analysis,

but within the phrase “I wasn't disappointed", it should be classified as positive.

● We would find it easy to recognize as sarcasm the statement "I'm really loving the

enormous pool at my hotel!", if this statement is accompanied by a photo of a tiny


swimming pool; whereas an automated sentiment analysis tool probably would not and

would most likely classify it as an example of positive sentiment.

● With short sentences and pieces of text, for example like those you find on Twitter

especially, and sometimes on Facebook, there might not be enough context for a reliable

sentiment analysis. However, in general, Twitter has a reputation for being a good source

of information for sentiment analysis, and with the new increased word count for tweets

it's likely it will become even more useful.

2.6 HOW THE PROPOSED SYSTEM SEEKS TO HANDLE THESE CHALLENGES.

The proposed system uses social media feeds of a particular brand or trend to analyze the sentiment

and polarity of the feeds therefore determine the general feeling of the masses towards a particular

product, service or issue being raised.

To ensure good quality of data being used to analyze sentiment, this system removes unnecessary

data or noise thus remains with good quality and plausible data from social media feeds i.e., data

that contains ‘unknown language’ or too many links will be filtered out.

The overall sentiment will require a huge collection of datasets i.e., a whole week of social media

feeds therefore more accurate sentiment will be generated.


REFERENCES

Bakhtawar Seerat, Farouque Azam, “Opinion Mining: Issues and Challenges (A Survey)”,

International Journal of Computer Applications, Vol49 No 9 July 2012Pg No 42-

51.

Bluesy Selvam, A. Abirami, “A Survey on Opinion Mining Framework”, International

Journal of Advanced Research in Computer and Communication Engineering, Vol

2, Issue 9, Sep 2013Pg No 3544-3549.

Capriello, A., Mason, P.R., Davis, B., Crotts, J.C., 2011. Farm tourism experiences in

travel reviews: a cross-comparison of three alternative methods for data analysis.

J. Bus. Res. 66, 778–785.

Chiu, C., Chiu, N.H., Sung, R.J., Hsieh, P.Y., 2015. Opinion mining of hotel customergenerated

contents in Chinese weblogs. Curr. Issues Tourism 18 (5),477–495.

Claster, W., Pardo, P., Cooper, M., Tajeddini, K., 2013. Tourism, travel and tweets:

algorithmic text analysis methodologies in tourism. Middle East J. Manage.

1(1), 81–99.

Culotta, A., 2010. Detecting Influenza Outbreaks by Analyzing Twitter Messages,

Retrieved from Cornell University Library _http://arxiv.org/abs/1007.4748_.

Ding, X., Liu, B., Yu, P.S., 2008. A holistic lexicon-based approach to opinion

mining.In: Proceedings of the 2008 International Conference on Web Search

and DataMining, ACM, pp. 231–240.

Dodds, P.S., Harris, K.D., Kloumann, I.M., Bliss, C.A., Danforth, C.M., 2011.
Temporal patterns of happiness and information in a global social network:

Hedonometrics and Twitter. PLoS ONE 6 (12), 1–26.

Duan, W., Cao, Q., Yu, Y., Levy, S., 2013. Mining online user-generated content:

using sentiment analysis technique to study hotel service quality. In:

SystemSciences (HICSS), 2013 46th Hawaii International Conference on,

IEEE, January2013, pp. 3119–3128.

G. Vinodhini et al, “Sentiment Analysis and Opinion Mining: A Survey”, International Journal

of Advanced Research in Computer Science and Software Engineering (IJARCSSE),

Vol 2, Issue 6, June 2012.

Govindarajan & Romina (2013), a survey of classification methods and applications for

sentiment analysis, the international journal of engineering and science (IJES).

Hu, M., Liu, B., 2004. Mining and summarizing customer reviews. In: Proceedings of

the Tenth ACM SIGKDD International Conference on Knowledge Discovery and

Data Mining, ACM, August 2004, pp. 168–177,

http://dx.doi.org/10.1145/1014052.1014073.

Liang, P. W., Liao, C. Y., Chueh, C. C., Zuo, F., Williams, S. T., Xin, X. K., ... & Jen,

A. K. Y. (2014). Additive enhanced crystallization of solution‐processed

perovskite for highly efficient planar‐heterojunction solar cells. Advanced

materials, 26(22), 3748-3754.

Pak, A., & Paroubek, P. (2010, May). Twitter as a corpus for sentiment analysis and opinion

mining. In LREc (Vol. 10, No. 2010, pp. 1320-1326).


Pang, B., Lee, L., 2008. Opinion mining and sentiment analysis. Found. Trends

Inf.Retrieval 2 (1–2), 1–135.

Parikh, R., & Movassate, M. (2009). Sentiment analysis of user-generated twitter updates

using various classification techniques. CS224N Final Report, 118.

Prabowo, R., & Thelwall, M. (2009). Sentiment analysis: A combined approach.

Journal of Informetrics, 3(2), 143-157.

Pravesh & Mohd (2014), methodological study of opinion mining and sentiment analysis

techniques, international journal on soft computing (IJSC) vol. 5.

Rambocas, M., & Gama, J. (2013). Marketing research: The role of sentiment analysis

(No. 489). Universidade do Porto, Faculdade de Economia do Porto.

Thelwall, M., Buckley, K., Paltoglou, G., 2011. Sentiment in Twitter events. J.

Am.Soc. Inf. Sci. Technol. 62 (2), 406–418.

Turney, P.D., Littman, M.L., 2003. Measuring praise and criticism: inference ofsemantic

orientation from association. ACM Trans. Inf. Syst. (TOIS) 21

(4),315–346.

Vishal & Sonawane (2016), sentiment analysis of twitter data: a survey of techniques,

international journal of computer applications, volume 139.

Wang, J., Gu, Q., Wang, G., 2013. Potential power and problems in sentiment mining of

social media. Int. J. Strateg. Decis. Sci. (IJSDS) 4 (2), 16–26.

You might also like