Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

TWITTER SENTIMENT ANALYSIS

Project work submitted to the Dr. N.G.P. Arts and Science College in partial fulfillment for the
award of the degree of

MASTER OF SCIENCE IN COMPUTER SCIENCE

BY
Ms.LAVANYA S
(Reg. No: 202CS024)

Under the Guidance of

Mrs. P.USHA. M.SC., M.Phil., (Ph.D) ,Professor

DEPARTMENT OF COMPUTER SCIENCE

Dr. N.G.P. ARTS AND SCIENCE COLLEGE

(An Autonomous Institution, Affiliated to Bharathiar University, Coimbatore)


Approved by Government of Tamil Nadu & Accredited by NAAC with ‘A’ Grade (2nd Cycle)
Dr. N.G.P.-Kalapatti Road, Coimbatore - 641 048, Tamil Nadu, India.
Website: www.drngpasc.ac.in | Email: info@drngpasc.ac.in. | Phone: +91-422-2369100
CERTIFICATE

This is to certify that the thesis entitled TWITTER SENTIMENT ANALYSIS submitted to

Dr. N.G.P. Arts and Science College in partial fulfillment for the award of degree of Master of
Science in computer science is a record of original project work done by Ms. Lavanya.S (Reg.
No: 202CS024)during the period of his study (2020-2022) in the Department of computer
science , Dr. N.G.P. Arts and Science College, Coimbatore-48 under my supervision and
guidance, and that the thesis has not formed the basis for the award of any Degree/ Diploma/
Associateship/ Fellowship or other similar title to any candidate of any autonomous college.

(Mrs.P.USHA) . (Dr.B.ROSILINE JEETHA)


Guide HoD

Submitted to the viva-voce examination held on _____________ at Department of computer


science , Dr. N.G.P. Arts and Science College, Coimbatore - 48.

Internal Examiner External Examiner


DECLARATION

I am Ms.Lavanya.S (Reg. No: 202CS024) hereby declare that the thesis entitled TWITTER
SENTIMENT ANALYSIS submitted to Dr. N.G.P. Arts and Science College, in partial
fulfillment for the award of degree of Masters of science in computer science is a record of
original project work done by me during the period 2020-2022 under the supervision and guidance
of Mrs. P. Usha M.SC., M.Phil., (Ph.D) Professor, Department of Computer science

Dr. N.G.P. Arts and Science College, Coimbatore-48, and that it has not formed the basis for the
award of any Degree/ Diploma/ Associate ship/ Fellowship or other similar title to any candidate
of any autonomous college.

(Ms.Lavanya.S)
Student
ACKNOWLEDGMENT

I would like to express my profound gratitude to Dr. Nalla G. Palaniswami, MD., AB (USA).,
Chairman, Kovai Medical Centre and Hospital (KMCH), Coimbatore and Dr. Thavamani D.
Palaniswami, MD., AB (USA), Secretary, Dr. N.G.P. Arts and Science College, Coimbatore
for their encouragement.

I wish to express my sincere thanks to Prof. Dr. V. Rajendran, M.Sc., M.Phil., B.Ed., M.Tech.
(Nanotech)., Ph.D., D.Sc., FlnstP (UK)., FASCh., Principal, Dr. N.G.P. Arts and Science
College, Coimbatore for his encouragement and support during the course period.

I would like to express my heartfelt thanks to Dr. B. RosilineJeethaM.C.A., M.Phil., Ph.D. Head
of the Department, and all the Teaching and Non-teaching Staff members, Department of
computer science , for their inspiration for the successful completion of this work.

I wish to express my deep sense of gratitude and thanks to my guide Mrs. Mrs. P. Usha
M.SC.,M.Phil.,(Ph.D) Professor, Department of computer science , Dr. N.G.P. Arts and
Science College, Coimbatore-48, for his excellent guidance and strong support throughout the
course of my study.

Next, I would like to thank friends, relatives and family members for their love and support. My
whole hearted thanks to all those who have directly and indirectly helped me to complete my
project.
ABSTRACT
In today’s world, social networking website like Twitter, Facebook etc plays a significant
role. Twitter is a micro-blogging platform which provides a tremendous amount of data which can
be used for various applications of sentiment analysis like predictions, reviews, elections,
marketing etc. Sentiment analysis is a process of extracting information from large amount of data
and classifies them into different classes called sentiments. Sentiment analysis deals with
identifying and classifying opinions or sentiments expressed in source text. The goal of sentiment
analysis for classification which is helpful to analyses the information form of the number of
tweets, where opinions are highly unstructured and sentiment are either positive or negative. For
we first pre-process the dataset, after that extract the dataset.
S.NO TABLE CONTENT PAGE NO
SYNOPSIS
1 INTRODUCTION
1.1 OVERVIEW OF THE PROJECT
2 SYSTEM ANALYSIS
2.1 EXISTING SYSTEM
2.1.1 DRAWBACKS OF EXISTING SYSTEM
2.2 PROPOSED SYSTEM
2.2.1 ADVANTAGES OF PROPOSED SYSTEM
3 SYSTEM SPECIFICATION
3.1 HARDWARE SPECIFICATION
3.2 SOFTWARE SPECIFICATION
3.3 SOFTWARE FEATURES
4 SYSTEM DESIGN AND DEVELOPMENT
4.1 FLOW DIAGRAM
4.2 ALGORITHM FLOW DIAGRAM
5 SYSTEM TESTING AND IMPLEMENTATION
5.1 SYSTEM TESTING
5.2 SYSTEM IMPLEMENTATION
6 CONCULATION
7 SCOPE FOR FUTURE ENHANCEMENT
8 BIBLIOGRAPHY
9 APPENDIX
A.SAMPLE CODING
1 INTRODUCTION
We have chosen to work with twitter since we feel it is a better approximation of
public sentiment as opposed to conventional internet articles and web blogs. The reason
is that the amount of relevant data is much larger for twitter, as compared to traditional
blogging sites. Moreover the response on twitter is more prompt and also more general
(since the number of users who tweet is substantially more than those who write web
blogs on a daily basis). Sentiment analysis of public is highly critical in macro-scale
socioeconomic phenomena like predicting the stock market rate of a particular firm. This
could be done by analysing overall public sentiment towards that firm with respect to
time and using economics tools for finding the correlation between public sentiment and
the firm’s stock market value. Firms can also estimate how well their product is
responding in the market, which areas of the market is it having a favourable response
and in which a negative response (since twitter allows us to download stream of geo-
tagged tweets for particular locations. If firms can get this information they can analyze
the reasons behind geographically differentiated response, and so they can market their
product in a more optimized manner by looking for appropriate solutions like creating
suitable market segments.

1.1 OVERVIEW OF THE PROJECT

This project of analyzing sentiments of tweets comes under the domain of


“Pattern Classification” and “Data Mining”. Both of these terms are very closely
related and intertwined, and they can be formally defined as the process of
discovering “useful” patterns in large set of data, either automatically (unsupervised)
or semi- automatically (supervised). The project would heavily rely on techniques of
“Natural Language Processing” in extracting significant patterns and features from
the large data set of tweets and on “Machine Learning” techniques for accurately
classifying individual unlabelled data samples (tweets) according to whichever
pattern model best describes them.

Twitter based features are more informal and relate with how people express
themselves on online social platforms and compress their sentiments in the limited
space of 140 characters offered by twitter. They include twitter hash tags, re-tweets,
word capitalization, word lengthening, question marks, presence of url in tweets,
exclamation marks, internet emoticons and internet shorthand/slangs.

Classification techniques can also be divided into a two categories: Supervised


vs. unsupervised and non-adaptive vs. adaptive/reinforcement techniques. Supervised
approach is when we have pre-labeled data samples available and we use them to
train our classifier. Training the classifier means to use the pre-labeled to extract
features that best model the patterns and differences between each of the individual
classes, and then classifying an unlabeled data sample according to whichever
pattern best describes it.
2 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM


A consequence it provoked the field of natural language processing in general and
sentiment analysis in particular. Information overload and the growing volume of reviews
and messages facilitated the need for high-performance automatic processing methods.
Multinomial Naive bayes classification algorithm tends to be a baseline solution for
sentiment analysis tasks. The basic idea of Naive bayes technique is to find the
probabilities of classes assigned to text by using the joint the probabilities of words and
classes. They can be result will be slow process.

2.1.1 DRAWBAKES OF EXISTING SYSTEM

 Naive bayes assumes that all predictors are independent, rarely happening in real
life. This limits the applicability of this algorithm in real-world use cases.

 This algorithm faces the ‘zero-frequency problem’ where it assigns zero probability
to a categorical variable whose category in the test data set wasn’t available in the
training dataset

 Its estimations can be wrong in some cases, so shouldn’t take its probability outputs
very seriously.
2.2 PROPOSED SYSTEM
SenticNet is performing tasks such as polarity detection and emotion recognition by
leveraging the denotative and connotative information associated with words and multi-
word expressions in stead of solely relying on word co-occurrence frequencies. Sentiment
analysis is a predictive modeling task where the model is trained to predict the polarity of
textual data or sentiment like Positive, neural and negative. Sentiment analysis is
performed their customer behaviour towards the products well. We already overloaded
with lots of unstructured data it becomes very tough to analyze the large volume of textual
data.

2.2.1 ADVANTAGES OF PROPOSED SYSTEM

 User friendly.

 It can be supported many language .

 Easy to maintain.
3 SYSTEM SPECIFICATION
It specifies the minimum project environment development environment and tools
that are develop and implement “Twitter sentiment analysis using Machine Learning”. It is
divided into two specifications.
1. Hardware requirement.
2. Software requirement.

3.1 HARDWARE SPECIFICATION

 Processor : INTEL & AMD

 RAM : 2GB

 Hard Disk : 80 GB

 Keyboard : Standard 101 Keys

 Mother Board : Intel

3.2 SOFTWARE SPECIFICATION

 Operating System : Windows 10

 IDE : Google Colab

 Back End : Python

3.3 SOFTWARE FEATURES

PYTHON

Python is an interpreted, object-oriented, high-level programming language with


dynamic semantics. Its high-level built in data structures, combined with dynamic typing
and dynamic binding, make it very attractive for Rapid Application Development, as well
as for use as a scripting or glue language to connect existing components together.
Python's simple, easy to learn syntax emphasizes readability and therefore reduces
the cost of program maintenance. Python supports modules and packages, which
encourages program modularity and code reuse. The Python interpreter and the extensive
standard library are available in source or binary form without charge for all major
platforms, and can be freely distributed.

Python Features

Python has few keywords, simple structure, and a clearly defined syntax. Python
code is more clearly defined and visible to the eyes. Python's source code is fairly easy-to-
maintain. Python's bulk of the library is very portable and cross-platform compatible on
UNIX, Windows, and Macintosh. Python has support for an interactive mode which allows
interactive testing and debugging of snippets of code.

GOOGLE COLOB

Google Colaboratory /Google Colab is a free environment that requires no setup


and runs entirely on cloud. With Colab, one can write and execute code, save and share
analyses, access powerful computing resources, all for free from browser. A powerful way
to iterate and write on your Python code for data analysis. Rather than writing and
rewriting an entire code, one can write lines of code and run the mata time. It is built of
iPython which is an interactive way of running Python code. It allows to support multiple
languages as well as storing the code and writing own markdown.
4 SYSTEM DESIGN AND DEVELOPMENT
4.1 SYSTEM DEVELOPMENT
MODULES

In this propose methodology, we explain about the twitter sentiment analysis using the
machine learning used senticnet algorithm.

 Creation of processed Tweets and feature vector.

 Use extract features to get features.

 Apply Features to get training set.

 Use training set to train classifier

 Trained classifier is stored in pickle.

 User input sentence to classify.

 User input topic to fetch tweet and classify.

 Calculating confidence

Creation of processed Tweets and feature vector

The required for a data is collected for twitter, already tweet by a user. The twitter
data is classify and stored in feature vector.

Use extract features to get features

The data is extract for feature vector and can be split the data in tweet.

Apply Features to get training set

The stored a dataset can be apply for training set. The training set data can be
classify a positive, negative and neutral.
Use training set to train classifier

The training set can be classify for a dataset. They can be used tweet to extract and
classify to stored pickle.

Trained classifier is stored in pickle

The user can get a data is already classify stored in a pickle.

User input sentence to classify

The user to get a new tweet is to analysis for system. The data is classify and stored
a pickle.

User input topic to fetch tweet and classify

The user get a input and already stored a dataset can be analysis. The two data can
be fetch and classified.

Calculating confidence

The finally for the new tweet and already classify a data is checked training data.
They can be analyzed data and result will be displayed.
4.2FLOW DIAGRAM
This diagram is the flow diagram of twitter sentiment analysis in given below:

4.3 ALGORITHM FLOW DIAGRAM


This diagram is a Twitter sentiment analysis using for algorithm flow diagram in
given below:
4.4 SYSTEM DESIGN

The degree of interest in each concept has varied over the year, each has stood the
test of time. Each provides the software designer with a foundation from which more
sophisticated design methods can be applied. Fundamental design concepts provide the
necessary framework for “getting it right”.

During the design process the software requirements model is transformed into
design models that describe the details of the data structures, system architecture, interface,
and components. Each design product is reviewed for quality before moving to the next
phase of software development.

4.4.1 INPUT DESIGN

The design of input focus on controlling the amount of dataset as input required,
avoiding delay and keeping the process simple. The input is designed in such a way to
provide security. Input design will consider the following steps:

 The dataset should be given as input.


 The dataset should be arranged.
 Methods for preparing input validations.
4.4.2 OUTPUT DESIGN

A quality output is one, which meets the requirement of the user and presents the
information clearly. In output design, it is determined how the information is to be
displayed for immediate need.

Designing computer output should proceed in an organized, well thought out manner, the
right output must be developed while ensuring that each output element is designed so that
the user will find the system can be used easily and effectively.
5 SYSTEM TESTING AND IMPLEMENTATION
5.1 SYSTEM TESTING
UNIT TESTING:
Unit testing focuses verification effort on the smallest unit of software design,
software component or module. Using the component level design description as a control
paths are tested to uncover errors within the boundary of the module. The relative
complexity of tests and the errors those uncover is limited by the constrained scope
established for unit testing. The unit test focuses on the internal processing logic and data
structures within the boundaries of a component. This is normally considered as an adjunct
to the coding step. The design of unit tests can be performed before coding begins.

Validation Testing

The objective of the validation test is to tell the user about the validity and
reliability of the system. It verifies whether the system operates as specified and the
integrity of important data is maintained. User motivation is very important for the
successful performance of the system. All the modules were tested individually using both
test data and live data. After each module was ascertained that it was working correctly and
it had been "integrated" with the system. Again the system was tested as a whole. We hold
the system tested with different types of users. The System Design, Data Flow Diagrams,
procedures etc. were well documented so that the system can be easily maintained and
upgraded by any computer professional at a later

5.2 SYSTEM IMPLEMENTATION


Implementation is the most crucial stage in achieving a successful system and
giving the user’s confidence that the new system is effective and workable. Implementation
of this project refers to the installation of the package in its real environment to the full
satisfaction of the users and operations of the system.
Testing is done individually at the time of development using the data and
verification is done the way specified in the program specification. In short,
implementation constitutes all activities that are required to put an already tested and
completed package into operation. The success of any information system lies in its
successful implementation.

System Implementation is the stage in the project where the theoretical design is
turned into a working system. The most critical stage is achieving a successful system and
in giving confidence on the new system for the user that it will work efficiently and
effectively. The existing system was long time process. The project execution was checked
with live environment and the user requirements are satisfied. The process of designing a
functional classifier for sentiment analysis can be broken down into five basic categories.
They are as follows:

 Data Acquisition

 Human Labelling

 Feature Extraction

 Classification

 Tweet Mood Web Application

Data Acquisition:

Data in the form of raw tweets is acquired by using the python library
“tweestream” which provides a package for simple twitter streaming API [26]. This
API allows two modes of accessing tweets: SampleStream and FilterStream.
SampleStream simply delivers a small, random sample of all the tweets streaming at
a real time. FilterStream delivers tweet which match a certain criteria. It can filter the
delivered tweets according to three criteria:
 Specific keyword(s) to track/search for in thetweets

 Specific Twitter user(s) according to theiruser-id’s

 Tweets originating from specific location(s) (only for geo-taggedtweets).

A programmer can specify any single one of these filtering criteria or a


multiple combination of these. But for our purpose we have no such restriction and
will thus stick to the Sample Stream mode. A tweet acquired by this method has a lot
of raw information in it which we may or may not find useful for our particular
application. It comes in the form of the python “dictionary” data type with various
key-value pairs. A list of some key-value pairs are given below:

 Whether a tweet has beenfavourited

 User ID

 Screen name of theuser

 Original Text of thetweet

 Presence ofhashtags

 Whether it is are-tweet

 Language under which the twitter user has registered theiraccount

 Geo-tag location of thetweet

 Date and time when the tweet was created

Since this is a lot of information we only filter out the information that we
need and discard the rest. For our particular application we iterate through all the
tweets in our sample and save the actual text content of the tweets in a separate
file given that language of the twitter is user’s account is specified to be English.
The original text content of the tweet is given under the dictionary key “text”
and the language of user’s account is given under “lang”.

Since human labelling is an expensive process we further filter out the tweets
to be labelled so that we have the greatest amount of variation in tweets without
the loss of generality. The filtering criteria applied are statedbelow:

 Remove Retweets (any tweet which contains the string“RT”)

 Remove very short tweets (tweet with length less than 20characters)

 Remove non-English tweets (by comparing the words of the tweets with a list
of 2,000 common English words, tweets with less than 15% of content
matching threshold arediscarded)

 Remove similar tweets (by comparing every tweet with every other
tweet, tweets with more than 90% of content matching with some other
tweet is discarded)

After this filtering roughly 30% of tweets remain for human labelling on
average per sample, which made a total of 10,173 tweets to be labelled.

Human Labelling

For the purpose of human labelling we made three copies of the tweets so that
they can be labelled by four individual sources. This is done so that we can take
average opinion of people on the sentiment of the tweet and in this way the noise
and inaccuracies in labelling can be minimized. Generally speaking the more
copies of labels we can get the better it is, but we have to keep the cost of
labelling in our mind, hence we reached at the reasonable figure of three.

We labelled the tweets in four classes according to sentiments


expressed/observed in the tweets: positive, negative, neutral/objective and
ambiguous.
We gave the following guidelines to our labellers to help them in the labelling
process:

 Positive

If the entire tweet has a positive/happy/excited/joyful attitude or if


something is mentioned with positive connotations. Also if more than one
sentiment is expressed in the tweet but the positive sentiment is more
dominant. Example: “4 more years of being in shithole Australia then I move
to the USA!:D”.
 Negative
If the entire tweet has a negative/sad/displeased attitude or if something
is mentioned with negative connotations. Also if more than one sentiment is
expressed in the tweet but the negative sentiment is more dominant. Example:
“I want an android now this iPhone is boring:S”.
 Neutral/Objective
If the creator of tweet expresses no personal sentiment/opinion in the
tweet and merely transmits information. Advertisements of different products
would be labelled under this category. Example: “US House Speaker vows to
stop Obama contraceptive rule... http://t.co/cyEWqKlE”.
 Ambiguous
If more than one sentiment is expressed in the tweet which are equally
potent with no one particular sentiment standing out and becoming more
obvious. Also if it is obvious that some personal opinion is being expressed
here but due to lack of reference to context it is difficult/impossible to
accurately decipher the sentiment expressed. Example: “I kind of like heroes
and don’t like it at the same time...”. Finally if the context of the tweet is not
apparent from the information available. Example: “That’s exactly how I feel
about avengershaha”.
 <Blank>
Leave the tweet unlabelled if it belongs to some language other than
English so that it is ignored in the trainingdata.

Besides this labellers were instructed to keep personal biases out of labelling and
make no assumptions, i.e. judge the tweet not from any past extra personal
information and only from the information provided in the current individual
tweet.
In the above matrix the “strict” measure of agreement is where all the label
assigned by both human beings should match exactly in all cases, while the
“lenient” measure is in which if one person marked the tweet as “ambiguous”
and the other marked it as something else, then this would not count as a
disagreement. So in case of the “lenient” measure, the ambiguous class could
map to any other class. So since the human-human agreement lies in the range of
60-70%, this shows us that sentiment classification is inherently a difficult task
even for human beings. which shows human-human agreement in case labelling
individual adjectives and verbs.

Adjectives Verbs
Human 1: Human Human 1: Human
2 3
Strict 76.19% 62.35%
Lenien 88.96% 85.06%
t

Human- Human Agreement in Verbs / Adjectives Labelling


Over here the strict measure is when classification is between the three categories
of positive, negative and neutral, while the lenient measure the positive and negative
classes into one class, so now humans are only classifying between neutral and
subjective classes. These results reiterate our initial claim that sentiment analysis is
an inherently difficult task. These results are higher than our agreement results
because in this case humans are being asked to label individual words which is an
easier task than labelling entire tweets.

Feature Extraction:

Now that we have arrived at our training set we need to extract useful features
from it which can be used in the process of classification. But first we will
discuss some text formatting techniques which will aid us in feature extraction:

Tokenization:

It is the process of breaking a stream of text up into words, symbols and other
meaningful elements called “tokens”. Tokens can be separated by whitespace
characters and/or punctuation characters. It is done so that we can look at tokens
as individual components that make up a tweet.

 Url’s and user references (identified by tokens “http” and “@”) are removed if
we are interested in only analyzing the text of thetweet.
 Punctuation marks and digits/numerals may be removed if for example we
wish to compare the tweet to a list of Englishwords.
 Lowercase Conversion: Tweet may be normalized by converting it to
lowercase which makes it’s comparison with an English dictionary easier.
 Stemming: It is the text normalizing process of reducing a derived word to its
root or stem [28]. For example a stemmer would reduce the phrases
“stemmer”, “stemmed”, “stemming” to the root word “stem”.
Advantage of stemming is that it makes comparison between words
simpler, as we do not need to deal with complex grammatical transformations
of the word. In our case we employed the algorithm of “porter stemming” on
both the tweets and the dictionary, whenever there was a need of comparison.

 Stop-words removal: Stop words are class of some extremely common words
which hold no additional information when used in a text and are thus claimed
to be useless [19]. Examples include “a”, “an”, “the”, “he”, “she”, “by”, “on”,
etc. It is sometimes convenient to remove these words because they hold no
additional information since they are used almost equally in all classes of text,
for example when computing prior-sentiment-polarity of words in a tweet
according to their frequency of occurrence in different classes and using this
polarity to calculate the average sentiment of the tweet over the set of words
used in that tweet.

 Parts-of-Speech Tagging: POS-Tagging is the process of assigning a tag to


each word in the sentence as to which grammatical part of speech that word
belongs to, i.e. noun, verb, adjective, adverb, coordinating conjunction etc.

Now that we have discussed some of the text formatting techniques employed
by us, we will move to the list of features that we have explored. As we will see
below a feature is any variable which can help our classifier in differentiating
between the different classes. There are two kinds of classification in our system
(as will be discussed in detail in the next section), the objectivity / subjectivity
classification and the positivity / negativity classification. As the name suggests
the former is for differentiating between objective and subjective classes while
the latter is for differentiating between positive and negative classes.
The list of features explored for objective / subjective classification is as below:

 Number of exclamation marks in atweet

 Number of question marks in atweet

 Presence of exclamation marks in atwee

 Presence of question marks in atweet

 Presence of url in atweet

 Presence of emoticons in atweet

 Unigram word models calculated using NaiveBayes

 Prior polarity of words through online lexiconMPQA

 Number of digits in atweet

 Number of capitalized words in atweet

 Number of capitalized characters in atweet

 Number of punctuation marks / symbols in atweet

 Ratio of non-dictionary words to the total number of words in thetweet

 Length of thetweet

 Number of adjectives in atweet

 Number of comparative adjectives in atweet

 Number of superlative adjectives in atweet

 Number of base-form verbs in atweet

 Number of past tense verbs in atweet

 Number of present participle verbs in atweet


 Number of past participle verbs in atweet

 Number of 3rd person singular present verbs in atweet

 Number of non-3rd person singular present verbs in atweet

 Number of possessive pronouns in atweet

We will start by calculating the probability of a word in our training data for
belonging to a particular class:

We now state the Bayes’ rule [19]. According to this rule, if we need to find
the probability of whether a tweet is objective, we need to calculate the probability
of tweet given the objective class and the prior probability of objective class. The
term P(tweet) can be substituted with P(tweet | obj) + P(tweet |subj).

Now if we assume independence of the unigrams inside the tweet (i.e. the
occurrence of a word in a tweet will not affect the probability of occurrence of any
other word in the tweet) we can approximate the probability of tweet given the
objective class to a mere product of the probability of all the words in the tweet
belonging to objective class. Moreover, if we assume equal class sizes for both
objective and subjective class we can ignore the prior probability of the objective
class. Henceforth we are left with the following formula, in which there are two
distinct terms and both of them are easily calculated through the formula mention
above.
Now that we have the probability of objectivity given a particular tweet, we can
easily calculate the probability of subjectivity given that same tweet by simply subtracting
the earlier term from 1. This is because probabilities must always add to 1. So if we have
information of P(obj | tweet) we automatically know P(subj |tweet).

Finally we calculate P(obj | tweet) for every tweet and use this term as a single feature
in our objectivity / subjectivity classification. There are two main potential problems with
this approach. First being that if we include every unique word used in the data set then
the list of words will be too large making the computation too expensive and time-
consuming. To solve this we only include words which have been used at least 5 times in
our data. This reduces the size of our dictionary for objective / subjective classification
from 11,216 to 2,320. While for positive / negative classification unigram dictionary size
is reduced from 6,502 to 1,235words.

The second potential problem is if in our training set a particular word only
appears in a certain class only and does not appear at all in the other class (for
example if the word is misspelled only once). If we have such a scenario then our
classifier will always classify a tweet to that particular class (regardless of any other
features present in the tweet) just because of the presence of that single word. This
is a very harsh approach and results in over-fitting. To avoid this we make use of the
technique known as “Laplace Smoothing”. We replace the formula for calculating
the probability of a word belonging to a class with the following formula:
In this formula “x” is a constant factor called the smoothing factor, which we
have arbitrarily selected to be 1. How this works is that even if the count of a word
in a particular class is zero, the numerator still has a small value so the probability
of a word belonging to some class will never be equal to zero. Instead if the
probability would have been zero according to the earlier formula, it would be
replace by a very small non-zeroprobability.

Classification:

Pattern classification is the process through which data is divided into


different classes according to some common patterns which are found in one class
which differ to some degree with the patterns found in the other classes. The ultimate
aim of our project is to design a classifier which accurately classifies tweets in the
following four sentiment classes: positive, negative, neutral and ambiguous.

There can be two kinds of sentiment classifications in this area: contextual


sentiment analysis and general sentiment analysis. Contextual sentiment analysis
deals with classifying specific parts of a tweet according to the context provided, for
example for the tweet “4 more years of being in shithole Australia then I move to the
USA :D” a contextual sentiment classifier would identify Australia with negative
sentiment and USA with a positive sentiment. On the other hand general sentiment
analysis deals with the general sentiment of the entire text (tweet in this case) as a
whole. Thus for the tweet mentioned earlier since there is an overall positive attitude,
an accurate general sentiment classifier would identify it as positive. For our
particular project we will only be dealing with the latter case, i.e. of general (overall)
sentiment analysis of the tweet as a whole.

The classification approach generally followed in this domain is a two-step


approach. First Objectivity Classification is done which deals with classifying a
tweet or a phrase as either objective or subjective.
After this we perform Polarity Classification (only on tweets classified as
subjective by the objectivity classification) to determine whether the tweet is
positive, negative or both (some researchers include the both category and some
don’t), and reports enhanced accuracy than a simple one-step approach.

The former would try to classify a tweet between objective and subjective
classes, while latter would do so between the positive and negative classes. We use
the short-listed features for these classifications and use the Naive Bayes algorithm
so that after the first step we have two numbers from 0 to 1 representing each tweet.
One of these numbers is the probability of tweet belonging to objective class and the
other number is probability of tweet belonging to positive class. Since we can easily
calculate the two remaining probabilities of subjective and negative by simple
subtraction by 1, we don’t need those two probabilities.

So in the second step we would treat each of these two numbers as separate
features for another classification, in which the feature size would be just 2. We use
WEKA and apply the following Machine Learning algorithms for this second
classification to arrive at the best result:

 K-MeansClustering

 Support VectorMachine

 NaiveBayes

 Rule BasedClassifiers

 Senticnet

 LSTM
TweetMood Web Application:

We designed a web application which performed real-time sentiment analysis


on Twitter on tweets that matched particular keywords provided by the user. For
example if a user is interested in performing sentiment analysis on tweets which
contain the word “Obama” he / she will enter that keyword and the web application
will perform the appropriate sentiment analysis and display the results for the user.

The url of the website is www.tweet-mood-check.appspot.comand its logo is given


below:

TweetMood web-application logo

The web application has been implemented using the Google App Engine
service because it can be used as a free web hosting service and it provides a layer
of abstraction to the developer from the low level web operations so it is easier to
learn. We implemented our algorithm in python and integrated it with GUI for our
website using HTML and Javascript template.

We have three ways of performing sentiment analysis on our website and we


will discuss each of them one by one:

 TweetScore

 TweetCompare

 TweetStats
TweetScore:

This feature calculates the popularity score of the keyword which is a number
from 100 to -100. The more positive popularity score suggests that the keyword is
highly positively popular on Twitter, while the more negative popularity score
suggests that the keyword is highly negatively popular on Twitter. A popularity score
close to 0 suggests that the keyword has either mixed opinions or is not a popular
topic on Twitter. The popularity score is dependent on tworatios:

 Number of positive classified tweets / Number of negative classifiedtweets

 Number of tweets acquired / Time in past needed to explore the RESTAPI

The first ratio suggests if the number of positive tweets is larger than negative
tweets on a particular keyword, the keyword would have overall popular opinion and
vice versa. The second ratio suggests that the lesser time in past we need to explore
the REST API to get the 1,000 tweets means that the more number of people are
talking about the keyword on Twitter, hence the keyword is popular on Twitter.
However it gives no information about the positivity or negativity of the keyword
and so higher the second ratio is, the more popularity score from the first ratio is
shifted to the extreme ends (away from zero) may it be in positive or negative
direction depends on whether there are more number of positive or negative tweets.
Finally a maximum of 10 tweets are displayed for each class (positive, negative and
neutral) so that the user develops confidence in our classifier.

TweetCompare:

This feature compares the popularity score of two or three different keywords
and replies with which keyword is currently most popular on Twitter. This can have
many interesting applications for example having our web application recommend
users between movies, songs and products/brands.
TweetStats:

This feature is for long term sentiment analysis. We input a number of popular
keywords on Twitter on which a backend operation runs after every hour, calculates
the popularity score for the tweets generated on that keyword within an hour time
frame and stores the results against every hour in a database. We can have a
maximum of about 300 such keywords as per Google’s bandwidth requirements. So
once we have a reasonable amount of data we can use it to plot graphs of popularity
score against time and visualize the effect of change in popularity score with respect
to certain events.
6 CONCLUSION
Sentiment analysis is used to identifying people’s opinion, attitude and emotional
states. The views of the people can be positive or negative. Commonly parts of speech are
used as features to extract the sentiment of the text. An adjective plays a crucial role in
identifying sentiment from parts of speech. Sometimes words having adjective and adverbs
are used together then it is difficult to identify sentiment and opinion. Twitter is large
source of data, which make it more attractive for performing sentiment analysis. We
perform analysis on around 15000 tweets total for each party, so that we analyze the result,
understand the patterns and give a review on people opinion. We saw different party have
different sentiment results according to their progress and working procedure. We also
saw social events, speech or rally cause a fluctuation in sentiment of people.
7 SCOPE FOR FUTURE ENHANCEMENT
Some of future scopes that can be included in our research work are:
Use of parser can be embedded into system to improve results.

 A web-based application can be make for our work in future.

 We can improve our system that can deal with sentences of multiple meaning.

 We can also increase the classification categories so that we can get better results.
8 BIBLOGRAPHY
Textual Reference

 Albert Biffet and Eibe Frank. Sentiment Knowledge Discovery in Twitter


StreamingData.DiscoveryScience,LectureNotesinComputerScience,2010,

 Alec Go, Richa Bhayani and Lei Huang. Twitter Sentiment Classification
using Distant Supervision. Project Technical Report, Stanford
University,2009.

 Alexander Pak and Patrick Paroubek. Twitter as a Corpus for Sentiment


Analysis and Opinion Mining. In Proceedings of international conference on
Language Resources and Evaluation (LREC), 2010.

 Efthymios Kouloumpis, Theresa Wilson and Johanna Moore. Twitter


Sentiment Analysis: The Good the Bad and the OMG! In Proceedings of
AAAI Conference on Weblogs and Social Media (ICWSM),2011.
Online Reference

 Steven Bird, Even Klein & Edward Loper. Natural Language Processing with
Python.

 BenParr.TwitterHas100MillionMonthlyActiveUsers;50%LogInEveryday.

<http://mashable.com/2011/10/17/twitter-costolo-stats/>

 Google App Engine<https://developers.google.com/appengine/>

 Google Chart API<https://developers.google.com/chart/>

 Tweet Stream: Simple Twitter Streaming API Access

<http://pypi.python.org/pypi/tweetstream>
9 APPENDIX
9.1 Screenshots

Home page
Upload page
Package Install
Result page
Result page in matrix
Output page
9.2 SAMPLE CODING
!pip install senticnet
!pip3 install sentic
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sentic import SenticPhrase
import nltk
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
import re
import nltk
from nltk.corpus import stopwords
from senticnet.senticnet import SenticNet
import nltk.stem as stemmer

b=["not","can't","don't","won't"]
h=[]
x=0
y=0
z=0
comment = input("Enter Your feeling : ")

senti = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-


z \t]) |(\w+:\/\/\S+)", " ", comment).split())
stop_words = set(stopwords.words('english'))
senti = senti.lower().split(" ")
filtered_sentence = [w for w in senti if not w in stop_words ]
print(filtered_sentence)
#print(nltk.pos_tag(filtered_sentence))

for i in filtered_sentence:
sp = SenticPhrase(i)
a=sp.info(i)
print(a)
if "positive" in a['sentiment']:
x=x+1
elif "negative" in a['sentiment'] :
y=y+1
print (x,y)
h=comment.split( )
for j in h:
if j in b:
z=z+1
if z>= 1:
if x>y:
print("Negative")
a="Negative"
elif y>x:
print("Positive")
a="Positive"
else:
print("Neutral")
a="Neutral"

else:
if x>y:
print("Positive")
a="Positive"
elif y>x:
print("Negative")
a="Negative"
else:
print("Neutral")
a="Neutral"

print("DOCUMENT MATRIX : ")

docs = h
vec = CountVectorizer()
X = vec.fit_transform(docs)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
print(df)
f=""
import urllib.request
import json
for c in comment:
if c not in ' ':
f=f+c
else:
f=f+'_'
x="https://api.thingspeak.com/update?api_key=C4Y9L5WM6V3ZXB7U&field5="+str(f)+
"&field6="+str(a)
conn = urllib.request.urlopen(x)

You might also like