Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)

Machine Learning Based Approach To Sentiment


Analysis
Sandeep Nigam Ajit Kumar Das Rakesh Chandra Balabantaray
Computer Science Engineering Computer Science Engineering Computer Science Engineering
IIIT Bhubaneswar, Odisha, India IIIT Bhubaneswar, Odisha, India IIIT Bhubaneswar, Odisha, India
a116018@iiit-bh.ac.in ajit@iiit-bh.ac.in rakesh@iiit-bh.ac.in

Abstract—Sentiment analysis is one of the fastest growing published on sentiment analysis for the domain of blogs
research areas in natural language processing (NLP), making and product reviews. Researchers have contributed their work
it challenging to keep track of all the activities. In a present on detecting sentiment in text, [4] by providing a simple
day scenario, as the internet is a source of learning, getting
thoughts, surveys for a product or a service. In a course of time, algorithm, called semantic orientation by analyzing the utility
a large number of audits are created on the web about an item, of linguistic features for detecting the sentiment of Twitter
individual or a place. Sentiment analysis is a research area which text [5]. In the given hierarchical method, text is classified
comprehends and extricates the assessment from the given review first as containing sentiment and then categorised as +ve and
and the analysis process incorporates natural language processing -ve . With the group of blogs and social networking sites,
(NLP), computational linguistics, text analytics and classifying the
polarity of the opinion. In the field of sentiment analysis, there are opinion mining and sentiment analysis grew to become an
numerous algorithms exist to tackle. This paper represents the area of interest for a lot of researchers. In [6] authors used
taxonomy of various sentiment analysis methods. This task and Twitter database and trained models to perform a sentiment
observed that Logistic regression gives high accuracy compared search and developed corpora by using emoticons to gather
to other techniques. +ve as well as −ve samples, after that worked with several
Keywords: Sentimental analysis, Social media, Twitter, NLP, classifiers. The most suitable outcome was attained by the
Logistic regression. Naive Bayes classifier with a traditional information measure
I. I NTRODUCTION for feature selection. In [7] they succeeded in achieving around
Microblogging nowadays has transformed into an extremely 81 percentage of accuracy on their test set. The techniques
mainstream specialized instrument among Online surfers. An which were mentioned above were not up to the mark for
incredible number of mail, messages, tweets are occurring reg- classifying the emoticons (−ve, +ve and neutral). Sentiment
ularly on widely used internet sites just like Twitter, Facebook analysis details the fundamental procedure to extort polarity
etc. Interestingly, clients have the majority of the power when and the subjectivity from the semantic orientation which
it identifies with what clients might want to see and how illustrates the strength of words and polarity text or phrases.
clients answer. With this, the organization’s accomplishments As per existing state of art, there are two primary practices for
and failures are openly expressed and also wind up with extracting sentiment automatically, for instance, the lexicon-
word of mouth. Moreover, the social networking sites may based technique and machine-learning-based technique [8].
result change in behaviour as well as decision making of The novelty in this paper is modeling of sentiment analysis,
customers, for instance, [1] states that 87 percent of online which can be done by extorting a huge volume of tweets.
users change their behaviour and opinions according to the Prototyping is commonly employed for this purpose. In [9],
client reviews. Due to this, the organization can quickly be clasification of customers tweets into positives and negatives is
able to observe behaviour of the client which will be helpful done and represented in a pie chart and HTLM page. In [10],
by providing a effective solution in order to compete with explored the application of location-based sentiment analysis
thier competitors. Numerous works have used effectively an choosing Twitter for recognizing patterns towards the Indian
ontology to understand the text [2]. In the phrase level, general elections 2014 and also suggested observations and
sentiment analysis method should be capable to identify the opinions precisely how ”social events” persuade the sentiments
polarity of the phrase which can be explained by [3]. of Twitter members on the social network.
This paper is articulated as follows : Section II presents
related work. Section III describes the methodology of the III. DATASET PREPARATION
proposed method. Section IV deals about the experimental
results. Finally, Section V leads to the conclusion. This section comprises of detecting the sentiment analysis
of the tweets with the use of supervised learning approaches.
II. R ELATED W ORK Firstly, the raw data has to be preprocessed which includes
In recent years, sentiment analysis has grown to be a popular cleaning and munging so that it can be used for futher
topic in the NLP research community. There are several papers processing on machine learning algorithms to have a better

ISBN: 978-1-5386-4119-4/18/$31.00 ©2018 IEEE 157


International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)

accuarcy. The pre-processing of data includes the following 5) hashtag / numbers: In some cases the text used with
steps, hashtag provides valuable information regarding the tweet. It
may be quite risky to get rid of all the text along with the
A. Corpus collection hashtag. There is a possibility that without losing the whole
context of text, we can remove only # part of the text. So,
In this paper we have used Sentiment140, dataset for train- we have decided to leave the text intact and just remove the
ing which originated from Stanford University. Much more #. We will do this in the process of cleaning of the data(all
information on the dataset are available from the link. http : the non letter characters as well as numbers). Next methods
//help.sentiment140.com/f or − students/. we stored a like Tokenization, stemming/lemmatization, converting all text
corpus of text posts and created a dataset of two classes into a lower case processes and stop words will be handled in
+ve sentiments, −ve sentiments. From the detailed description later phase while creating a matrix with both countvectorizer
of the dataset from the link, the info about each field can and Tfidf vectorizer.
be found. Dataset provides 1.6 million entries, without null
6) Stemming / Lemmatization: Stemming is an ordinary
entries, and the important one is sentiment column. Even
rule-based procedure for stripping the suffixes (ing, ly, es,
though the dataset information mention neutral class, the
s etc) from a word. Lemmatization, on the flip side, is
training set does not have any neutral class. 50 percentage
a structured step-by-step procedure for procuring the root
of the data is with +ve label, and another 50 percentage with
form of the word, it helps to make the use of vocabulary
−ve label.
(dictionary significance of words) and morphological analysis
(word structure and grammar relations).
TABLE I
DATASET D ESCRIPTION IV. P ROPOSED W ORK & M ETHODOLOGY
In order to apply a couple of data visualization, we need
term frequency data of words that are being used in the tweets
and the number of times used in entire corpus. We have
used countvectorizer to analyze the term frequencies, even
though the countvectorizer is also fit for train and predict,
but at this point, we will be extracting the term frequencies
for the visualization. We have implemented with stop words
included, and not limiting the maximum number of terms.
And as we can see the most frequent words are all stop
words like to, the, etc. The indexes are the tokens from the
tweets dataset (Sentiment140) and the numbers in Negative
and Positive columns indicate the number of times the token
appeared in negative tweets and positive tweets. The below
B. Data Preparation And Cleaning table has the contents like ’to’ ’the’ etc are the stop words
which have impact in both negative and postive sense in the
1) HTML decoding: It appears like HTML encoding has
whole dataset. For example consider first row. The stop words
not been transformed to text, and ended up in text area as
 ’to’ in which its impact of negative sense is around 313160
&amp ,  &quot , etc. Decoding HTML to normal text will be
and positive sense is 252566 from the total occurance of ’to’s
my the step of data preparation. We have used BeautifulSoup
in the total dataset.
(python library) to do this.
2) @mention: The next important part of the preparation is TABLE II
managing @mention. Even though @mention holds a certain VISUALIZATION OF WORD IN POSITIVE AND NEGATIVE
info (which another user that the tweet mentioned), this info OCCURANCE
does not add value to develop sentiment analysis model. Negative Positive Total
3) URL links: The next important part of the cleaning is to 313160 252566 565726
managing URL links, same with @mention, even though it the 257836 265998 523834
my 190774 125955 316729
holds some info, for sentiment analysis purpose, this can be it 157448 147786 305234
avoided. and 153958 149642 303600
4) U T F − 8 BOM (Byte Order Mark): Different forms you 103844 198245 302089
not 194724 86865 281589
of characters such as xef, xbf, xbd, are U T F − 8 BOM. is 133435 111191 244626
The U T F − 8 BOM is a sequence of bytes (EF BB BF) in 115541 101160 216701
which allows the reader to identify the recognized file as being for 98999 117369 216368
encoded in U T F −8. By decoding text with  utf −8−sig  , this
BOM could be changed with Unicode unrecognizable special The tokens are spread through the whole corpus, the next is
characters. For this type of cases, ? is used in this this paper. to divide the tokens into 2 different classes(+Ve, -Ve). The stop

158
International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)

words are not going to support so much, because of the same Here, p = sigmoid function, x is a random variable.
high-frequency words, such as the, to, will equally consistent 1
in both the classes. If these stop words dominate both of the sigmoid f unction (p) = 1+e−z (2)
classes, we will not be able to get remarkable results. So, Naive Bayes classifier is a simple probabilistic classifier
we decided to remove stop words, as well as going to limit which is based on Bayes theorem. It is a basic text classi-
the maximum features to 10000. After removing the stop fication technique. Naive Bayes variations such as Multino-
words(most frequent words) like ’to’ ’the’ etc, words like ’just’ mial Naive Bayes and Bernoulli Naive Bayes theorems are
’good’ etc, are appearing more frequently. For these words, mostly used. Multinomial Naive Bayes is suitable whenever
we are finding the impact of negativity and positivity in the there exist several occurrences of words in the classification
total dataset by counting them. Next, we calculate a Positive problem. Bernoulli Naive Bayes can be applied in the absence
of particular word.
TABLE III
p( B
A )∗p(A)
VISUALIZATION OF WORD IN POSITIVE AND NEGATIVE
OCCURANCE AFTER REMOVING STOP WORDS
A
p( B )= p(B) (3)

Negative Positive Total The approach is initially characterized over a vector space
just 64004 62944 126948 where exactly the issue is to recognize choice surface that
good 29209 62118 91327
day 41374 48186 89560
”best” partitions the information into 2 classes. For straightly
like 41050 37520 78570 distinguishable space, choice surface is a hyperplane, and
today 38116 30100 68216 communicated numerically as,
work 98116 19529 64949
love 16990 47694 64684
going 33689 30939 64628 wx + c = 0 (4)
got 33408 28033 61445
w is the vector of weight, x an arbitrary object to be catego-
Rate(PR) of the words. If a word occurs more frequently in rized and the constant c are learnt from a training set of linearly
one class as compared to a another, this could be an excellent separable objects. In SVM, we need to find a maximum margin
measure of how much the word is meant to characterize the line. It means the distance between hyperplane and the closest
class. This is shown in the below table, element from this hyperplane. Total margin will be computed
as,
1 1 2
Positive words frequency ||w|| + ||w|| = ||w|| (5)
PR =
Positive words frequency + Negative words frequency As per SVM algorithm, finding the best line or the best
decision boundary will help us to separate our space into
classes .
TABLE IV
CALCULATION OF POSITIVE RATES 1) Countvectorizer: With countvectorizer, we simply count
the number of times a word that appear in each text. As
Positive
Negative Positive Total
Rate
an example, let’s say there are 3 records in corpora ”I love
mileymonday 0 160 160 1 dogs”, ”I hate dogs and knitting”, ”Knitting is my hobby and
devidends 0 84 84 1 my passion”. If we develop vocabulary from all these three
emailunlimited 0 110 110 1 sentences and indicate each record as count vectors, this will
shareholder 3 80 83 0.963
fuzzball 4 99 103 0.961 seem like below.
delongeday 9 170 179 0.949
recommends 8 112 120 0.933 TABLE V
atcha 5 81 88 0.920 COUNTING THE OCCURANCE OF THE WORDS USING
COUNTVECTORISER

I love dogs hate and knitting is my hobby passion


A. Feature extraction rec1 1 1 1
The feature extraction has been done using countvectorizer, rec2 1 1 1 1 1
rec3 1 1 1 2 1 1
tf-idf, and their performance has been moniterd. We have used
machine learning approaches such as Logistic Regression,
But if the size of the corpora gets large, the selection of
Naive Bayes, Support Vector machine to find the technique
vocabulary should get too large to process. So that we made a
among the countvectorizer and Tf-Idf which gives better
decision to restrict the selection of vocabularies. Stop Words
performance.
are words which does not contain useful such as, ”the”, ”of”,
Logistic regression is a supervised learning problem. It is
etc. We ran the same test with and without stop words on
an algorithm for the binary classification. This classifier is
Ngrams(like Unigram, Bigram, Trigram). In this we have
a statistical method for analyzing a dataset. logistic regression
tested Unigram with and without Stopwords. But only with
helps to find the best fitting line and predict a binary outcome.
Stopwords in Bigram and Trigram and then compared the
p
ln( 1−p ) = b0 + b 1 ∗ x (1) result. In addition, we also specified a custom stop words list,

159
International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)

TABLE VI
PERFORMANCE OF COUNTVECTORISER

Unigram without SW Unigram with SW Unigram without custom SW Bigram with SW Trigram with SW

10000 77.19% 79.47% 78.57% 80.92% 80.80%

20000 77.43% 79.80% 78.68% 81.12% 81.30

30000 77.58% 79.89% 78.86% 81.49% 81.38%

40000 77.70% 79.88% 78.85% 81.64% 81.77%

50000 77.78% 79.89% 78.95% 81.80% 82.01%

60000 77.73% 79.88% 78.95% 81.72% 82.00%

70000 77.73% 79.92% 78.93% 81.64% 82.05%

80000 77.74% 79.93% 78.93% 81.73% 82.12%

90000 77.73% 79.95% 78.99% 81.76% 82.05%

100000 77.69 79.96% 78.97% 81.82% 82.16%

which has top 10 most common words in the corpora like


”to”, ”the”, ”my”, ”it”, ”and”, ”you”, ”not”, ”is”, ”in”, ”for”. Fig. 1. With Stop words without stop Words(Unigram) Accuracy
N-gram is a contiguous sequence of n items from a given
sample of text or speech. The formula of N-gram is,
In fig 1, graph represents the performance of the Countvec-
Ngk = n − (N − 1) (6)
toriser on Unigram, Bigram, Trigram with stopwords. From
Where n is a number of words in a given sentence, Ngk the graph, we can analyze that, countvectoriser with Trigram
how many number of N-grams are present in the given is giving more accuracy comparing with the rest of the two.
sentance depending on the N values. We explain with the given 1) Tf-Idf: Tf-Idf can be another method to convert textual
example, The dog jumps over the moon, with this example we data to a numeric form and is short for Term Frequency-
can explain unigram, bigram, trigram. Inverse Document Frequency. The mathematical equation of
Tf-idf useful for term weighting.
• If N=1, then we call it as a Unigram. For the above
example we get 6 unigrams. They are the, dog, jumps,
over, the, moon.
weightj,k = tfj,k ∗ log( df
N
j
) (7)
• If N = 2, Then we call it as a Bigram. For the example
where weightj,k is the weight for term j in document k, N is
stated above, we get 5 bigrams. They are the dog, dog the number of documents in the collection, tfj,k is the term
jumps, jumps over, over the, the moon. frequency of term j in document k and dfj is the document
• If N =3, we can state it as a trigram. For the above
frequency of term j in the collection.
example, we get 4 trigrams. They are the dog jumps, The vector value it yields is the product of these two terms;
dog jumps over, jumps over the, over the moon. TF and IDF. We initialized Tf-Idf vectorizer, and fit the Tf-
The logistic regression is one of the linear models and Idf-transformed data to logistic regression, and evaluated the
computationally scalable to large data. After that we examine validation accuracy for a different number of features. This
the accuracy of logistic regression on the validation set. The paper furthermore will show a comparison between countvec-
table IV shows the train on the different number of features and
checking the accuracy of logistic regression on the validation
set. This table is demonstrated on three different conditions of
unigram. Firstly with stop words removal, second with custom
defined stop words removal, third without stop words removal.

V. R ESULTS & D ISCUSSION


In this section we are discussing about the performance
comparision which were carried out on different types of
metrics. So, from the table which given below contains the
information about the performance of the countvectoriser
using ’with and without stopwords’, ’with custom stopwords.
As per the observations which had carried out we can
have a conclusion that, though with the presence of
stopwords in unigram, the maximum accuarcy is obtained
for countvectoriser ’without stopwords’ and ’with custom
stopwords’. Fig. 2. N-gram(1-3) Test Result Accuracy Using Countvectorizer Method

160
International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)

torizer and Tf-Idf with a graph. We apply different machine learning algorithms for analysis
In table, train on the different number of features, and checking such as logistic regression, SVM, L1-based SVM, Multinomial
the accuracy of logistic regression on the validation set is Naive Bayes, Bernoulli Naive Bayes. We also compared the
shown. This Tf-Idf table VII shows an accuracy between a performances of the different machine learning algorithms
unigram, bigram, trigram with stop words. and tabulated them. Logistic regression gives high accuracy
compared to other technique.

TABLE VII VI. C ONCLUSION


PERFORMNACE OF T F - IDF WITH UNIGRAM,BIGRAM,TRIGRAM In this paper, sentiment analysis using machine learning has
Unigram Bigram Trigram been proposed and all the analysis is being carried out for
with SW with SW with SW English language only. A set of experiments where carried
10000 79.77% 81.12% 81.05% out aiming to validate our model for aspect identification
20000 79.96% 81.36% 81.57%
30000 80.03% 81.96% 81.79%
and aspect level sentiment classification. In particular for
40000 80.11% 81.97% 82.13% Unigrams we have evaluated with and without stopwords and
50000 80.19% 82.06% 82.17% with custom stopwords where as for Bigram and Trigram,
60000 80.17% 82.09% 82.34% evaluation is done on ”with stopwords” only. The results
70000 80.16% 82.12% 82.42%
80000 80.14% 82.19% 82.52% depicted in this paper clearly shows that the accuarcy is better
90000 80.16% 82.21% 82.49% with logistic regression on Tf-Idf vectorizer with stopwords
100000 80.17% 82.22% 82.59% on Trigrams with a 82.59 percentage. In, our future research
we will be applying the models on different Indian languages.
R EFERENCES
[1] A. K. Jose, N. Bhatia, and S. Krishna, Twitter Sentiment Analysis,
National Institute of Technology Calicut, 2010.
[2] Boguslavsky, I, Semantic Descriptions for a Text Understanding System,
In Computational Linguistics and Intellectual Technologies, Papers from
the Annual International Conference Dialogue 2017 (pp.14-28).
[3] Theresa Wilson, Janyce Wiebe and Paul Hoffmann, Recognizing con-
textual polarity in phrase-level sentiment analysis , Proceedings of the
conference on human language technology and empirical methods in
natural language processing(pp. 347-354), october 06-08-2005.
[4] Peter D. Turney, Thumbs up or thumbs down? Semantic orientation
applied to unsupervised classification of reviews Proceedings of the 40th
Annual Meeting of the Association for Computational Linguistics (ACL),
July 2002, pp. 417-424.
[5] Efthymios Kouloumpis, Theresa Wilson, Johanna Moore, ”Twitter Sen-
timent Analysis:The Good the Bad and the OMG!”, Proceedings of the
Fifth International AAAI Conference on Weblogs and Social Media, 2011.
[6] Bo Pang and Lillian Lee. 2008. Opinion mining and sentiment analysis.
Foundations and Trends inInformation Retrieval Vol. 2, (2008) 1135.
Fig. 3. N-gram(1-3) Test Result Accuracy of Countvectoriser and Tf-Idf [7] Alec Go, Lei Huang, and Richa Bhayani. 2009. Twitter sentiment anal-
ysis. Final Projects from CS224N for Spring 2008/2009 at The Stanford
In The fig 2, graph represents the comparision of the perfor- Natural Language Processing Group.
[8] M.Taboada, J. Brooke, M. Tofiloski, K. Voll, and M. Stede, Lexicon-
mances of Tf-Idf and countvectoriser with Unigram, Bigram, Based Methods for Sentiment Analysis, Association for Journal Compu-
Trigram. From the graph, we can analyze that, Tf-Idf with Tri- tational Linguistics, 2011.
gram is giving more accuracy comparing with the rest. Upon [9] Aliza Sarlan, Chayanit Nadam, Shuib Basri ”Twitter sentiment analysis”
Proceedings of the 6th International conference on Information Technol-
comparision, we can analyse that Tf-Idf with stopwords gives ogy and Multimedia, 2014.
more accurate result than countvectoriser with stopwords. [10] Omaima Almatrafi, Suhem Parack, Bravim Chavan ”Application of
Location-Based Sentiment Analysis by Using Twitter for Identifying the
Trend towards Indian General Election 2014, Proceedings of the 9th
International Conference on Ubiquitous Information Management and
TABLE VIII Communication, January 2015.
P ERFORMANCE EVALUATION OF VARIOUS TECHNIQUES

Model Validation set accuracy


Logistic regression 82.59%
Linear SVM 82.04%
Linear SVM with L-1
82.01%
based feature selection
rbf SVM 76.51%
bernoulli Naive Bayes 80.14%
multinominal Naive
78.98%
Bayes

161

You might also like