2012 Liviu P. Dinu, Iulia Iuga, 2012. The Naive Bayes Classifier in Opinion Mining - in Search of The Best Feature

The Naive Bayes Classifier in Opinion Mining:
In Search of the Best Feature Set
Liviu P. Dinu and Iulia Iuga
University of Bucharest, Faculty of Mathematics and Computer Science,

Center for Computational Linguistics,
14 Academiei, RO-010014, Bucharest, Romania
ldinu@fmi.unibuc.ro, iuliaiuga1211@gmail.com
Abstract. This paper focuses on how naive Bayes classifiers work in

opinion mining applications. The first question asked is what are the
feature sets to choose when training such a classifier in order to obtain
the best results in the classification of objects (in this case, texts). The
second question is whether combining the results of Naive Bayes clas-
sifiers trained on different feature sets has a positive effect on the final
results. Two data bases consisting of negative and positive movie reviews
were used when training and testing the classifiers for testing purposes.
1 Introduction
During the last decade, data text mining [15] has received a lot of attention, due
to the explosion of available data (over 80% of information is stored as text).
Typical text mining tasks include text categorization and text clustering [4],
humor characterization [9], coherence texts investigation [3], or opinion mining
and sentiment analysis [12], [8].
This paper focuses on how naive Bayes classifiers work in opinion mining
field, which has had a boost as the on-line social media which has had a boost
as the on-line social media (blogs [2], social networks [10], etc.) has risen and
the interest in quickly determining the general opinions on certain topics has
increased.
Given a set of subjective texts that express opinions about a certain object,
the purpose is to extract those attributes (features) of the object that have
been commented on in the given texts and to determine whether these texts are
positive, negative or neutral.
A couple of interesting applications are sentiment analysis
tools for Twitter status updates (http://www.tweetfeel.com/ ,
http://twittersentiment.appspot.com/ are relevant examples, but not
the only ones) or the analysis of short comments on film reviews[16], [11].
1.1 Preliminaries
In the ”bag-of-words” model [7], we begin with making the simplifying assump-
tion about a text that it can be represented as collections of words in which
grammar rules are negligible and even the word order is unimportant.
A. Gelbukh (Ed.): CICLing 2012, Part I, LNCS 7181, pp. 556–567, 2012.

c Springer-Verlag Berlin Heidelberg 2012
The Naive Bayes Classifier in Opinion Mining 557
Bayes classifiers [6] assign the most likely class to a given example described
by its feature vector. Training such classifiers can be significantly simplified by
assuming that features are independent classes, that is:

P (X|C) = (P (Xi |C)) (1)
i=1,n
where X = (X1 , X2 ,..., Xn ) is a feature vector and C is a class. Despite

the unrealistic assumption that the features are independent from each other,
the resulting classifier, called naive Bayes classifier, is very successful in practice
and this technique has the great advantages of being much faster then more
sophisticated ones and much more approachable, besides generating competitive
results. The naive Bayes has proven effective and is often used in many practical
applications besides the text classification that we are going to discuss further,
such as medical diagnosis [5] and system performance management [14]. In order
to determine the probability of a word to appear in a positive versus a negative
context, we will first train a classifier, in this case a naive Bayes classifier, on a
set of annotated features (here the labels will be ”positive” and ”negative”).
In the field of statistics, the accuracy of a measurement system is the degree
of closeness of measurements of a quantity to that quantity’s actual (true) value.
The precision is the fraction of retrieved instances that are relevant, while recall
is the fraction of relevant instances that are retrieved.
1.2 Motivation and Personal Contributions

The purpose of our study was determining what is the best way of selecting fea-
tures from texts that are to be analysed in opinion mining applications that use
the naive Bayes classifier as a decision taker. We were interested in performance
from the point of view of accuracy as well as of the time efficiency, which is
important in practical applications.
We first ran a series of tests on features from the input texts that we considered
relevant in the classification of the texts from the opinion mining point of view
and then we proposed two combining techniques for individual classifiers trained
on different feature sets.
In the closing of this paper we will briefly present an application that we have
developed, application that implements an algorithm for determining the per-
centage of positive, negative and neutral NewsGroup reviews and user comments
for films listed on the Internet Movie Database, disregarding their star-scores and
using only on naive Bayes classification as the analyses tool.
2 Tests
In this section we will present the results obtained when training and testing
naive Bayes classifiers on ten different feature sets. Each feature set will be
discussed in the following.
558 L.P. Dinu and I. Iuga
The data set used for training the naive Bayes classifiers was Polarity Dataset
v2.0 [12]. It consists of 1000 positive movie reviews and 1000 negative ones (we
can assume that the classifiers receives equal numbers of positive and negative
features when trained). This corpus is included in NLTK (Natural Language
ToolKit, a tool that we used when programming these tests) under the name
movie reviews.
For testing data we used Polarity Dataset v1.0 [12] which consist from 700
positive movie reviews and 700 negative ones.
Both these data bases can be found and downloaded at the following link:
http://www.cs.cornell.edu/people/pabo/movie-review-data/.
2.1 Results Obtained for Testing the Classification Given by the

Naive Bayes Classifier When Different Features Were Taken
into Consideration
The features which were considered for testing the classification given by the
naive Bayes classifier are listed below. The results of all the tests can be read in
the table found after the enumeration of the feature sets:
1. Test no. 1: The first test was run when considering all the words.
For this test, the most informative features were:
avoids = True pos : neg = 13.0 : 1.0
astounding = True pos : neg = 12.3 : 1.0
slip = True pos : neg = 11.7 : 1.0
outstanding = True pos : neg = 11.5 : 1.0
ludicrous = True neg : pos = 11.0 : 1.0
fascination = True pos : neg = 11.0 : 1.0
3000 = True neg : pos = 11.0 : 1.0
insulting = True neg : pos = 11.0 : 1.0
sucks = True neg : pos = 10.6 : 1.0
hudson = True neg : pos = 10.3 : 1.0
2. Test no. 2: For the second test, we eliminated the stopwords from the texts,
hoping that they don’t weight much in the subjectivity department. It turned
out it made no difference if we filter or not tyhe stopwords.
Same as before, the most informative features were:
3000 = True neg : pos = 11.0 : 1.0
3. Test no. 3: For this test, we applied a stemmer to the words, trying to find
out if maybe only the roots of the words of the texts would be sufficient to
obtain the information we are looking for. According to this test, different
forms of the words are relevant when expressing opinions.
plod = True neg : pos = 13.7 : 1.0

misfir = True neg : pos = 11.7 : 1.0
outstand = True neg : pos = 11.5 : 1.0
incoher = True neg : pos = 11.0 : 1.0
3000 = True neg : pos = 11.0 : 1.0
predat = True neg : pos = 10.3 : 1.0
seamless = True pos : neg = 10.3 : 1.0
hatr = True pos : neg = 10.3 : 1.0
ideolog = True pos : neg = 10.3 : 1.0
4. Test no. 4: For the next test, we took into consideration the bigrams (pairs
of words) from the texts, plus all the words. It appears that collocations of
words from the text help determine polarity.
The most informative features were:
(’give’, ’us’) = True pos : neg = 14.3 : 1.0

(’quite’, ’frankly’) = True pos : neg = 12.3 : 1.0
(’does’, ’so’) = True neg : pos = 12.3 : 1.0
(’&’, ’robin’) = True neg : pos = 11.7 : 1.0
(’fairy’, ’tale’) = True neg : pos = 11.7 : 1.0
outstanding = True neg : pos = 11.5 : 1.0
5. Test no. 5: For this test we considered as the feature set for training and
testing the most frequent 10.000 words. The words that have the highest
frequencies are the most relevant, but still, there is room for improvement
and, according to the previous test, it appears that the collocations provide
slightly more information. That leads to the next idea: to combine these two,
that is to train and test a classifier on all the bigrams from the texts and
most frequent 10000 words too.

3000 = True neg : pos = 11.0 : 1.0
thematic = True neg : pos = 10.3 : 1.0
6. Test no. 6: For all bigrams and most frequent words. When using the best
words and the bigrams as feature set for training and testing, there is no
improvement comparing to the test ran just on bigrams.
The most informative features were:
(’give’, ’us’) = True pos : neg = 14.3 : 1.0
(’quite’, ’frankly’) = True pos : neg = 12.3 : 1.0
(’does’, ’so’) = True neg : pos = 12.3 : 1.0
(’&’, ’robin’) = True neg : pos = 11.7 : 1.0
(’fairy’, ’tale’) = True neg : pos = 11.7 : 1.0
outstanding = True neg : pos = 11.5 : 1.0
7. Test no. 7: A different approach to take that came to mind was using as
features those parts of speech from the text that seem to express the most
subjectivity, that being the adjectives and the adverbs. Test number 7 was
done on adjectives only.
In order to extract the adjectives from the text we used the WordNet the-
saurus (also included as a package in a the Natural Language ToolKit) and
extracted the words of the movie reviews that appeared at least once in WN
as adjectives. We did not use a part of speech tagger, but that is a technique
that is worth being investigated. The same tactic was used in the next test,
for extracting the adverbs.
8. Test no. 8: This test was done on both the adjectives and the adverbs from
the texts.
9. Test no. 9: Going in this direction, another idea came to mind, that being
that we might benefit from adding to the adjectives, extracted from the texts
in the same manner as presented before, their WordNet synonyms.
We can notice that this has not improved our results, but the contrary and
the reason that happened could be that we did not determine the meaning
of those adjectives in their contexts and therefore we added to the training
feature sets all the possible synonyms of those words, disregarding their
actual meaning in the context. An interesting direction to go from this point
NaiveBayes Accuracy Neg precision Neg recall Pos precision Pos recall
Test 1 86.07 97.37 74.14 79.12 98.00
Test 2 86.07 97.37 74.14 79.12 98.00
Test 3 83.79 97.02 69.71 76.37 97.86
Test 4 91.00 95.13 86.43 87.57 95.57
Test 5 89.71 92.64 86.29 87.17 93.14
Test 6 91.00 95.13 86.43 87.57 95.57
Test 7 81.5 96.03 65.71 73.94 97.29
Test 8 82.36 95.94 67.57 74.97 97.14
Test 9 70.43 94.97 43.14 63.22 97.71
Test 10 53.50 54.56 41.86 52.84 65.14
Fig. 1. Precision of feature sets
would be using a disambiguation algorithm for establishing the meaning of

the adjectives in their contexts and only adding to the training feature sets
the synonyms of those meanings that we obtained. After that, applying the
classifier. This might lead to better results.
10. Test no. 10: Another question raised was how much information is provided
by parts of speech in the text studied. Meaning, how much could one find
out about the subjectivity of a text when only looking at the parts of speech
(nouns, adjectives, adverbs and verbs) that the texts consist of. So, for this
test, the feature sets were the parts of speech of the reviews:
As it turns out, this test is not very relevant for our purposes and the results
are just as good as flipping a coin. However, another idea to be pursued is
whether the way parts of speech are enchained is relevant in this type of
classification. Maybe it is more likely to have a certain taxis in a positive
text then in a negative one.
For every features we computed the accuracy, negation precision, negation re-
call, positive precision and positive recall. The Results obtained for testing the
classification given by the naive Bayes classifier when the previous 10 features
were taken into consideration are summarized in Figure 1
Example 1. We show in Figure 2 examples of classification of the documents
on which the testing was done into the ”positive” and ”negative” categories.
Fig. 2. Example of classification
In the left side you can see the list of the documents classified by naive Bayes
as positive: for example, the document cv002 tok-12931.txt has a probability of
being negative of 0.0229 and a positive probability of 0.9771, therefore is included
in the positive documents list.
Remark 1. We split the texts in thirds and ran the same tests described before on
these parts. The accuracy decreased slightly for each of the thirds. This indicates
that there isn’t a rule about having more information about sentiment polarity
in the begining as oposed to the end or the middle.
2.2 Combining Classifiers
As discussed in the previous subsection, for each feature set, we have built a
naive Bayes classifier that we trained on the first data set and test on the second
one. The results for each of the feature sets listed before can be read in the Table
2.1 from the previous subsection. In the following we will provide two combining
classifiers methods which we applied on the previous features.
Each classifier calculates a certain probability for the documents to be pos-
itive or negative ones. If the probability of a text to be positive is bigger then
the one of it being negative, then the classifier assigns it the positive class and
the other way around. We will therefore have a resulting list that looks like this:
{(neg > pos), (pos > neg), ...}.
The Majority Rule. For each document, we will assign the class that appears
a majority of times in the list generated by the classifiers.
Probability Aggregation. We calculate the sum of the positive/negative
probabilities given by each classifier. Then, if the sum of the positive probabil-
ities is bigger then the negative one, we will assign that document the positive
class and vice versa.
Table 1. The majority rule
Method 1 Accuracy Neg precision Neg recall Pos precision Pos recall
c1 - c5 86.57 97.23 75.29 79.84 97.86
c1 - c7 87.79 97.32 77.71 81.45 97.86
c1 - c9 86.79 97.42 75.57 80.05 98.00
Table 2. Probability aggregation
Method 2 Accuracy Neg precision Neg recall Pos precision Pos recall
c1 - c5 87.29 97.11 76.86 80.85 97.71
c1 - c7 87.79 97.32 77.71 81.45 97.86
c1 - c9 87.14 97.27 76.43 80.59 97.86
As seen in the result tables 1 and 2 figured before, we did not find a combining
method that increases the performance for combined classifiers as opposed to
the individual ones. An idea would be to take advantage of the difference in
the recall and precision measurements obtained on different feature sets. That
is, suppose we have a number of classifiers trained on a certain feature set for
which it generates big positive precision values and another series of classifiers
that have a big positive recall (and the other way around for the negative values).
By combining them, they might balance each other out and conduct to better
results. In our case, we did not have good examples of independent feature sets
to implement this idea, but we consider it worth investigating.
3 An Application: Opinion Mining for IMDb User

Comments and NewsGroup Reviews
The conclusion drawn after running the tests in section 2 of this paper is that
the best results are obtained for the feature set consisting of bigram collocations
found in the texts. However, there is not such big of a difference in performance
between this method and the one using the most frequent words feature sets – in
this particular case, we had a difference of approximately 1%. But there is a sig-
nificant difference when thinking about the time consumption aspect, the later
being a much faster method (runs 2-3 times faster than the former). This aspect
is very important in practical applications that are meant for the use of non-
specialists. That is the reason why we chose extracting the most frequent words
from the texts analyzed. This application is basically an in-browser application
consisting of two buttons placed in the toolbar of the browser, buttons that,
when pressed, if the user is visiting the page corresponding to a movie or TV
series listed on the imdb.com website, will calculate the percentage of positive,
negative and neutral user comments, respectively reviews posted in the News-
Group. (When the user is on a different page, they will receive error messages;
also, if there aren’t any comments/reviews, a message will be displayed that will
let them know about this, when the buttons are pressed.) This categorization
Fig. 3. The toolbar of a browser
Fig. 4. The dialog boxes
will be made exclusively based on the algorithm discussed previously and not
taking into consideration the scores the commentators might have given for that
particular title. The algorithm gives scores for the comments tested for opinions
that give the positive/negative probabilities; so, if this probability of a certain
review to be positive is somewhere between 0.45 and 0.55, the review will be
considered to be neutral (same for negative probability).
We show three print-screens, the first one (Figure 3) shows the toolbar of a
browser, where the two buttons are to be found, the second and third one (Figure
4)show the dialog boxes displayed after pressing the two buttons hen the user is
visiting the page of a particular title listed on the IMDb website.
In order to minimize the time spent on calculations, we will start with an
already trained classifier. For this training, we used the same data base used in
the training step from the tests presented in section 2 (1000 positive and 1000
negative movie reviews).
Remark 2. When the number of user comments assigned to a movie increases,

the scores percentage of positive/negative result generated with this algorithm
is very close to the the star-score of that title on IMDb, which makes sense (bare
in mind that any logged user is allowed to rate a title on IMDb, they don’t have
to necessarily leave a comment in order to do so). Not the same thing happens
for the reviews listed in the NewsGroup; this is not so bizarre, as these reviews
are generally posted by critics, who might not share the opinion of the general
public.
By developing this application, we were able to observe how the classification

with naive Bayes works in real case scenarios. The fact that, when having a large
number of user comments, the results that the algorithm gave were increasingly
closer to the star rating is proof enough that this is a good path and is worth
investigating and improving in the future.
4 Conclusions
Nowadays, due to the explosion of the internet, we deal with an unprecedented
amount of data published by people all over the world who express their opinions
on different topics. In most cases, when in need to access and determine general
opinions on certain topics, we do not have a rating system available (such as the
one provided by the IMDb) and developing opinion mining methods that are
fast and as efficient as possible is a current issue.
This study was focused on developing such a method that uses a time-efficient
classification algorithm. We decided to use the naive Bayes classifier. After per-
forming a series of tests to determine its performance when running on different
feature sets extracted from the analysed texts, we came to the conclusion that
the best option for selecting the feature set for real time applications is extract-
ing a relevant number of the most frequent words from the texts used for training
the classifier. For such a simple and apparently overly-simplifying technique, it
performs very well and extremely fast.
As we were trying to find those feature sets that make the most sense to
be relevant in opinion mining, we made a series of assumptions, such as that
groups of words might give more information (correct), that the most frequent
words weight more (also correct), but also found new leads that we think are
worth further investigating: some parts of speech provide more information than
others (adjectives, adverbs); this seems like a sensible assumption, but there
is information lost when extracting the adjectives from the texts (we used the
WordNet thesaurus to determine if each word has at least one adjective/adverb
meaning; a better solution might be using a part of speech tagger). Then, we
tried to add to the training feature set the synonyms of the adjectives selected
that way; this method also lead to decreasing the performance, a reason for this
might be that we did not use a disambiguation technique to determine the sense
of the words in their contexts before adding the synonyms, but we added all
synonyms of the words instead. That would be the second most important idea
to investigate in a future work.
Of course, this method, even if will give good accuracy for classification, will be
extremely slow: applying a part of speech tagger and a disambiguation algorithm
are both time consuming. And in the end it might turn out not have even
been worth the effort, not gaining those many percent points in final results of
classification.
Secondly, we tried to find a combining method for classifiers that had been
trained on different feature sets in order to increase the final accuracy level. We
did not manage to find such a method, but a disadvantage that we had was
not having a balanced set of classifiers, that is, all the classifiers were wrong in
the same way: similar values for precision and recall values. We think that by
combining classifiers that even each other out, we would have a much better
chance with the combining methods we proposed.
Acknowledgments. Research supported by the CNCS, IDEI - PCE project

311/2011, ”The Structure and Interpretation of the Romanian Nominal Phrase
in Discourse Representation Theory: the Determiners.”
References
1. Chaovalit, P., Zhou, L.: Movie Review Mining: a Comparison between Supervised
and Unsupervised Classification Approaches. In: 38th Hawaii International Con-
ference on System Sciences, HICSS 2005 (2005)
2. Conrad, J.G., Schilder, F.: Opinion mining in legal blogs. In: Proceedings of the
11th International Conference on Artificial Intelligence and Law, ICAIL 2007, pp.
231–236 (2007)
3. Dinu, A.: Short Text Categorization via Coherence Constraints. In: Proc. 13th In-
ternational Symposium on Symbolic and Numeric Algorithms for Scientific Com-
puting, SYNASC 2011, Timisoara, Romania, September 26-29, pp. 247–251 (2011)
4. Feldman, R., Sanger, J.: The Text Mining Handbook - Advanced Approaches in
Analyzing Unstructured Data. Cambridge University Press (2007)
5. Kononenko, I.: Machine learning for medical diagnosis: history, state of the art and
perspective. Artificial Intelligence in Medicine 23(1), 89–109 (2001)
6. Langley, P., Iba, W., Thompson, K.: An Analysis of Bayesian Classifiers. In: Proc.
AAAI 1992, pp. 223–228 (1992)
7. Lewis, D.D.: Naive (Bayes) at Forty: The Independence Assumption in Information
Retrieval. In: Proc. Machine Learning: ECML-1998, 10th European Conference on
Machine Learning, Chemnitz, Germany, April 21-23, pp. 4–15 (1998)
8. Mihalcea, R., Banea, C., Wiebe, J.: Learning Multilingual Subjective Language
via Cross-Lingual Projections. In: Proceedings of the 45th Annual Meeting of the
Association for Computational Linguistics, ACL 2007, Prague, Czech Republic,
June 23-30 (2007)
9. Mihalcea, R., Pulman, S.: Characterizing Humour: An Exploration of Features
in Humorous Texts. In: Gelbukh, A. (ed.) CICLing 2007. LNCS, vol. 4394, pp.
337–347. Springer, Heidelberg (2007)
10. Pak, A., Paroubek, P.: Twitter as a Corpus for Sentiment Analysis and Opinion
Mining. In: Proceedings of the International Conference on Language Resources
and Evaluation, LREC 2010, Valletta, Malta, May 17-23 (2010)
11. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment classification using
machine learning techniques. In: Proceedings of the 2002 Conference on Empirical
Methods in Natural Language Processing (EMNLP), pp. 79–86 (2002)
12. Pang, B., Lee, L.: Opinion Mining and Sentiment Analysis. Foundations and Trends
in Information Retrieval (FTIR) 2(1-2), 1–135 (2007)
13. Perkins, J.: Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing
(2010)
14. Rish, I., Watson, T.J.: An empirical study of the naive Bayes classifier. Research
Center (2001),
http://domino.research.ibm.com/comm/
research people.nsf/pages/rish.pubs.html
15. Witten, I.H., Frank, E.: Data mining: practical machine learning tools and tech-
niques. Elsevier (2005)
16. Yessenov, K., Misailovic, S.: Sentiment Analysis of Movie Review Comments, Re-
port on Spring 2009 final project (2009),
http://people.csail.mit.edu/kuat/courses/6.863/
17. Beautiful Soup - HTML/XML parser for Python,
http://www.crummy.com/software/BeautifulSoup/
18. IMDbPY - package for manipulating IMDb data for Python,
http://imdbpy.sourceforge.net/
19. NLTK - Natural Language ToolKit, http://www.nltk.org/
20. PyGTK - library for implementing graphic user interfaces in Python,
http://www.pygtk.org/
21. WebKit - web browser web, http://www.webkit.org/

2012 Liviu P. Dinu, Iulia Iuga, 2012. The Naive Bayes Classifier in Opinion Mining - in Search of The Best Feature

Uploaded by

Copyright:

Available Formats

You might also like

2012 Liviu P. Dinu, Iulia Iuga, 2012. The Naive Bayes Classifier in Opinion Mining - in Search of The Best Feature

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2012 Liviu P. Dinu, Iulia Iuga, 2012. The Naive Bayes Classifier in Opinion Mining - in Search of The Best Feature

Uploaded by

Copyright:

Available Formats

The Naive Bayes Classifier in Opinion Mining:

In Search of the Best Feature Set

Liviu P. Dinu and Iulia Iuga

University of Bucharest, Faculty of Mathematics and Computer Science,

Abstract. This paper focuses on how naive Bayes classiﬁers work in

where X = (X1 , X2 ,..., Xn ) is a feature vector and C is a class. Despite

1.2 Motivation and Personal Contributions

2.1 Results Obtained for Testing the Classiﬁcation Given by the

plod = True neg : pos = 13.7 : 1.0

(’give’, ’us’) = True pos : neg = 14.3 : 1.0

avoids = True pos : neg = 13.0 : 1.0

Test 1 86.07 97.37 74.14 79.12 98.00

Test 2 86.07 97.37 74.14 79.12 98.00

Test 3 83.79 97.02 69.71 76.37 97.86

Test 4 91.00 95.13 86.43 87.57 95.57

Test 5 89.71 92.64 86.29 87.17 93.14

Test 6 91.00 95.13 86.43 87.57 95.57

Test 7 81.5 96.03 65.71 73.94 97.29

Test 8 82.36 95.94 67.57 74.97 97.14

Test 9 70.43 94.97 43.14 63.22 97.71

Test 10 53.50 54.56 41.86 52.84 65.14

Fig. 1. Precision of feature sets

would be using a disambiguation algorithm for establishing the meaning of

Fig. 2. Example of classiﬁcation

2.2 Combining Classiﬁers

Table 1. The majority rule

Table 2. Probability aggregation

3 An Application: Opinion Mining for IMDb User

Fig. 3. The toolbar of a browser

Fig. 4. The dialog boxes

Remark 2. When the number of user comments assigned to a movie increases,

By developing this application, we were able to observe how the classiﬁcation

Acknowledgments. Research supported by the CNCS, IDEI - PCE project

You might also like