Implementation of Multinomial Naive Bayes Algorithm For Sentiment Analysis of Applications' User Feedback

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Implementation of Multinomial Naïve Bayes

Algorithm for Sentiment Analysis of Applications’


User Feedback

Gabriella Putri Wiratama Andre Rusli


Department of Informatics Department of Informatics
Universitas Multimedia Nusantara Universitas Multimedia Nusantara
Tangerang, Indonesia Tangerang, Indonesia
gabriella.wiratama@student.umn.ac.id andre.rusli@umn.ac.id

Abstract—User feedbacks can be used as a tool for application Identifying issues from user feedbacks based on its
developers to find out and understand users’ needs, preferences, characteristics can be challenging. First, user feedback contains
and complaints. It is important for developers to identify problems numerous noise words, such as misspelled words, repetitive
that arise from the user-given feedbacks. This is very difficult to words, not to mention, popular local language of slang words,
do considering the amount of feedback received every day. which is a popular language among Indonesian youngsters. This
Reading and classifying each and every feedback takes a long time makes it becomes difficult to categorize whether the user is
and it is very ineffective. To overcome this problem, a sentiment satisfied or dissatisfied with the application. Second, for some
analysis system based on Multinomial Naïve Bayes classification popular apps, the volume of user feedbacks generated each day
algorithm was built to determine whether a feedback has a positive
is simply too large to be manually read and analyzed. Third, not
or negative sentiment. Naïve Bayes algorithm is generally used for
classification because it is very easy and effective. Based on
all feedback contains useful information for developers. Only a
previous research, Multinomial Naïve Bayes algorithm gives off third of user feedbacks contain information that can directly
the best performance compared to other traditional machine help developers improve their apps [4].
learning algorithms. This study aims to implement Multinomial As a first step towards a comprehensive support tool for
Naïve Bayes classification algorithm on a web application and analyzing user feedbacks, this research proposes a tool to
calculate the accuracy of class predictions made by the system. automatically classify user feedbacks according to its overall
Based on the results of several test, the accuracy of class
sentiment. Machine learning classifiers have been widely used
predictions which were evaluated using confusion matrix, shows
for the purpose of sentiment analysis providing good results [5].
that the model with training-testing comparison of 70:30, balanced
datasets, and over-sampling each dataset by 30% produces the
Among various machine learning algorithm, Naïve Bayes (NB)
best performance, with 71.6% for accuracy, 76.92% for precision, algorithm is generally used for the classification problem due
61.73% for recall and 68.49% for F1 score. to its simplicity and effectiveness [6].
User feedbacks contain many words and sentences, which
Keywords—Bahasa Indonesia; Multinomial Naïve Bayes; NLP; are expressed in various ways. Therefore, the feedback is
sentiment analysis; user feedback;
processed first, before conducting sentiment analysis.
I. INTRODUCTION Preprocessing the user feedback is done with common Natural
Language Processing (NLP) technique. For text written in
System objectives, conceptual structures, requirements and English, there are several NLP toolkits available such as
assumptions that have been elicited, evaluated, specified and Stanford Core NLP Toolkit [7], OpenNLP [8], or NLTK [9]. As
analyzed may need to be changed for a variety of reasons, for Bahasa Indonesia, at the time our research was conducted,
including defects to be fixed; project fluctuations in terms of it was still hard to find an integrated NLP toolkit available. The
priorities and constraints; better customer understanding of the available NLP modules for Bahasa Indonesia are separated
system’s actual features and so on [1]. As a developer, it is modules such as stemmer, POS tagger, syntactic parser and
important to anticipate changes that might occur. One of the other similar modules, thus, we conducted the text pre-
ways to identify the changes that have to be made is through processing activities step by step, details on these activities will
feedbacks given by users. User feedbacks are direct opinion be explained in the following sections.
from the users who have experience the apps and reflect the
instant user experience [2]. Through user feedback, developers This research aims to study, analyze and implement
can understand users’ needs, preferences, and complaints. The Multinomial Naïve Bayes algorithm for analyzing sentiment of
emerging issues detected from user feedback can provide application user feedbacks in Bahasa Indonesia. The paper is
informative evidence for developers in maintaining their apps organized as follows. In the following section we review some
and scheduling the app updates [3]. related works to ours, then we present a brief overview of some
research in sentiment analysis. Section IV describe our
experiment and findings in implementing Multinomial Naïve
Bayes for sentiment analysis of application user feedback.
Section V concludes our work and proposes several future
works.
II. RELATED WORK
Sentiment analysis is the method used for enabling
computers to recognize and classifying opinions from big
unstructured texts datasets with machine language and
computer programming [10]. Sentiment analysis involves
classifying opinions into categories like “positive”, “negative”
or “neutral”. Many approaches and algorithm have been used
for sentiment analysis, most of them are machine learning
techniques. A comparative analysis of performance between
NB, Support Vector Machine and Maximum Entropy show that
NB classifiers outperforms others [11]. Another comparative
research on machine learning classifiers for Twitter sentiment
analysis show that Multinomial NB outperforms other
classifiers [5]. Many researches focus on sentiment analysis of
Twitter data due to simple and easy accessibility to massive
amount of data generated in real time [12, 13, 14]. Fig. 1. The operation flow of the proposed approach [6]

There isn’t as many researches on sentiment analysis of Multinomial NB expands the use of Naïve Bayes algorithm.
application user feedback using NB algorithm, not to mention, It implements NB for data distributed multinomially and is a
application user feedback in Bahasa Indonesia. frequency-based model proposed for text classification, in
III. METHODOLOGY which word counts are used to represent data. Methods for
enhancing Naïve Bayes performance are classified into five
Several literatures and existing tools are used in this categories which are structure extension, feature selection,
research. A previous research [15] has pointed out the potential
attribute weighting, local learning and data expansion [20].
of using Naïve Bayes to classify feedbacks in Bahasa Indonesia.
Among these methods, Song et al [6] used attribute weighting
In our research, in order to classify feedbacks for sentiment
analysis, we utilized a proposed classification method based on method of using different weight for each attribute and feature
Naïve Bayes algorithm proposed in another research [6] and selection method of selecting a subset of attributes based on the
some adjustment were made to accommodate natural language weight. MNB with attribute weighting can be modeled using
processing in Bahasa Indonesia. Researches related to natural (1).
language processing in have also been conducted using other
algorithms [16, 17], displaying the potential of natural language 𝑐(𝑑) = arg max[log 𝑃(𝑐) + ∑𝑚
𝑖=1 𝑊𝑇𝑖 𝑓𝑖 log 𝑃(𝑤𝑖 |𝑐)]
𝑐∈𝐶
processing steps to process texts written in Bahasa Indonesia.
(1)
A. Natural Language Processing where 𝑊𝑇𝑖 is the weight of each word 𝑤𝑖 (𝑖 = 1,2, … , 𝑚).
Natural language processing (NLP) is a computer science C. Attribute weighting
field dealing with human language processing in either text or
speech [18]. Preprocessing of feedback includes the stemming, Song et al [6] proposed an attribute weighting approach
tokenization and removal of URL, numbers, punctuation and similar to GRW. In the proposed approach, the training set is
special character, stop-words and duplicate words. In addition, first divided into positive and negative, and then the number of
common slang and spelling mistake are replaced with the root positive words and negative words are counted separately from
word. Stemming is done using an open-source stemmer library each set. The weight 𝑊𝑇𝑖 of each word 𝑤𝑖,𝑐 in Dc can be
[19], based on Enhanced Confix Stripping for Bahasa defined as:
Indonesia. 𝐼𝐺𝑅(𝑐,𝑤 )×𝑚𝑐
𝑖,𝑐
𝑊𝑇𝑖,𝑐 = ∑𝑚𝑐 𝐼𝐺𝑅(𝑐,𝑤 (2)
B. Multinomial Naïve Bayes 𝑖=1 𝑖,𝑐 )
The classification method used in our research exploits a
novel attribute weighting and feature selection approach based where c is the class label (∈ {𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒, 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒}), mc is the
on the Multinomial NB. Fig. 1 illustrates the operation flow of number of different words in Dc. 𝐼𝐺𝑅(𝑐, 𝑤𝑖,𝑐 ), which is the
the proposed approach. The first method (attribute weighting) information gain ratio of 𝑤𝑖,𝑐 , can be obtained by (3).
aims to calculate the weights of words based on the training set
divided into positive and negative class, while the second 𝐼𝐺(𝑐,𝑤𝑖,𝑐 )
𝐼𝐺𝑅(𝑐, 𝑤𝑖,𝑐 ) = 𝐻(𝑤𝑖,𝑐 )
(3)
method (feature selection) aims to modify the weights using the
average of weight differences for automatic feature selection.
𝐼𝐺(𝑐, 𝑤𝑖,𝑐 ) and 𝐻(𝑤𝑖,𝑐 ) are information gain and entropy common way to overcome this problem is to use Laplace
information of 𝑤𝑖,𝑐 , respectively. 𝐼𝐺(𝑐, 𝑤𝑖,𝑐 ) and 𝐻(𝑤𝑖,𝑐 ) can Smoothing. This technique is performed by assuming that the
be obtained by (4) and (5). training dataset is large enough so that when adding 1 to each
count, we only need to make a difference that is ignored in the
𝐼𝐺(𝑐, 𝑤𝑖,𝑐 ) = 𝐻(𝑐) − 𝐻(𝑐|𝑤𝑖,𝑐 ) (4) probability value of zero estimation. The smoothing is
calculated as follows.
|𝐷𝑣,𝑐 | |𝐷𝑣,𝑐 |
𝐻(𝑤𝑖,𝑐 ) = − ∑𝑉 |𝐷| log 2 |𝐷| (5)
𝑐𝑜𝑢𝑛𝑡(𝑤,𝑐)+1
𝑃(𝑤|𝑐) = (11)
𝑐𝑜𝑢𝑛𝑡(𝑐)+ |𝑉|
where 𝐻(𝑐) is the entropy of Dc, 𝐻(𝑐|𝑤𝑖,𝑐 ) is the conditional
entropy of Dc given the word 𝑤𝑖,𝑐 , |𝐷𝑣,𝑐 | is the size of Dc in where 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑐) is the number of words w in class c,
terms of the number of 𝑤𝑖,𝑐 for which v (∈ {0, 0̅}). 𝐻(𝑐) and 𝑐𝑜𝑢𝑛𝑡(𝑐) is the total of words in class c and |𝑉| is the number
𝐻(𝑐|𝑤𝑖,𝑐 ) are calculated as: of vocabularies in the training document.

𝐻(𝑐) = −𝑃(𝑐) log 2 𝑃(𝑐) (6)


IV. RESULT AND DISCUSSION
|𝐷𝑣,𝑐 |
𝐻(𝑐|𝑤𝑖,𝑐 ) = − ∑𝑉 |𝐷| ∑𝑐 𝑃(𝑐|𝑣) log 2 𝑃(𝑐|𝑣) (7) A. Dataset
In the work conducted in this paper, we use feedbacks from
an existing web-based e-learning system in Universitas
D. Feature selection Multimedia Nusantara. There is a total of 627 feedbacks
Typical meaningless words have larger weight value than gathered on 8th of June 2018 which is comprised of 269 positive
frequent or important word with the existing attribute weighting and 358 negative feedbacks.
approaches because they usually appear infrequently in the B. Performance Evaluation
training set. Song et al [6] proposed feature selection approach
Several scenarios are conducted to determine the best
that uses the difference between positive weight and negative
configuration for classifying the feedback’s sentiment. An
weight of all the words to modify the weights of meaningless
experiment with two training-testing ratios, 70:30 and 80:20,
words. After obtaining the weights of all the words in the
are conducted. First, we try to use the feedbacks as it is, which
training set, the average of the differences is calculated. Based
was imbalanced (Scenario 1). Furthermore, we try to under-
on that, some weight values are changed to zero. This lets the
sampling (Scenario 2) and over-sampling the dataset (Scenario
meaningless words be effectively excluded when the class of
3).
test document is predicted. The difference of weight, 𝑊𝐷𝑖 , of
In the experiment following Scenario 1 with imbalanced
𝑤𝑖 (𝑖 = 1,2, … , 𝑚) is defined as follows.
dataset, the model achieves 50.88% for accuracy, 39.93% for
precision, 29.26% for recall and 33.57% for F1 score with 70:30
|𝑊𝑇𝑖,𝑝 − 𝑊𝑇𝑖,𝑛 |, if 𝑤𝑖 ∈ 𝐷𝑝 and 𝑤𝑖 ∈ 𝐷𝑛 training-testing ratio. This is probably caused by the large
𝑊𝐷𝑖 = { (8)
𝑊𝑇𝑖,𝑐 , otherwise difference of number between positive and negative class, in
which the number of negative feedbacks is much higher than
The average of 𝑊𝐷𝑖 ’s, 𝐴𝑣𝑔𝑊𝐷 , is computed using (9) the positive feedbacks. In Scenario 2, we try to balance the
dataset by under-sampling the negative feedbacks to match the
∑𝑚
𝑖=1 𝑊𝐷𝑖
number of positive feedbacks. The model performs better, with
𝐴𝑣𝑔𝑊𝐷 = 𝑚
(9) 66.09% for accuracy, 66.1% for precision, 66.3% for recall and
66.06% for F1 score, by using 70:30 training-testing ratio. By
𝐴𝑣𝑔𝑊𝐷 and the weight of all words are compared in each balancing the dataset, the prior probability for each class is
training set, Dc. if the weight is higher than the average, it is levelled. Moreover, by reducing the negative feedbacks, the
modified to zero. Otherwise, it is kept as follows. word occurrence in negative dataset is also reduced, resulting
in a more balance word weighting. However, removing
0, if 𝑊𝑇𝑖,𝑐 > 𝐴𝑣𝑔𝑊𝐷 feedbacks could cause some important information to be loss,
𝑊𝑇𝑖,𝑐 = { (10) especially in negative feedbacks that can contain useful
𝑊𝑇𝑖,𝑐 , otherwise
information, such as bug reports or feature request.
Finally, the class label of the test document is predicted using In Scenario 3, we try to balance the dataset, by randomly
non-zero weight words based on (1). resampled data from positive class training set to match the
number of negative feedbacks, which was 250 data. The model
performance performs better, though not much increment, with
E. Laplace Smoothing 67.2% for accuracy, 67.97% for precision, 67.04% for recall
Conditional probability 𝑃(𝑐|𝑣) in (11) can cause problems and 66.71% for F1 score. We also try to resample data on both
because it can produce a probability of zero for document that classes to raise the performance. The resampling is done by
have words that have never been encountered before. A adding 10% to 50% of the initial amount of the negative training
dataset, 250 data. The result is model with 30% over-sampling
achieves the best performance, with 71.6% for accuracy, REFERENCES
76.92% for precision, 61.73% for recall and 68.49% for F1
[1] A. v. Lamsweerde, Requirements Engineering: From
score. Based on the experimentation, several feedbacks in the
System Goals to UML Models to Software
dataset contain some ambiguous feedback, which might cause
Specifications, Chichester: John Wiley & Sons Ltd, 2009.
difficulties for the machine in determining the feedback’s
overall sentiment. In addition, the sentence structure for Bahasa [2] T.-S. Nguyen, H. W. Lauw and P. Tsaparas, "Review
Indonesia is lenient whether in text or speech, application Synthesis for Micro-Review Summarization," in Eighth
feedbacks included. An example of an ambiguous feedback is ACM International Conference on Web Search and Data
as follows, one user feedback reads “Absensi”, meaning Mining, Shanghai, 2015.
“Attendance” in English. It is unclear whether the user’s [3] C. Gao, J. Zeng, M. R. Lyu and I. King, "Online App
sentiment about the application is negative or positive. Review Analysis for Indentifying Emerging Issues," in
Moreover, the feedback contains only one word, which is not International Conference on Software Engineering,
enough to accurately predict its sentiment. Feedbacks such as Gothenburg, 2018.
the one previously mentioned (ones that only contain one word) [4] N. Chen, J. Lin, S. C. H. Hoi, X. Xiao and B. Zhang, "AR-
often appear in the dataset, henceforth, lowering the Miner: Mining Informative Reviews for Developers from
performance of the classification model. Mobile App Marketplace," in 36th International
Conference on Software Engineering, Hyderabad, 2014.
TABLE I. MODEL PERFORMANCE
[5] H. M. Ismail, S. Harous and B. Belkhouche, "A
Total Predicted: Predicted: Total Accuracy =
Testing Positive Negative 71.6% Comparative Analysis of Machine Learning Classifiers
Data = 162 for Twitter Sentiment Analysis," Research in Computing
Class: TP 50 FN 31 81 Precision =
Science, vol. 110, pp. 71-83, 2016.
Positive 76.92% [6] J. Song, K. T. Kim, B. Lee, S. Kim and H. Y. Youn, "A
Class: FP 15 TN 66 81 Recall = Novel Classification Approach based on Naive Bayes for
Negative 61.73% Twitter Sentiment Analysis," KSII Transactions on
Total 65 97 162 F1 Score = Internet and Information System, vol. 11, no. 6, pp. 2996-
68.49% 3011, 2017.
[7] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J.
Bethard and D. McClosky, "The Standford CoreNLP
V. CONCLUSION AND FUTURE WORKS Natural Language Processing Toolkit," in 52nd Annual
In this research, we used an attribute weighting and feature Meeting of the Association for Computational
selection approach based on Multinomial Naïve Bayes in order Linguistics: System Demonstrations, Baltimore, 2014.
to get sentiment classification of application user feedbacks. [8] "OpenNLP," [Online]. Available:
With this approach we used prior user feedbacks from an https://opennlp.apache.org/. [Accessed 11 July 2019].
existing web-based university e-learning system which then are
[9] "NLTK," [Online]. Available: http://www.nltk.org/.
classified into positive and negative class.
[Accessed 11 July 2019].
The sentiment analysis model’s performance was tested and [10] C. Fiarni, H. Maharani and R. Pratama, "Sentiment
evaluated with the best result is 71.6% for accuracy, 76.92% for Analysis System for Indonesia Online Retail Shop
precision, 61.73% for recall and 68.49% for F1 score. The Review Using Hierarchy Naive Bayes Technique," in
results are calculated from the performance of the model which Fourth International Conference on Information and
is built by over-sampling both datasets up to 30% more than the
Communication Technologies (ICoICT), Bandung, 2016.
initial amount of the negative training set, with 70:30 training-
test ratio. The conducted research shows the potential of [11] M. Tabra and A. Lawan, "A Comparative Analysis of the
supporting requirements engineering process in software Performance of Three Machine Learning Algorithms for
development by analyzing sentiment of application user Tweets on Nigerian dataset," The International Journal
feedbacks in Bahasa Indonesia using Multinomial NB. of E-Learning and Educational Technologies in the
Digital Media (IJEETDM), vol. 3, no. 1, pp. 23-30, 2017.
Future works include conducting the experiment on a larger
and more representative dataset and employ more advance [12] N. A. Vidya, M. I. Fanany and I. Budi, "Twitter
natural language processing methods and/or tools. Furthermore, Sentiment to Analyze Net Brand Reputation of Mobile
to enhance the proposed system, various N-gram approaches Phone Providers," in The Third Information Systems
such as bigram and trigram could be applied, together with a International Conference, Surabaya, 2015.
more sophisticated attribute weighting and feature selection. [13] A. Alamsyah, W. Rahmah and H. Irawan, "Sentiment
Additionally, semantic understanding could also be Analysis Based on Appraisal Theory for Marketing
incorporated to produce a more precise sentiment analysis. Intelligence in Indonesia's Mobile Phone Market,"
Some positive terms can be used sarcastically to express
Theoritical and Applied Information Technology, vol. 82,
negative sentiment and some negative terms can be used vice
no. 2, pp. 335-340, 2015.
versa.
[14] I. Zulfa and E. Winarko, "Sentimen Analisis Tweet [18] A. Purwarianti, A. Andhika, A. F. Wicaksono, I. Afif and
Berbahasas Indonesia dengan Deep Belief Network," F. Ferdian, "InaNLP: Indonesia Natural Language
Indonesian Journal of Computing and Cybernetics Processing Toolkit, case study: Complaint Tweet
Systems, vol. 11, no. 2, pp. 187-198, 2017. Classification," in International Conference On
[15] A. Rusli, "Ekstraksi Kebutuhan Aplikasi Berdasarkan Advanced Informatics: Concepts, Theory And
Feedback Pengguna Menggunakan Naïve Bayes dan Application (ICAICTA), Penang, 2016.
Gamifikasi," Ultimatics: Jurnal Teknik Informatika, vol. [19] Sastrawi, "High quality stemmer library for Indonesian
10, no. 1, pp. 34-40, 2018. Language (Bahasa)," Github Repository, 2017.
[16] S. Hansun and D. D. Sinaga, "Indonesia Text Document [20] L. Jiang, H. Zhang and Z. Cai, "A Novel Bayes Model:
Similarity Detection System Using Rabin-Karp and Hidden Naive Bayes," IEEE Transactions on Knowledge
Confix-Stripping Algorithms," International Journal of and Data Engineering, vol. 21, no. 10, pp. 1361-1371,
Innovative Computing, Information and Control, vol. 14, 2009.
no. 5, pp. 1893-1903, 2018.
[17] S. Hansun and M. Widjaja, "Implementation of Porter's
Modified Stemming Algorithm in an Indonesian Word
Error Detection Plugin Application," International
Journal of Technology, vol. 2, pp. 139-150, 2015.

You might also like