Professional Documents
Culture Documents
Opinion Mining in Albanian Evaluation of The Performance of Machine Learning Algorithms For Opinions Classification
Opinion Mining in Albanian Evaluation of The Performance of Machine Learning Algorithms For Opinions Classification
ISSN No:-2456-2165
Abstract:- The volume of opinions given in social media is Opinion mining aims to identify information about the
growing, so we need useful tools to analyze them and use sentiment expressed in an online opinion by a person. One of
the gained information in the future. Opinion mining is an the tasks in opinion mining is opinion classification at the
important research field to analyze the opinion given by document level based on a pre-defined emotional expression
someone for an entity. The most used methods in opinion category. For example, the category can contain two classes,
mining are machine learning techniques. Our work aims the positive where are classified the opinions that express a
to evaluate the performance of machine learning positive opinion, and the negative where are classified the
techniques through experiments in performing opinion opinions that express a negative opinion. While extensive
classification tasks in the Al-banian language. Our research work in opinion mining is concentrated on languages
approach to opinion mining addresses the problem of such as English, German, Italian or Japanese, insignificant
classifying text document opinions as positive or negative work is done for the Albanian language.
opinions. Since a lack of research on the Albanian
language in this field, we conducted an experimental The Albanian language is an independent branch of the
evaluation of fourteen techniques used for opinion Indo-European language family spoken by around 7 million
mining. We tested different machine learning algorithm's native speakers. Albanian is the official language in Albania
performance using Weka. We have presented other and Kosovo, the second official New Republic of Macedonia,
conclusions related to the best feature combination of and the regional language in the Ulcinj in Montenegro. Also, it
traditional machine learning algorithms. is spoken in some provinces in southern Italy, Sicily, Greece,
Romania, and Serbia, and by Albanian communities all over
Keywords:- Opinion Mining, Sentiment Analysis, Machine the world where Albanians have migrated and are living.
Learning, Albanian Language. Studying the Albanian language is interesting not only in
linguistics but and in other fields since it is a unique and
I. INTRODUCTION separate language. Albanian is a very complex language, its
large alphabet contains 36 letters, and it has complex
Nowadays, social media has influenced the way people morphological and lexical features.
communicate with each other and express their opinions.
People spend a considerable part of their day on social media, Here, we present an intensive study for opinion
and opinions given about the products, services, or various classification based on the opinion’s polarity. The
social aspects posted on it are increasing day by day. The 2018 classification is done at the document level. The opinion is
report of BrightLocal1 of “Local Consumer Review Survey” categorized as positive if it expresses a positive opinion and as
indicates that 86% of responders read online reviews, 50% of negative if it expresses a negative one. Our work consists of
18-34-year-old responders always read online reviews, and evaluation through experiments the performance of machine
only 5% of them have never read online reviews. Opinion (or learning approaches for the opinion classification task of
online review) analysis helps people in their decisions or opinion mining in in-domain and multi-domain corpora. The
companies to develop strategies to improve and expand their Albanian language is not a well-documented language in
products or services. So, academia and business are interested natural language processing resources, here are not available
in developing tools to analyze and extracts information to the annotated corpora to be used for opinion mining tasks. For
huge number of online opinions. these reasons, we collected text documents opinions from
online media to create a dataset with opinions classified into
two classes, positive and negative. We have used different
preprocessing tools for better performance. In the following
section, we have discussed these in more detail.
In the second preprocessing case we preprocessed the Analyzing the results, we can conclude that the
data using only the first step of the first case: the conversion of algorithms best perform when they are used to classify
the words to lower case, the stop-words removal, and the opinions in in-domain opinion mining. The best performing
special characters, numbers, and punctations removal. algorithm in terms of the average value is the Naïve Bayes,
and its performance is more stable than the other algorithms.
By the end of this phase, the text documents of the
opinions are bags of words without any language structures. We decided to investigate more features that can
improve the performance of the algorithms listed in Table 2.
To evaluate the performance of the selected algorithm we have
to follow the steps described below.
Table 2 Experimental results from paper [13] and [14] in terms of percent of correctly classified instances
Average Average Average
In-domain Multi-domain Total
Naïve Bayes Multinomial 83.40 79.35 80.62
Logistic 81.00 77.61 78.67
Bayesian Logistic Regretion 80.60 77.38 78.39
SGD 80.60 76.13 77.53
SMO 79.40 74.70 76.17
Random Forest 75.40 75.03 75.15
HyperPipes 83.20 70.11 74.20
VotedPerceptron 76.40 70.57 72.39
SimpleLogistic 71.60 65.32 67.29
RBFNetwork 75.60 62.19 66.38
J48 66.80 63.43 64.49
IBK 61.80 61.04 61.28
NaiveBayesMultinomial 87 92 92 93 93 93 93 89 95 95 94 97 94 95 77 82 82 84 87 86 89
SGD 88 92 92 92 92 93 93 84 89 89 89 89 87 87 75 83 83 82 82 83 83
RBFNetwork 49 93 93 51 51 95 51 88 50 92 97 97 97 97 66 90 90 92 92 86 86
BayesianLogisticRegretion 89 91 90 89 89 89 89 85 92 91 90 94 91 93 77 81 82 86 85 87 85
Logistic 94 92 92 93 93 93 93 84 94 94 84 84 87 87 66 82 82 90 90 82 82
HyperPipes 92 92 92 91 91 92 92 92 93 93 98 98 98 98 67 80 80 88 88 82 82
SMO 88 90 90 92 92 91 91 82 87 87 82 82 81 81 74 82 82 81 81 84 84
Random Forest 86 83 81 88 90 90 88 76 80 83 67 78 79 77 76 79 77 78 79 79 82
VotedPerceptron 86 82 82 81 82 80 82 76 88 84 87 88 88 83 74 79 77 80 78 82 80
SimpleLogistic 84 80 80 82 81 80 82 74 68 65 70 68 67 65 66 69 68 71 68 68 68
J48 87 72 72 70 70 70 70 67 57 57 57 57 59 59 58 76 76 65 65 65 65
IBK 60 63 63 63 63 67 67 65 64 64 65 65 65 65 62 60 60 56 56 62 62
(continued)
Table 3 (continued)
C_4 C_5 C_6
BayesianLogisticRegretion 74 72 72 70 72 71 71 78 76 80 78 82 77 81 77 76 76 76 77.2 75 76
SMO 75 71 71 73 73 74 74 78 76 76 79 79 81 81 76 72 72 72 71.6 71 71
(continued)
NaiveBayesMultinomial 74.11 74.22 75.8 72.9 73.9 72.9 73.8 81.39 84.15 84.31 85.07 87.04 85.7 86.51 84.88
SGD 72.44 72.89 72.9 74.6 74.6 74.2 74.2 78.98 81.13 81.13 82.25 82.25 81.12 81.12 81.14
RBFNetwork 70.44 71.89 71.9 69.2 69.2 69.8 69.8 74.55 74.3 86.3 81.2 81.2 86.31 80.03 80.56
BayesianLogisticRegretion 73.89 73.11 73.9 72.2 73.1 72.7 73.1 79.13 80.1 80.7 80.12 81.76 80.35 81.13 80.47
Logistic 59.44 61.33 61.3 59.6 59.6 60.4 60.4 75.86 80.53 80.53 81.14 81.14 80.12 80.12 79.92
HyperPipes 50.67 49.67 49.7 49.7 49.7 49.7 49.7 74.64 77.47 77.47 81.55 81.55 80.52 80.52 79.1
SMO 71.67 72.33 72.3 72 72 71.8 71.8 77.81 78.56 78.56 78.66 78.66 79.08 79.08 78.63
Random Forest 75.56 77.1 75.4 74.7 75 75.1 75.6 75.25 77.73 74.43 76.01 77.54 78.5 77.91 76.77
VotedPerceptron 70.89 71.56 73.3 70.6 70 70.2 68.8 74.58 76.68 76.48 76.68 77.94 77.15 77.25 76.68
SimpleLogistic 68.22 69.78 69.8 68.4 68.4 69.1 69.1 70.37 69.08 68.51 70.23 69.66 69.44 69.87 69.6
J48 60 62.11 62.1 61.3 61.3 61.3 61.3 65.03 64.42 64.42 63.02 63.02 64.88 64.88 64.24
IBK 60.67 58.11 58.1 57.8 57.8 58.1 58.1 61.44 59.47 59.47 61.17 61.17 62.3 62.3 61.05
Table 4 Statistical Experiment results per each algorithm in terms of percent correct
Dataset Naïve Bayes SGD Bayesian Hyper RBF Logistic SMO Voted Random Simple J48 IBk
Multinomial Logistic Pipes Network Perceptron Forest Logistic
Regression
C_1 92.10 | 91.4 87.6 91.5 95.1 92.5 89.7 83.60* 85.3 77.60* 74.30* 63.20*
C_2 96.20 | 88.30* 91.7 97.9 97.4 90.5 83.60* 86.70* 80.70* 66.60* 58.80* 63.40*
C_3 85.30 | 83.1 86.9 87.4 90.9 84.5 80.5 78.3 79.4 68.90* 67.50* 57.40*
C_4 88.80 | 76.50* 76.70* 93 93.3 88.2 75.00* 72.90* 71.50* 65.30* 58.90* 66.60*
C_5 87.80 | 81.1 82.5 90 93.1 86.2 80.2 76.40* 78.8 66.10* 65.40* 57.80*
C_6 79.06 | 74.76* 76.36 61.72* 76.34 67.84* 71.44* 73.02* 77.68 67.40* 64.28* 60.12*
C_7 73.72 | 73.44 72.33 49.87* 68.86* 59.11* 71.28 71.44 75.19 67.60* 61.21* 58.72*
(v/ /*) | (0/4/3) (0/6/1) (0/5/2) (0/6/1) (0/5/2) (0/4/3) (0/2/5) (0/5/2) (0/0/7) (0/0/7) (0/0/7)