ELBERGUI Initiation Recherche

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/338237919

Sentiment analysis for moroccan dialect

Presentation · September 2019

CITATIONS READS
0 1,217

1 author:

Adam el Bergui
Paris Descartes, CPSC
2 PUBLICATIONS   0 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Sentiment analysis For moroccan dialect View project

All content following this page was uploaded by Adam el Bergui on 11 September 2020.

The user has requested enhancement of the downloaded file.


Sentiment Analysis for Moroccan Dialect
EL BERGUI ADAM
National school For Computer Science.
ENSIAS
adam.elbergui@um5s.net.ma

Abstract— Given the importance of Sentiment Analysis, many


research works have been devoted to this research area. How-
ever, most of these studies have focused on English and other
Indo-European languages.The studies that are conducting to
investigate Arabic text mining and natural language processing
for Arabic text are not compatible with the wide use of Arabic
language in social media due to its complexity and the absence
of free lexical resources regarding the Arabic language . The
studies become fewer when it comes to Dialectal Arabic due
to its many challenges . To overcome this challenges we aim to
explore sentiment classification for Moroccan Dialect language,
we propose in this work to conduct a supervised approach
on two different datasets of Moroccan Dialect reviews. This
approach involves four consecutive tasks : Data collection,text
pre-processing, feature extraction, and sentiment classification.

Fig. 1. Arabic language types


I. INTRODUCTION
With the rapid development of network and information language and the fastest growing during the last five years
technology, the applications in form of rich media have with a growth rate of 6091.9% in the number of Internet users
involved in all aspects of our lives. Accordingly, the interac- [1] . Arabic Language belongs to the Semitic languages and
tion and communication between individuals have become it can be classified into three general types [2] :
more and more convenient and frequent by sharing their • Classical Arabic : CA is identified as the language
views about different topics through social networks such of the holy Qur’an . It is accorded an elevated status
as Twitter, Facebook or they leave comments and reviews throughout the Muslim world as it is essential for
regarding products on a particular websites which creates a understanding the Qur’an.
need for mining such data . Sentiment analysis has become • Modern Standard Arabic : MSA is an artificial
a key tool for making sense of that data. This has allowed language in that it is no one native language. It is
companies to get key insights and automate all kind of commonly used for formal, literary, and educational
processes. Sentiments are very important whenever we need purposes across the Arabic speaking countries.
to make a decision we need to know others sentiment. • Dialectal Arabic : DA used in informal communication
This is not only true for individuals, but it is also true for throughout the Arabic world. These language varieties
organizations and governments. not only show considerable variation from one country
Sentiment Analysis examines the problem of studying texts, to another, but also differ from one region to another
like posts and reviews, uploaded by users on forums, and within the same county.Arabic dialectal varieties are
electronic businesses, regarding the opinions they have about generally classified into five groups [3] .
a product, service, event, person or idea. Existing research
has produced numerous techniques for various tasks of II. C HALLENGES O F A RABIC SENTIMENT ANALYSIS
sentiment analysis, which include both supervised and un-
supervised methods. In the supervised setting, early papers Arabic is an important language for its historical, cultural
used all types of supervised machine learning methods and social aspects.However ,Arabic did not receive much
(Support Vector Machines (SVM), Maximum Entropy, Naïve attention until recently, but it still lacks behind due to its
Bayes, etc.) and feature combinations. Unsupervised methods many challenges . These challenges have some similarities
include various methods that exploit sentiment lexicons, across different languages but there are many issues and
grammatical analysis, and syntactic patterns. problems that are specific to Arabic language .
Arabic is one of the six official languages of the United • Data gathering is challenging for the Arabic language,
Nations. It is the official language of 27 countries and due to the limited sources available to collect it .
is spoken by more than 422 million people in the Arab • Social Media sites often contain misspelled words and
world. On the web, Arabic is ranked the fourth mostly used transliterated
are still limited in size, availability and dialects coverage.
For instance, the highest proportion of available resources
and researches are devoted to Modern Standard Aarabic.
The objective of the present section is to review the most
important research studies that have dealt with Sentiment
Analysis in dialectal Arabic.
A Sentiment Analyses for Tunisian Social Network Texts
was devloped [7] . They collected 260 Statues, posted by
Tunisian .They used SVM and NB as classifiers , for features
extraction they used unigrams, bigrams, trigrams and parts of
Fig. 2. Geolinguistic classification of the Arabic dialects [2] speech . they showed that the accuracy is high when using the
bigram as a feature. SVM’s (83.16% accuracy) outperformed
the NB(81.14% accuracy ) when unigrams is used.
• Arabiziis a new trend in social media where Arabic An opinion mining system called Ara’a was proposed by
words represented by Latin letters . Additionally, some Azmi and Alzanin [8]. It targets the Saudi Najdi Dialect. The
Arabic users tend to switch between Arabic and English, dataset used was a collection of 815 comments on online
making it difficult to detect if a word is written is news articles that were manually annotated using four polar-
Arabizi or English [5] . Dealing with this form of ities : strongly positive, positive, negative, strongly negative.
writing has been the subject solely of studies that aim All the words in the dataset were removed except those
at detecting and converting Arabizi into Arabic [5]. with explicit connotations. A vocabulary was constructed
• Idioms are commonly used in Arabic text in social containing all the distinct keywords from the training set.
media. These idioms could vary from one dialect to An NB classifier was used along with a revised n-gram for
another and may have a reversed meaning in different words not found in the vocabulary. The classifier reached an
regions of the Arab world which creates a challenge in accuracy of 82%.
Sentiment Analysis systems. Through the study of relative work done in recent years
• Sarcasm detection is a general issue in sentiment anal- in Arabic sentiment analysis domain we can say that this
ysis due to its effect in misleading the classification. is a vast domain where a range of techniques available
Detecting sarcasm is a hard process and limited research for classification sentiment for data from social media and
has been conducted on this issue. Sarcasm has not been still there are a lot of ongoing research. The accuracy and
yet explored in the Arabic language . complexity of methods depends on the dataset and the
Dialectal Arabic shares many challenges with MSA, as DA number of dataset features.
inherits the same characteristics of being a Semitic language
IV. M ETHODOLOGY
with a complex templatic derivational morphology .In addi-
tion to the shared challenges, DA has its own peculiarities, In this chapter , we present our methodology used for the
which can be summarized as follows : task of classifying the sentence orientations. It precise our
• Lack of standard orthography. Many of the words in text models, the used data set and the applied classifiers. We
DA do not follow a standard orthographic system [3]. detail also our pre-processing schemes and the normalization
• Many words do not overlap with MSA as a result techniques used to deal with the informal dialectal Arabic
of language borrowing from other languages , for ex- language nature especially Moroccan Arabic ”Darija”. we
ample a sentence in a Moroccan dialect may contain present Also the measurement techniques used to evaluate the
words derived from standard arabic, Amazigh dialect performance of sentiment classification. We can summarize
“Tamazight”,French, Spanish, and English. our methodology as follows : First, describing the data set
• The regular discourse features in informal texts, such used in this work. Second, applying different pre-processing
as the use of emoticons and character repetition for stage (including normalisation , noise elimination, conversion
emphasis . of the emotion icons into text and more) to the generated data
• Code-switching is common phenomenon in multilingual set which in turn leads the polarity classification performance
communities where speakers switch from one language to increase.Third The techniques used for text representation
or dialect to another within the same context . Code- . Finally, classifying the Moroccan dialect text using machine
switching is very common in Arabic dialects . learning classifiers and compare the results.

III. R ELATED W ORKS A. Data Collection


A considerable amount of previous works have been We collected two different dataset , the first one is
published on sentiment analysis. These studies mostly re- balanced (number of positive comments equals to negative
port analysis of sentiments in messages in English. Fewer ones ) and contains comments on different domains , the
studies were done on sentiments in Arabic language .The second one is unbalanced and contains comments on only
available Arabic datasets and lexicons for sentiment analysis one domain which is politics .
The first dataset is collected by A.Oussous et al [6] from words to their original root. Two types of stemming
internet discussion forums and blogs which are hosted in approaches can be cited :light stemming and root ex-
Morocco or extensively used by Moroccans .The data set traction . The first transforms each term to its three-
is a combination of reviews and comments from Facebook, letter root. The light stemming reduces each term by
Twitter, YouTube and Hespress . It is a multi-domain corpus removing its prefixes and suffixes without reducing
consisting of the text covering a maximum vocabulary from them to their roots. We investigated both types in this
sport, social and politics domain.The collected corpus, called research .
MSAC (Moroccan Sentiment Analysis Corpus).The final
MSAC dataset after cleaning contains about 1,014 of positive C. Features extraction
reviews and 1,022 of negative ones written in Moroccan After the pre-processing task, texts (reviews) were repre-
dialect ”Darija”. sented as a vector .We used unigrams (bag of words) but we
The second dataset is gathered from various social media also combined unigrams and bigrams to take in consideration
websites as Facebook , Twitter , YouTube and Hespress and the order of words in the text during our experiments .To
it is annotated manually .The dataset contains only comments get a numerical representation of the text data we used
that related on politics espicially Moroccan general election two weighting schemes :TF_IDF(Term Frequency_Inverse
of 2016 written in Moroccan dialect ”Darija” . The dataset Document Frequency) , Term Occurence .
is called ElecMorocco2016 and contains 3673 comments of • The TF-IDF value increases uniformly to the number of
positive comments and 6581 of negative ones. times a word (feature) occurs in the text, but is adjusted
B. Data pre-processing by the frequency of the feature in the dataset, which
minimizes the weight of some words that appear more
The pre-processing techniques are an essential step in
frequently in general.
the SA for Arabic text. Especially the Arabic dialectal text
because of its unstructured form. Indeed, the posts and texts d = (w1 , ....., w|V | )
generated by social media include informal writing, errors,
the use of abbreviations, missing punctuation, no respect wi = tfi × log(D/dfi )
of grammatical rules. So, we need to process unstructured
where tfi is the number of times the word wi occurs
text that lack grammar standardization. We have also to
in the text d ,dfi number of texts contain the word wi
eliminate spelling mistakes and noise. To minimize the
and D is the total number of texts .
effect of those issues we decided to pre-process Arabic
• The Term Occurence weighting count the occurence
posts before classification. To enhance the results of SA
of each term in the text .
for Arabic text, we created our own text pre-processing
scheme to deal with the informal Arabic language nature. d = (w1 , ....., w|V | )
We describe below the different pre-processing tasks that
we implemented using the Python programming language. where d is the text and |V | is the size of vocabulary
Those tasks are normalization and tokenization, stop words set in the dataset . wi represents how many the word
removal and lexicon developments. occurs in the text.
• Noise Reduction by Removing all user-names
D. Classification Models
(e.g.@username), hash tags (e.g.topic), URLs
(e.g. www.example.com), re-tweet sign (e.g. RT), The performance of sentiment analysis is strongly de-
punctuations , additional white spaces and non-Arabic pendant on the applied classifier. Then depending on the
word and letters from the text. data-size various model validation techniques can be used.
• Replace the emojis encountered in the reviews with their Cross-validation is commonly used for sentiment analysis
sentiment . To do so, we prepared a list for mapping evaluations. The annotated dataset is split into k equal parts,
known emojis to their corresponding sentiment (positive then the first part is treated as the testing data and the rest as
or negative ) . This way we do not lose the sentiment training data, this selection process is repeated for each of
expressed by those emojis. the parts. Each part is used exactly once as the testing data.
• Tokenization by spliting texts into a sequence of tokens • Support Vector Machine is a machine learning method
based on whitespaces characters. This step allows us to based on vector spaces, where the goal is to find a
model a text as a word vector. decision boundary between two classes that represents
• Removing the repeated letters which might appear spe- the maximum margin of separation in the training data.
cially in social media which expresses affirmation and SVM can construct a non-linear decision surface in the
accentuation. original feature space by mapping the data instances
• Remove Stop words , This step filters Arabic stop words non-linearly to an inner product space where the classes
by removing every token equal to an item from the stop can by separated linearly with a hyperplane.
word list. • Naive Bayes classifier assumes conditional indepen-
• The process of stemming aims normalizing word vari- dence to make the calculations easier ; that is, given
ations by removing prefixes and suffixes and reducing the class attribute value, other feature attributes become
conditionally independent. This condition, though unre-
alistic, performs well in practice and greatly simplifies
calculation .
• Logistic regression uses the sigmoid function hw (x) =
T
f (x) = 1/(1 + e−w x ) as a learning model, then it
optimizes a cost function that measures the likelihood of
the data given the classifier’s class probability estimates
.
E. Evaluation
The efficiency of each classifier can be measured by calcu-
lating the Accuracy,Precision,F-Measure and Recall ,which
are defined as: Fig. 3. Results using light stemming and a combination of unigram and
• Accuracy : The most common metric for classification bigram on MSAC dataset
, which is the fraction of samples predicted correctly .
T rueP ositive + T rueN egative
Accuracy =
T otal
• Precision : The fraction of predicted positives events
that are actually positive.
T rueP ositive
P recision =
T rueP ositive + F alseP ositive
• Recall :Also known as sensitivity and represents the
fraction of positives samples predicted correctly.
T rueP ositive
Recall =
T rueP ositive + F alseN egative Fig. 4. Results using light stemming and a combination of unigram and
bigram on Election dataset
• F1-Score : The harmonic mean of precision and recall
.
P recision ∗ Recall
F 1 − Score = 2 ×
P recision + Recall best results are given by using Term Occurrence weighting,
V. RESULTS light-stemming and the combination of unigram and bi-gram,
In this section , we compare the performance of the it achieves 80.64% based on the accuracy measure. Finally,
machine learning classification methods (SVM,Naive Bayes, the best result achieved by Logistic regression classfier is
and Maximum Entropy) for each machine learning technique, 81.17% based on the accuracy measure, this result is given
different combinations of the preprocessing have been ap- by using TF-IDF, light stemming and uni-gram features.
plied . The obtained results for elections dataset shows that the
The obtained results for MSAC dataset shows that the best results based on the accuracy measure are given when
best result achieved by SVM Is 84.31% based on the using Logistic regression with TF-IDF, light stemming and
accuracy measure, this result is given when using TF-IDF unigrams model .We also notice that using light stemming
weighting,light stemming and the uni-gram model. For the yields the best performance regardless of the algorithm used.
NB classifier, the best results are given by using TF-IDF
VI. CONCLUSIONS
weighting, light-stemming and the combination of uni-gram
and bi-gram, it achieves 86.02% based on the accuracy mea- This project aimed at investigating the classification of
sure. Finally, the best result achieved by Logistic regression sentiments for Moroccan dialect. The first goal for this
classifier is 84.56% based on the accuracy measure, this project is to study the sentiment analysis workflow in details,
result is given by using TF-IDF, light stemming and uni- describing the most used approaches in this field and showing
gram features. The obtained results for MSAC dataset shows some related works for Arabic sentiment analysis. the second
that the best results based on the accuracy measure are goal is to take advantage of some settings to discover the
given when using Naive Bayes with TF-IDF, light stemming ones that lead to the highest results. The variables considered
and the combination of unigrams and bigram. We also were about the weighting of terms, the type of stemming and
notice that using light stemming and TF-IDF yields the best the n-gram word model. In order to compare these variables
performance regardless of the algorithm used. The obtained We used the three most popular classifiers recognized for
results for the elections dataset shows that the best result their effectiveness for the classification task, namely Naive
achieved by SVM Is 80.93% based on the accuracy measure, Bayes,Support Vector Machine and Logistic Regression. An-
this result is given when using TF-IDF weighting, light other goal of this project was to compare the effectiveness of
stemming and the uni-gram model. For the NB classifier, the these three classifiers in two different datsets for Moroccan
dialect the first one is MSAC and it is about various topics
and the second one is about the 2016 elections of morocco.
VII. F UTURE W ORKS
Future work will involve investigation of other methods
in particular deep learning approaches. In order to use deep
learning approaches we would increase the amount of data
because deep learning models needs large datasets. We also
aim to investigate few shot learning, transfer learning and
data augmentation models for text .
R EFERENCES
[1] Sentiment analysis in Arabic : A review of the literature
Naaima Boudad , Rdouan Faizi, Rachid Oulad Haj Thami, Rad-
douane Chiheb ENSIAS, Mohammed V University, Rabat, Mo-
rocco the literature. Ain Shams Eng. J. (2017, in press). https
://doi.org/10.1016/j.asej.2017.04.007
[2] DIALECTAL ARABIC PROCESSING USING DEEP LEARNING
Inaugural-Dissertation zur Erlangung des Doktorgrades der Philoso-
phie (Dr. phil.) durch die Philosophische Fakultät der Heinrich-Heine-
Universität Düsseldorf.
[3] Habash, Nizar Y. (2010). “Introduction to Arabic natural language
processing.” In : Synthesis Lectures on Human Language Technologies
3.1, pp. 1–187
[4] Al-Twairesh N, Al-Khalifa H, Al-Salman A (2014) In : 2014 IEEE/
ACS 11th international conference on computer systems and applica-
tions (AICCSA), IEEE, pp 148–155
[5] R.M. Duwairi, R. Marji, N. Sha’ban, S. Rushaidat, Sentiment analysis
in Arabic tweets, Presented at the 5th international conference on
Information and Communication Systems (icics), 2014, pp. 1–6
[6] Improving Sentiment Analysis of Moroccan Tweets Using Ensemble
Learning . Ahmed Oussous , Ayoub Ait Lahcen , and Samir Belih
(2018)
[7] Ghadeer, A.S., Aljarah, I. and Alsawalqah, H., 2017. Enhancing the
Arabic sentiment analysis using different preprocessing operators. New
Trends Inf. Technol, 113(April, 2017), pp.113-117
[8] S.R. El-Beltagy, A. Ali, Open issues in the sentiment analysis of
Arabic social media : a case study, Presented at the 9th International
Conference on Innovations in information technology (iit), 2013, pp.
215–220.

View publication stats

You might also like