Professional Documents
Culture Documents
Memoir
Memoir
Memoir
Thème
Analyse des sentiments des commentaires
En Arabe des lecteurs des journaux en lignes
Promotion 2017/2018
Acknowledgements
Before anyone, I thank ALLAH almighty, who guides me and gives me the strength and courage to
complete the research work.
I would like to express my sincere thanks to my supervisor guide Dr. Sadik Bessou for his continuous
encouragement and support to me and gave me the freedom to develop my ideas in the middle.
I am very thankful to to many friends for their support, encouragement, helps,listening, the stimulating
discussions during this thesis as well as before it. Dehel Amel, Boudissa roukia, kedad Walid,
Bahmed Assia and Sarri Racha.
Moreover, I wish to convey my gratitude to all the people who have helped me directly or indirectly during
the last five years and specially to my friend Diboune Nadia.
A special thanks and gratitude goes out to my lovely family for their helps,wise counsel and affection
Finally, thanks to all those who participated in this work and made it possible.
2
Abstract
Information technologies have firmly entered our life and it is impossible to imagine our life without gadgets
or the Internet. Today, Sentiment analysis is one of the fastest growing research areas in computer science.
Within the last couple of years, Sentiment Analysis in Arabic has gained a considerable interest from the
research community. Because it can help analyse trending topics such as political crises, elections, disasters
and predict it before it occurs.
In this thesis, we present the details of collecting and constructing a large dataset of Arabic comments.
The techniques used in cleaning and pre-processing the collected dataset are explained. Sentiment can be
divided into three classes : positive, negative and neutral.
We have studied six algorithms of classification of texts Multinomial Naive Bayes, Support Vector Ma-
chines, Random Forest, Logistic Regression, Multi-layer perceptron and k-nearest neighbors. The application
of these algorithms revealed that the Naïve Bayes algorithm performs well for the classification of texts for
small size training data, and we were able to achieve an accuracy of 85.57% on two classes classification and
65.64% on three classes classification.
Keywords :
Natural language processing, Sentiment analysis, Arabic text classification, polarity classifica-
tion, feature selection, Naïve Bayes, Machine learning.
Résumé
Les technologies de l’information sont entrées dans notre vie et il est impossible d’imaginer notre vie
sans gadgets ou Internet. Aujourd’hui, l’analyse des sentiments est l’un des domaines de recherche les plus
dynamiques en informatique.
Au cours des deux dernières années, l’analyse des sentiments en arabe a suscité un intérêt considérable
de la part de la communauté de la recherche. Parce qu’il peut aider à analyser des sujets d’actualité tels que
les crises politiques, les élections, les catastrophes, et les prédire avant qu’ils ne surviennent.
Dans cette thèse, nous présentons les détails de la collecte et la construction d’un grand ensemble de
données de commentaires arabes. Les techniques utilisées pour nettoyer et prétraiter l’ensemble de données
collectées sont expliquées. Le sentiment peut être divisé en trois catégories : positif, négatif et neutre.
Nous avons étudié six algorithmes de classification de textes Multinomial Naive Bayes, Machines à vec-
teurs de support, Forêt aléatoire, Régression logistique, Perceptron multicouche et k-plus proches voisins.
L’application de ces algorithmes a révélé que l’algorithme de Naïve Bayes fonctionne bien pour la classifi-
cation des textes pour les données d’entraînement de petite taille, et nous avons obtenu une précision de
3
85,57% sur la classification en deux classes et de 65,64% sur la classification en trois classes.
Mots clés :
Traitement du langage naturel, Analyse de sentiment, classification de texte arabe, classifica-
tion de polarité, sélection de caractéristique, Naïve Bayes, apprentissage automatique.
Pﺨlﻣ
ون أدوا‹ أوd ﺑAnﺗAy ﺣCwOﻞ ﺗyﺤtsm وﻣﻦ اﻟTyﻣwy اﻟAnﺗAyﻖ ﻓﻲ ﺣm‹ ﺑﻌAﻣwlﻌm اﻟAyﺟwﻟwnkﺖ ﺗl دﺧdqﻟ
ﻦyﻣA ﻣﻦ ﺧﻼل اﻟﻌrﺗwybmkم اﻟwlل ﻋA ﻓﻲ ﻣﺠTy“ﺤbﻻ‹ اﻟAﺠmع اﻟrF ﻣﻦ أrﻋAKmﻞ اﻟylﺢ ﺗﺤb} أd وﻗ.ﻧﺖrtاﻹﻧ
dﻋAsﻦ أن ﻳkm ﻷﻧﻪ ﻳ.„Aﻊ اﻷﺑﺤmt ﻣﻦ ﻣﺠAًryb ﻛAًAﻣAmt اﻫTyﺑr اﻟﻌTﻐl ﻓﻲ اﻟrﻋAKmﻞ اﻟylﺐ ﺗﺤst اﻛ, ﻦyyRAmاﻟ
.Ahﻋwﻞ وﻗb ﻗAhﺆ ﺑbnt„ واﻟCاwk‹ واﻟAﺑAﺨt واﻻﻧTyFAys‹ اﻟAﻣE ﻣ“ﻞ اﻷTﺋﻌAK‹ اﻟAﻋwRwmﻞ اﻟylﻓﻲ ﺗﺤ
‹Aynqtح اﻟrJ ﻢt ﻳAmk.Tyﺑr‹ اﻟﻌAqylﻌtة ﻣﻦ اﻟryb ﻛTﻋwmء ﻣﺠAnﻊ وﺑmﻞ ﺟy}Afم ﺗdq ﻧ, TﻟAFrﻓﻲ ﻫ@ه اﻟ
„ ›ﻼY إﻟCwﻌKﻢ اﻟysqﻦ ﺗkm ﻳ.Ahﻌmﻲ ﺗﻢ ﺟt‹ اﻟAﻧAyb اﻟTﻋwmﺠm ﻟTqbsm اﻟTﻟﺠAﻌmﻒ واﻟy\nt ﻓﻲ اﻟTﻣdﺨtsmاﻟ
آﻻ‹ دﻋﻢ,zﻳAﻳﻒ ﺑA ﻧQwOnﻒ اﻟynOt‹ ﻟAyﻣECاw ﺧTtF TFاCd ﺑAnm ﻗdqة ﻟdﻳA وﻣﺤTyblF , TyﺑA إﻳﺠ:‹Aﻓﺌ
.انryب اﻟﺠr أﻗ-‹ و كAqbWد اﻟdﻌt ﻣ, ﻲtsﺟwl اﻟCاd واﻻﻧﺤ, TyاﺋwK‹ اﻟﻌAﺑA واﻟﻐ, ‹Ahﺠtmاﻟ
‹AﻧAyb ﻟQwOnﻒ اﻟynOt ﻟdyﻞ ﺟkK ﺗﺆدي ﺑzﻳAﻳﻒ ﺑA ﻧTyﻣECاw‹ أن ﺧAyﻣECاwﻖ ﻫ@ه اﻟﺨybW ﺗrhZ أdوﻗ
T› ›ﻼYl ﻋ%46.56 وTqbWﻦ ﻣﻦ اﻟyt ﻓﺌYl ﻋ%75.58 AﻫCd ﻗTﻖ دﻗyq ﻣﻦ ﺗﺤAnkm ﺗd وﻗ, ryﻐOا‹ اﻟﺤﺠﻢ اﻟÐ ﻳﺐCdtاﻟ
.‹AfynOﺗ
:TyﺣAtfm‹ اﻟAmlkاﻟ
ﻢlﻌt اﻟ, zﻳAﻳﻒ ﺑA ﻧ, ةzym اﻟCAyt اﺧ, بAWﻒ اﻷﻗynO ﺗ, ﺑﻲr اﻟﻌPnﻒ اﻟynO ﺗ, rﻋAKmﻞ اﻟyl ﺗﺤ, TyﻌybW اﻟTﻐl اﻟTﻟﺠAﻣﻌ
اﻵﻟﻲ
4
Table des matières
1 INTRODUCTION 11
1.1 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 THEORETICAL BACKGROUND 14
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Natural language processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Some Theoretical Developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Natural Language Text Processing Systems . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.4 Applications of natural language processing . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.5 Levels of Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 Machine Learning process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Machine Learning types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.4 Applications of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.5 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.5.1 Naïve Bayesian Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.5.2 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.5.3 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5
TABLE DES MATIÈRES TABLE DES MATIÈRES
6
TABLE DES MATIÈRES TABLE DES MATIÈRES
5.2.3.1 CountVectorizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.3.2 TfidfVectorizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.4 N-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.5 Final Data representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.6 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.2 Jupyter Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.3 Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.4 Matplotlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.5 Seaborn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.6 Scikit-learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.7 NLTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
7
List of Figures
4.1 The Top 10 Spoken Languages in the World with their Corresponding Percent 2014 . . . . . 41
4.2 the top ten languages in the internet in millions of users -2017- . . . . . . . . . . . . . . . . . 41
8
LIST OF FIGURES LIST OF FIGURES
9
List of Tables
10
Chapter 1
INTRODUCTION
The rapid growth of the internet and computer technologies has caused the existence of billions of elec-
tronic text documents which are created, edited, and stored in digital ways. This situation has brought great
challenge to the public, specifically the computer users in searching, organizing, and storing these documents.
So text classification occupies a considerable place in this set of data analysis tools. It consists of assign-
ing a textual document to a class. Such a task is indispensable both for the search for information and for
the extraction of knowledge.
Sentiment Analysis (SA) known as opinion mining is the process of classifying the emotion conveyed
by text, for example as negative, positive or neutral. It is one of the most vital research fields of Natural
Language Processing (NLP) nowadays. Information gained from applying Sentiment Analysis has many po-
tential usages, for instance, to help marketers evaluate the success of an advertisement campaign, to identify
how different demographics have received a product release, to predict user behavior, or to forecast election
results.
This project aims to implement Sentiment Analysis for Arabic text because it is a very rich language,
it has a very different and difficult structure than other languages. Therefore it is important to build an
Arabic Text Classifier. As well it is the fastest growing language on the web among other languages, it is
ranked fourth among languages on the web.
In our context, we will use machine learning (ML) which can build a classifier which can learn from a
set of pre-classified examples. We are interested to use multiple classification techniques, namely support
vector machine, Naive Bayes, Logistic Regression,Multi-layer perceptron and k-nearest neighbors. We will
compare between them in term of the accuracy and processing time perspectives.
11
1.1. RESEARCH OBJECTIVES CHAPTER 1. INTRODUCTION
Chapter 2« THEORETICAL BACKGROUND »: Describes different methods that can be applied for
sentiment classification task. Under this chapter, we will give an overview about Natural language process-
ing and about machine learning and its basic, present its types, common machine learning algorithms are
explained.
Chapter 4« ARABIC TEXT CLASSIFICATION»: This chapter focuses on the Arabic language and
it’s characteristics including the importance of Arabic text classification and his Challenges.
12
1.2. THESIS ORGANIZATION CHAPTER 1. INTRODUCTION
Chapter 6« RESULTS AND DISCUSSION »: Contains experiments and results achieved in this re-
search work. Based on the discuss results analysis is performed in order to define which of the algorithms
performs better for comments classification.
Chapter 7« CONCLUSION AND FUTUR WORKS»:Conclusion and future research perspectives are
presented in this chapter.
13
Chapter 2
THEORETICAL BACKGROUND
2.1 Introduction
In this chapter, we present some generalities concerning the Natural language processing, its character-
istics, its different Applications and its Levels of Language. Then we’ll give a quick view about machine
learning and its algorithms.
2.2.1 Definition
Natural Language Processing (NLP) is an area of research and application that explores how computers
can be used to understand and manipulate natural language text or speech to do useful things. NLP is
essentially multidisciplinary: it is closely related to linguistics.
NLP researchers aim to gather knowledge on how human beings understand and use language so that
appropriate tools and techniques can be developed to make computer systems understand and manipulate
natural languages to perform the desired tasks. [1]
Natural Language Processing basically can be classified into two parts: Natural Language Understanding
and Natural Language Generation which evolves the task to understand and generate the text (Figure 2.1).[2]
14
2.2. NATURAL LANGUAGE PROCESSING CHAPTER 2. THEORETICAL BACKGROUND
The most recent theoretical developments that have influenced research in NLP can be grouped into four
classes:
— Statistical and corpus-based methods in NLP,
— Recent efforts to use WordNet for NLP research,
— The resurgence of interest in finite-state and other computationally lean approaches to NLP.
— The initiation of collaborative projects to create large grammar and NLP tools. [1]
Manipulation of texts for knowledge extraction, for automatic indexing and abstracting, or for producing
text in a desired format, has been recognized as an important area of research in NLP.
This is broadly classified as the area of natural language text processing that allows structuring of large
bodies of textual information with a view to retrieving particular information or to deriving knowledge
structures that may be used for a specific purpose.
Automatic text processing systems generally take some form of text input and transform it into an output
of some different form. [1]
There are lot of important applications of natural language processing to answer to many real-world
challenges . The following list is not complete:
— Sentiment Analysis : Identifying sentiments and opinions stated in a text.
— Better search engines
— Speech recognition : Recognizing a spoken language and transforming it into a text.
15
2.2. NATURAL LANGUAGE PROCESSING CHAPTER 2. THEORETICAL BACKGROUND
16
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND
— Morphology: the structure of words. For instance, unusually can be thought of as composed of
a prefix un-, a stem usual, and an affix -ly. Composed is compose plus the inflectional affix -ed: a
spelling rule means we end up with composed rather than composed.
— Syntax: the way words are used to form phrases.
— Semantics. Compositional semantics is the construction of meaning (generally expressed as logic)
based on syntax. This is contrasted to lexical semantics, i.e., the meaning of individual words.
— Pragmatics: meaning in context, although linguistics and NLP generally have very different per-
spectives here.[3]
2.3.1 Definition
Lets take some definition of Machine Learning: In 1959 Arthur Samuel defined machine learning as:”Machine
Learning is the field of study that gives computers the ability to learn without being explicitly programmed.”
[5]
In 1997 Tom Mitchell defined machine learning as follows:“A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure P if its performance at tasks in
T, as measured by P, improves with experience E.”
Basically, machine learning is the ability of a computer to learn from experience. Experience is usually
given in the form of input data. Looking at this data, the computer can find dependencies in the data that
are too complex for a human to form.
Machine learning can be used to reveal a hidden class structure in an unstructured data, or it can be
used to find dependencies in a structured data to make predictions. [6]
17
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND
1. Collecting data: It can be raw data from excel, access, text les etc., this step (gathering past data)
forms the foundation of the future learning. The better the variety, density and volume of relevant
data, better the learning prospects for the machine becomes.
2. Preparing the data: Any analytical process depends on the quality of the data used. One needs
to spend time determining the quality of data and then taking steps for fixing issues such as missing
data.
3. Training a model: This step involves choosing the appropriate algorithm and representation of
data in the form of the model. The cleaned data is split into two parts train and test, the first part
(training data) is used for developing the model. The second part (test data), is used for testing the
accuracy of the model.
4. Evaluating the model: To test the accuracy, the second part of the data (test data) is used. This
step determines the precision in the choice of the algorithm based on the outcome. A better test to
check accuracy of model is to see its performance on data which was not used at all during model
build.
5. Improving the performance: This step might involve choosing a different model altogether or
introducing more variables to augment the efficiency. That is why significant amount of time needs
to be spent in data collection and preparation.
Machine Learning is divided into 3 categories: supervised learning, unsupervised learning, and reinforce-
ment learning.
— Supervised Learning (Predictive): is based on learning a model from labeled training data D
that allows us to make predictions about future data. The main drawback of this type of learning is
the necessity of having lots of labeled data, which is not always available.[7]
— Unsupervised Learning (Descriptive): It aims to design a model structuring information. The
difference here is that the behaviors (or categories or classes) of the learning data are not known, that
18
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND
Machine learning has proven itself to be the answer to many real-world challenges. therefore in this
section, we will discuss some applications of machine learning with some examples:
— Computer-Aided Diagnosis: Pattern recognition algorithms used in computer-aided diagnosis can
assist doctors in interpreting medical images in a relatively short period. Few examples are as follows:
Pathological brain detection, Breast cancer, Colon cancer, Bone metastases, Coronary artery disease,
congenital heart defect, Alzheimer’s disease. [9]
— Speech Recognition: The field of speech recognition aims to develop methodologies and technologies
that enable computers to recognize and translate spoken language into text. This technology can help
people with disabilities. With the passage of time, the accuracy of speech recognition engines is
increasing.
— Security applications: for access control, use face recognition as one of its components. That is,
given the photo (or video recording) of a person, recognize who this person is.
— Text Mining: text data was observed that most of the enterprise related information is stored in
text format. The challenge was how to use this unstructured data or text.
Text mining is helpful in a number of applications including:
— Business intelligence
— National security
— Life sciences
19
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND
There are lot of algorithms of machine learning. Some of it are used for supervised learning like: Decision
Trees, Rule-Based Classifiers, Naïve Bayesian Classification, Neural Networks, and Support Vector Machine
e.g. The other are used for unsupervised learning like: k-Means Clustering, Principal Component Analysis,
and Hidden Markov Model e.g. The use of this or that algorithm depends strongly on the task to be solved
(classification, estimation of values, etc.) In this thesis we will detail some algorithms of supervised learning.
1. Definition Naive Bayes is one of the simplest existing classifiers. It is based on the Bayesian theorem,
it is particularly suited when the dimensionality of the inputs is high. Parameter estimation for naive
Bayes models uses the method of maximum likelihood. This classifier is a powerful algorithm used
for: Real time Prediction, Text classification/ Spam Filtering, and Recommendation System. [10]
2. Bayes theorem
𝑝(𝑑|𝑐𝑗 )𝑝(𝑐𝑗 )
𝑝(𝑐𝑗 |𝑑) = (2.1)
𝑝(𝑑)
— 𝑝(𝑐𝑗 |𝑑) = probability of instance d being in class 𝑐𝑗 , This is what we are trying to compute
— 𝑝(𝑑|𝑐𝑗 ) = probability of generating instance d given class 𝑐𝑗 , We can imagine that being in class
𝑐𝑗 , causes you to have feature d with some probability
— 𝑝(𝑐𝑗 ) = probability of occurrence of class 𝑐𝑗 , This is just how frequent the class 𝑐𝑗 , is in our
database
— 𝑝(𝑑) = probability of instance d occurring This can actually be ignored, since it is the same for
all classes.
Naive Bayes considers some set of candidate classes C and is interested in finding the most probable
class c in C given the observed data D. The most probable class is called maximum a posteriori
(MAP)[6]
𝑐𝑀 𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑐∈𝐶 𝑃 (𝐷|𝑐)𝑃 (𝑐) (2.2)
The Naive Bayes Classifier also assumes conditional independence to make the calculations easier;
that is, given the class attribute value, other feature attributes become conditionally independent.
This condition, though unrealistic, performs well in practice and greatly simplifies calculation.[6]
20
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND
Advantages
Disadvantages
1. Introduction A Support Vector Machine (SVM) is a classifier which distinct(s) the various classes
of data by the use of a hyper-plane. SVM is modelled with the training data and it outputs the
hyper-plane in the test data. The SVM model tries to find the space in the matrix of data where
different classes of data can be widely differentiated and draws a hyper-plane.
There is more than one way to draw a hyper-plane so, which one is optimal? An optimal hyper-plane
is chosen which maximizes the margin between the classes. Hyper-plane need not always be linear.
A hyper-plane in SVM can also work as a non-linear classifier using technique known as kernel-trick.
2. Basic concept
— Hyper-plane If we take a case of binary classification (i.e. example is classified divided into two
classes). A hyper-plane separator is a hyper-plane which separates the two classes, in particular it
separates their learning points. The class of hyper-planes that is considered is 𝑤𝑇 𝑥 + 𝑏 = 0 with
weight vector w and bias b, (i.e., the linear hyper-planes). (Figure 2.8)
— Support Vector for a task of determining the hyper-plane separated from the SVMs, it is advis-
able to use only the closest points (i.e. the points of the boundary between the two data classes)
from the total set these points are called supports vectors. (Figure 2.8)
21
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND
— Margin We can define a unique separating hyperplane using the margin: the minimal Euclidean
distance of a pattern to the decision boundary. The hyperplane that maximises this margin
2
𝑚= ||𝑤|| is called the maximal margin hyperplane. [11] (Figure 2.8)
3. Linear and Nonlinear For the SVMs models, we finds two cases linearly and non-linearly separable.
— Linearly Separable Case are more simple, because we can easily find the linear classifier. But
in most of the real problems there is no linear separation possible between the data, the classifier
of maximum margin can not be used because it only works if the classes of data of learning are
linearly separable.[12]
If we suppose that we have binary classification. The SVM is working to put a hyperplane in
the middle of the two classes, so that the distance to the nearest positive or negative example is
maximized. The SVM discriminant function has the form [13]
𝑓 (𝑥) = 𝑤𝑇 𝑥 + 𝑏 (2.3)
where w is the parameter vector, and b is the bias or offset scalar. The classification rule is
𝑠𝑖𝑔𝑛(𝑓 (𝑥)) , and the linear decision boundary is specified by 𝑓 (𝑥) = 0. The labels 𝑦 ∈ −1, 1; if 𝑓
separates the data, the geometric distance between a point x and the decision boundary is:
𝑦𝑓 (𝑥)
(2.4)
||𝑤||
If we give training data (𝑥, 𝑦)1:𝑛 we want to nd a decision boundary w; b such that to maximize
the geometric distance of the closest point, i.e:
𝑛 𝑦𝑖 (𝑤𝑇 𝑥𝑖 + 𝑏)
max min (2.5)
𝑤,𝑏 𝑖=1 ‖𝑤‖
The above objective is difficult to optimize directly. The optimization of equation 2.13 is actually
over equivalence classes of w, b up to scaling. Therefore, we can reduce the redundancy by
22
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND
This converts the complex problem of equation 2.16 into a constrained but simpler problem
1
max
𝑤,𝑏 ‖𝑤‖ (2.8)
𝑇
subject to 𝑦𝑖 (𝑤 𝑥 + 𝑏) ≥ 1 𝑖 = 1...𝑛
1 1
Maximizing is equivalent to minimizing ||𝑤||2 . And that we called it decision boundary
||𝑤|| 2
problems,and its can be found by solving the following constrained optimization problem
1
min ‖𝑤‖2
𝑤,𝑏 2 (2.9)
subject to 𝑦𝑖 (𝑤𝑇 𝑥𝑖 + 𝑏) ≥ 1 𝑖 = 1...𝑛
— Linearly Non-Separable Case To overcome the disadvantages of non linearly separable, the
SVM propose to change the data space, this non linear transformation of data can allow linear
separation of the examples in a new space. This new space is called "re-description space". We
have then a transformation of a problem nonlinear separation in the representation space into
a problem of linear separation in a space of re-description of larger dimension. This nonlinear
transformation is performed via a kernel function.[13] [12]
However many real datasets are not linearly separable, because the previous problem of equation
2.18 has no solution.To handle non-separable data, we release the constraints by making the
inequalities easier to work with. This is done with slack variables 𝜉 ≥ 0, one for each constraint :
23
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND
Now a point 𝑥𝑖 can satisfy the constraint even if it is on the wrong side of the decision boundary, as
long as 𝜉𝑖 is large enough. Of course all constraints can be trivially satisfied this way. To prevent
this, we penalize the sum of 𝜉𝑖 , and arrive at the new primal problem
𝑛
1 ∑︁
min ‖𝑤‖2 + 𝐶 𝜉𝑖
𝑤,𝑏,𝜉 2 𝑖=1
(2.11)
subject to 𝑦𝑖 (𝑤𝑇 𝑥𝑖 + 1) ≥ 1 − 𝜉𝑖 , 𝑖 = 1...𝑛
𝜉𝑖 ≥ 0
where 𝐶 is a weight parameter, which needs to be carefully set (e.g., by cross validation 1 ).
Now to similarity the look at the dual problem of equation 2.19 by introducing Lagrange multi-
pliers. We arrive at
𝑛 𝑛
1 ∑︁ ∑︁
max − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖 𝑥𝑗 + 𝛼𝑖
𝛼 2 𝑖,𝑗=1 𝑖=1
𝐾(𝑥, 𝑥′ ) = 𝑥𝑇 𝑥′ (2.13)
Where d is a metric over 𝑋 and f is a function in.An example of the RBF kernel is the kernel
Gaussian
||𝑥 − 𝑥′ ||2
𝐾(𝑥, 𝑥′ ) = 𝑒( − ) (2.15)
2𝜎 2
1. Cross-Validation is a statistical method of evaluating and comparing learning algorithms by dividing data into two
segments: one used to learn or train a model and the other used to validate the model.
24
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND
— Regularization
The Regularization parameter (often termed as C parameter in python’s sklearn library) tells the
SVM optimization how much you want to avoid misclassifying each training example.
For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane
does a better job of getting all the training points classified correctly. Conversely, a very small
value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if
that hyperplane misclassifies more points [10]. The figure below are example of two different
regularization parameter.
Figure 2.11: Left: low regularization value, right: high regularization value
— Gamma
25
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND
The gamma parameter defines how far the influence of a single training example reaches, with low
values meaning ‘far’ and high values meaning ‘close’. In other words, with low gamma, points far
away from plausible seperation line are considered in calculation for the seperation line. Where as
high gamma means the points close to plausible line are considered in calculation[10].
— Margin
Margin is very importrant characteristic of SVM classifier. SVM to core tries to achieve a good
margin.
A good margin is one where this separation is larger for both the classes. figuer below gives to
visual example of good and bad margin. A good margin allows the points to be in their respective
classes without crossing to other class [10].
26
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND
27
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND
Neural networks were originally called Perceptrons. In 1943 McCulloch and Pitts introduced the
perceptron as the basic form of Artificial Neural Network based on Rosenblatt’s original perceptron,
where a single artificial neuron with n input layers as depicted in figure above can be represented by
the formula that follows. The w weights allow each of the n inputs to contribute a greater or lesser
amount to the sum of input signals. The activation function f transforms the neuron’s combined input
28
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND
1
𝑦 = 𝑔(𝑥) = (2.18)
1 + 𝑒−1
29
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND
30
2.4. CONCLUSION CHAPTER 2. THEORETICAL BACKGROUND
2.4 Conclusion
In this chapter, we introduced definitions, basics and we explained some generalities about Natural
language processing and about machine learning and its basic, presented its types, mention a few of its
algorithms.In the next chapter will talk about the background necessary of Sentiment Analysis.
31
Chapter 3
BACKGROUND ON SENTIMENT
ANALYSIS (SA)
3.1 Introduction
Sentiment analysis, also called opinion mining, is the field of study that analyzes people’s opinions,
sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services,
organizations, individuals, issues, events, topics, and their attributes. It represents a large problem space.
3.2 Definition
Sentiment analysis is a series of methods, techniques, and tools about detecting and extracting subjective
information, such as opinion and attitudes, from language. [20]
Traditionally, sentiment analysis has been about opinion polarity, i.e., whether someone has positive,
neutral, or negative opinion towards something. [21]
Sentiment analysis coincides with those of the social media. In fact, sentiment analysis is now right at
the center of the social media research. Hence, research in sentiment analysis not only has an important
impact on NLP, but may also have a profound impact on management sciences, political science, economics,
and social sciences as they are all affected by people’s opinions. [22]
32
3.4. EVOLUTION OF SA CHAPTER 3. BACKGROUND ON SENTIMENT ANALYSIS (SA)
much as personal recommendations. According to 2011 Study: 74% of customer’s confidence is based on
online personal recommendation reviews, 60% in 2012 study, and 57% in 2013 Study. But this percentage
increases with respect to 2014 Study: 94% of customer’s trust on online sentiment reviews. [23]
3.4 Evolution of SA
We have seen a massive increase in the number of papers focusing on sentiment analysis and opinion
mining during the recent years. According to our data, nearly 7,000 papers of this topic have been published
and, more interestingly, 99% of the papers have appeared after 2004 making sentiment analysis one of the
fastest growing research areas. [24]
The number of papers in sentiment analysis is increasing rapidly as can be observed from Figure 3.1. [25]
Machine learning approach is used to train an algorithm with a predefined dataset before applying it
to actual dataset. Machine learning techniques first trains the algorithm with some particular inputs with
known outputs so that later it can work with new unknown data.
33
3.6. THREE MAJOR TECHNIQUESCHAPTER
OF SA 3. BACKGROUND ON SENTIMENT ANALYSIS (SA)
Rule based approach is employed by shaping various rules for obtaining the opinion, created by tokenizing
every sentence in each document then testing every token, or word, for its presence. If the word is there and
has a positive sentiment, 𝑎 + 1 rating was applied to that. Every post starts with a neutral score of zero,
and was considered positive. If the ultimate polarity score was bigger than zero, or negative if the score was
less than zero [27] once the output of rule based approach it will check or raise whether the output is correct
or not. If the input sentence contains any word that isn’t present within the database which can facilitate
within the analysis of moving picture review, then such words are to be added to the database. [26]
Lexicon based techniques work on an assumption that the collective polarity of a sentence or documents
is total of polarities of the individual phrases or words. This methodology relies on emotional analysis for
sentiment analysis dictionaries for every domain. Next, every domain lexicon was replenished with appraisal
words of applicable training collection that have the best weight, calculated by the strategy of RF(Relevance
Frequency). The word-modifier changes (increases or decreases) the weight of the subsequent appraisal word
by an exact share. [28]
34
3.7. SA APPLICATIONS CHAPTER 3. BACKGROUND ON SENTIMENT ANALYSIS (SA)
3.7 SA Applications
Opinions are central to almost all human activities because they are key influences of our behaviors.
Whenever we need to make a decision, we want to know others opinions. In the past, when an individual
needed opinions, he/she asked friends and family.
With the explosive growth of social media on the Web, individuals and organizations are increasingly
using the content in these media for decision making. [22]
So sentiment analysis has applications in vital sectors. The following points highlight some of these
applications: [29]
— Marketing: Since social media has become a unique platform of customer interactions, the use of
sentiment analysis can easily take marketing to a whole new level. Companies have figured that
emotions of social media are shaping their brand’s image. Therefore, sentiment analysis tools give
marketers a way to measure their effectiveness, and help consumers who are trying to research a
product or a service.
— Politics: Many valuable uses can be obtained for political organizations by fully understanding social
media sentiment. Social media feedback has been used to inform political leaders of potential threats,
problems or issues with their organizations. In addition, an essential role for sentiment analysis
appeared earlier in predicting elections, and acquiring citizens responses on important issues such as
increasing prices and changing the constitution.
— Healthcare: Medical web blogs are all over the internet these days. These web logs contain only
medicine and health-care issues such as diseases, medical treatments and medications. Due to the
35
3.8. SENTIMENT ANALYSIS TASKS
CHAPTER 3. BACKGROUND ON SENTIMENT ANALYSIS (SA)
health-related experiences and medical histories these web pages provide for practitioners and patients,
sentiment analysis tools had to be developed for the use in medical fields.
— Finance: Sentiment Analysis can also be used in the financial world. Investors can easily follow their
favorite companies and monitor their sentiment data in real time. With sentiment analysis, business
investors can acquire business news easier and aggregate this information to make better financial
decisions.
The goal of this task is to categorize the sentences into two classes under the name of subjective class
and objective class. For example,"The weather is cold." is determined as objective sentence, and "I like cold
weather." is identified as a subjective sentence [30].
This approach detects the opinions that are embedded in sentences or documents. An opinionated
sentence is the smallest complete unit that sentiments can be extracted from. Thus, to extract opinions
and determine their polarities, the sentiment words, the opinion holders, and the contextual information are
considered as indications [31].
The goal of this approach is to find how the public change their views or opinions over time. Tracking
opinion systems can track the opinions in different sources according to various requests. The results of this
approach are very useful for companies, institutes, concerned public and especially for governments. [31]
The four steps of opinion summarization consist of: detecting subject, retrieving relevant sentences, iden-
tifying the opinion-oriented sentences, and summarizing. The opinion summarization divides the opinionated
36
3.9. TWO MAIN TYPES OF OPINIONS
CHAPTER 3. BACKGROUND ON SENTIMENT ANALYSIS (SA)
Sentiment classification is an area in sentiment analysis which determines the overall polarity of a text,
or a sentence or a feature. It categorizes the opinionated statements into different classes by using a function
that is called classifier. [31]
Comparisons of more than one entity. Ex: “IPhone is better than Blackberry”.
A regular opinion has the following basic components:
3. Polarity: denotes the emotion expressed, it can be (positive and negative) or (positive, negative and
neutral )
4. Aspect: the part of target that the sentiment is expressed towards. [32]
The task of document-level sentiment analysis is to determine whether an opinionated document that
comments on an object expresses an overall positive or negative opinion. For example, a sentiment analysis
system classifies the overall polarity of a customer review about a specific product. [33]
Document level analysis assumes a piece of text expresses sentiment towards a single target. So it is not
applicable to documents in which opinions are expressed on multiple products. [33]
37
3.11. RELATED WORK IN SA CHAPTER 3. BACKGROUND ON SENTIMENT ANALYSIS (SA)
The sentence level of sentiment analysis involves determining whether each sentence expressed a neutral,
positive, or negative opinion.
The subjectivity classification is very important, because it filters out those sentences that contain no
opinions. The sentence level of sentiment classification assumes that one sentence expresses a single opinion
from a single opinion holder. This task requires both local and global contextual information. [33]
The aspect level of sentiment analysis focuses on opinions itself instead of looking at the constructs of
documents, such as paragraphs, sentences and phrases. It is not enough just to find out the polarity of the
opinions; identifying the opinion targets is also essential [33].The goal of this level is to identify the opinion
or sentiment on entities and their different aspects.
38
3.12. CONCLUSION CHAPTER 3. BACKGROUND ON SENTIMENT ANALYSIS (SA)
Pang and al. [38] used a single Naive Bayes classifier on a movie review corpus to achieve similar results as
the previous study. Multiple Naive Bayes models were trained using different features such as part of speech
tagging, unigrams, and bigrams. They achieved a classification accuracy of 77.3% which was considered a
high performance of the Naive Bayes classifier on that domain.
The research in sentiment analysis trend is not limited yet. In order to improve accuracy and perfor-
mance of the proposed techniques, applications, or algorithms. It enables them to more compatible with
understanding meaning and features. But still there are some problems and challenges in text analysis of
reviews/documents and evaluate sentiment scores.
3.12 Conclusion
In this chapter we talked about Sentiment Analysis and its components.Thus we will see in more detail
in the next chapter,on the Arabic text classification.
39
Chapter 4
4.1 Introduction
Generally Text Categorization (TC) is a very important and fast growing applied research field, because
more text document is available online. Text categorization is a necessity due to the very large amount of
text documents that humans have to deal with daily and it utilized to give useful information from the large
amount of data.
The main goal of text mining or classification is to extract the information with value from unstructured
textual resources. Also to deals with the operations like, retrieval, classification and summarization.
40
4.2. ARABIC TEXT CLASSIFICATION CHAPTER 4. ARABIC TEXT CLASSIFICATION (ATC)
Figure 4.1: The Top 10 Spoken Languages in the World with their Corresponding Percent 2014
Internet World Stats presents its latest estimates for Internet Users by Language. Top Ten Languages
Internet Stats were updated in December 31, 2017 are represented in figure below 4.2 and we will observed
that Arabic occupied the fourth Rank in the world.
Figure 4.2: the top ten languages in the internet in millions of users -2017-
Despite Arabic is wide language, there are relatively few studies on the retrieval/mining of Arabic text
documents in the literature. This is due to the unique nature of Arabic language morphological principles.
So it has a very different and difficult structure than other languages.
41
4.3. CHARACTERISTICS OF ARABIC LANGUAGE
CHAPTER 4. ARABIC TEXT CLASSIFICATION (ATC)
42
4.6. RESEARCH MOTIVATION CHAPTER 4. ARABIC TEXT CLASSIFICATION (ATC)
43
4.7. CONCLUSION CHAPTER 4. ARABIC TEXT CLASSIFICATION (ATC)
4.7 Conclusion
In this chapter we talked about Arabic language classification and it’s characteristics including the im-
portance of Arabic text classification and his Challenges. Next chapter we will present the data collection,
preprocessing steps and frameworks.
44
Chapter 5
5.1 Introduction
This chapter describes our datasets and represents the main steps that have to be performed for carrying
out the sentiment classification, namely pre-processing and feature extraction that allow us to use machine
learning methods.
45
5.2. DATA COLLECTION CHAPTER 5. DATASETS AND IMPLEMENTATION FRAMEWORKS
In this phase, we collect the data set that will be used for building and testing the classifier module. We
are using a newspaper “El Chorouk online” because is a daily newspaper in Algeria published Saturday to
Thursday, it is the most read, and it was the third most visited website in 2010 in the MENA ( Middle East
North Africa )region.
We collected different articles in economic, political, social issues, violence, culture and art etc. Our corpus
consists of 1633 documents with 63055 tokens, each comment has a sentiment label (polarity): positive,
neutral, negative. 31392 tokens for negative, 21248 tokens for neutral and 9975 token for positive.
Nb: There are a few terms we should define them
— Document: This could be a text message, tweet, comment, email, book, lyrics to a song. This is
equivalent to one row or observation.
— Corpus: a collection of documents. This would be equivalent to a whole data set of rows/observations.
— Token: this is a word or symbols derived from a document through the process of tokenization. For
example the document ’How are you’ would have tokens of ’How’, ’are’, and ’you’.
To collect this corpus we passed by three steps searching, organizing and finally storing:
1. Searching: we are searching for articls those have number of comments superior than 20.
2. Organizing: this was the main step we are reading articls with their comments and we jud their
labels if it was positive, negative or point of view (neutral). We will remember some words that were
help as to decide what is the labels of this comments? in the tables below 5.1,5.2,5.3.
Positive words
46
5.2. DATA COLLECTION CHAPTER 5. DATASETS AND IMPLEMENTATION FRAMEWORKS
Negative words
Neutral words
3. Storing: after that we will separate each label in his file so they become three files.
Finally our data is ready and we will translate them to format CSV, and here is the statistical
distribution of our data as shown in the following table 5.4.
According to the table 5.4 we can obtain the following percentages shown in the figure below 5.2 :
47
5.2. DATA COLLECTION CHAPTER 5. DATASETS AND IMPLEMENTATION FRAMEWORKS
We also observed that there is difference in length of comments between positive, negative and neutral.
The following figure 5.3 represents this difference.
In general, the large amount of data collected through various sources such as the internet, surveys and
experiments etc are full of missing values, noises and distortions.
Pre-processing is crucial in terms of computation time and classifier performance because noisy data can
slower the learning process and decrease the efficiency of the system in general. Therefore data pre-processing
is a fundamental step to improve the quality of the input data a pre-processing phase is required.
In this phase, we will follow various intermediate processing steps to get to the final text format that will
be used in the final learning step.
Pre-processing includes the following:
— Removal of URLs
Frequently comments contain web links to share some additional information. The content of the
links is not analyzed, hence address of the link itself does not provide any useful information and its
elimination can reduce the feature size, which is why URL is removed from the comment.
— Tokenization
is the process of breaking up the given text into individual elements called tokens. The tokens may
be words or numbers or punctuation mark. It is a mandatory step before any kind of processing.
The basic tokenizer like in NLTK will split our text into sentences and our sentences into typographic
tokens. Below 5.5 Here is an example of tokenization:
48
5.2. DATA COLLECTION CHAPTER 5. DATASETS AND IMPLEMENTATION FRAMEWORKS
input
« á ÒÊÖÏ @ ÉÖÞ
©Òm' à
@ é<Ë@ È
A ð é<Ë@ ¼@Qk
éÓA
.
.
output
« á ÒÊÖÏ @ ÉÖÞ
©Òm' à
@ é<Ë@ È
A ð é<Ë@ ¼@Qk
éÓA
.
.
Table 5.5: Tokenization example
— Noramlization
Arabic information retrieval systems normalize Arabic words to increase retrieval effectiveness. Nor-
malization of an Arabic word means replacing specific letters within the word with other letters
according to a predefined set of rules as shown in Table 5.6 below.
In this step there are another kinds of impurity encountered is numeric data, punctuation, spaces and
single letters, and noise removal like (Tashdid, Fatha, Damma . . . .).
All of this kinds of impurity themselves do not represent any polarity. So for this reason they should
be removed from the dataset like our example shown in figure 5.4 below:
49
5.2. DATA COLLECTION CHAPTER 5. DATASETS AND IMPLEMENTATION FRAMEWORKS
— Stop Words
Stop words are extremely frequent words that considered as valueless for taking them as features, such
as stop words pronouns, conjunctions, and prepositions, names. Stop words are deemed irrelevant for
searching purposes because they occur frequently in the language for which the indexing engine has
been tuned. In order to save both space and time, these words are dropped at indexing time and then
ignored at search time.Figure 5.5 below provides an example of stop words removal.
— Stemming
Stemming is the process of removing some affixes from words, and reducing these words to their
roots. After reducing words to their roots, these roots can be used in compression, spell checking,
text searching, and text analysis.
The main goal of a stemmer is to map different forms of the same word to a common representation
called “stem”. Stemming can significantly improve the efficiency of the classification by reducing the
number of terms being input to the classification.
Many stemming methods have been developed for Arabic language. The two most widely used
stemmers are:
1. The root extraction stemmer developed by Khoja and al which allows to transform each surface
Arabic word in the document, into its root. It is commonly called heavy stemming.
2. The light stemmer developed by Larkey et al which allows to remove prefixes and suffixes.
In this project we used a Light Stemmer we remove only word prefixes in order to do not create non
real words because Arabic language is highly derivative where tens or even hundreds of words could
be formed using only one stem. The following Figure 5.12 is the result of the stemming phase in our
project.
50
5.2. DATA COLLECTION CHAPTER 5. DATASETS AND IMPLEMENTATION FRAMEWORKS
After pre-processing is completed, Text data requires special preparation before we can start using it for
training the classifiers and predictive modeling.
So it needs to be encoded as integers or floating point values, because most machine learning algorithms
can’t take in straight text, so we create a matrix of numerical values to represent our text, for use as input
to a machine learning algorithm, called feature extraction.
Hence it comes to mind the question How to Prepare Text documents for Machine Learning? In order
to address this,there are two different most common ways of doing this CountVectorizer and TfidfVectorizer
By extracting numerical features from text content.
5.2.3.1 CountVectorizer
CountVectorizer takes what’s called the Bag of Words approach. The bag-of-words model is a simplifying
representation used in natural language processing (NLP) and information retrieval (IR). In this
model, a text is represented as the bag of its words, disregarding grammar and even word order but keeping
multiplicity.The bag-of-words model is commonly used in methods of document classification.
The Bag of Words model learns a vocabulary from all of the documents, then models each document by
counting the occurrence of each word W and storing it in a matrix X.
For further clarification, let us provide the following example we have three sentences:
— S0:
”Tmhmاﻟ ﻋﻲAnW}ء اﻻAوع اﻟ@ﻛr ﻓd أﺣwﻢ اﻷﻟﻲ ﻫlﻌt” اﻟ
— S1:
” ‹AﻧAyb ﻣﻦ اﻟTmy‹ ﻗAﻣwlا— ﻣﻌrﺨtF اwﻢ اﻵﻟﻲ ﻫlﻌtl ﻟTyFAF اﻷTmhm” اﻟ
— S2:
”‹AqybWtاﻟ لwqا ﻣﻦ ﺣrybدا ﻛdﻢ اﻵﻟﻲ ﻋlﻌtﻦ اﻟmSt”ﻳ
To get our bags of words, we count the number of word occurs in each sentence. So the feature vector for
the three sentences above are represented in the table 5.7 below :
51
5.2. DATA COLLECTION CHAPTER 5. DATASETS AND IMPLEMENTATION FRAMEWORKS
We may note that most values in X will be zeros, for this reason we say that bags of words are typically
high-dimensional sparse datasets. We can save a lot of memory by only storing the non-zero parts of the
feature vectors in memory.
5.2.3.2 TfidfVectorizer
TfidfVectorizer is an alternative to CountVectorizer. It also creates a document term matrix from our
documents. It calculates term frequency-inverse document frequency value for each word (TF-IDF). The
TF-IDF is a statistical measure used to evaluate how important a word is to a document in a collection or
in a corpus.
The TF-IDF is the product of two weights:
— Term frequency: is a weight representing how often a word occurs in a document. If we have several
occurrences of the same word in one document we can expect the TF-IDF to increase.
— Inverse document frequency:is another weight representing how common a word is across docu-
ments. If a word is used in many documents then the TF-IDF will decrease.
Let’s see the same example that we used in CountVectorizer but we will use TfidfVectorizer. The following
table 5.8 represents the result.
52
5.2. DATA COLLECTION CHAPTER 5. DATASETS AND IMPLEMENTATION FRAMEWORKS
NB:
To extract the features, we used the scikit-learn library. We will import the CountVectorizer and TfidfVec-
torizer as shown below:
5.2.4 N-grams
In natural language processing, the n-gram is a contiguous sequence of n items from a given sample of
text or speech. The items can be syllables, letters, or words according to the application. N-grams are basic
features of CountVectorizer and TfidfVectorizer.
The n-grams typically are collected from a text or speech corpus. if 𝑛 = 1, the n-gram is called
” ∖ 𝑢𝑛𝑖𝑔𝑟𝑎𝑚”, if 𝑛 = 2, the n-gram is called ” ∖ 𝑏𝑖𝑔𝑟𝑎𝑚”, if 𝑛 = 3, the n-gram is called ” ∖ 𝑡𝑟𝑖𝑔𝑟𝑎𝑚”,
if 𝑛 > 3, we will replace the letter n by its numerical value, such as four-gram, five-gram,etc. So, the main
difference relies on the chosen N. Figure below shows an example split of a sentence into n-grams.
53
5.2. DATA COLLECTION CHAPTER 5. DATASETS AND IMPLEMENTATION FRAMEWORKS
Now let’s us show the most frequent n-grams unigrams, bigrams and trigrams extracted from our data:
We converted our text data into matrix of features of integers or floating values using the vector models
which we explained earlier. In these models each comment is represented by a line of matrix in which the
columns represent the existing words, and the values represent the appearance of each word in the comment.
Where we get the final data representation that will be used to create the classifier.
The following figures show the final data representation by CountVectorizer using uni-grams, bi-grams
and the third figure representation by TfidfVectorizer using uni-grams .
54
5.2. DATA COLLECTION CHAPTER 5. DATASETS AND IMPLEMENTATION FRAMEWORKS
5.2.6 Classification
In this step we use different classification algorithms, for each of them, the features extracted from the
training set in the pre-processing phase are used to the classification algorithm to build the classification
model which allows us to calculate the accuracy of each of them, and to calculate the processing time required
55
5.3. FRAMEWORKS CHAPTER 5. DATASETS AND IMPLEMENTATION FRAMEWORKS
for the model building for each algorithm. We will see the detail of this phase in the following chapter.
5.3 Frameworks
Different frameworks are used for this project. All of them are free and open source.
5.3.1 Python
Python is a high level general programming language and is very widely used in all types of disciplines
such as general programming, web development, software development, data analysis, machine learning etc.
Python is used for this project because it is very flexible and easy to use and it can perform same tasks with
fewer lines of codes than in any other mainstream programming languages.[44]
The Jupyter Notebook is an open-source web application that allows you to create and share documents
that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and
transformation, numerical simulation, statistical modeling, machine learning and much more.The Notebook
has support for over 40 programming languages, including those popular in Data Science such as Python,
R, Julia and Scala.
5.3.3 Pandas
Pandas is open source BSD licensed software specially written for python programming language. It
provides complete set of data analysis tools for python and is best competitor for R programming language.
Operations like reading data-frame, reading csv and excel files, slicing, indexing, merging, handling missing
data etc., can be easily performed with Pandas. Most important feature of Pandas is, it can perform time
series analysis. [45]
5.3.4 Matplotlib
matplotlib is a library for making 2D plots of arrays in Python. Although it has its origins in emulating
the MATLAB graphics commands, it is independent of MATLAB, and can be used in a Pythonic, object
oriented way. Although matplotlib is written primarily in pure Python, it makes heavy use of NumPy3 and
other extension code to provide good performance even for large arrays. [46]
5.3.5 Seaborn
Seaborn is library used for data visualization and is created by using python programming language. It is
high level library stacked on top of matplotlib. Seaborn is more attractive and informative than matplotlib
56
5.4. CONCLUSION CHAPTER 5. DATASETS AND IMPLEMENTATION FRAMEWORKS
and very easy to use and is tightly integrated with NumPy and Pandas. Seaborn and matplotlib can be used
essentially side by side to derive conclusions from the datasets. [47]
5.3.6 Scikit-learn
Scikit-learn is a free software library for the Python programming language. It is very easy to use.
Scikit-learn includes all the tools and algorithms needed for most of machine learning tasks. It features
various classification, regression and clustering algorithms including support vector machines, random forests,
gradient boosting, k-means, and is designed to interoperate with the Python numerical and scientic libraries
NumPy and SciPy. [48]
5.3.7 NLTK
The Natural Language Toolkit (NLTK) is a platform used for building programs for text analysis. The
platform was originally released by Steven Bird and Edward Loper in conjunction with a computational
linguistics course at the University of Pennsylvania in 2001.
It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a
suite of text processing libraries for categorizing text, tokenization, stemming, tagging, parsing, and semantic
reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum. NLTK is available
for Windows, Mac OS X, and Linux. Best of all, NLTK is a free, open source, community-driven project.
[49]
5.4 Conclusion
In this chapter, we focused on the conceptual aspects of our project. we present information about the
our data collection and explain steps of data preprocessing, then we presented the tools that we used. The
next chapter will be devoted to the realization of our application.
57
Chapter 6
6.1 Introduction
This chapter introduces and describes the results of the performance obtained from the study of our
dataset. In our experiments, we tested six machine learning classification methods that are commonly used
in Sentiment Analysis after all pre-processing which are: Artificial Neural Network, Multinomial Naive Bayes,
Support Vector Machines, Logistic Regression and K-Nearest Neighbors and Random Forest.
For each classifier we must split the datasets into training and testing subsets (e.g., 80% and 20%) in
order to create best model and get best results of the performance. In addition, we also calculated the
computational time during training and testing stages, after that we decided wich model was the best for
our data .We also tested all methods with two classes: "positive, negative" and three classes: "positive,
negative and neutral". Finally To achieve an overall comparison, we also tested Bag-of-words features using
unigram, bigrams, and the two together.
To calculate these metrics we are using Confusion matrix wich contains the estimated and actual distri-
bution of labels as shown in the following table 6.1, each column corresponds to the actual label and each
row corresponds to the estimated (predicted) labels.
58
6.2. EVALUATION METRICS OF PERFORMANCE CHAPTER 6. RESULTS AND DISCUSSION
— True Positives (TP): is the number of true positives: the sentence that is actually positive and was
estimated as positive.
— True Negatives (TN): is the number of true negatives: the sentence that is actually negative and
was estimated as negative.
— False Positives (FP): is the number of false positives: the sentence that is actually negative but
estimated as positive.
— False Negatives (FN): is the number of false negatives: the sentence that is actually positive but
estimated as negative.
6.2.1 Accuracy
Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted obser-
vation to the total observations. Accuracy is a great measure but only when we have symmetric datasets.
Therefore, we have to look at other parameters to evaluate the performance of our model. It can be estimated
as:
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (6.1)
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
6.2.2 Precision
Precision is the ratio of correctly predicted positive observations to the total predicted positive obser-
vations. High precision relates to the low false positive rate. Precision can be estimated using following
formula:
𝑇𝑃
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (6.2)
𝑇𝑃 + 𝐹𝑃
59
6.3. RESULTS AND EVALUATION CHAPTER 6. RESULTS AND DISCUSSION
6.2.3 Recall
Recall shows the ability of the classifier to guess the ratio of correctly predicted positive observations to
the all observations in actual class, it is used the formula:
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = (6.3)
𝑇𝑃 + 𝐹𝑁
6.2.4 F1 score
F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives
and false negatives into account. F1 score can be calculated using:
2 * 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 * 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹 1𝑆𝑐𝑜𝑟𝑒 = (6.4)
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
Note:
— F1 score is usually more useful than accuracy, especially if you have an uneven class distribution.
— Accuracy works best if false positives and false negatives have similar cost.
— If the cost of false positives and false negatives are very different, it’s better to look at both Precision
and Recall.
In the following experiments we will classify the comments into two classes positive and negative.
First experiment:
The first experiment was conducted on the model that is trained using unigrams features, the result of
the evaluation is depicted in the table 6.1 below:
60
6.3. RESULTS AND EVALUATION CHAPTER 6. RESULTS AND DISCUSSION
From the table 6.1 it can be seen that the best accuracy achieved is 84.71 % by Multinomial Naive Bayes
also it is the best one for the other metrics. The Figure 6.2 below show comparison of machine learning
classification algorithms for accuracy and time token on training the dataset.
Second experiment:
The following experiments are performed by employment of the bigrams features, the results are presented
in Table 6.2.
61
6.3. RESULTS AND EVALUATION CHAPTER 6. RESULTS AND DISCUSSION
The Figure 6.3 below shows the average accuracy and the average processing time for each of the six
algorithms for the unigrams runs. It is clear that the NB has the best accuracy and it is has the smallest
processing time.
Third experiment:
This experiment represents the accuracy of previous algorithms with the use of Unigrams and Bigrams
features, the results are shown in the table 6.3
62
6.3. RESULTS AND EVALUATION CHAPTER 6. RESULTS AND DISCUSSION
For the Multinomial Naive Bayes the accuracy is 85.57% that is a bit better that what was obtained
using the unigrams and bigrams alone.
The following figure 6.4 represents the accuracy of each classification method that we used with their
calculating time:
6.3.1.1 Discussion
We compare results of different methods and different features selection (unigrams, bigrams, trigrams),
it is clear that when we used bigrams features we get bad results in all methods, for the Multinomial Naive
Bayes the highest accuracy is reached when we are used bigrams and unigrams both at the same time 85.57%,
also the Multi-layer Perceptron get a very good accuracy 83.05% but it takes a lot of time. KNN did not
show great performance in all experiment so it is bad method for our dataset. Overall, we see that our
method combined with using unigrams and bigrams or just unigrams can give very good results. Our final
data results for two class classification was 85.57%.
63
6.3. RESULTS AND EVALUATION CHAPTER 6. RESULTS AND DISCUSSION
After finishing with two class classification, we tried to classify sentiment comments into three classes
positive, negative and neutral.
Fourth experiment: The fourth experiment was conducted on the model that is trained using unigrams
features, the result of the evaluation is depicted in the table 6.4 below:
In Table 6.4, Multinomial Naive Bayes has highest classification accuracy and the smallest processing
time. From the comparison of SVC, Random forest, KNN, Logistic Regression or Multi-layer Perceptron
with Multinomial Naive Bayes, Multinomial Naive Bayes is the best classifier with 64.41% average value.
The Figure 6.6 below show comparison of machine learning classification algorithms for accuracy and
time token on training the dataset.
Fifth experiment: The fifth experiments are performed by employment of the bigrams features, the
results are presented in Table 6.5.
64
6.3. RESULTS AND EVALUATION CHAPTER 6. RESULTS AND DISCUSSION
In this experiment, there is a marked reduction in accuracy in all methods. We can see that Multinomial
Naive Bayes and Logistic Regression have the same accuracy 58.58% but if we see the F1 score Multinomial
Naive Bayes is better than Logistic Regression so Multinomial Naive Bayes is the best one.
The Figure ?? below represents comparison of machine learning classification algorithms for accuracy
and time token on training the dataset.
Sixth experiment:
This experiment represent the accuracy of previous algorithms with the use of Unigrams and Bigrams
features, the results show in the table 6.6
65
6.3. RESULTS AND EVALUATION CHAPTER 6. RESULTS AND DISCUSSION
For the Multinomial Naive Bayes it is clear that it got the top value opposed to the previous cases. The
Figure 6.7 below show comparison of machine learning classification algorithms for accuracy and time token
on training the dataset.
If we compare the accuracy only we find that the Multinomial Naive Bayes and the Multi-layer Perceptron
have the greatest value Almost 0.6... But if we take into consideration the time of calculates the Multinomial
Naive Bayes is the most fast with accuracy = 65.64%.
6.3.2.1 Discussion
The classification of sentiment into three is more difficult than two class classification. So that we can see
that the accuracy is not good as for two class classification, but this is perfectly normal and was expected
because for more classes we have need to big size of dataset. Permission for more data you can achieve a very
good results. The same remark for the last three experiments like the three first, when we select unigrams
and bigrams or just unigrams can give good results compared to when we select bigrams, Our final data
results for three class classification was 65.64% Which we got by Multinomial Naive Bayes.
66
6.4. CONCLUSION CHAPTER 6. RESULTS AND DISCUSSION
6.4 Conclusion
This chapter presented the results of conducted experiments using six classifiers. It can be observed
that every algorithm has its intrinsic capacity to outperform other algorithm depending upon the situation.
Multinomial Naive Bayes approach gave quite good results in both time and accuracy for our dataset. The
figure below 6.8 represents some new examples that are categorized by Multinomial Naive Bayes.
67
Chapter 7
7.1 Conclusion
The goal of machine learning is to turn data into information based on past experience and build deci-
sion systems that can act on that information. This goal has attracted interest from various domains, and
today, machine learning solutions have become indispensable tools in many fields of science, business and
engineering.
The primary objective of this thesis was to create a new datasets for classification of Arabic sentiment
comments and compare the algorithms with different performance metrics using machine learning in order
to see which is better algorithm for our datasets.
We chose Arabic language because it could serve as a practical guide for future annotation projects, and
the corpus will be available for the research community.
To achieve this research objective six machine learning algorithms have been tested: Support Vector
Machines, Multinomial Naïve Bayes, Random Forest, Logistic Regression and others. Every algorithm per-
formed better in some situation and worse in another, but Multinomial Naïve Bayes are the likely models to
work best in our dataset in this study.
The best result of the accuracy that was achieved, made up 85,57% for two class classification and 64.21%
for three class classification.
68
7.2. FUTUR WORK CHAPTER 7. CONCLUSION AND FUTUR WORKS
69
Bibliography
[1] Gobinda G. Chowdhury. Natural Language Processing. PhD thesis, Dept of Computer and Information
Sciences University of Strathclyde, Glasgow G1 1XH, UK.
[2] Natural language processing. Technical report, Natural Language Processing RSS. N.p., n.d. Web, 2017.
[4] Designing Machine Learning Systems with Python. Packt Publishing, 2016. ISBN
1785882953,9781785882951.
[5] Arthur L Samuel. Some studies in machine learning using the game of checkers. IBM Journal of Research
and Development, 1960.
[6] Machine Learning. McGraw-Hill series in computer science, 1997. ISBN ISBN
9780070428072,0070428077.
[7] Stephen Marshland. Machine learning : an algorithmic perspective. Chapman and Hall/CRC machine
learning and pattern recognition series. CRC Press, 2015.
[8] M. Littman L. P. Kaelbling and A. Moore. Moore reinforcement learning. A Survey Journal of
Arti𝑐 𝑖𝑎𝑙𝐼𝑛𝑡𝑒𝑙𝑙𝑖𝑔𝑒𝑛𝑐𝑒𝑅𝑒𝑠𝑒𝑎𝑟𝑐ℎ, 1996.
[9] Eihab Bashier Mohammed Bashier Mohssen Mohammed, Muhammad Badruddin Khan. Machine learn-
ing: Algorithms and applications. CRC Press Reference, 2016.
[10] Savan Patel. Computer engineer. Technical report, Just started exploring machine learning, 2017.
[11] Marref Nadia. Apprentissage incremental and machines a vecteurs supports. Master’s thesis, Universite
HADJ LAKHDAR BATNA, 2013.
[12] The Top Ten Algorithms in Data Mining. data mining and knowledge discovery series. CRC Press,
2009. ISBN 1 edition , ISBN 9781420089646,1420089641.
[13] Xiaojin Zhu. Advanced natural language processing-support vector machines. Spring, 2010.
70
BIBLIOGRAPHY BIBLIOGRAPHY
[14] Abe Sh. Support vector machines for pattern classification. 2005.
[15] A basic introduction to neural networks. Technical report, Computer Science Department Darmouth
College.
[16] Kevin Gurney. An introduction to neural networks. University of Sheffield UCL Press, 1997.
[17] Anish Singh Walia. A data nerd in deep love with machine learning. Statistics Data Science, 2017.
[18] Isaac Changhau. Activation Functions in Artificial Neural Networks. PhD thesis, 2017.
[19] Paul King. Computational neuroscientist. Technical report, Data Scientist, Technology Entrepreneur,
2016.
[20] Handbook of Natural Language Processing, chapter Sentiment Analysis and Subjectivity. New York,
NY, USA„ 2009.
[21] S. Lawrence K. Dave and D. M. Pennock. Mining the peanut gallery: Opinion extraction and semantic
classification of product reviews. In in Proceedings of the 12th international conference on World Wide
Web.
[22] Bing Liu. Sentiment Analysis and Opinion Mining. PhD thesis, University of Toronto.
[23] Advances in The Human Side of Service Engineering. 5th International Conference on Applied Human
Factors and Ergonomics, Volume Set,Proceedings of the 5th AHEE Conference, 2014.
[24] Miikka Kuutila Mika V. Mäntylä, Daniel Graziotin. The evolution of sentiment analysis. A Review of
Research Topics, Venues, and Top Cited Papers.
[25] R. Stagner. The cross-out technique as a method in public opinion analysis. The Journal of Social
Psychology, 1940.
[26] Anchal Kathuria et al. A novel review of various sentimental analysis techniques. International Journal
of Computer Science and Mobile Computing, 2017.
[27] R. Nithya and D. Maheswari. Sentiment analysis on unstructured review. In International Conference
on Intelligent Computing Applications.
[28] H. Dalal C. Bhadane and H. Doshi. Sentiment analysis: Measuring opinions. Procedia Comput, 2015.
[29] Sally Rushaidat Raed Marji, Narmeen Sha’ban. Sentiment analysis in arabic tweets. Irbid 22110, 2014.
[30] B. Liu. Sentiment analysis and subjectivity. Handb. Nat. Lang . Process, 2010.
[31] H. Chen K. Lun-Wei L. Yu-Ting C L. Ku, Y. Liang. Opinion extraction, summarization and tracking
in news and blog corpora. Artif Intell, 2006.
71
BIBLIOGRAPHY BIBLIOGRAPHY
[32] Ronen Feldman. Introduction to sentiment analysis. Technical report, Based on slides from Bing Liu.
[33] A Framework and practical implementation for sentiment analysis and aspect exploration. PhD thesis,
Alliance Manchester Business School, 2016.
[35] Lyndon Kennedy David A. Shamma and Elizabeth F. Tweet the debates: Understanding community
annotation of uncollected sources. In Proceedings of the First SIGMM Workshop on Social Media, 2009.
[36] Nicholas A. Diakopoulos and David A. Shamma. Characterizing debate performance via aggregated
twitter sentimen. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems,
2010.
[37] L Pang, B. Lee. A sentimental education: Sentiment analysis using subjectivity summarization based
on minimum cuts. Association of Computational Linguistics (ACL), 2004.
[38] Shivakumar Vaithyanathan Bo Pang, Lillian Lee. Sentiment classification using machine learning tech-
niques.
[39] Tari Brahimi B, Touahria M. Data and text mining techniques for classifying arabic tweet polarity.
Journal of Digital Information Management, 2016.
[42] Mohammed N. Azarah. Arabic Text Classification Using Learning Vector Quantization. PhD thesis,
The Islamic University – Gaza Denary of Higher Studies Faculty of Information Technology, 2012.
[43] M.S. Khorsheed and A.O Al-Thubaity. Comparative evaluation of text classification techniques using a
large diverse arabic dataset. Springer Science + Business Media Dordrecht, 2013.
[49] Nltk 3.2.5 documentation. Technical report, NLTK project, 2017. URL http://www.nltk.org.
72