Memoir

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 72

RÉPUBLIQUE ALGÉRIENNE DÉMOCRATIQUE ET POPULAIRE

MINISTÈRE DE L’ENSEIGNEMENT SUPERIEUR ET DE LA RECHERCHE


SCIENTIFIQUE

UNIVERSITE FERHAT ABBAS SETIF 1

Faculté des Sciences


Département d'Informatique

Mémoire de Fin d'études


En vue d’obtention du diplôme de MASTER
OPTION : Fondements et Ingénierie de l’Information et de l’Image

Thème
Analyse des sentiments des commentaires
En Arabe des lecteurs des journaux en lignes

Présenté par : Encadré par :


- Rania Aberkane Dr. Sadik Bessou

Promotion 2017/2018
Acknowledgements
Before anyone, I thank ALLAH almighty, who guides me and gives me the strength and courage to
complete the research work.
I would like to express my sincere thanks to my supervisor guide Dr. Sadik Bessou for his continuous
encouragement and support to me and gave me the freedom to develop my ideas in the middle.
I am very thankful to to many friends for their support, encouragement, helps,listening, the stimulating
discussions during this thesis as well as before it. Dehel Amel, Boudissa roukia, kedad Walid,
Bahmed Assia and Sarri Racha.
Moreover, I wish to convey my gratitude to all the people who have helped me directly or indirectly during
the last five years and specially to my friend Diboune Nadia.
A special thanks and gratitude goes out to my lovely family for their helps,wise counsel and affection
Finally, thanks to all those who participated in this work and made it possible.

2
Abstract

Information technologies have firmly entered our life and it is impossible to imagine our life without gadgets
or the Internet. Today, Sentiment analysis is one of the fastest growing research areas in computer science.
Within the last couple of years, Sentiment Analysis in Arabic has gained a considerable interest from the
research community. Because it can help analyse trending topics such as political crises, elections, disasters
and predict it before it occurs.
In this thesis, we present the details of collecting and constructing a large dataset of Arabic comments.
The techniques used in cleaning and pre-processing the collected dataset are explained. Sentiment can be
divided into three classes : positive, negative and neutral.
We have studied six algorithms of classification of texts Multinomial Naive Bayes, Support Vector Ma-
chines, Random Forest, Logistic Regression, Multi-layer perceptron and k-nearest neighbors. The application
of these algorithms revealed that the Naïve Bayes algorithm performs well for the classification of texts for
small size training data, and we were able to achieve an accuracy of 85.57% on two classes classification and
65.64% on three classes classification.

Keywords :
Natural language processing, Sentiment analysis, Arabic text classification, polarity classifica-
tion, feature selection, Naïve Bayes, Machine learning.

Résumé

Les technologies de l’information sont entrées dans notre vie et il est impossible d’imaginer notre vie
sans gadgets ou Internet. Aujourd’hui, l’analyse des sentiments est l’un des domaines de recherche les plus
dynamiques en informatique.
Au cours des deux dernières années, l’analyse des sentiments en arabe a suscité un intérêt considérable
de la part de la communauté de la recherche. Parce qu’il peut aider à analyser des sujets d’actualité tels que
les crises politiques, les élections, les catastrophes, et les prédire avant qu’ils ne surviennent.
Dans cette thèse, nous présentons les détails de la collecte et la construction d’un grand ensemble de
données de commentaires arabes. Les techniques utilisées pour nettoyer et prétraiter l’ensemble de données
collectées sont expliquées. Le sentiment peut être divisé en trois catégories : positif, négatif et neutre.
Nous avons étudié six algorithmes de classification de textes Multinomial Naive Bayes, Machines à vec-
teurs de support, Forêt aléatoire, Régression logistique, Perceptron multicouche et k-plus proches voisins.
L’application de ces algorithmes a révélé que l’algorithme de Naïve Bayes fonctionne bien pour la classifi-
cation des textes pour les données d’entraînement de petite taille, et nous avons obtenu une précision de

3
85,57% sur la classification en deux classes et de 65,64% sur la classification en trois classes.

Mots clés :
Traitement du langage naturel, Analyse de sentiment, classification de texte arabe, classifica-
tion de polarité, sélection de caractéristique, Naïve Bayes, apprentissage automatique.

P‫ﺨ‬l‫ﻣ‬
‫ون أدوا‹ أو‬d‫ ﺑ‬An‫ﺗ‬Ay‫ ﺣ‬CwO‫ﻞ ﺗ‬y‫ﺤ‬tsm‫ وﻣﻦ اﻟ‬Ty‫ﻣ‬wy‫ اﻟ‬An‫ﺗ‬Ay‫ﻖ ﻓﻲ ﺣ‬m‫‹ ﺑﻌ‬A‫ﻣ‬wl‫ﻌ‬m‫ اﻟ‬Ay‫ﺟ‬w‫ﻟ‬wnk‫ﺖ ﺗ‬l‫ دﺧ‬dq‫ﻟ‬
‫ﻦ‬y‫ﻣ‬A‫ ﻣﻦ ﺧﻼل اﻟﻌ‬r‫ﺗ‬wybmk‫م اﻟ‬wl‫ل ﻋ‬A‫ ﻓﻲ ﻣﺠ‬Ty“‫ﺤ‬b‫ﻻ‹ اﻟ‬A‫ﺠ‬m‫ع اﻟ‬rF‫ ﻣﻦ أ‬r‫ﻋ‬AKm‫ﻞ اﻟ‬yl‫ﺢ ﺗﺤ‬b}‫ أ‬d‫ وﻗ‬.‫ﻧﺖ‬rt‫اﻹﻧ‬
d‫ﻋ‬As‫ﻦ أن ﻳ‬km‫ ﻷﻧﻪ ﻳ‬.„A‫ﻊ اﻷﺑﺤ‬mt‫ ﻣﻦ ﻣﺠ‬Aًryb‫ ﻛ‬AًA‫ﻣ‬Amt‫ اﻫ‬Ty‫ﺑ‬r‫ اﻟﻌ‬T‫ﻐ‬l‫ ﻓﻲ اﻟ‬r‫ﻋ‬AKm‫ﻞ اﻟ‬yl‫ﺐ ﺗﺤ‬st‫ اﻛ‬, ‫ﻦ‬yyRAm‫اﻟ‬
.Ah‫ﻋ‬w‫ﻞ وﻗ‬b‫ ﻗ‬Ah‫ﺆ ﺑ‬bnt‫„ واﻟ‬C‫ا‬wk‫‹ واﻟ‬A‫ﺑ‬A‫ﺨ‬t‫ واﻻﻧ‬TyFAys‫‹ اﻟ‬A‫ﻣ‬E‫ ﻣ“ﻞ اﻷ‬T‫ﺋﻌ‬AK‫‹ اﻟ‬A‫ﻋ‬wRwm‫ﻞ اﻟ‬yl‫ﻓﻲ ﺗﺤ‬
‹Aynqt‫ح اﻟ‬rJ ‫ﻢ‬t‫ ﻳ‬Amk.Ty‫ﺑ‬r‫‹ اﻟﻌ‬Aqyl‫ﻌ‬t‫ة ﻣﻦ اﻟ‬ryb‫ ﻛ‬T‫ﻋ‬wm‫ء ﻣﺠ‬An‫ﻊ وﺑ‬m‫ﻞ ﺟ‬y}Af‫م ﺗ‬dq‫ ﻧ‬, T‫ﻟ‬AFr‫ﻓﻲ ﻫ@ه اﻟ‬
„‫ ›ﻼ‬Y‫ إﻟ‬Cw‫ﻌ‬K‫ﻢ اﻟ‬ysq‫ﻦ ﺗ‬km‫ ﻳ‬.Ah‫ﻌ‬m‫ﻲ ﺗﻢ ﺟ‬t‫‹ اﻟ‬A‫ﻧ‬Ayb‫ اﻟ‬T‫ﻋ‬wm‫ﺠ‬m‫ ﻟ‬Tqbsm‫ اﻟ‬T‫ﻟﺠ‬A‫ﻌ‬m‫ﻒ واﻟ‬y\nt‫ ﻓﻲ اﻟ‬T‫ﻣ‬d‫ﺨ‬tsm‫اﻟ‬
‫ آﻻ‹ دﻋﻢ‬,z‫ﻳ‬A‫ﻳﻒ ﺑ‬A‫ ﻧ‬QwOn‫ﻒ اﻟ‬ynOt‫‹ ﻟ‬Ay‫ﻣ‬EC‫ا‬w‫ ﺧ‬TtF TF‫ا‬Cd‫ ﺑ‬Anm‫ ﻗ‬dq‫ة ﻟ‬d‫ﻳ‬A‫ وﻣﺤ‬TyblF , Ty‫ﺑ‬A‫ إﻳﺠ‬:‹A‫ﻓﺌ‬
.‫ان‬ry‫ب اﻟﺠ‬r‫ أﻗ‬-‫‹ و ك‬AqbW‫د اﻟ‬d‫ﻌ‬t‫ ﻣ‬, ‫ﻲ‬ts‫ﺟ‬wl‫ اﻟ‬C‫ا‬d‫ واﻻﻧﺤ‬, Ty‫اﺋ‬wK‫‹ اﻟﻌ‬A‫ﺑ‬A‫ واﻟﻐ‬, ‹Ah‫ﺠ‬tm‫اﻟ‬
‹A‫ﻧ‬Ayb‫ ﻟ‬QwOn‫ﻒ اﻟ‬ynOt‫ ﻟ‬dy‫ﻞ ﺟ‬kK‫ ﺗﺆدي ﺑ‬z‫ﻳ‬A‫ﻳﻒ ﺑ‬A‫ ﻧ‬Ty‫ﻣ‬EC‫ا‬w‫‹ أن ﺧ‬Ay‫ﻣ‬EC‫ا‬w‫ﻖ ﻫ@ه اﻟﺨ‬ybW‫ ﺗ‬rhZ‫ أ‬d‫وﻗ‬
T›‫ ›ﻼ‬Yl‫ ﻋ‬%46.56 ‫ و‬TqbW‫ﻦ ﻣﻦ اﻟ‬yt‫ ﻓﺌ‬Yl‫ ﻋ‬%75.58 A‫ﻫ‬Cd‫ ﻗ‬T‫ﻖ دﻗ‬yq‫ ﻣﻦ ﺗﺤ‬Ankm‫ ﺗ‬d‫ وﻗ‬, ry‫ﻐ‬O‫ا‹ اﻟﺤﺠﻢ اﻟ‬Ð ‫ﻳﺐ‬Cdt‫اﻟ‬
.‹AfynO‫ﺗ‬
:Ty‫ﺣ‬Atfm‫‹ اﻟ‬Amlk‫اﻟ‬
‫ﻢ‬l‫ﻌ‬t‫ اﻟ‬, z‫ﻳ‬A‫ﻳﻒ ﺑ‬A‫ ﻧ‬, ‫ة‬zym‫ اﻟ‬CAyt‫ اﺧ‬, ‫ب‬AW‫ﻒ اﻷﻗ‬ynO‫ ﺗ‬, ‫ﺑﻲ‬r‫ اﻟﻌ‬Pn‫ﻒ اﻟ‬ynO‫ ﺗ‬, r‫ﻋ‬AKm‫ﻞ اﻟ‬yl‫ ﺗﺤ‬, Ty‫ﻌ‬ybW‫ اﻟ‬T‫ﻐ‬l‫ اﻟ‬T‫ﻟﺠ‬A‫ﻣﻌ‬
‫اﻵﻟﻲ‬

4
Table des matières

1 INTRODUCTION 11
1.1 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 THEORETICAL BACKGROUND 14
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Natural language processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Some Theoretical Developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Natural Language Text Processing Systems . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.4 Applications of natural language processing . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.5 Levels of Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 Machine Learning process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Machine Learning types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.4 Applications of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.5 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.5.1 Naïve Bayesian Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.5.2 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.5.3 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 BACKGROUND ON SENTIMENT ANALYSIS (SA) 32


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 The importance of sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Evolution of SA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5
TABLE DES MATIÈRES TABLE DES MATIÈRES

3.5 Methods for Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33


3.5.1 Machine learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5.2 Rule Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5.3 Lexical Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 Three major techniques of SA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7 SA Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.8 Sentiment analysis tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.8.1 Subjectivity classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.8.2 Opinion extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.8.3 Opinion tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.8.4 Opinion summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.8.5 Sentiment classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.9 Two main types of opinions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.9.1 Regular opinions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.9.2 Comparative opinions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.10 Levels of sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.10.1 Document level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.10.2 Sentence level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.10.3 Aspect level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.11 Related Work in SA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 Arabic Text Classification (ATC) 40


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 ARABIC TEXT CLASSIFICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Characteristics of Arabic Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4 The importance of ATC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5 Challenges of using Arabic language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 Research Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5 DATASETS AND IMPLEMENTATION FRAMEWORKS 45


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.1 Arabic corpus collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.3 Features Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6
TABLE DES MATIÈRES TABLE DES MATIÈRES

5.2.3.1 CountVectorizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2.3.2 TfidfVectorizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.4 N-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.5 Final Data representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.6 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.2 Jupyter Notebook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.3 Pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.4 Matplotlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.5 Seaborn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3.6 Scikit-learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.7 NLTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6 RESULTS AND DISCUSSION 58


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.2 Evaluation metrics of performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.2.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2.2 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2.3 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2.4 F1 score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3 Results and evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3.1 Two Classes : Positive and Negative . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.3.1.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.3.2 Three Classes : Positive, Negative and Neutral . . . . . . . . . . . . . . . . . . . . . . 64
6.3.2.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7 CONCLUSION AND FUTUR WORKS 68


7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.2 Futur work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7
List of Figures

2.1 Classification of NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


2.2 Example for spelling correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Example for information extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Example for Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Machine Learning process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Types of machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Which hyper-plane is optimal? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.8 Principle basic concepts of svm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.9 Linear vs nonlinear problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.10 Mapping features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.11 Left: low regularization value, right: high regularization value . . . . . . . . . . . . . . . . . . 25
2.12 high Gamma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.13 Low Gamma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.14 Good margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.15 Bad margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.16 Simple example of neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.17 The Basics of Neural Networks: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.18 Most popular types of Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.19 Back propagation algorithm for neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1 Number of papers published per year in Scopus . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1 The Top 10 Spoken Languages in the World with their Corresponding Percent 2014 . . . . . 41
4.2 the top ten languages in the internet in millions of users -2017- . . . . . . . . . . . . . . . . . 41

5.1 Text classification phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46


5.2 Standard dataset statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Comment length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

8
LIST OF FIGURES LIST OF FIGURES

5.4 Remove number, punctuation and noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49


5.5 Example of stop words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.6 Example of stemming phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.7 Package of features extracted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.8 Split of a phrase into unigrams, bigrams and trigrams. . . . . . . . . . . . . . . . . . . . . . . 54
5.9 Most frequent unigrams bigrams and trigrams . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.10 Final data representation by CountVectorizer using unigrams . . . . . . . . . . . . . . . . . . 55
5.11 Final data representation by CountVectorizer using bigrams . . . . . . . . . . . . . . . . . . . 55
5.12 Final data representation by TfidfVectorizer using unigrams . . . . . . . . . . . . . . . . . . . 55

6.1 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59


6.2 Comparison of accuracy and time for first experiment . . . . . . . . . . . . . . . . . . . . . . 61
6.3 Comparison of accuracy and time for second experiment . . . . . . . . . . . . . . . . . . . . . 62
6.4 Comparison of accuracy and time for third experiment . . . . . . . . . . . . . . . . . . . . . . 63
6.5 Comparison of accuracy and time for fourth experiment . . . . . . . . . . . . . . . . . . . . . 64
6.6 Comparison of accuracy and time for fifth experiment . . . . . . . . . . . . . . . . . . . . . . 65
6.7 Comparison of accuracy and time for sixth experiment . . . . . . . . . . . . . . . . . . . . . . 66
6.8 Examples of classification with Multinomial Naive Bayes . . . . . . . . . . . . . . . . . . . . . 67

9
List of Tables

3.1 Comparison of various sentimental analysis approaches . . . . . . . . . . . . . . . . . . . . . . 35

5.1 Some frequent words in positive document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46


5.2 Some frequent words in negative document . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Some frequent words in neutral document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4 Number of documents and tokens in each lable . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.5 Tokenization example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.6 Normalization rules for Arabic words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.7 Bag of Words representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.8 Example of TfidfVectorizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.1 Evaluation of algorithms using unigrams 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61


6.2 Evaluation of algorithms using Bigrams 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.3 Evaluation of algorithms using unigrams and bigrams 1 . . . . . . . . . . . . . . . . . . . . . 63
6.4 Evaluation of algorithms using unigrams 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.5 Evaluation of algorithms using bigrams 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.6 Evaluation of algorithms using unigrams and bigrams 2 . . . . . . . . . . . . . . . . . . . . . 66

10
Chapter 1

INTRODUCTION

The rapid growth of the internet and computer technologies has caused the existence of billions of elec-
tronic text documents which are created, edited, and stored in digital ways. This situation has brought great
challenge to the public, specifically the computer users in searching, organizing, and storing these documents.

So text classification occupies a considerable place in this set of data analysis tools. It consists of assign-
ing a textual document to a class. Such a task is indispensable both for the search for information and for
the extraction of knowledge.

Sentiment Analysis (SA) known as opinion mining is the process of classifying the emotion conveyed
by text, for example as negative, positive or neutral. It is one of the most vital research fields of Natural
Language Processing (NLP) nowadays. Information gained from applying Sentiment Analysis has many po-
tential usages, for instance, to help marketers evaluate the success of an advertisement campaign, to identify
how different demographics have received a product release, to predict user behavior, or to forecast election
results.

This project aims to implement Sentiment Analysis for Arabic text because it is a very rich language,
it has a very different and difficult structure than other languages. Therefore it is important to build an
Arabic Text Classifier. As well it is the fastest growing language on the web among other languages, it is
ranked fourth among languages on the web.

In our context, we will use machine learning (ML) which can build a classifier which can learn from a
set of pre-classified examples. We are interested to use multiple classification techniques, namely support
vector machine, Naive Bayes, Logistic Regression,Multi-layer perceptron and k-nearest neighbors. We will
compare between them in term of the accuracy and processing time perspectives.

11
1.1. RESEARCH OBJECTIVES CHAPTER 1. INTRODUCTION

1.1 Research Objectives


The main objectives of this research are to:
— Explanation of the concepts of machine learning algorithms
— Create a new datasets using newspaper EL CHOUROK
— Give different techniques used in Sentiment Analysis.
— Discover efficient algorithms that can be used for sentiment analysis as well as to provide improvements
for existing solutions.
— Study the effect of different features selection and preprocessing techniques on the accuracy of a
Machine Learning model.
— Analyze computational time of the algorithms and computational resources that particular algorithm
requires.

1.2 Thesis Organization


This manuscrit will be divided into six main chapters as follows:

Chapter 2« THEORETICAL BACKGROUND »: Describes different methods that can be applied for
sentiment classification task. Under this chapter, we will give an overview about Natural language process-
ing and about machine learning and its basic, present its types, common machine learning algorithms are
explained.

Chapter 3« BACKGROUND ON SENTIMENT ANALYSIS »: Gives the background necessary to per-


form Sentiment Analysis. Under this chapter, general Sentiment Analysis, sentiment components, Evolution
of Sentiment Analysis, and related work that has been done will be discussed.

Chapter 4« ARABIC TEXT CLASSIFICATION»: This chapter focuses on the Arabic language and
it’s characteristics including the importance of Arabic text classification and his Challenges.

Chapter 5« DATASETS AND IMPLEMENTATION FRAMEWORKS »: presents information about


the data collection, also explain each steps of data preprocessing and presents feature selection mechanism.
Moreover, offer development environment.

12
1.2. THESIS ORGANIZATION CHAPTER 1. INTRODUCTION

Chapter 6« RESULTS AND DISCUSSION »: Contains experiments and results achieved in this re-
search work. Based on the discuss results analysis is performed in order to define which of the algorithms
performs better for comments classification.

Chapter 7« CONCLUSION AND FUTUR WORKS»:Conclusion and future research perspectives are
presented in this chapter.

13
Chapter 2

THEORETICAL BACKGROUND

2.1 Introduction
In this chapter, we present some generalities concerning the Natural language processing, its character-
istics, its different Applications and its Levels of Language. Then we’ll give a quick view about machine
learning and its algorithms.

2.2 Natural language processing

2.2.1 Definition

Natural Language Processing (NLP) is an area of research and application that explores how computers
can be used to understand and manipulate natural language text or speech to do useful things. NLP is
essentially multidisciplinary: it is closely related to linguistics.
NLP researchers aim to gather knowledge on how human beings understand and use language so that
appropriate tools and techniques can be developed to make computer systems understand and manipulate
natural languages to perform the desired tasks. [1]
Natural Language Processing basically can be classified into two parts: Natural Language Understanding
and Natural Language Generation which evolves the task to understand and generate the text (Figure 2.1).[2]

14
2.2. NATURAL LANGUAGE PROCESSING CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.1: Classification of NLP

2.2.2 Some Theoretical Developments

The most recent theoretical developments that have influenced research in NLP can be grouped into four
classes:
— Statistical and corpus-based methods in NLP,
— Recent efforts to use WordNet for NLP research,
— The resurgence of interest in finite-state and other computationally lean approaches to NLP.
— The initiation of collaborative projects to create large grammar and NLP tools. [1]

2.2.3 Natural Language Text Processing Systems

Manipulation of texts for knowledge extraction, for automatic indexing and abstracting, or for producing
text in a desired format, has been recognized as an important area of research in NLP.
This is broadly classified as the area of natural language text processing that allows structuring of large
bodies of textual information with a view to retrieving particular information or to deriving knowledge
structures that may be used for a specific purpose.
Automatic text processing systems generally take some form of text input and transform it into an output
of some different form. [1]

2.2.4 Applications of natural language processing

There are lot of important applications of natural language processing to answer to many real-world
challenges . The following list is not complete:
— Sentiment Analysis : Identifying sentiments and opinions stated in a text.
— Better search engines
— Speech recognition : Recognizing a spoken language and transforming it into a text.

15
2.2. NATURAL LANGUAGE PROCESSING CHAPTER 2. THEORETICAL BACKGROUND

— Spelling correction, grammar checking: Suggesting alternatives for the errors.

Figure 2.2: Example for spelling correction

— Information extraction : Includes named-entity recognition

Figure 2.3: Example for information extraction

— Machine Translation: Translating a text from one language to another

Figure 2.4: Example for Machine Translation

16
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND

2.2.5 Levels of Language

— Morphology: the structure of words. For instance, unusually can be thought of as composed of
a prefix un-, a stem usual, and an affix -ly. Composed is compose plus the inflectional affix -ed: a
spelling rule means we end up with composed rather than composed.
— Syntax: the way words are used to form phrases.
— Semantics. Compositional semantics is the construction of meaning (generally expressed as logic)
based on syntax. This is contrasted to lexical semantics, i.e., the meaning of individual words.
— Pragmatics: meaning in context, although linguistics and NLP generally have very different per-
spectives here.[3]

2.3 Machine Learning


Machine Learning is a sub-field of artificial intelligence that involve the development of self learning
algorithms to gain knowledge from data in order to make predictions.
Machine Learning is included mostly in every area of our lives, from groundbreaking and life-saving
medical research, to discovering fundamental physical aspects of our universe. From providing us with
better, cleaner food, to web analytic and economic modeling, robust e-mail spam betters, convenient text
and voice recognition software, reliable Web search engines and a lot more promising applications. [4]

2.3.1 Definition

Lets take some definition of Machine Learning: In 1959 Arthur Samuel defined machine learning as:”Machine
Learning is the field of study that gives computers the ability to learn without being explicitly programmed.”
[5]
In 1997 Tom Mitchell defined machine learning as follows:“A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure P if its performance at tasks in
T, as measured by P, improves with experience E.”
Basically, machine learning is the ability of a computer to learn from experience. Experience is usually
given in the form of input data. Looking at this data, the computer can find dependencies in the data that
are too complex for a human to form.
Machine learning can be used to reveal a hidden class structure in an unstructured data, or it can be
used to find dependencies in a structured data to make predictions. [6]

2.3.2 Machine Learning process

There are 5 basic steps used to perform a machine learning task:

17
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND

1. Collecting data: It can be raw data from excel, access, text les etc., this step (gathering past data)
forms the foundation of the future learning. The better the variety, density and volume of relevant
data, better the learning prospects for the machine becomes.

2. Preparing the data: Any analytical process depends on the quality of the data used. One needs
to spend time determining the quality of data and then taking steps for fixing issues such as missing
data.

3. Training a model: This step involves choosing the appropriate algorithm and representation of
data in the form of the model. The cleaned data is split into two parts train and test, the first part
(training data) is used for developing the model. The second part (test data), is used for testing the
accuracy of the model.

4. Evaluating the model: To test the accuracy, the second part of the data (test data) is used. This
step determines the precision in the choice of the algorithm based on the outcome. A better test to
check accuracy of model is to see its performance on data which was not used at all during model
build.

5. Improving the performance: This step might involve choosing a different model altogether or
introducing more variables to augment the efficiency. That is why significant amount of time needs
to be spent in data collection and preparation.

The figure below2.5 represents these basic steps:

Figure 2.5: Machine Learning process

2.3.3 Machine Learning types

Machine Learning is divided into 3 categories: supervised learning, unsupervised learning, and reinforce-
ment learning.
— Supervised Learning (Predictive): is based on learning a model from labeled training data D
that allows us to make predictions about future data. The main drawback of this type of learning is
the necessity of having lots of labeled data, which is not always available.[7]
— Unsupervised Learning (Descriptive): It aims to design a model structuring information. The
difference here is that the behaviors (or categories or classes) of the learning data are not known, that

18
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND

is what we are trying to find.


— Reinforcement Learning: The input data is the same as for supervised learning, however, learning
is guided by the environment in the form of rewards or penalties given according to the error committed
during the learning. [8]
The figure below 2.6 summarizes the three categories of machine learning:

Figure 2.6: Types of machine learning

2.3.4 Applications of Machine Learning

Machine learning has proven itself to be the answer to many real-world challenges. therefore in this
section, we will discuss some applications of machine learning with some examples:
— Computer-Aided Diagnosis: Pattern recognition algorithms used in computer-aided diagnosis can
assist doctors in interpreting medical images in a relatively short period. Few examples are as follows:
Pathological brain detection, Breast cancer, Colon cancer, Bone metastases, Coronary artery disease,
congenital heart defect, Alzheimer’s disease. [9]
— Speech Recognition: The field of speech recognition aims to develop methodologies and technologies
that enable computers to recognize and translate spoken language into text. This technology can help
people with disabilities. With the passage of time, the accuracy of speech recognition engines is
increasing.
— Security applications: for access control, use face recognition as one of its components. That is,
given the photo (or video recording) of a person, recognize who this person is.
— Text Mining: text data was observed that most of the enterprise related information is stored in
text format. The challenge was how to use this unstructured data or text.
Text mining is helpful in a number of applications including:
— Business intelligence
— National security
— Life sciences

19
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND

— Those related to sentiment classification


— Automated placement of advertisement
— Automated classification of news articles
— Social media monitoring
— Spam filter. [9]

2.3.5 Machine Learning Algorithms

There are lot of algorithms of machine learning. Some of it are used for supervised learning like: Decision
Trees, Rule-Based Classifiers, Naïve Bayesian Classification, Neural Networks, and Support Vector Machine
e.g. The other are used for unsupervised learning like: k-Means Clustering, Principal Component Analysis,
and Hidden Markov Model e.g. The use of this or that algorithm depends strongly on the task to be solved
(classification, estimation of values, etc.) In this thesis we will detail some algorithms of supervised learning.

2.3.5.1 Naïve Bayesian Classification

1. Definition Naive Bayes is one of the simplest existing classifiers. It is based on the Bayesian theorem,
it is particularly suited when the dimensionality of the inputs is high. Parameter estimation for naive
Bayes models uses the method of maximum likelihood. This classifier is a powerful algorithm used
for: Real time Prediction, Text classification/ Spam Filtering, and Recommendation System. [10]

2. Bayes theorem
𝑝(𝑑|𝑐𝑗 )𝑝(𝑐𝑗 )
𝑝(𝑐𝑗 |𝑑) = (2.1)
𝑝(𝑑)
— 𝑝(𝑐𝑗 |𝑑) = probability of instance d being in class 𝑐𝑗 , This is what we are trying to compute
— 𝑝(𝑑|𝑐𝑗 ) = probability of generating instance d given class 𝑐𝑗 , We can imagine that being in class
𝑐𝑗 , causes you to have feature d with some probability
— 𝑝(𝑐𝑗 ) = probability of occurrence of class 𝑐𝑗 , This is just how frequent the class 𝑐𝑗 , is in our
database
— 𝑝(𝑑) = probability of instance d occurring This can actually be ignored, since it is the same for
all classes.
Naive Bayes considers some set of candidate classes C and is interested in finding the most probable
class c in C given the observed data D. The most probable class is called maximum a posteriori
(MAP)[6]
𝑐𝑀 𝐴𝑃 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑐∈𝐶 𝑃 (𝐷|𝑐)𝑃 (𝑐) (2.2)

The Naive Bayes Classifier also assumes conditional independence to make the calculations easier;
that is, given the class attribute value, other feature attributes become conditionally independent.
This condition, though unrealistic, performs well in practice and greatly simplifies calculation.[6]

3. Advantages/Disadvantages of Naïve Bayes

20
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND

Advantages

— Fast to train (single scan). Fast to classify


— Not sensitive to irrelevant features
— Handles real and discrete data
— Handles streaming data well

Disadvantages

— Assumes independence of features

2.3.5.2 Support Vector Machine

1. Introduction A Support Vector Machine (SVM) is a classifier which distinct(s) the various classes
of data by the use of a hyper-plane. SVM is modelled with the training data and it outputs the
hyper-plane in the test data. The SVM model tries to find the space in the matrix of data where
different classes of data can be widely differentiated and draws a hyper-plane.
There is more than one way to draw a hyper-plane so, which one is optimal? An optimal hyper-plane
is chosen which maximizes the margin between the classes. Hyper-plane need not always be linear.
A hyper-plane in SVM can also work as a non-linear classifier using technique known as kernel-trick.

Figure 2.7: Which hyper-plane is optimal?

2. Basic concept
— Hyper-plane If we take a case of binary classification (i.e. example is classified divided into two
classes). A hyper-plane separator is a hyper-plane which separates the two classes, in particular it
separates their learning points. The class of hyper-planes that is considered is 𝑤𝑇 𝑥 + 𝑏 = 0 with
weight vector w and bias b, (i.e., the linear hyper-planes). (Figure 2.8)
— Support Vector for a task of determining the hyper-plane separated from the SVMs, it is advis-
able to use only the closest points (i.e. the points of the boundary between the two data classes)
from the total set these points are called supports vectors. (Figure 2.8)

21
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND

— Margin We can define a unique separating hyperplane using the margin: the minimal Euclidean
distance of a pattern to the decision boundary. The hyperplane that maximises this margin
2
𝑚= ||𝑤|| is called the maximal margin hyperplane. [11] (Figure 2.8)

Figure 2.8: Principle basic concepts of svm

3. Linear and Nonlinear For the SVMs models, we finds two cases linearly and non-linearly separable.
— Linearly Separable Case are more simple, because we can easily find the linear classifier. But
in most of the real problems there is no linear separation possible between the data, the classifier
of maximum margin can not be used because it only works if the classes of data of learning are
linearly separable.[12]
If we suppose that we have binary classification. The SVM is working to put a hyperplane in
the middle of the two classes, so that the distance to the nearest positive or negative example is
maximized. The SVM discriminant function has the form [13]

𝑓 (𝑥) = 𝑤𝑇 𝑥 + 𝑏 (2.3)

where w is the parameter vector, and b is the bias or offset scalar. The classification rule is
𝑠𝑖𝑔𝑛(𝑓 (𝑥)) , and the linear decision boundary is specified by 𝑓 (𝑥) = 0. The labels 𝑦 ∈ −1, 1; if 𝑓
separates the data, the geometric distance between a point x and the decision boundary is:

𝑦𝑓 (𝑥)
(2.4)
||𝑤||

If we give training data (𝑥, 𝑦)1:𝑛 we want to nd a decision boundary w; b such that to maximize
the geometric distance of the closest point, i.e:
𝑛 𝑦𝑖 (𝑤𝑇 𝑥𝑖 + 𝑏)
max min (2.5)
𝑤,𝑏 𝑖=1 ‖𝑤‖

The above objective is difficult to optimize directly. The optimization of equation 2.13 is actually
over equivalence classes of w, b up to scaling. Therefore, we can reduce the redundancy by

22
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND

requiring the closest point to the decision boundary to have:

𝑦𝑓 (𝑥) = 𝑦(𝑤𝑇 𝑥 + 𝑏) = 1 (2.6)

which implies all points to satisfy

𝑦𝑓 (𝑥) = 𝑦(𝑤𝑇 𝑥 + 𝑏) > 1 (2.7)

This converts the complex problem of equation 2.16 into a constrained but simpler problem
1
max
𝑤,𝑏 ‖𝑤‖ (2.8)
𝑇
subject to 𝑦𝑖 (𝑤 𝑥 + 𝑏) ≥ 1 𝑖 = 1...𝑛

1 1
Maximizing is equivalent to minimizing ||𝑤||2 . And that we called it decision boundary
||𝑤|| 2
problems,and its can be found by solving the following constrained optimization problem
1
min ‖𝑤‖2
𝑤,𝑏 2 (2.9)
subject to 𝑦𝑖 (𝑤𝑇 𝑥𝑖 + 𝑏) ≥ 1 𝑖 = 1...𝑛

— Linearly Non-Separable Case To overcome the disadvantages of non linearly separable, the
SVM propose to change the data space, this non linear transformation of data can allow linear
separation of the examples in a new space. This new space is called "re-description space". We
have then a transformation of a problem nonlinear separation in the representation space into
a problem of linear separation in a space of re-description of larger dimension. This nonlinear
transformation is performed via a kernel function.[13] [12]

Figure 2.9: Linear vs nonlinear problems

However many real datasets are not linearly separable, because the previous problem of equation
2.18 has no solution.To handle non-separable data, we release the constraints by making the
inequalities easier to work with. This is done with slack variables 𝜉 ≥ 0, one for each constraint :

𝑦𝑖 (𝑤𝑇 𝑥𝑖 + 1) ≥ 1 − 𝜉𝑖 , 𝑖 = 1...𝑛 (2.10)

23
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND

Now a point 𝑥𝑖 can satisfy the constraint even if it is on the wrong side of the decision boundary, as
long as 𝜉𝑖 is large enough. Of course all constraints can be trivially satisfied this way. To prevent
this, we penalize the sum of 𝜉𝑖 , and arrive at the new primal problem
𝑛
1 ∑︁
min ‖𝑤‖2 + 𝐶 𝜉𝑖
𝑤,𝑏,𝜉 2 𝑖=1
(2.11)
subject to 𝑦𝑖 (𝑤𝑇 𝑥𝑖 + 1) ≥ 1 − 𝜉𝑖 , 𝑖 = 1...𝑛

𝜉𝑖 ≥ 0

where 𝐶 is a weight parameter, which needs to be carefully set (e.g., by cross validation 1 ).
Now to similarity the look at the dual problem of equation 2.19 by introducing Lagrange multi-
pliers. We arrive at
𝑛 𝑛
1 ∑︁ ∑︁
max − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝑥𝑖 𝑥𝑗 + 𝛼𝑖
𝛼 2 𝑖,𝑗=1 𝑖=1

subject to 0 ≤ 𝛼𝑖 ≤ 𝐶, 𝑖 = 1...𝑛 (2.12)


𝑛
∑︁
𝛼 𝑖 𝑦𝑖 = 0
𝑖=1
The discriminant function is now
𝑛
∑︁
𝑓 (𝑥) = 𝛼𝑖 𝑦𝑖 𝑥𝑖 𝑇 𝑥 + 𝑏
𝑖=1

4. Tuning parameters: Kernel, Regularization, Gamma and Margin


— Kernel
The learning of the hyperplane in linear SVM is done by transforming the problem using some
linear algebra. This is where the kernel plays role [10]. In the following sections we discuss some
of the kernels that are used in support vector machines.
— Linear Kernels
If a classification problem is linearly separable in the input space, we need not map the input
space into a high-dimensional space. In such a situation we use linear kernels [14]:

𝐾(𝑥, 𝑥′ ) = 𝑥𝑇 𝑥′ (2.13)

— Radial Basis Function Kernels


The radial basis function (RBF) are kernels that can be written in the form 2.13

𝐾(𝑥, 𝑥′ ) = 𝑓 (𝑑(𝑥, 𝑥′ )) (2.14)

Where d is a metric over 𝑋 and f is a function in.An example of the RBF kernel is the kernel
Gaussian
||𝑥 − 𝑥′ ||2
𝐾(𝑥, 𝑥′ ) = 𝑒( − ) (2.15)
2𝜎 2
1. Cross-Validation is a statistical method of evaluating and comparing learning algorithms by dividing data into two
segments: one used to learn or train a model and the other used to validate the model.

24
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND

Where 𝜎 is a positive real which represents the bandwidth of the kernel


— Sigmoid Function Kernels
Sigmoid functions with parameters 𝑘 and 𝜎, and kernel:

𝐾(𝑥, 𝑥′ ) = 𝑡𝑎𝑛ℎ(𝑘𝑥𝑇 𝑥′ − 𝛿) (2.16)

Figure 2.10: Mapping features

— Regularization
The Regularization parameter (often termed as C parameter in python’s sklearn library) tells the
SVM optimization how much you want to avoid misclassifying each training example.
For large values of C, the optimization will choose a smaller-margin hyperplane if that hyperplane
does a better job of getting all the training points classified correctly. Conversely, a very small
value of C will cause the optimizer to look for a larger-margin separating hyperplane, even if
that hyperplane misclassifies more points [10]. The figure below are example of two different
regularization parameter.

Figure 2.11: Left: low regularization value, right: high regularization value

— Gamma

25
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND

The gamma parameter defines how far the influence of a single training example reaches, with low
values meaning ‘far’ and high values meaning ‘close’. In other words, with low gamma, points far
away from plausible seperation line are considered in calculation for the seperation line. Where as
high gamma means the points close to plausible line are considered in calculation[10].

Figure 2.12: high Gamma

Figure 2.13: Low Gamma

— Margin
Margin is very importrant characteristic of SVM classifier. SVM to core tries to achieve a good
margin.
A good margin is one where this separation is larger for both the classes. figuer below gives to
visual example of good and bad margin. A good margin allows the points to be in their respective
classes without crossing to other class [10].

26
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.14: Good margin

Figure 2.15: Bad margin

2.3.5.3 Artificial Neural Networks

1. What is neural network ?


Artificial Neural Network is modelled after biological neural networks and attempts to allow computers
to learn in manners similar to human-reinforcement learning.[15] A neural network is an interconnected
assembly of simple processing elements units or nodes whose functionality is loosely based on the
animal neuron. The processing ability of the network is stored in the inter unit connection strengths
or weights obtained by a process of adaptation to or learning from a set of training patterns. Also a
neuron is an information-processing unit that is fundamental to the operation of a neural network.
[16]

27
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.16: Simple example of neural network.

2. The Basics of Neural Networks


Neural networks are typically organized in layers. Layers are made up of a number of interconnected
’nodes’ which contain an ’activation function’. Patterns are presented to the network via the ’input
layer’, which communicates to one or more ’hidden layers’ where the actual processing is done via a
system of weighted ’connections’. The hidden layers then link to an ’output layer’ where the answer
is output as shown in the graphic below.[15]

Figure 2.17: The Basics of Neural Networks:

Neural networks were originally called Perceptrons. In 1943 McCulloch and Pitts introduced the
perceptron as the basic form of Artificial Neural Network based on Rosenblatt’s original perceptron,
where a single artificial neuron with n input layers as depicted in figure above can be represented by
the formula that follows. The w weights allow each of the n inputs to contribute a greater or lesser
amount to the sum of input signals. The activation function f transforms the neuron’s combined input

28
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND

signals into a single output signal to be broadcasted to the output layer.


2
∑︁
𝑦(𝑥) = 𝑓 ( (𝑤𝑖 𝑥𝑖) (2.17)
𝑖=1

3. What is Activation Function ?


Activation functions are really important for a Artificial Neural Network to learn and make sense of
something really complicated and Non-linear complex functional mappings between the inputs and
response variable. They introduce non-linear properties to our Network. Their main purpose is to
convert a input signal of a node in a A-NN to an output signal. That output signal now is used as a
input in the next layer in the stack. [17]
There have been many kinds of activation functions that have been proposed over the years such as:
[17] [18]

(1) Sigmoid or Logistic


A Sigmoid function is a special case of the logistic function having a characteristic “S”-shaped
curve. It is especially used for models where we have to predict the probability as an output. The
range of Sigmoid is between 0 to 1. The logistic function is defined by the formula :

1
𝑦 = 𝑔(𝑥) = (2.18)
1 + 𝑒−1

(2) Tanh -Hyperbolic Tangent-


The hyperbolic tangent (tanh) function is an alternative to Sigmoid function but better. The
range of the tanh function is from (−1 to 1). tanh is also sigmoidal (s - shaped). It is defined by
the formula:
1 − 𝑒−2𝑥
𝑡𝑎𝑛ℎ(𝑥) = (2.19)
1 + 𝑒−2𝑥
(3) ReLu -Rectified linear units
The ReLU is the most used activation function in the world right now. Since, it is used in almost
all the convolutional neural networks or deep learning. The range of ReLU is between 0 to ∞, It
is defined by the formula:
𝑦 = 𝑔(𝑥) = 𝑚𝑎𝑥(0, 𝑥) (2.20)

29
2.3. MACHINE LEARNING CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.18: Most popular types of Activation functions

4. How neural networks work


The way the neural network trains itself is by first computing the cost function for the train dataset
for a given set of weights for the neurons. Then it goes back and adjusts the weights, followed by
computing the cost function for the train dataset based on the new weights. The process of sending
the errors back to the network for adjusting the weights is called back propagation.[19]
The Back propagation algorithm looks for the minimum value of the error function in weight space
using a technique called the delta rule or gradient descent. The weights that minimize the error
function is then considered to be a solution to the learning problem. Let us look at the steps involved
in training the neural network with Stochastic Gradient Descent[19]:
— Initialize the weights to small numbers very close to 0 (but not 0)
— Forward propagation – the neurons are activated from left to right, by using the first data entry
in our train dataset, until we arrive at the predicted result y
— Measure the error which will be generated
— Back propagation – the error generated will be back propagated from right to left, and the weights
will be adjusted according to the learning rate
— Repeat the previous three steps, forward propagation, error computation and back propagation
on the entire train dataset
— This would mark the end of the first epoch, the successive epochs will begin with the weight values
of the previous epochs, we can stop this process when the cost function converges within a certain
acceptable limit.

30
2.4. CONCLUSION CHAPTER 2. THEORETICAL BACKGROUND

Figure 2.19: Back propagation algorithm for neural networks

2.4 Conclusion
In this chapter, we introduced definitions, basics and we explained some generalities about Natural
language processing and about machine learning and its basic, presented its types, mention a few of its
algorithms.In the next chapter will talk about the background necessary of Sentiment Analysis.

31
Chapter 3

BACKGROUND ON SENTIMENT
ANALYSIS (SA)

3.1 Introduction
Sentiment analysis, also called opinion mining, is the field of study that analyzes people’s opinions,
sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services,
organizations, individuals, issues, events, topics, and their attributes. It represents a large problem space.

3.2 Definition
Sentiment analysis is a series of methods, techniques, and tools about detecting and extracting subjective
information, such as opinion and attitudes, from language. [20]
Traditionally, sentiment analysis has been about opinion polarity, i.e., whether someone has positive,
neutral, or negative opinion towards something. [21]
Sentiment analysis coincides with those of the social media. In fact, sentiment analysis is now right at
the center of the social media research. Hence, research in sentiment analysis not only has an important
impact on NLP, but may also have a profound impact on management sciences, political science, economics,
and social sciences as they are all affected by people’s opinions. [22]

3.3 The importance of sentiment analysis


There are millions online users, who write and read online and Internet usage around the world. On-
line daily sentiments becomes the most significant issue in making a decision. According to a new survey
conducted by Dimensional Research, the survey discuss the percentage of trust online customer reviews as

32
3.4. EVOLUTION OF SA CHAPTER 3. BACKGROUND ON SENTIMENT ANALYSIS (SA)

much as personal recommendations. According to 2011 Study: 74% of customer’s confidence is based on
online personal recommendation reviews, 60% in 2012 study, and 57% in 2013 Study. But this percentage
increases with respect to 2014 Study: 94% of customer’s trust on online sentiment reviews. [23]

3.4 Evolution of SA
We have seen a massive increase in the number of papers focusing on sentiment analysis and opinion
mining during the recent years. According to our data, nearly 7,000 papers of this topic have been published
and, more interestingly, 99% of the papers have appeared after 2004 making sentiment analysis one of the
fastest growing research areas. [24]
The number of papers in sentiment analysis is increasing rapidly as can be observed from Figure 3.1. [25]

Figure 3.1: Number of papers published per year in Scopus

3.5 Methods for Sentiment Analysis


There exist many algorithms, methodologies for sentiment analysis. Still many researchers are working of
developing new effective methods or improving existing methodologies [26]. There are three main techniques:

3.5.1 Machine learning Approach

Machine learning approach is used to train an algorithm with a predefined dataset before applying it
to actual dataset. Machine learning techniques first trains the algorithm with some particular inputs with
known outputs so that later it can work with new unknown data.

33
3.6. THREE MAJOR TECHNIQUESCHAPTER
OF SA 3. BACKGROUND ON SENTIMENT ANALYSIS (SA)

3.5.2 Rule Based Approach

Rule based approach is employed by shaping various rules for obtaining the opinion, created by tokenizing
every sentence in each document then testing every token, or word, for its presence. If the word is there and
has a positive sentiment, 𝑎 + 1 rating was applied to that. Every post starts with a neutral score of zero,
and was considered positive. If the ultimate polarity score was bigger than zero, or negative if the score was
less than zero [27] once the output of rule based approach it will check or raise whether the output is correct
or not. If the input sentence contains any word that isn’t present within the database which can facilitate
within the analysis of moving picture review, then such words are to be added to the database. [26]

3.5.3 Lexical Based Approach

Lexicon based techniques work on an assumption that the collective polarity of a sentence or documents
is total of polarities of the individual phrases or words. This methodology relies on emotional analysis for
sentiment analysis dictionaries for every domain. Next, every domain lexicon was replenished with appraisal
words of applicable training collection that have the best weight, calculated by the strategy of RF(Relevance
Frequency). The word-modifier changes (increases or decreases) the weight of the subsequent appraisal word
by an exact share. [28]

3.6 Three major techniques of SA


The three major techniques used in sentimental analysis are analyzed based on their performance and
accuracy. The major advantages and disadvantages of using any approach are also discussed. The comparison
of all these techniques is shown in tabular form below3.1 : [26]

34
3.7. SA APPLICATIONS CHAPTER 3. BACKGROUND ON SENTIMENT ANALYSIS (SA)

Approach Classification Advantage Disadvantage Methods


Machine Learn- It is classified as both Support feature learn- Large data require- SVM Naïve
ing Method supervised and unsu- ing and parameter op- ment and works on Bayes FDOSA
pervised learning timization for best re- single domain
sults
Rule based ap- It is classified as both Higher accuracy, re- Rules must need to de- Fine-grained
proach supervised and unsu- quire lesser data but fine accuracy as per- dictionary
pervised learning need expert human la- formance is highly rule Booster words
bor dependent
Lexicon Based It is classified under Labelled data and the Excessively rely on Corpus Dictio-
Approach unsupervised learning procedure of learning is emotional dictionary nary
not required

Table 3.1: Comparison of various sentimental analysis approaches

3.7 SA Applications
Opinions are central to almost all human activities because they are key influences of our behaviors.
Whenever we need to make a decision, we want to know others opinions. In the past, when an individual
needed opinions, he/she asked friends and family.
With the explosive growth of social media on the Web, individuals and organizations are increasingly
using the content in these media for decision making. [22]
So sentiment analysis has applications in vital sectors. The following points highlight some of these
applications: [29]
— Marketing: Since social media has become a unique platform of customer interactions, the use of
sentiment analysis can easily take marketing to a whole new level. Companies have figured that
emotions of social media are shaping their brand’s image. Therefore, sentiment analysis tools give
marketers a way to measure their effectiveness, and help consumers who are trying to research a
product or a service.
— Politics: Many valuable uses can be obtained for political organizations by fully understanding social
media sentiment. Social media feedback has been used to inform political leaders of potential threats,
problems or issues with their organizations. In addition, an essential role for sentiment analysis
appeared earlier in predicting elections, and acquiring citizens responses on important issues such as
increasing prices and changing the constitution.
— Healthcare: Medical web blogs are all over the internet these days. These web logs contain only
medicine and health-care issues such as diseases, medical treatments and medications. Due to the

35
3.8. SENTIMENT ANALYSIS TASKS
CHAPTER 3. BACKGROUND ON SENTIMENT ANALYSIS (SA)

health-related experiences and medical histories these web pages provide for practitioners and patients,
sentiment analysis tools had to be developed for the use in medical fields.
— Finance: Sentiment Analysis can also be used in the financial world. Investors can easily follow their
favorite companies and monitor their sentiment data in real time. With sentiment analysis, business
investors can acquire business news easier and aggregate this information to make better financial
decisions.

3.8 Sentiment analysis tasks


Watching "How the people think?" leads the governments and companies to understand "What the people
need" and it always helps them to improve the quality of their products and services [11].Today, there are
several sources of people’s views on the Web, and there are several techniques and methods to understand,
categorize, and analyses the opinions.
These techniques lead organizations and individuals to gain the necessary information easily and rapidly.
Some of the important tasks in sentiment analysis are presented in the following:

3.8.1 Subjectivity classification

The goal of this task is to categorize the sentences into two classes under the name of subjective class
and objective class. For example,"The weather is cold." is determined as objective sentence, and "I like cold
weather." is identified as a subjective sentence [30].

3.8.2 Opinion extraction

This approach detects the opinions that are embedded in sentences or documents. An opinionated
sentence is the smallest complete unit that sentiments can be extracted from. Thus, to extract opinions
and determine their polarities, the sentiment words, the opinion holders, and the contextual information are
considered as indications [31].

3.8.3 Opinion tracking

The goal of this approach is to find how the public change their views or opinions over time. Tracking
opinion systems can track the opinions in different sources according to various requests. The results of this
approach are very useful for companies, institutes, concerned public and especially for governments. [31]

3.8.4 Opinion summarization

The four steps of opinion summarization consist of: detecting subject, retrieving relevant sentences, iden-
tifying the opinion-oriented sentences, and summarizing. The opinion summarization divides the opinionated

36
3.9. TWO MAIN TYPES OF OPINIONS
CHAPTER 3. BACKGROUND ON SENTIMENT ANALYSIS (SA)

documents in two categories: positive and negative. [31]

3.8.5 Sentiment classification

Sentiment classification is an area in sentiment analysis which determines the overall polarity of a text,
or a sentence or a feature. It categorizes the opinionated statements into different classes by using a function
that is called classifier. [31]

3.9 Two main types of opinions

3.9.1 Regular opinions

Sentiment/opinion expressions on some target entities:


— Direct opinions: “the touch screen is really cool”.
— Indirect opinions: “after taking the drug, my pain has gone”.

3.9.2 Comparative opinions

Comparisons of more than one entity. Ex: “IPhone is better than Blackberry”.
A regular opinion has the following basic components:

1. Holder: represents the person that holds the sentiment

2. Target: denotes the subject of sentiment

3. Polarity: denotes the emotion expressed, it can be (positive and negative) or (positive, negative and
neutral )

4. Aspect: the part of target that the sentiment is expressed towards. [32]

3.10 Levels of sentiment analysis


Sentiment analysis has been mainly investigated at three different levels: document level, sentence level
and aspect level.

3.10.1 Document level

The task of document-level sentiment analysis is to determine whether an opinionated document that
comments on an object expresses an overall positive or negative opinion. For example, a sentiment analysis
system classifies the overall polarity of a customer review about a specific product. [33]
Document level analysis assumes a piece of text expresses sentiment towards a single target. So it is not
applicable to documents in which opinions are expressed on multiple products. [33]

37
3.11. RELATED WORK IN SA CHAPTER 3. BACKGROUND ON SENTIMENT ANALYSIS (SA)

3.10.2 Sentence level

The sentence level of sentiment analysis involves determining whether each sentence expressed a neutral,
positive, or negative opinion.
The subjectivity classification is very important, because it filters out those sentences that contain no
opinions. The sentence level of sentiment classification assumes that one sentence expresses a single opinion
from a single opinion holder. This task requires both local and global contextual information. [33]

3.10.3 Aspect level

The aspect level of sentiment analysis focuses on opinions itself instead of looking at the constructs of
documents, such as paragraphs, sentences and phrases. It is not enough just to find out the polarity of the
opinions; identifying the opinion targets is also essential [33].The goal of this level is to identify the opinion
or sentiment on entities and their different aspects.

3.11 Related Work in SA


Sentiment Analysis is a vast domain which requires the study of related work as well as a good knowledge
of theoretical background. In the following we describe relevant works in Sentiment Analysis as a problem
domain.
Mobile devices products reviews were analyzed by (Zhang, et-al) in [34]. This research can help in
evaluate accuracy. It is useful in a judgment of the product quality and status in the market. This research
used three machine learning algorithms (Naïve Base Classifier, K-nearest neighbor, and random forest) to
calculate the sentiments accuracy. The random forest improves the performance of the classifier.
Some research works have been carried out on sentiment in the microblog domain.Shamma et al. [35]
examined a variety of aspects of debate modeling using Twitter, and annotated corpus of 3,269 tweets posted
during the presidential debate on 2008 between Barack Obama and John McCain. Later, Diakapolous and
Shamma [36] used manual an notations to characterize the sentiment reactions to various issues in a debate
between John McCain and Barack Obama in the lead up to the US Presidential election in 2008, finding that
sentiment is useful as a measure to identify controversial moments in the debate. In these studies, Twitter
proved to be a basically source of data for identifying important topics and associated public reaction.
The Research on opinion mining on YouTube performed with Jin and Ho [37] for discussing how social
media can be utilized to radicalize a person. The research idea which illustrates in Crawling, a global
social networking platform, such as YouTube, has the potential to unearth content and interaction aimed at
radicalization of those with little or no apparent prior interest in violent terrorism . Their work examines
an approach is indeed fruitful. They got together a large dataset from a collection within YouTube that was
recognized as potentially having a radicalizing agenda. The data is analyzed using social network analysis
and sentiment analysis tools.

38
3.12. CONCLUSION CHAPTER 3. BACKGROUND ON SENTIMENT ANALYSIS (SA)

Pang and al. [38] used a single Naive Bayes classifier on a movie review corpus to achieve similar results as
the previous study. Multiple Naive Bayes models were trained using different features such as part of speech
tagging, unigrams, and bigrams. They achieved a classification accuracy of 77.3% which was considered a
high performance of the Naive Bayes classifier on that domain.
The research in sentiment analysis trend is not limited yet. In order to improve accuracy and perfor-
mance of the proposed techniques, applications, or algorithms. It enables them to more compatible with
understanding meaning and features. But still there are some problems and challenges in text analysis of
reviews/documents and evaluate sentiment scores.

3.12 Conclusion
In this chapter we talked about Sentiment Analysis and its components.Thus we will see in more detail
in the next chapter,on the Arabic text classification.

39
Chapter 4

Arabic Text Classification (ATC)

4.1 Introduction
Generally Text Categorization (TC) is a very important and fast growing applied research field, because
more text document is available online. Text categorization is a necessity due to the very large amount of
text documents that humans have to deal with daily and it utilized to give useful information from the large
amount of data.
The main goal of text mining or classification is to extract the information with value from unstructured
textual resources. Also to deals with the operations like, retrieval, classification and summarization.

4.2 ARABIC TEXT CLASSIFICATION


The Arabic Language is the 5th broadly used language in the world. It is pronounced by more than 280
million people as a first language and by 250 million as a second language. According to (Nationsonline,
2014; Wikipedia, 2014), Arabic is the fifth spoken language in the world. (See figure 4.2).[39]

40
4.2. ARABIC TEXT CLASSIFICATION CHAPTER 4. ARABIC TEXT CLASSIFICATION (ATC)

Figure 4.1: The Top 10 Spoken Languages in the World with their Corresponding Percent 2014

Internet World Stats presents its latest estimates for Internet Users by Language. Top Ten Languages
Internet Stats were updated in December 31, 2017 are represented in figure below 4.2 and we will observed
that Arabic occupied the fourth Rank in the world.

Figure 4.2: the top ten languages in the internet in millions of users -2017-

Despite Arabic is wide language, there are relatively few studies on the retrieval/mining of Arabic text
documents in the literature. This is due to the unique nature of Arabic language morphological principles.
So it has a very different and difficult structure than other languages.

41
4.3. CHARACTERISTICS OF ARABIC LANGUAGE
CHAPTER 4. ARABIC TEXT CLASSIFICATION (ATC)

4.3 Characteristics of Arabic Language


Arabic is a very rich language with complex morphology. Arabic language belongs to the family of
Semitic languages. It differs from Latin languages morphologically, syntactically and semantically. The
writing system of Arabic has 25 consonants and three long vowels that are written from right to left and
change shapes according to their position in the word.
In addition, Arabic has short vowels (diacritics) written above and under a consonant to give it its desired
sound and hence give a word a desired meaning.
The Arabic language consists of three types of words; nouns, verbs and particles. Nouns and verbs are
derived from a limited set of about 10,000 roots. So, Arabic language is highly derivative where tens or even
hundreds of words could be formed using only one root. Furthermore, a single word may be derived from
multiple roots.[40]

4.4 The importance of ATC


The importance of ATC comes from the following main reasons:
— Due to Historical, Geographical, Religious reason; Arabic language is a very rich with documents.
— A study of the world market, commissioned by the Miniwatts Marketing Group shows that the number
of Arab Internet users in the Middle East and Africa could jumped to 32 million in 2008 from 2.5
million in the year 2000, and in June 2012 this number jumped to more than 90 million users, the
growth of Arab Internet users in the Middle East region (for the same period 2000-2012) is expected
to reach about 2,640% compared to the growth of the world Internet users.
— The conducted research pointed out that 65% of the Internet Arabic speaking users could not read
English pages.
— The big growth of the Arabic internet content in the last years has raised up the need for an Arabic
language processing tools.

4.5 Challenges of using Arabic language


Arabic is a morphologically rich language, this can be explained by the following points:[29] [41]
— Many people tend to use the dialect of their country instead of using MSA
) d‫ﻫ‬AJ )‫ف‬AJ( ‫ﻓﺖ‬wJ (
— Repeating the letter more than once to intensify the meaning or feeling (A common style found in
informal channels) such as
”‫ااااااااااااا‬d‫“ﺟ‬
, in MSA is written as
”‫ا‬d‫“ﺟ‬

42
4.6. RESEARCH MOTIVATION CHAPTER 4. ARABIC TEXT CLASSIFICATION (ATC)

which means extremely too much.


— Arabic has various diacritics; based on the presence or absence of such diacritics, the meaning of
words can be totally different. For example, "teacher"
”TFCd‫“ﻣ‬
and "school"
,ÅTFCd‫“ﻣ‬
both can be read as the same word when written without diacritics
”TFCd‫“ﻣ‬
— Negation words that are used to negate past or present tense verbs, which change the meaning of the
verb to exactly the opposite. e.g
”‫ب‬Atk‫اﻟ‬ ‫@ا‬h‫“ﻟﻢ أﻋﺠﺐ ﺑ‬
“I didn’t like "this book.”
— Arabic has a very complex morphology recording as compared to English language. For example, to
convey the possessive, a word shall have the letter,
”‫“ى‬
attached to it as a suffix. There is no disjoint Arabic-equivalent of “my”.
— Broken plurals are common. Broken plurals are somewhat like irregular English plurals except that
they often do not resemble the singular form as closely as irregular plurals resemble the singular in
English. Because broken plurals do not obey normal morphological rules, they are not handled by
existing stemmers.
— Arabic synonyms are widespread. Arabic is considered one of the richest languages in the world. This
makes exact keyword match is inadequate for Arabic retrieval and classification.
From previous challenges Arabic language needs a set of pre-processing routines to be suitable for ma-
nipulation to build an Arabic Text Classifier which we will study in chapter 4.

4.6 Research Motivation


Over the year, the electronic text information increasingly through organization and internet. Arabic
text document is one of the most famous documents and needs to apply text mining methods as any other
languages. Text classification can be applied in importance operations such as real time sorting of files into
folder hierarchies, topic identifications, dynamic task-based interests, automatic meta-data organization,
text filtering and documents organization for databases and web pages. There are a lot of researches for
classification English language text but in Arabic language text is limited. [42]

43
4.7. CONCLUSION CHAPTER 4. ARABIC TEXT CLASSIFICATION (ATC)

4.7 Conclusion
In this chapter we talked about Arabic language classification and it’s characteristics including the im-
portance of Arabic text classification and his Challenges. Next chapter we will present the data collection,
preprocessing steps and frameworks.

44
Chapter 5

DATASETS AND IMPLEMENTATION


FRAMEWORKS

5.1 Introduction
This chapter describes our datasets and represents the main steps that have to be performed for carrying
out the sentiment classification, namely pre-processing and feature extraction that allow us to use machine
learning methods.

5.2 Data Collection


There are many standard data sets for English text classification that available free but for Arabic text
classification unfortunately there is no free standard data set. Almost of the researchers in the field of Arabic
text classification collected their own corpus from the online web sites.[43]
The project contains in general three main phases as shown in figure below 5.1 and each phase can divided
into sequences of steps:
— Arabic documents collection
— Data pre-processing
— Classification

45
5.2. DATA COLLECTION CHAPTER 5. DATASETS AND IMPLEMENTATION FRAMEWORKS

Figure 5.1: Text classification phases

5.2.1 Arabic corpus collection

In this phase, we collect the data set that will be used for building and testing the classifier module. We
are using a newspaper “El Chorouk online” because is a daily newspaper in Algeria published Saturday to
Thursday, it is the most read, and it was the third most visited website in 2010 in the MENA ( Middle East
North Africa )region.
We collected different articles in economic, political, social issues, violence, culture and art etc. Our corpus
consists of 1633 documents with 63055 tokens, each comment has a sentiment label (polarity): positive,
neutral, negative. 31392 tokens for negative, 21248 tokens for neutral and 9975 token for positive.
Nb: There are a few terms we should define them
— Document: This could be a text message, tweet, comment, email, book, lyrics to a song. This is
equivalent to one row or observation.
— Corpus: a collection of documents. This would be equivalent to a whole data set of rows/observations.
— Token: this is a word or symbols derived from a document through the process of tokenization. For
example the document ’How are you’ would have tokens of ’How’, ’are’, and ’you’.
To collect this corpus we passed by three steps searching, organizing and finally storing:

1. Searching: we are searching for articls those have number of comments superior than 20.

2. Organizing: this was the main step we are reading articls with their comments and we jud their
labels if it was positive, negative or point of view (neutral). We will remember some words that were
help as to decide what is the labels of this comments? in the tables below 5.1,5.2,5.3.

Positive words

‫ﻣﻌﻚ ﺣﻖ‬ Tns‫ اﻟﺤ‬Tyn‫اﻟ‬ ‫ﻼ‬hF ‫ﻫﻼ و‬A‫ﻓ‬ ‫ﻒ‬y\‫ﻳﻢ و ﻧ‬r‫ﻛ‬

Table 5.1: Some frequent words in positive document

46
5.2. DATA COLLECTION CHAPTER 5. DATASETS AND IMPLEMENTATION FRAMEWORKS

Negative words

‫ﻢ‬kyl‫ام ﻋ‬r‫ﻪ ﺣ‬l‫واﻟ‬ ‫ة‬C@‫ ﻗ‬Tb‫ﻟﻌ‬ ‫د‬Asf‫ واﻟ‬T‫داﺋ‬r‫واﻟ‬ ‫ﻊ‬K‫ﺢ وﺑ‬yb‫ﻗ‬

Table 5.2: Some frequent words in negative document

Neutral words

‫ﻦ‬Z‫ا‬ dqt‫ﻧﻌ‬ ‫ﻦ‬km‫ﻣ‬ Am‫ﺑ‬C

Table 5.3: Some frequent words in neutral document

3. Storing: after that we will separate each label in his file so they become three files.

Finally our data is ready and we will translate them to format CSV, and here is the statistical
distribution of our data as shown in the following table 5.4.

Sentiment label Number of documents Number of tokens


Positive 453 9975
Negative 760 31392
Neutral 420 21248
Total 1633 63055

Table 5.4: Number of documents and tokens in each lable

According to the table 5.4 we can obtain the following percentages shown in the figure below 5.2 :

Figure 5.2: Standard dataset statistics.

47
5.2. DATA COLLECTION CHAPTER 5. DATASETS AND IMPLEMENTATION FRAMEWORKS

We also observed that there is difference in length of comments between positive, negative and neutral.
The following figure 5.3 represents this difference.

Figure 5.3: Comment length

5.2.2 Data Preprocessing

In general, the large amount of data collected through various sources such as the internet, surveys and
experiments etc are full of missing values, noises and distortions.
Pre-processing is crucial in terms of computation time and classifier performance because noisy data can
slower the learning process and decrease the efficiency of the system in general. Therefore data pre-processing
is a fundamental step to improve the quality of the input data a pre-processing phase is required.
In this phase, we will follow various intermediate processing steps to get to the final text format that will
be used in the final learning step.
Pre-processing includes the following:
— Removal of URLs
Frequently comments contain web links to share some additional information. The content of the
links is not analyzed, hence address of the link itself does not provide any useful information and its
elimination can reduce the feature size, which is why URL is removed from the comment.
— Tokenization
is the process of breaking up the given text into individual elements called tokens. The tokens may
be words or numbers or punctuation mark. It is a mandatory step before any kind of processing.
The basic tokenizer like in NLTK will split our text into sentences and our sentences into typographic
tokens. Below 5.5 Here is an example of tokenization:

48
5.2. DATA COLLECTION CHAPTER 5. DATASETS AND IMPLEMENTATION FRAMEWORKS

input
 « á ÒʂÖÏ @ ÉÖÞ… ©Òm' à @ é<Ë@ È A‚ ð é<Ë@ ¼@Qk
éÓA

.
.
output
 « á ÒʂÖÏ @ ÉÖÞ… ©Òm' à @ é<Ë @ È A‚ ð é<Ë@ ¼@Qk
éÓA

.
.
Table 5.5: Tokenization example

— Noramlization
Arabic information retrieval systems normalize Arabic words to increase retrieval effectiveness. Nor-
malization of an Arabic word means replacing specific letters within the word with other letters
according to a predefined set of rules as shown in Table 5.6 below.

Letter Replacement word Word Normalized

‫أ‬ ‫ا‬ T‫ﻣ‬E‫أ‬ ‫ﻣﻪ‬E‫ا‬

‫إ‬ ‫ا‬ ‫إﻋﻼن‬ ‫اﻋﻼن‬


‫آ‬ ‫ا‬ T‫آﻟ‬ ‫اﻟﻪ‬
‫ة‬ ‫ه‬ Tml‫ﻣﻌ‬ ‫ﻪ‬ml‫ﻣﻌ‬
‫ى‬ ‫ي‬ Yl‫ﻋ‬ ‫ﻲ‬l‫ﻋ‬
‫ؤ‬ ‫و‬ ‫ﻣﺆﻣﻦ‬ ‫ﻣﻦ‬w‫ﻣ‬

Table 5.6: Normalization rules for Arabic words

In this step there are another kinds of impurity encountered is numeric data, punctuation, spaces and
single letters, and noise removal like (Tashdid, Fatha, Damma . . . .).
All of this kinds of impurity themselves do not represent any polarity. So for this reason they should
be removed from the dataset like our example shown in figure 5.4 below:

Figure 5.4: Remove number, punctuation and noise

49
5.2. DATA COLLECTION CHAPTER 5. DATASETS AND IMPLEMENTATION FRAMEWORKS

— Stop Words
Stop words are extremely frequent words that considered as valueless for taking them as features, such
as stop words pronouns, conjunctions, and prepositions, names. Stop words are deemed irrelevant for
searching purposes because they occur frequently in the language for which the indexing engine has
been tuned. In order to save both space and time, these words are dropped at indexing time and then
ignored at search time.Figure 5.5 below provides an example of stop words removal.

Figure 5.5: Example of stop words

— Stemming
Stemming is the process of removing some affixes from words, and reducing these words to their
roots. After reducing words to their roots, these roots can be used in compression, spell checking,
text searching, and text analysis.
The main goal of a stemmer is to map different forms of the same word to a common representation
called “stem”. Stemming can significantly improve the efficiency of the classification by reducing the
number of terms being input to the classification.
Many stemming methods have been developed for Arabic language. The two most widely used
stemmers are:

1. The root extraction stemmer developed by Khoja and al which allows to transform each surface
Arabic word in the document, into its root. It is commonly called heavy stemming.

2. The light stemmer developed by Larkey et al which allows to remove prefixes and suffixes.

In this project we used a Light Stemmer we remove only word prefixes in order to do not create non
real words because Arabic language is highly derivative where tens or even hundreds of words could
be formed using only one stem. The following Figure 5.12 is the result of the stemming phase in our
project.

50
5.2. DATA COLLECTION CHAPTER 5. DATASETS AND IMPLEMENTATION FRAMEWORKS

Figure 5.6: Example of stemming phase

5.2.3 Features Extraction

After pre-processing is completed, Text data requires special preparation before we can start using it for
training the classifiers and predictive modeling.
So it needs to be encoded as integers or floating point values, because most machine learning algorithms
can’t take in straight text, so we create a matrix of numerical values to represent our text, for use as input
to a machine learning algorithm, called feature extraction.
Hence it comes to mind the question How to Prepare Text documents for Machine Learning? In order
to address this,there are two different most common ways of doing this CountVectorizer and TfidfVectorizer
By extracting numerical features from text content.

5.2.3.1 CountVectorizer

CountVectorizer takes what’s called the Bag of Words approach. The bag-of-words model is a simplifying
representation used in natural language processing (NLP) and information retrieval (IR). In this
model, a text is represented as the bag of its words, disregarding grammar and even word order but keeping
multiplicity.The bag-of-words model is commonly used in methods of document classification.
The Bag of Words model learns a vocabulary from all of the documents, then models each document by
counting the occurrence of each word W and storing it in a matrix X.
For further clarification, let us provide the following example we have three sentences:
— S0:
”Tmhm‫اﻟ‬ ‫ﻋﻲ‬AnW}‫ء اﻻ‬A‫وع اﻟ@ﻛ‬r‫ ﻓ‬d‫ أﺣ‬w‫ﻢ اﻷﻟﻲ ﻫ‬l‫ﻌ‬t‫” اﻟ‬
— S1:
” ‹A‫ﻧ‬Ayb‫ ﻣﻦ اﻟ‬Tmy‫‹ ﻗ‬A‫ﻣ‬wl‫ا— ﻣﻌ‬r‫ﺨ‬tF‫ ا‬w‫ﻢ اﻵﻟﻲ ﻫ‬l‫ﻌ‬tl‫ ﻟ‬TyFAF‫ اﻷ‬Tmhm‫” اﻟ‬
— S2:
”‹AqybWt‫اﻟ‬ ‫ل‬wq‫ا ﻣﻦ ﺣ‬ryb‫دا ﻛ‬d‫ﻢ اﻵﻟﻲ ﻋ‬l‫ﻌ‬t‫ﻦ اﻟ‬mSt‫”ﻳ‬
To get our bags of words, we count the number of word occurs in each sentence. So the feature vector for
the three sentences above are represented in the table 5.7 below :

51
5.2. DATA COLLECTION CHAPTER 5. DATASETS AND IMPLEMENTATION FRAMEWORKS

d‫ا— أﺣ‬r‫ﺨ‬tF‫ا‬ ‫ اﻵﻟﻲ‬TyFAF‫ﻋﻲ اﻷ‬AnW}‫‹ اﻻ‬A‫ﻧ‬Ayb‫‹ اﻟ‬AqybWt‫اﻟ‬ ‫ﻢ‬l‫ﻌ‬t‫ء اﻟ‬A‫اﻟ@ﻛ‬


0 1 0 1 0 1 0 0 1 1
1 0 1 1 1 0 1 0 1 0
2 0 0 1 0 0 0 1 1 0

Tmhm‫اﻟ‬ ‫ل‬wq‫ﺣ‬ ‫دا‬d‫ﻋ‬ ‫وع‬r‫ﻓ‬ Tmy‫ا ﻗ‬ryb‫ﻛ‬ ‹A‫ﻣ‬wl‫ﻣﻌ‬ ‫ﻦ‬mSt‫ﻳ‬


0 1 0 0 1 0 0 0 0
1 1 0 0 0 1 0 1 0
2 0 1 1 0 0 1 0 1

Table 5.7: Bag of Words representation

We may note that most values in X will be zeros, for this reason we say that bags of words are typically
high-dimensional sparse datasets. We can save a lot of memory by only storing the non-zero parts of the
feature vectors in memory.

5.2.3.2 TfidfVectorizer

TfidfVectorizer is an alternative to CountVectorizer. It also creates a document term matrix from our
documents. It calculates term frequency-inverse document frequency value for each word (TF-IDF). The
TF-IDF is a statistical measure used to evaluate how important a word is to a document in a collection or
in a corpus.
The TF-IDF is the product of two weights:
— Term frequency: is a weight representing how often a word occurs in a document. If we have several
occurrences of the same word in one document we can expect the TF-IDF to increase.
— Inverse document frequency:is another weight representing how common a word is across docu-
ments. If a word is used in many documents then the TF-IDF will decrease.
Let’s see the same example that we used in CountVectorizer but we will use TfidfVectorizer. The following
table 5.8 represents the result.

52
5.2. DATA COLLECTION CHAPTER 5. DATASETS AND IMPLEMENTATION FRAMEWORKS

d‫ا— أﺣ‬r‫ﺨ‬tF‫ا‬ ‫ اﻵﻟﻲ‬TyFAF‫ﻋﻲ اﻷ‬AnW}‫‹ اﻻ‬A‫ﻧ‬Ayb‫‹ اﻟ‬AqybWt‫اﻟ‬ ‫ﻢ‬l‫ﻌ‬t‫اﻟ‬ ‫ء‬A‫اﻟ@ﻛ‬


0 0.435357 0.000000 0.257129 0.000000 0.435357 0.000000 0.000000 0.257129 0.435357
1 0.000000 0.399169 0.235756 0.399169 0.000000 0.399169 0.000000 0.235756 0.000000
2 0.000000 0.000000 0.247433 0.000000 0.000000 0.000000 0.41894 0.247433 0.000000

Tmhm‫اﻟ‬ ‫ل‬wq‫ﺣ‬ ‫دا‬d‫ﻋ‬ ‫وع‬r‫ﻓ‬ Tmy‫ﻗ‬ ‫ا‬ryb‫ﻛ‬ ‹A‫ﻣ‬wl‫ﻣﻌ‬ ‫ﻦ‬mSt‫ﻳ‬


0 0.331100 0.000000 0.000000 0.435357 0.000000 0.000000 0.000000 0.000000
1 0.303578 0.000000 0.000000 0.000000 0.399169 0.000000 0.399169 0.000000
2 0.000000 0.41894 0.41894 0.000000 0.000000 0.41894 0.000000 0.41894

Table 5.8: Example of TfidfVectorizer

NB:
To extract the features, we used the scikit-learn library. We will import the CountVectorizer and TfidfVec-
torizer as shown below:

Figure 5.7: Package of features extracted

5.2.4 N-grams

In natural language processing, the n-gram is a contiguous sequence of n items from a given sample of
text or speech. The items can be syllables, letters, or words according to the application. N-grams are basic
features of CountVectorizer and TfidfVectorizer.

The n-grams typically are collected from a text or speech corpus. if 𝑛 = 1, the n-gram is called
” ∖ 𝑢𝑛𝑖𝑔𝑟𝑎𝑚”, if 𝑛 = 2, the n-gram is called ” ∖ 𝑏𝑖𝑔𝑟𝑎𝑚”, if 𝑛 = 3, the n-gram is called ” ∖ 𝑡𝑟𝑖𝑔𝑟𝑎𝑚”,
if 𝑛 > 3, we will replace the letter n by its numerical value, such as four-gram, five-gram,etc. So, the main
difference relies on the chosen N. Figure below shows an example split of a sentence into n-grams.

53
5.2. DATA COLLECTION CHAPTER 5. DATASETS AND IMPLEMENTATION FRAMEWORKS

Figure 5.8: Split of a phrase into unigrams, bigrams and trigrams.

Now let’s us show the most frequent n-grams unigrams, bigrams and trigrams extracted from our data:

Figure 5.9: Most frequent unigrams bigrams and trigrams

5.2.5 Final Data representation

We converted our text data into matrix of features of integers or floating values using the vector models
which we explained earlier. In these models each comment is represented by a line of matrix in which the
columns represent the existing words, and the values represent the appearance of each word in the comment.
Where we get the final data representation that will be used to create the classifier.

The following figures show the final data representation by CountVectorizer using uni-grams, bi-grams
and the third figure representation by TfidfVectorizer using uni-grams .

54
5.2. DATA COLLECTION CHAPTER 5. DATASETS AND IMPLEMENTATION FRAMEWORKS

Figure 5.10: Final data representation by CountVectorizer using unigrams

Figure 5.11: Final data representation by CountVectorizer using bigrams

Figure 5.12: Final data representation by TfidfVectorizer using unigrams

5.2.6 Classification

In this step we use different classification algorithms, for each of them, the features extracted from the
training set in the pre-processing phase are used to the classification algorithm to build the classification
model which allows us to calculate the accuracy of each of them, and to calculate the processing time required

55
5.3. FRAMEWORKS CHAPTER 5. DATASETS AND IMPLEMENTATION FRAMEWORKS

for the model building for each algorithm. We will see the detail of this phase in the following chapter.

5.3 Frameworks
Different frameworks are used for this project. All of them are free and open source.

5.3.1 Python

Python is a high level general programming language and is very widely used in all types of disciplines
such as general programming, web development, software development, data analysis, machine learning etc.
Python is used for this project because it is very flexible and easy to use and it can perform same tasks with
fewer lines of codes than in any other mainstream programming languages.[44]

5.3.2 Jupyter Notebook

The Jupyter Notebook is an open-source web application that allows you to create and share documents
that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and
transformation, numerical simulation, statistical modeling, machine learning and much more.The Notebook
has support for over 40 programming languages, including those popular in Data Science such as Python,
R, Julia and Scala.

5.3.3 Pandas

Pandas is open source BSD licensed software specially written for python programming language. It
provides complete set of data analysis tools for python and is best competitor for R programming language.
Operations like reading data-frame, reading csv and excel files, slicing, indexing, merging, handling missing
data etc., can be easily performed with Pandas. Most important feature of Pandas is, it can perform time
series analysis. [45]

5.3.4 Matplotlib

matplotlib is a library for making 2D plots of arrays in Python. Although it has its origins in emulating
the MATLAB graphics commands, it is independent of MATLAB, and can be used in a Pythonic, object
oriented way. Although matplotlib is written primarily in pure Python, it makes heavy use of NumPy3 and
other extension code to provide good performance even for large arrays. [46]

5.3.5 Seaborn

Seaborn is library used for data visualization and is created by using python programming language. It is
high level library stacked on top of matplotlib. Seaborn is more attractive and informative than matplotlib

56
5.4. CONCLUSION CHAPTER 5. DATASETS AND IMPLEMENTATION FRAMEWORKS

and very easy to use and is tightly integrated with NumPy and Pandas. Seaborn and matplotlib can be used
essentially side by side to derive conclusions from the datasets. [47]

5.3.6 Scikit-learn

Scikit-learn is a free software library for the Python programming language. It is very easy to use.
Scikit-learn includes all the tools and algorithms needed for most of machine learning tasks. It features
various classification, regression and clustering algorithms including support vector machines, random forests,
gradient boosting, k-means, and is designed to interoperate with the Python numerical and scientic libraries
NumPy and SciPy. [48]

5.3.7 NLTK

The Natural Language Toolkit (NLTK) is a platform used for building programs for text analysis. The
platform was originally released by Steven Bird and Edward Loper in conjunction with a computational
linguistics course at the University of Pennsylvania in 2001.
It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a
suite of text processing libraries for categorizing text, tokenization, stemming, tagging, parsing, and semantic
reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum. NLTK is available
for Windows, Mac OS X, and Linux. Best of all, NLTK is a free, open source, community-driven project.
[49]

5.4 Conclusion
In this chapter, we focused on the conceptual aspects of our project. we present information about the
our data collection and explain steps of data preprocessing, then we presented the tools that we used. The
next chapter will be devoted to the realization of our application.

57
Chapter 6

RESULTS AND DISCUSSION

6.1 Introduction
This chapter introduces and describes the results of the performance obtained from the study of our
dataset. In our experiments, we tested six machine learning classification methods that are commonly used
in Sentiment Analysis after all pre-processing which are: Artificial Neural Network, Multinomial Naive Bayes,
Support Vector Machines, Logistic Regression and K-Nearest Neighbors and Random Forest.

For each classifier we must split the datasets into training and testing subsets (e.g., 80% and 20%) in
order to create best model and get best results of the performance. In addition, we also calculated the
computational time during training and testing stages, after that we decided wich model was the best for
our data .We also tested all methods with two classes: "positive, negative" and three classes: "positive,
negative and neutral". Finally To achieve an overall comparison, we also tested Bag-of-words features using
unigram, bigrams, and the two together.

6.2 Evaluation metrics of performance


The effectiveness of the classification algorithms is usually estimated based on such metrics as precision,
recall, F1 score, and accuracy.

To calculate these metrics we are using Confusion matrix wich contains the estimated and actual distri-
bution of labels as shown in the following table 6.1, each column corresponds to the actual label and each
row corresponds to the estimated (predicted) labels.

58
6.2. EVALUATION METRICS OF PERFORMANCE CHAPTER 6. RESULTS AND DISCUSSION

Figure 6.1: Confusion matrix

— True Positives (TP): is the number of true positives: the sentence that is actually positive and was
estimated as positive.
— True Negatives (TN): is the number of true negatives: the sentence that is actually negative and
was estimated as negative.
— False Positives (FP): is the number of false positives: the sentence that is actually negative but
estimated as positive.
— False Negatives (FN): is the number of false negatives: the sentence that is actually positive but
estimated as negative.

6.2.1 Accuracy

Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted obser-
vation to the total observations. Accuracy is a great measure but only when we have symmetric datasets.
Therefore, we have to look at other parameters to evaluate the performance of our model. It can be estimated
as:

𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (6.1)
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁

6.2.2 Precision

Precision is the ratio of correctly predicted positive observations to the total predicted positive obser-
vations. High precision relates to the low false positive rate. Precision can be estimated using following
formula:
𝑇𝑃
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (6.2)
𝑇𝑃 + 𝐹𝑃

59
6.3. RESULTS AND EVALUATION CHAPTER 6. RESULTS AND DISCUSSION

6.2.3 Recall

Recall shows the ability of the classifier to guess the ratio of correctly predicted positive observations to
the all observations in actual class, it is used the formula:

𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = (6.3)
𝑇𝑃 + 𝐹𝑁

6.2.4 F1 score

F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives
and false negatives into account. F1 score can be calculated using:

2 * 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 * 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹 1𝑆𝑐𝑜𝑟𝑒 = (6.4)
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙

Note:
— F1 score is usually more useful than accuracy, especially if you have an uneven class distribution.
— Accuracy works best if false positives and false negatives have similar cost.
— If the cost of false positives and false negatives are very different, it’s better to look at both Precision
and Recall.

6.3 Results and evaluation


This subsection describes all the results obtained from the study and introduces the best performer
according to various performance metrics. First performance was obtained by using two classes with different
steps. Second, performance was obtained by using three classes.

6.3.1 Two Classes: Positive and Negative

In the following experiments we will classify the comments into two classes positive and negative.

First experiment:
The first experiment was conducted on the model that is trained using unigrams features, the result of
the evaluation is depicted in the table 6.1 below:

60
6.3. RESULTS AND EVALUATION CHAPTER 6. RESULTS AND DISCUSSION

Algorithms used Accuracy Precision Recall F1-score


Multinomial Naive Bayes 84.71% 85% 85% 84%
Random forest 75.61% 76% 76% 76%
LinearSVC 72.31% 74% 72% 73%
KNN 43.80% 72% 44% 35%
Logistic Regression 77.68% 78% 78% 78%
Multi-layer Perceptron 79.33% 79% 79% 79%

Table 6.1: Evaluation of algorithms using unigrams 1

From the table 6.1 it can be seen that the best accuracy achieved is 84.71 % by Multinomial Naive Bayes
also it is the best one for the other metrics. The Figure 6.2 below show comparison of machine learning
classification algorithms for accuracy and time token on training the dataset.

Figure 6.2: Comparison of accuracy and time for first experiment

Second experiment:
The following experiments are performed by employment of the bigrams features, the results are presented
in Table 6.2.

61
6.3. RESULTS AND EVALUATION CHAPTER 6. RESULTS AND DISCUSSION

Algorithms used Accuracy Precision Recall F1-score


Multinomial Naive Bayes 75.20% 76% 75% 73%
Random forest 51.23% 71% 51% 48%
LinearSVC 51.75% 72% 52% 48%
KNN 37.19% 64% 37% 22%
Logistic Regression 74.79% 76% 75% 72%
Multi-layer Perceptron 55.75% 75% 56% 54%

Table 6.2: Evaluation of algorithms using Bigrams 1

The Figure 6.3 below shows the average accuracy and the average processing time for each of the six
algorithms for the unigrams runs. It is clear that the NB has the best accuracy and it is has the smallest
processing time.

Figure 6.3: Comparison of accuracy and time for second experiment

Third experiment:
This experiment represents the accuracy of previous algorithms with the use of Unigrams and Bigrams
features, the results are shown in the table 6.3

62
6.3. RESULTS AND EVALUATION CHAPTER 6. RESULTS AND DISCUSSION

Algorithms used Accuracy Precision Recall F1-score


Multinomial Naive Bayes 85.57% 86% 86% 85%
Random forest 70.66% 73% 71% 71%
LinearSVC 70.66% 74% 71% 72%
KNN 40.49% 69% 40% 29%
Logistic Regression 79.75% 80% 80% 80%
Multi-layer Perceptron 83.05% 83% 83% 83%

Table 6.3: Evaluation of algorithms using unigrams and bigrams 1

For the Multinomial Naive Bayes the accuracy is 85.57% that is a bit better that what was obtained
using the unigrams and bigrams alone.

The following figure 6.4 represents the accuracy of each classification method that we used with their
calculating time:

Figure 6.4: Comparison of accuracy and time for third experiment

6.3.1.1 Discussion

We compare results of different methods and different features selection (unigrams, bigrams, trigrams),
it is clear that when we used bigrams features we get bad results in all methods, for the Multinomial Naive
Bayes the highest accuracy is reached when we are used bigrams and unigrams both at the same time 85.57%,
also the Multi-layer Perceptron get a very good accuracy 83.05% but it takes a lot of time. KNN did not
show great performance in all experiment so it is bad method for our dataset. Overall, we see that our
method combined with using unigrams and bigrams or just unigrams can give very good results. Our final
data results for two class classification was 85.57%.

63
6.3. RESULTS AND EVALUATION CHAPTER 6. RESULTS AND DISCUSSION

6.3.2 Three Classes: Positive, Negative and Neutral

After finishing with two class classification, we tried to classify sentiment comments into three classes
positive, negative and neutral.
Fourth experiment: The fourth experiment was conducted on the model that is trained using unigrams
features, the result of the evaluation is depicted in the table 6.4 below:

Algorithms used Accuracy Precision Recall F1-score


Multinomial Naive Bayes 64.41% 65% 64% 63%
Random forest 62.57% 63% 63% 58%
LinearSVC 60.73% 59% 61% 59%
KNN 37.11% 40% 37% 30%
Logistic Regression 62.88% 61% 63% 61%
Multi-layer Perceptron 63.20% 61% 63% 61%

Table 6.4: Evaluation of algorithms using unigrams 2

In Table 6.4, Multinomial Naive Bayes has highest classification accuracy and the smallest processing
time. From the comparison of SVC, Random forest, KNN, Logistic Regression or Multi-layer Perceptron
with Multinomial Naive Bayes, Multinomial Naive Bayes is the best classifier with 64.41% average value.
The Figure 6.6 below show comparison of machine learning classification algorithms for accuracy and
time token on training the dataset.

Figure 6.5: Comparison of accuracy and time for fourth experiment

Fifth experiment: The fifth experiments are performed by employment of the bigrams features, the
results are presented in Table 6.5.

64
6.3. RESULTS AND EVALUATION CHAPTER 6. RESULTS AND DISCUSSION

Algorithms used Accuracy Precision Recall F1-score


Multinomial Naive Bayes 58.58% 60% 59% 56%
Random forest 40.18% 65% 40% 35%
LinearSVC 44.78% 68% 45% 41%
KNN 37.11% 42% 30% 19%
Logistic Regression 58.58% 68% 59% 53%
Multi-layer Perceptron 46.93% 61% 47% 44%

Table 6.5: Evaluation of algorithms using bigrams 2

In this experiment, there is a marked reduction in accuracy in all methods. We can see that Multinomial
Naive Bayes and Logistic Regression have the same accuracy 58.58% but if we see the F1 score Multinomial
Naive Bayes is better than Logistic Regression so Multinomial Naive Bayes is the best one.
The Figure ?? below represents comparison of machine learning classification algorithms for accuracy
and time token on training the dataset.

Figure 6.6: Comparison of accuracy and time for fifth experiment

Sixth experiment:
This experiment represent the accuracy of previous algorithms with the use of Unigrams and Bigrams
features, the results show in the table 6.6

65
6.3. RESULTS AND EVALUATION CHAPTER 6. RESULTS AND DISCUSSION

Algorithms used Accuracy Precision Recall F1-score


Multinomial Naive Bayes 65.64% 67% 66% 64%
Random forest 60.42% 64% 60% 58%
LinearSVC 61.34% 61% 61% 59%
KNN 31.90% 42% 32% 23%
Logistic Regression 62.88% 61% 62% 59%
Multi-layer Perceptron 63.19% 68% 67% 63%

Table 6.6: Evaluation of algorithms using unigrams and bigrams 2

For the Multinomial Naive Bayes it is clear that it got the top value opposed to the previous cases. The
Figure 6.7 below show comparison of machine learning classification algorithms for accuracy and time token
on training the dataset.

Figure 6.7: Comparison of accuracy and time for sixth experiment

If we compare the accuracy only we find that the Multinomial Naive Bayes and the Multi-layer Perceptron
have the greatest value Almost 0.6... But if we take into consideration the time of calculates the Multinomial
Naive Bayes is the most fast with accuracy = 65.64%.

6.3.2.1 Discussion

The classification of sentiment into three is more difficult than two class classification. So that we can see
that the accuracy is not good as for two class classification, but this is perfectly normal and was expected
because for more classes we have need to big size of dataset. Permission for more data you can achieve a very
good results. The same remark for the last three experiments like the three first, when we select unigrams
and bigrams or just unigrams can give good results compared to when we select bigrams, Our final data
results for three class classification was 65.64% Which we got by Multinomial Naive Bayes.

66
6.4. CONCLUSION CHAPTER 6. RESULTS AND DISCUSSION

6.4 Conclusion
This chapter presented the results of conducted experiments using six classifiers. It can be observed
that every algorithm has its intrinsic capacity to outperform other algorithm depending upon the situation.
Multinomial Naive Bayes approach gave quite good results in both time and accuracy for our dataset. The
figure below 6.8 represents some new examples that are categorized by Multinomial Naive Bayes.

Figure 6.8: Examples of classification with Multinomial Naive Bayes

67
Chapter 7

CONCLUSION AND FUTUR WORKS

7.1 Conclusion
The goal of machine learning is to turn data into information based on past experience and build deci-
sion systems that can act on that information. This goal has attracted interest from various domains, and
today, machine learning solutions have become indispensable tools in many fields of science, business and
engineering.

The primary objective of this thesis was to create a new datasets for classification of Arabic sentiment
comments and compare the algorithms with different performance metrics using machine learning in order
to see which is better algorithm for our datasets.

We chose Arabic language because it could serve as a practical guide for future annotation projects, and
the corpus will be available for the research community.

To achieve this research objective six machine learning algorithms have been tested: Support Vector
Machines, Multinomial Naïve Bayes, Random Forest, Logistic Regression and others. Every algorithm per-
formed better in some situation and worse in another, but Multinomial Naïve Bayes are the likely models to
work best in our dataset in this study.

The best result of the accuracy that was achieved, made up 85,57% for two class classification and 64.21%
for three class classification.

68
7.2. FUTUR WORK CHAPTER 7. CONCLUSION AND FUTUR WORKS

7.2 Futur work


Future work will involve investigation of other approaches for preprocessing comments to achieve the
higher accuracy, precision, etc. There are several directions that can be performed:
— We would like to use deep learning algorithme because because CNN performs better that Naive
Bayes classifier, but it requires solid computational resources and large amount of training sample.
— Comment may contain a lot of spelling mistakes, hence, spelling corrector can be applied to exclude
typos.
— Morever, it would be interesting to devlop our data for a lot of classes with good accuracy.

69
Bibliography

[1] Gobinda G. Chowdhury. Natural Language Processing. PhD thesis, Dept of Computer and Information
Sciences University of Strathclyde, Glasgow G1 1XH, UK.

[2] Natural language processing. Technical report, Natural Language Processing RSS. N.p., n.d. Web, 2017.

[3] Natural language processing. Copyright c Ann Copestake, 2003–2004.

[4] Designing Machine Learning Systems with Python. Packt Publishing, 2016. ISBN
1785882953,9781785882951.

[5] Arthur L Samuel. Some studies in machine learning using the game of checkers. IBM Journal of Research
and Development, 1960.

[6] Machine Learning. McGraw-Hill series in computer science, 1997. ISBN ISBN
9780070428072,0070428077.

[7] Stephen Marshland. Machine learning : an algorithmic perspective. Chapman and Hall/CRC machine
learning and pattern recognition series. CRC Press, 2015.

[8] M. Littman L. P. Kaelbling and A. Moore. Moore reinforcement learning. A Survey Journal of
Arti𝑐 𝑖𝑎𝑙𝐼𝑛𝑡𝑒𝑙𝑙𝑖𝑔𝑒𝑛𝑐𝑒𝑅𝑒𝑠𝑒𝑎𝑟𝑐ℎ, 1996.

[9] Eihab Bashier Mohammed Bashier Mohssen Mohammed, Muhammad Badruddin Khan. Machine learn-
ing: Algorithms and applications. CRC Press Reference, 2016.

[10] Savan Patel. Computer engineer. Technical report, Just started exploring machine learning, 2017.

[11] Marref Nadia. Apprentissage incremental and machines a vecteurs supports. Master’s thesis, Universite
HADJ LAKHDAR BATNA, 2013.

[12] The Top Ten Algorithms in Data Mining. data mining and knowledge discovery series. CRC Press,
2009. ISBN 1 edition , ISBN 9781420089646,1420089641.

[13] Xiaojin Zhu. Advanced natural language processing-support vector machines. Spring, 2010.

70
BIBLIOGRAPHY BIBLIOGRAPHY

[14] Abe Sh. Support vector machines for pattern classification. 2005.

[15] A basic introduction to neural networks. Technical report, Computer Science Department Darmouth
College.

[16] Kevin Gurney. An introduction to neural networks. University of Sheffield UCL Press, 1997.

[17] Anish Singh Walia. A data nerd in deep love with machine learning. Statistics Data Science, 2017.

[18] Isaac Changhau. Activation Functions in Artificial Neural Networks. PhD thesis, 2017.

[19] Paul King. Computational neuroscientist. Technical report, Data Scientist, Technology Entrepreneur,
2016.

[20] Handbook of Natural Language Processing, chapter Sentiment Analysis and Subjectivity. New York,
NY, USA„ 2009.

[21] S. Lawrence K. Dave and D. M. Pennock. Mining the peanut gallery: Opinion extraction and semantic
classification of product reviews. In in Proceedings of the 12th international conference on World Wide
Web.

[22] Bing Liu. Sentiment Analysis and Opinion Mining. PhD thesis, University of Toronto.

[23] Advances in The Human Side of Service Engineering. 5th International Conference on Applied Human
Factors and Ergonomics, Volume Set,Proceedings of the 5th AHEE Conference, 2014.

[24] Miikka Kuutila Mika V. Mäntylä, Daniel Graziotin. The evolution of sentiment analysis. A Review of
Research Topics, Venues, and Top Cited Papers.

[25] R. Stagner. The cross-out technique as a method in public opinion analysis. The Journal of Social
Psychology, 1940.

[26] Anchal Kathuria et al. A novel review of various sentimental analysis techniques. International Journal
of Computer Science and Mobile Computing, 2017.

[27] R. Nithya and D. Maheswari. Sentiment analysis on unstructured review. In International Conference
on Intelligent Computing Applications.

[28] H. Dalal C. Bhadane and H. Doshi. Sentiment analysis: Measuring opinions. Procedia Comput, 2015.

[29] Sally Rushaidat Raed Marji, Narmeen Sha’ban. Sentiment analysis in arabic tweets. Irbid 22110, 2014.

[30] B. Liu. Sentiment analysis and subjectivity. Handb. Nat. Lang . Process, 2010.

[31] H. Chen K. Lun-Wei L. Yu-Ting C L. Ku, Y. Liang. Opinion extraction, summarization and tracking
in news and blog corpora. Artif Intell, 2006.

71
BIBLIOGRAPHY BIBLIOGRAPHY

[32] Ronen Feldman. Introduction to sentiment analysis. Technical report, Based on slides from Bing Liu.

[33] A Framework and practical implementation for sentiment analysis and aspect exploration. PhD thesis,
Alliance Manchester Business School, 2016.

[34] M Kasper, W. Vela. Sentiment analysis for hotel reviews.

[35] Lyndon Kennedy David A. Shamma and Elizabeth F. Tweet the debates: Understanding community
annotation of uncollected sources. In Proceedings of the First SIGMM Workshop on Social Media, 2009.

[36] Nicholas A. Diakopoulos and David A. Shamma. Characterizing debate performance via aggregated
twitter sentimen. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems,
2010.

[37] L Pang, B. Lee. A sentimental education: Sentiment analysis using subjectivity summarization based
on minimum cuts. Association of Computational Linguistics (ACL), 2004.

[38] Shivakumar Vaithyanathan Bo Pang, Lillian Lee. Sentiment classification using machine learning tech-
niques.

[39] Tari Brahimi B, Touahria M. Data and text mining techniques for classifying arabic tweet polarity.
Journal of Digital Information Management, 2016.

[40] Aitao C. Building an arabic stemmer for information retrieval.

[41] A. W. Saad M. Arabic text classification using decision trees. 2010.

[42] Mohammed N. Azarah. Arabic Text Classification Using Learning Vector Quantization. PhD thesis,
The Islamic University – Gaza Denary of Higher Studies Faculty of Information Technology, 2012.

[43] M.S. Khorsheed and A.O Al-Thubaity. Comparative evaluation of text classification techniques using a
large diverse arabic dataset. Springer Science + Business Media Dordrecht, 2013.

[44] Python programming documentation. . URL https://www.python.org/about/.

[45] Pandas documentation. . URL http://pandas.pydata.org.

[46] Matplotlib Release 2.2.2. 2018.

[47] Michael Waskom. An introduction to seaborn. URL http://seaborn.pydata.org/introduction.


html.

[48] Scikit-learn user guide Release 0.18.2. 2017.

[49] Nltk 3.2.5 documentation. Technical report, NLTK project, 2017. URL http://www.nltk.org.

72

You might also like