Memoir Presentaion Tex

Analyse des sentiments des commentaires En Arabe des lecteurs
des journaux en lignes
Aberkane Rania
UNIVERSITY FERHAT ABBAS SETIF 1
Faculty of Sciences
Department of Computer Science
Memory of master degree

Supervised by : : Dr. Sadik Bessou
June 16, 2018
Aberkane Rania Thesis of master June 16, 2018 1 / 27

Table of Contents
1 Introduction
2 Theoretical background
3 Preprocessing
4 Implementation
5 Results
6 Conclusion

Table of Contents
1 Introduction
3 Preprocessing
4 Implementation
5 Results
6 Conclusion

Introduction
The rapid growth of the internet and computer technologies has caused the existence of billions of
electronic text documents which are created, edited, and stored in digital ways. Hence, The Manual
procedures for text classification become laborious, time-consuming, and potentially unreliable. So, for
that we must use the automatic techniques to facilitate assignment of text to categories.

Introduction
The rapid growth of the internet and computer technologies has caused the existence of billions of
electronic text documents which are created, edited, and stored in digital ways. Hence, The Manual
procedures for text classification become laborious, time-consuming, and potentially unreliable. So, for
that we must use the automatic techniques to facilitate assignment of text to categories.
Text classification draw more and more attention recently, it has been applied on different domains
including web mining, opinion mining, and sentiment analysis.

Table of Contents
1 Introduction
3 Preprocessing
4 Implementation
5 Results
6 Conclusion

Sentiment analysis
Today, Sentiment analysis is one of the fastest growing research areas in computer science. So what is
sentiment analysis ??
Definition
Sentiment analysis is the techniques helps to extract subjective information and to analyze the
sentiments of the people interacting online using the social channels like Facebook, Twitter,
Instagram, comments and other social networking sites.
The opinion of people comes in positive,

negative or neutral towards a particular
product, brand, service or campaign by a
particular company or organization.

Sentiment analysis applications
Sentiment analysis has a profound impact on all topics affected by

peoples opinions.
Businesse: To measure their sales and improve their marketing

strategies.
Political Campaign: Like the Obama administration used

sentiment analysis in presidential election on 2012.
Sport: It is important to understand crowd sentiments for sports

and accordingly change sports strategy.

Evolution of Sentiment analysis
Given the importance of sentiment analysis the number of papers is increasing rapidly as can be
observed from Figure:

Why did we choose Arabic language ?
There are a lot of researches for classification

English language text but in Arabic language text is
limited. Despite it is wide language, The number of
Arab Internet users could jumped to more than 90
million users in 2012 from 2.5 million in the year
2000, and it has become occupied the fourth Rank
in 2017.
The big growth of the Arabic internet content in the last years has raised up the need for an Arabic
language processing tools. So,How can we classify the Arabic text ? and what are the tools to help us
to that ?

Why did we choose Arabic language ?
There are a lot of researches for classification

English language text but in Arabic language text is
limited. Despite it is wide language, The number of
Arab Internet users could jumped to more than 90
million users in 2012 from 2.5 million in the year
2000, and it has become occupied the fourth Rank
in 2017.
The big growth of the Arabic internet content in the last years has raised up the need for an Arabic
language processing tools. So,How can we classify the Arabic text ? and what are the tools to help us
to that ?

Machine learning
What is Machine Learning ?
How can machine learn ?

Machine learning is the ability of a computer to learn from experience. Experience is usually given in
the form of input data. Looking at this data, the computer can find dependencies in the data that are
too complex for a human to form.

What are the processes of Machine Learning ?
There are 5 basic steps used to perform a machine learning task:

1 Collecting data: This step forms the foundation of the future learning. The better the quality and
quantity of relevant data, better the learning prospects for the machine becomes.


2 Preparing the data: Any analytical process depends on the quality of the data used.


3 Training a model: In this step the cleaned data is split into two parts train and test, We used
training data for developing the model.


4 Evaluating the model: This step is to test the accuracy, we used the test data, because the
better test to check accuracy of model is to see its performance on data which was not used at all
during model build.


4 Evaluating the model: This step is to test the accuracy, we used the test data, because the
better test to check accuracy of model is to see its performance on data which was not used at all
during model build.
5 Improving the performance: This step involve choosing a different model altogether or
introducing more variables to augment the efficiency.

Table of Contents
1 Introduction
3 Preprocessing
4 Implementation
5 Results
6 Conclusion

Data collection
We collected our data set from the newspaper El Chorouk

online, because almost of the researchers in the field of
Arabic text classification collected their own corpus.
The figure represents the statistical distribution of our

data set.

Data Preprocessing
Before we can use data collected we need to do some preprocessing to remove unnecessary informations:
Removal of URLs.
Tokenization.
Noramlization(replacing specific letters,numeric data, punctuation, spaces and single letters).
Remove stop words:such as stop words pronouns, conjunctions, and prepositions, names.
Stemming.

Table of Contents
1 Introduction
3 Preprocessing
4 Implementation
5 Results
6 Conclusion

Prepare Data For Machine Learning
After preprocessing is completed, it comes to mind the following question How To Prepare Data For
Machine Learning ?
Machine learning cannot work with raw text directly, the text must be converted into numbers. It
should be represented like this:
Feature 1 Feature 2 ... Feature N Label

Document 1 F11 F12 F1... F1N Yes
Document 2 F21 F22 F2... F2N No
... F... F... F... F3N ...
Document M FM1 FM2 FM... FMN No
So, how can we transform the data into numbers ? and what can features represent ? that what we will
discover in the next slides.

The bag-of-words
A bag-of-words model describes the occurrence of words in document. It is a way

of extracting features from text for use in modeling, with machine learning.
Example: setp of bag-of-word
1 Learning a vocabulary from all of the

documents.
2 Counting the occurrence of each word.
3 Storing it in a matrix.
To realise this in scikit-learn we use CountVectorizer or TfidfVectorizer.

Features
N-grams are basic features of bag-of-words. features can be single words (Unigrams), two word
(Bigrams) or three words (Trigrams).
Example:

Machine Learning Algorithms
Now our data is ready to use Machine Learning Algorithms.
We tested six machine learning classification methods that are commonly used in Sentiment Analysis
which are:
Multinomial Naive Bayes.

Support Vector Machines (svm).
Logistic Regression.
K-Nearest Neighbors (knn).
Random Forest.
Artificial Neural Networks

Machine Learning Algorithms
Now our data is ready to use Machine Learning Algorithms.
We tested six machine learning classification methods that are commonly used in Sentiment Analysis
which are:
Multinomial Naive Bayes.

Support Vector Machines (svm).
Each algorithm goes through
Logistic Regression. these steps to build the
K-Nearest Neighbors (knn). classifier:
Random Forest.
Artificial Neural Networks

Sentiment Analysis Best Practices:

Table of Contents
1 Introduction
3 Preprocessing
4 Implementation
5 Results
6 Conclusion

Two Classes: Positive and Negative
Algorithms Unigrams features Bigrams features Unigrams and Bigrams

Multinomial Naive Bayes 84.71% 75.20% 85.57%
Random forest 75.61% 51.23% 70.66%
LinearSVC 72.31% 51.75% 70.66%
KNN 43.80% 37.19% 40.49%
Logistic Regression 77.68% 74.79% 79.75%
Multi-layer Perceptron 79.33% 55.75% 83.05%

Three Classes : Positive, Negative and Neutral
Algorithms Unigrams features Bigrams features Unigrams and Bigrams

Multinomial Naive Bayes 64.41% 58.58% 65.64%
Random forest 62.57% 40.18% 60.42%
LinearSVC 60.73% 44.78% 61.34%
KNN 37.11% 37.11% 31.90%
Logistic Regression 62.88% 58.58% 62.88%
Multi-layer Perceptron 63.20% 46.93% 63.19%

Table of Contents
1 Introduction
3 Preprocessing
4 Implementation
5 Results
6 Conclusion

Conclusion
Multinomial Nave Bayes are the likely models to work best in our dataset.
We tested diferent features and found that unigrams and bigrams works the best.
By monitoring attitudes and opinions about any topic, we are able to detect shifts in opinions and
adapt readily to meet the changing needs.

Future work
We would like to develop our data for different categorizations with good accuracy.
Extending classifier to work with Arabic.
We would like to work on speech.


Memoir Presentaion Tex

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Memoir Presentaion Tex

Uploaded by

Copyright:

Available Formats

Analyse des sentiments des commentaires En Arabe des lecteurs

des journaux en lignes

UNIVERSITY FERHAT ABBAS SETIF 1

Memory of master degree

June 16, 2018

Aberkane Rania Thesis of master June 16, 2018 1 / 27

Aberkane Rania Thesis of master June 16, 2018 2 / 27

Aberkane Rania Thesis of master June 16, 2018 3 / 27

Aberkane Rania Thesis of master June 16, 2018 4 / 27

Aberkane Rania Thesis of master June 16, 2018 4 / 27

Aberkane Rania Thesis of master June 16, 2018 5 / 27

The opinion of people comes in positive,

Aberkane Rania Thesis of master June 16, 2018 6 / 27

Sentiment analysis has a profound impact on all topics affected by

Businesse: To measure their sales and improve their marketing

Political Campaign: Like the Obama administration used

Sport: It is important to understand crowd sentiments for sports

Aberkane Rania Thesis of master June 16, 2018 7 / 27

Aberkane Rania Thesis of master June 16, 2018 8 / 27

There are a lot of researches for classification

Aberkane Rania Thesis of master June 16, 2018 9 / 27

There are a lot of researches for classification

Aberkane Rania Thesis of master June 16, 2018 9 / 27

What is Machine Learning ?

How can machine learn ?

Aberkane Rania Thesis of master June 16, 2018 10 / 27

There are 5 basic steps used to perform a machine learning task:

Aberkane Rania Thesis of master June 16, 2018 11 / 27

There are 5 basic steps used to perform a machine learning task:

Aberkane Rania Thesis of master June 16, 2018 11 / 27

There are 5 basic steps used to perform a machine learning task:

Aberkane Rania Thesis of master June 16, 2018 11 / 27

There are 5 basic steps used to perform a machine learning task:

Aberkane Rania Thesis of master June 16, 2018 11 / 27

There are 5 basic steps used to perform a machine learning task:

Aberkane Rania Thesis of master June 16, 2018 11 / 27

Aberkane Rania Thesis of master June 16, 2018 12 / 27

We collected our data set from the newspaper El Chorouk

The figure represents the statistical distribution of our

Aberkane Rania Thesis of master June 16, 2018 13 / 27

Aberkane Rania Thesis of master June 16, 2018 14 / 27

Aberkane Rania Thesis of master June 16, 2018 15 / 27

Feature 1 Feature 2 ... Feature N Label

Aberkane Rania Thesis of master June 16, 2018 16 / 27

A bag-of-words model describes the occurrence of words in document. It is a way

Example: setp of bag-of-word

1 Learning a vocabulary from all of the

To realise this in scikit-learn we use CountVectorizer or TfidfVectorizer.

Aberkane Rania Thesis of master June 16, 2018 18 / 27

Now our data is ready to use Machine Learning Algorithms.

Multinomial Naive Bayes.

Aberkane Rania Thesis of master June 16, 2018 19 / 27

Now our data is ready to use Machine Learning Algorithms.

Multinomial Naive Bayes.

Aberkane Rania Thesis of master June 16, 2018 19 / 27

Aberkane Rania Thesis of master June 16, 2018 20 / 27

Aberkane Rania Thesis of master June 16, 2018 21 / 27

Algorithms Unigrams features Bigrams features Unigrams and Bigrams

Aberkane Rania Thesis of master June 16, 2018 22 / 27

Algorithms Unigrams features Bigrams features Unigrams and Bigrams

Aberkane Rania Thesis of master June 16, 2018 23 / 27

Aberkane Rania Thesis of master June 16, 2018 24 / 27

Aberkane Rania Thesis of master June 16, 2018 25 / 27