Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Analyse des sentiments des commentaires En Arabe des lecteurs

des journaux en lignes

Aberkane Rania

UNIVERSITY FERHAT ABBAS SETIF 1

Faculty of Sciences
Department of Computer Science

Memory of master degree


Supervised by : : Dr. Sadik Bessou

June 16, 2018

Aberkane Rania Thesis of master June 16, 2018 1 / 27


Table of Contents

1 Introduction

2 Theoretical background

3 Preprocessing

4 Implementation

5 Results

6 Conclusion

Aberkane Rania Thesis of master June 16, 2018 2 / 27


Table of Contents

1 Introduction

2 Theoretical background

3 Preprocessing

4 Implementation

5 Results

6 Conclusion

Aberkane Rania Thesis of master June 16, 2018 3 / 27


Introduction

The rapid growth of the internet and computer technologies has caused the existence of billions of
electronic text documents which are created, edited, and stored in digital ways. Hence, The Manual
procedures for text classification become laborious, time-consuming, and potentially unreliable. So, for
that we must use the automatic techniques to facilitate assignment of text to categories.

Aberkane Rania Thesis of master June 16, 2018 4 / 27


Introduction

The rapid growth of the internet and computer technologies has caused the existence of billions of
electronic text documents which are created, edited, and stored in digital ways. Hence, The Manual
procedures for text classification become laborious, time-consuming, and potentially unreliable. So, for
that we must use the automatic techniques to facilitate assignment of text to categories.

Text classification draw more and more attention recently, it has been applied on different domains
including web mining, opinion mining, and sentiment analysis.

Aberkane Rania Thesis of master June 16, 2018 4 / 27


Table of Contents

1 Introduction

2 Theoretical background

3 Preprocessing

4 Implementation

5 Results

6 Conclusion

Aberkane Rania Thesis of master June 16, 2018 5 / 27


Sentiment analysis

Today, Sentiment analysis is one of the fastest growing research areas in computer science. So what is
sentiment analysis ??
Definition
Sentiment analysis is the techniques helps to extract subjective information and to analyze the
sentiments of the people interacting online using the social channels like Facebook, Twitter,
Instagram, comments and other social networking sites.

The opinion of people comes in positive,


negative or neutral towards a particular
product, brand, service or campaign by a
particular company or organization.

Aberkane Rania Thesis of master June 16, 2018 6 / 27


Sentiment analysis applications

Sentiment analysis has a profound impact on all topics affected by


peoples opinions.

Businesse: To measure their sales and improve their marketing


strategies.

Political Campaign: Like the Obama administration used


sentiment analysis in presidential election on 2012.

Sport: It is important to understand crowd sentiments for sports


and accordingly change sports strategy.

Aberkane Rania Thesis of master June 16, 2018 7 / 27


Evolution of Sentiment analysis

Given the importance of sentiment analysis the number of papers is increasing rapidly as can be
observed from Figure:

Aberkane Rania Thesis of master June 16, 2018 8 / 27


Why did we choose Arabic language ?

There are a lot of researches for classification


English language text but in Arabic language text is
limited. Despite it is wide language, The number of
Arab Internet users could jumped to more than 90
million users in 2012 from 2.5 million in the year
2000, and it has become occupied the fourth Rank
in 2017.

The big growth of the Arabic internet content in the last years has raised up the need for an Arabic
language processing tools. So,How can we classify the Arabic text ? and what are the tools to help us
to that ?

Aberkane Rania Thesis of master June 16, 2018 9 / 27


Why did we choose Arabic language ?

There are a lot of researches for classification


English language text but in Arabic language text is
limited. Despite it is wide language, The number of
Arab Internet users could jumped to more than 90
million users in 2012 from 2.5 million in the year
2000, and it has become occupied the fourth Rank
in 2017.

The big growth of the Arabic internet content in the last years has raised up the need for an Arabic
language processing tools. So,How can we classify the Arabic text ? and what are the tools to help us
to that ?

Aberkane Rania Thesis of master June 16, 2018 9 / 27


Machine learning

What is Machine Learning ?

How can machine learn ?


Machine learning is the ability of a computer to learn from experience. Experience is usually given in
the form of input data. Looking at this data, the computer can find dependencies in the data that are
too complex for a human to form.

Aberkane Rania Thesis of master June 16, 2018 10 / 27


What are the processes of Machine Learning ?

There are 5 basic steps used to perform a machine learning task:


1 Collecting data: This step forms the foundation of the future learning. The better the quality and
quantity of relevant data, better the learning prospects for the machine becomes.

Aberkane Rania Thesis of master June 16, 2018 11 / 27


What are the processes of Machine Learning ?

There are 5 basic steps used to perform a machine learning task:


1 Collecting data: This step forms the foundation of the future learning. The better the quality and
quantity of relevant data, better the learning prospects for the machine becomes.
2 Preparing the data: Any analytical process depends on the quality of the data used.

Aberkane Rania Thesis of master June 16, 2018 11 / 27


What are the processes of Machine Learning ?

There are 5 basic steps used to perform a machine learning task:


1 Collecting data: This step forms the foundation of the future learning. The better the quality and
quantity of relevant data, better the learning prospects for the machine becomes.
2 Preparing the data: Any analytical process depends on the quality of the data used.
3 Training a model: In this step the cleaned data is split into two parts train and test, We used
training data for developing the model.

Aberkane Rania Thesis of master June 16, 2018 11 / 27


What are the processes of Machine Learning ?

There are 5 basic steps used to perform a machine learning task:


1 Collecting data: This step forms the foundation of the future learning. The better the quality and
quantity of relevant data, better the learning prospects for the machine becomes.
2 Preparing the data: Any analytical process depends on the quality of the data used.
3 Training a model: In this step the cleaned data is split into two parts train and test, We used
training data for developing the model.
4 Evaluating the model: This step is to test the accuracy, we used the test data, because the
better test to check accuracy of model is to see its performance on data which was not used at all
during model build.

Aberkane Rania Thesis of master June 16, 2018 11 / 27


What are the processes of Machine Learning ?

There are 5 basic steps used to perform a machine learning task:


1 Collecting data: This step forms the foundation of the future learning. The better the quality and
quantity of relevant data, better the learning prospects for the machine becomes.
2 Preparing the data: Any analytical process depends on the quality of the data used.
3 Training a model: In this step the cleaned data is split into two parts train and test, We used
training data for developing the model.
4 Evaluating the model: This step is to test the accuracy, we used the test data, because the
better test to check accuracy of model is to see its performance on data which was not used at all
during model build.
5 Improving the performance: This step involve choosing a different model altogether or
introducing more variables to augment the efficiency.

Aberkane Rania Thesis of master June 16, 2018 11 / 27


Table of Contents

1 Introduction

2 Theoretical background

3 Preprocessing

4 Implementation

5 Results

6 Conclusion

Aberkane Rania Thesis of master June 16, 2018 12 / 27


Data collection

We collected our data set from the newspaper El Chorouk


online, because almost of the researchers in the field of
Arabic text classification collected their own corpus.

The figure represents the statistical distribution of our


data set.

Aberkane Rania Thesis of master June 16, 2018 13 / 27


Data Preprocessing

Before we can use data collected we need to do some preprocessing to remove unnecessary informations:

Removal of URLs.
Tokenization.
Noramlization(replacing specific letters,numeric data, punctuation, spaces and single letters).
Remove stop words:such as stop words pronouns, conjunctions, and prepositions, names.
Stemming.

Aberkane Rania Thesis of master June 16, 2018 14 / 27


Table of Contents

1 Introduction

2 Theoretical background

3 Preprocessing

4 Implementation

5 Results

6 Conclusion

Aberkane Rania Thesis of master June 16, 2018 15 / 27


Prepare Data For Machine Learning

After preprocessing is completed, it comes to mind the following question How To Prepare Data For
Machine Learning ?

Machine learning cannot work with raw text directly, the text must be converted into numbers. It
should be represented like this:

Feature 1 Feature 2 ... Feature N Label


Document 1 F11 F12 F1... F1N Yes
Document 2 F21 F22 F2... F2N No
... F... F... F... F3N ...
Document M FM1 FM2 FM... FMN No

So, how can we transform the data into numbers ? and what can features represent ? that what we will
discover in the next slides.

Aberkane Rania Thesis of master June 16, 2018 16 / 27


The bag-of-words

A bag-of-words model describes the occurrence of words in document. It is a way


of extracting features from text for use in modeling, with machine learning.

Example: setp of bag-of-word

1 Learning a vocabulary from all of the


documents.
2 Counting the occurrence of each word.
3 Storing it in a matrix.

To realise this in scikit-learn we use CountVectorizer or TfidfVectorizer.


Aberkane Rania Thesis of master June 16, 2018 17 / 27
Features

N-grams are basic features of bag-of-words. features can be single words (Unigrams), two word
(Bigrams) or three words (Trigrams).

Example:

Aberkane Rania Thesis of master June 16, 2018 18 / 27


Machine Learning Algorithms

Now our data is ready to use Machine Learning Algorithms.

We tested six machine learning classification methods that are commonly used in Sentiment Analysis
which are:

Multinomial Naive Bayes.


Support Vector Machines (svm).
Logistic Regression.
K-Nearest Neighbors (knn).
Random Forest.
Artificial Neural Networks

Aberkane Rania Thesis of master June 16, 2018 19 / 27


Machine Learning Algorithms

Now our data is ready to use Machine Learning Algorithms.

We tested six machine learning classification methods that are commonly used in Sentiment Analysis
which are:

Multinomial Naive Bayes.


Support Vector Machines (svm).
Each algorithm goes through
Logistic Regression. these steps to build the
K-Nearest Neighbors (knn). classifier:
Random Forest.
Artificial Neural Networks

Aberkane Rania Thesis of master June 16, 2018 19 / 27


Sentiment Analysis Best Practices:

Aberkane Rania Thesis of master June 16, 2018 20 / 27


Table of Contents

1 Introduction

2 Theoretical background

3 Preprocessing

4 Implementation

5 Results

6 Conclusion

Aberkane Rania Thesis of master June 16, 2018 21 / 27


Two Classes: Positive and Negative

Algorithms Unigrams features Bigrams features Unigrams and Bigrams


Multinomial Naive Bayes 84.71% 75.20% 85.57%
Random forest 75.61% 51.23% 70.66%
LinearSVC 72.31% 51.75% 70.66%
KNN 43.80% 37.19% 40.49%
Logistic Regression 77.68% 74.79% 79.75%
Multi-layer Perceptron 79.33% 55.75% 83.05%

Aberkane Rania Thesis of master June 16, 2018 22 / 27


Three Classes : Positive, Negative and Neutral

Algorithms Unigrams features Bigrams features Unigrams and Bigrams


Multinomial Naive Bayes 64.41% 58.58% 65.64%
Random forest 62.57% 40.18% 60.42%
LinearSVC 60.73% 44.78% 61.34%
KNN 37.11% 37.11% 31.90%
Logistic Regression 62.88% 58.58% 62.88%
Multi-layer Perceptron 63.20% 46.93% 63.19%

Aberkane Rania Thesis of master June 16, 2018 23 / 27


Table of Contents

1 Introduction

2 Theoretical background

3 Preprocessing

4 Implementation

5 Results

6 Conclusion

Aberkane Rania Thesis of master June 16, 2018 24 / 27


Conclusion

Multinomial Nave Bayes are the likely models to work best in our dataset.

We tested diferent features and found that unigrams and bigrams works the best.

By monitoring attitudes and opinions about any topic, we are able to detect shifts in opinions and
adapt readily to meet the changing needs.

Aberkane Rania Thesis of master June 16, 2018 25 / 27


Future work

We would like to develop our data for different categorizations with good accuracy.

Extending classifier to work with Arabic.

We would like to work on speech.

Aberkane Rania Thesis of master June 16, 2018 26 / 27


Aberkane Rania Thesis of master June 16, 2018 27 / 27

You might also like