Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Amharic Political Sentiment Analysis Using Deep

Learning Approaches
Fikirte Alemayehu
Haramaya University
Million Meshesha
Addis Ababa University
Jemal Abate (  abatejemal@gmail.com )
Haramaya University

Article

Keywords:

Posted Date: June 27th, 2023

DOI: https://doi.org/10.21203/rs.3.rs-3060010/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License

Page 1/32
Abstract
This research aims to develop a sentiment analysis system specifically designed for the Amharic
language. The study employs four deep learning algorithms to achieve this goal: Convolutional Neural
Network (CNN), Bidirectional Long Short-Term Memory (BiLSTM), Gated Recurrent Unit (GRU), and a
combination of CNN and BiLSTM. The CNN algorithm is utilized for its effectiveness in extracting
relevant features from the input data. By applying filters and pooling operations, the CNN can identify
important patterns and structures within the Amharic text. The BiLSTM algorithm is chosen for its ability
to process sequential information by considering both past and future contexts. It incorporates a memory
cell that enables the model to retain important information and understand the dependencies between
different parts of the text. Additionally, the GRU algorithm is employed as it offers similar capabilities to
BiLSTM but with fewer computational requirements. This allows for more efficient processing without
sacrificing performance. The experimental results obtained from the sentiment analysis system indicate
that the combination of CNN and BiLSTM yields promising outcomes. The system achieved an accuracy
rate of 91.60%, demonstrating its ability to correctly classify sentiments expressed in Amharic text.
Furthermore, the precision rate of 90.47% indicates a high level of accuracy in identifying positive and
negative sentiments, while the recall rate of 93.91% suggests that the system effectively captures
relevant sentiment instances. In summary, this study successfully designs a sentiment analysis system
specifically tailored for the Amharic language. By leveraging the capabilities of deep learning algorithms
such as CNN, BiLSTM, and GRU, the system demonstrates strong performance in accurately classifying
sentiments expressed in Amharic text, as evidenced by the achieved accuracy, precision, and recall rates.

1. INTRODUCTION
In 2020, more than 3.6 billion people worldwide used social media, a 7% increase from January. Some
instances of social media interaction include comments, likes, and shares that express peoples’ opinions.
This enormous amount of unstructured data gives data scientists and information scientists the ability to
look at social interactions at an unprecedented scale and a level of detail that has never been imagined
previously. Analysis and evaluation of the information are becoming more complicated as the number of
people using social networking sites grows. For example, Facebook, Instagram, e-commerce websites,
and blogs improve customer satisfaction and the overall shopping experience for the customer by
allowing customers to rate or comment on the products they've purchased or are planning to purchase.

Sentiment analysis, also known as Opinion mining, is the study of people's attitudes, sentiments, about
things like products, services as well as their attributes. Because of the high volume of data, and
comments they provide, social media platforms and sites are undoubtedly a valuable resource for
sentiment analysis and opinion mining. In this research, the name opinion mining is interchangeably
used with sentiment analysis. Social media sites such as Facebook, Twitter, and YouTube were being
used to assist in a country's political reform process. The public used social media to inform, mobilize
and rebel against the undemocratic, corrupted government. On the one hand, the ruling government and
activists use it not only to provide information but also to get public feedback on a political situation.
Page 2/32
Many instances that illustrate the effectiveness of social media involvement in politics was the 2008
United States election of former President Barack Obama who utilized social media to engage with the
younger generation, to hear their thoughts, and address their concern which resulted in Barack Obama's
election win.

In 2015, the increased use of social media in Ethiopia for political reform marked the end of nearly 27
years of oppression and autocratic governance. In Ethiopia Facebook is more popular than any other
social media platform; it is used by 56.94 percent of social media users. Social media sites nowadays
contain a large amount of Amharic opinion either as a review or comment, which might be difficult to
comprehend when it’s long or with diverse content. The general objective of this study is to construct a
deep learning sentimental analysis model for Amharic political sentiment.

2. RELATED WORKS
Sentiment analysis is the practice of classifying sentiments into predefined categories like positive or
negative. Various work has been conducted in sentimental analysis locally in different domain areas
such as entertainment using machine learning. Collecting reviews from different social media sites for
Aspect Level sentiment classification, or using Deep Learning for Amharic sentiment classification.

Ghoroghi et al.[1], proposed an integrated architecture of Convolutional Neural Network and Long Short-
Term Memory network to identify the polarity of words on the Google cloud and perform computations on
Google Collaboratory. The first deep learning model for the Amharic language was designed by [2]. The
model showed better performance than the previous model, but it was modelled using normal LSTMs
only considering past data that don’t provide context and combination of different deep learning models
were not considered.

From related works, sentiment analysis has come a long way from manually dividing sentiments to rule-
based then improved by machine learning and now reached state the art, deep learning. The performance
for previous work shows that combination of CNN with BI-LSTM gives a better accuracy compare to
others. Amharic is a morphologically rich language with a complex structure which makes it hard to
analyze for NLP tasks, but different researchers faced these problems and moved from lexical-based to
machine learning and now onto deep learning.

Currently, due to social media, the web is filled with people's opinions that can be used for better
understanding, decision making, product improvement, and provide improvement in many other areas.
Sentimental analysis helps analyze the vast number of opinions on the web. Prior research has provided
a variety of solutions for different types of sentiment analysis tasks by using different approaches like
supervised and unsupervised methods. A supervised machine learning approach needs a labelled dataset
for training the model and testing the performance of the model. Different works previously used a variety
of supervised machine learning algorithms and feature extraction in supervised learning. But since then,
Deep learning has developed as a strong machine learning technique that has achieved state-of-the-art
performance in a wide range of applications, like computer vision, speech recognition, and natural
Page 3/32
language processing. Deep learning for sentiment analysis has lately gained tremendous popularity.
Deep learning for Amharic language is gaining interest from the research community but there is still
much work to be done.

3. RESEARCH METHODOLOGY
3.1. Research Design
This study has implemented an experimental research method. Experimental research design is a
scientific method of investigation in which one or more independent variables are altered and applied to
one or more dependent variables to determine their impact on the latter. In experimental research, one
needs to perform an experimental setup, such as determining how many trials to run and which
parameters, weights, methodologies, and datasets to employ. To conduct experimental research there are
three major steps followed; Sentiment data collection and preparation, Implementation tools selection for
design and development of a model for sentiment analysis, and evaluation of the proposed model.

3.2. Data Collection and Preparation


To execute both training and testing tasks for deep learning in NLP, such as sentiment analysis,
annotated datasets are required. Amharic language is an under-resourced language that doesn’t have a
standardized and well-organized dataset for different textual research works; especially for sentimental
analysis which needs a collection of people's opinions on a given area of interest.

For this research, 5000 comments were manually collected from different sources that focus on the
current political situation in the country with the help of a Linguistic expert. During data collection in
cooperation with the linguistic expert; lexicons that are used to express sentiment provided by previous
researchers, Selama and Tulu, were used as a criterion for selecting the comment as well as for labelling
the comments as positive and negative.

3.3. Implementation Tools


For this research, deep learning models were selected based on the ability the model provides for Amharic
sentiment analysis. Some of the criteria considered include CNN is known for feature extraction
capabilities, BI-LSTM for its ability to preserve long-term dependency, solve the vanishing gradient
problem as well as the capability to understand the different contexts.

There are a lot of open-source deep learning development tools for natural language processing that can
be easily customized for diverse applications. For this research, Python version 3.8.8 was used for
preprocessing, data analysis, building as well as evaluating the model and data visualization via the
different libraries available.

3.4. Implemented Deep Learning Approaches

Page 4/32
There are different deep learning models available for sentiment classification. For this research, the deep
learning models were chosen based on the ability the model provides for Amharic sentiment analysis.
Some of the criteria considered include feature extraction capabilities, long-term dependency
preservation, the vanishing gradient problem, the capability to understand the different contexts in
Amharic language and models with fewer parameters and higher convergence time [3] [4][5].

3.4.1. Convolutional Neural Network (CNN)


Convolutional Neural Network (CNN) models use a convolutional layer and pooling layers to extract high-
level features. For this research a 1D CNN for sentiment words, which treats sentiment as a one-
dimensional collection of pixels were employed. CNN is used to find hidden connections between words
in the nearby region. CNN is recognized for its capability in extracting features accurately and minimizing
the number of input features. Figure 1 presents the architecture of CNN model used for text classification.

CNN model was built by applying the following steps [5]. First embedded words are fed into the
convolutional layer, which selects the features, and then the pooling layer performs dimensionality
reduction on the feature extracted on the previous layer after the features are combined then passed into
the fully connected layer, where the output is determined based on Sigmoid function that normalizes into
the two classes (i.e., positive and negative).

3.4.2. Bidirectional-LSTM
Over long sequences, parts of the gradient vector may exponentially expand or decline, making it
challenging for RNN to include long-term dependencies. The LSTM design overcomes the issue of
learning long-term dependencies presented by the simple RNN by incorporating a memory cell that can
hold a state over a long period. Long short-term memory networks that are bidirectional, according to the
researchers, can incorporate context information from both past and future inputs [6]. In a way, the
Bidirectional-LSTM combines the forward hidden layer with the backward hidden layer (see the Fig. 2), to
manipulate both previous and future input.

It can be seen from Fig. 2 that Bi-LSTM can learn in both directions and integrate the pieces of knowledge
to make a prediction. The embedded words were used as an input for bidirectional LSTM model and
added a BI-LSTM layer using Keras. Tensorflow's Keras now has a new bidirectional class that can be
used to construct bidirectional-LSTM and then fit the model to our data.

3.4.3. Gated Recurrent Unit (GRU)


GRU uses gating units that influence the flow of information within the unit to address the vanishing
gradient problem of a regular RNN. Large texts benefit greatly from GRU. GRU like LSTM has gating units
that regulate data flow but unlike LSTM there is no need for additional designated memory cells. The
update and reset gates are two crucial gates of GRU that decide what information should be passed to
the output [8].

Page 5/32
The architecture depicted in Fig. 3 shows how GRU uses the two gates for output determination. The reset
gate determines whether parts of the prior hidden state should be integrated with the present input to
formulate a new hidden state. The Update gate is in charge of deciding just how much of the prior hidden
state should be kept and how much of the proposed new hidden state from the Reset gate should be
included in the final hidden state. Whenever the Update gate is multiplied with the prior hidden state for
the first time, the gate chooses which pieces of the prior hidden state to preserve in memory and dismiss
the rest. As a result, whenever it utilizes the reverse of the Update gate to extract the newly proposed
hidden state from the Reset gate, it is filling up the required pieces of information [4].
3.4.4. Hybrid CNN- Bidirectional-LSTM
The strengths of CNN and Bi-directional models are combined in this hybrid technique (see Fig. 4).
Convolutional neural network (CNN) models use convolutional layers and pooling layers to extract
features, whereas Bidirectional-LSTM models preserve long-term dependencies between word sequences
[3]. Hence CNN-Bidirectional-LSTM models are more suitable for sentiment classification.

The inputs are preprocessed and embedded before it is passed to CNN. Convolutional layers extract
features from different parts of the text and the pooling layer reduces the number of features in the input.
Then features obtained from the pooling layer are passed to the Bidirectional-LSTM to extract contextual
information. Finally, the last states of the BiLSTM are concatenated and passed into the Sigmoid
activation function, which squashes the final value in the range between 0 and 1. The researcher propose
a CNN-Bidirectional-LSTM to solve individual flaws of CNN and Bidirectional-LSTM.

4. PROPOSED ARCHITECTURE AND DESIGN


The general Architecture of Amharic sentimental analysis using a deep learning approach is shown in
Fig. 5 below.

4.1. Data Preprocessing


Data preprocessing is the process of removing distortion from data to make any classification task easier
in our case sentiment classification and improve the performance of the model. As a result, it is critical to
apply data preprocessing to overcome such issues because the more the data is cleaned the more
accurate the deep learning model will be. Figure 6 below shows the different data preprocessing tasks
performed to clean the comments and opinions collected from social media, such as Facebook and
Twitter.

4.1.1. Short-form Expansion


Every language uses the shortened form of a written word or phrase which is also called an abbreviation.
Short forms are usually used to save space and time, avoid repeating lengthy words and phrases, or just
to adhere to standard practice [9]. In Amharic, there is a lot of short-form that need to be expanded to get

Page 6/32
the full-length word because the researcher is using the word to train our data. Some of the short forms
used frequently in writing comments and opinions in Amharic are shown in Table 1 below.

Table 1
Amharic short forms in writing
Short-form Expanded form Meaning

ዶ/ር ዶክተር Doctor

እ/ር እግዚአብሄር God

ኢ/ያ ኢትዮጵያ Ethiopia

ጠ/ሚኒስቴር ጠቅላይ ሚኒስትር Prime Minister

ወ/ሪት ወይዘሪት Miss

4.1.2. Data Cleaning


The Amharic sentiment dataset was collected using social media websites. Social media text is not clean
it includes special characters, punctuation, symbols, and emojis that not used to express various
sentiments. In this stage of preprocessing, eliminate any special characters, symbols, and emojis that
aren't needed. It was started by removing all non-Amharic characters and any special characters shown
below in Table 2.

Table 2
Removed words, numbers, and punctuations
English word and [a-z A-Z | [0–9]
numbers

Amharic Punctuations [፤።፡፣:,.?/()•“”*፨]+

Special characters [@#$%^&=?×!,;:_.(){}`'/+*<>\"¤—„\ ®¯™¡¡\x10»€«·‘0e1b§”¬¦…""f÷\~¨©±¥


£¶–°•˜’“|]

Geez numeral [፩ ፪ ፫ ፬ ፭ ፮ ፯ ፰ ፱ ፲ ፳ ፴ ፵ ፶ ፷ ፸ ፹ ፺ ፻]

Emojis 🙂 😀😃 🥰 😍 💗 U0001F300-\U0001F5FF

4.1.3. Normalization
Normalization is the process of transforming a collection of orthographically disparate words into a more
consistent or common sequence. Normalization is one of the techniques for extracting clean data from
unstructured text. In Amharic, there are different characters that have the same sound but are written in
different forms like [ሃኅኃሐሓኻ] which are replaced with 'ሀ' to give our dataset consistency [10].

4.1.4. Tokenization
Tokenization is the division of a sentence into a list of tokens or words. Larger chunks of a text document
can be tokenized into a list of sentences, and sentences into a list of words. The list of words identified by
Page 7/32
the tokenizer function is then used for training and also testing. To be comprehended by the deep
learning system, such tokens are also transformed to vector format

4.1.5. Stop-words detection and removal


Stop-words are words that do not contribute to creating meaning to the content and that appear
frequently in a given text data but don’t have significant or have no impact on classification [10]. Stop
words have to be removed to reduce the dimensionality of the word vector because they have no
contribution in determining emotion or sentiment. Some of the most common stop words in Amharic
language are ነው፣ ናቸው፣ ግን፣ ሆኖም፣ እና፣ ነበር፣ ና፣ ከ፣ ስለ etc.

4.1.6. Padding
Deep learning networks expect datasets to have vectors with equal dimensions. However, not all
sentences are the same size after preprocessing. To put it another way, some of the sentences are longer
or shorter in terms of the word they contain. To make the documents uniform in size, a zero is added pre
the sentence or post the sentence of the shorter sentence matrices which is called Padding.

In TensorFlow words and zero padding is not allowed so the numerical representation of words is used as
an input. For this research, a sentence with numerically represented words and a maximum length of a
given sentence is used as an input. If the sentence is less than the maximum length post padding is
applied which is adding zero at the end of the sentence to make it equal to the maximum length of a
sentence which in this research is 20. Figure 7 shows sample output after padding is applied to the
dataset.

4.2. Word-embedding
Word-embedding is a feature learning technique in which each word or phrase in the vocabulary is
mapped to an N-dimensional real-number vector. The goal of word embedding is to convert all words in
the dictionary into a lower-dimensional vector. To build a word representation of the data for the deep
learning model the researcher employs Word2Vec as an embedding model. After preprocessing and
converting the datasets to a format that can be analyzed, the words in the sentence must be represented
as vectors so that Word2Vec can calculate similarity, analogy, as well as other procedures. The
embedding layer converts the input into an NxM dimensional vector, where N represents the longest
sentence in the dataset and M represents the embedding dimension. The Word2Vec model has a vector
space that compares all of the words in the post and comment. These models were used to extract
features from the dataset's text by obtaining the model's average of all vectors. Figure 8 shows the
algorithm used for the Word2Vec model.

The input for the Word2Vec model was the collected dataset and the dataset was preprocessed to avoid
noise in the model. Gensim and TensorFlow don’t run in the same environment so the researcher created
an environment for Gensim. After preprocessing the Word2Vec model was trained and the trained model
were saved. The Pre-trained Model was used as the embedding layer for the deep learning model.

Page 8/32
5. EXPERIMENTATION AND PERFORMANCE EVALUATION
In this study, various experiments are conducted at the sentence level. The researcher used Word2Vec for
word embedding, which is a vector representation of words to feature learning. Then applied four distinct
deep learning models to train and test the data after word embedding. The experimental results and
evaluation, as well as a discussion of the results, are presented in this chapter.

5.1. Dataset Description


A large amount of data is required to execute the experiments and obtain a better result as well as a high-
performing classification model for Amharic sentimental analysis. Deep learning techniques, by their very
nature, necessitate a vast amount of data and perform better with enormous datasets. The data used in
this study was composed of unstructured Amharic sentiment texts collected from various sources. The
researcher used lexical words that hold sentiment for Amharic language specified in prior research like
[11] and [12] to collect sentiment sentence and annotate with the correct polarity. The comments were
selected if they contained at least one sentiment lexicon. In the end, linguistics specialists analyze the
classified sentence to validate if the data is classified to the correct class.

Following the preparation of the dataset, it is known that social media contents are not clean and contain
various geez numbering, emojis, numbers, and English characters that must be cleaned. Words in their
short forms were expanded to full forms and Stop words, which are words that just don't contribute to the
contextual constriction of texts, were deleted when preprocessing, and words with multiple images and
the same meaning are converted into normalized form.

Once comment sentences are tokenized into word level, their numerical representation is used as an input
for deep learning algorithms when training and testing a model. Because the deep learning model's input
is stored in uniform matrices, the words are post-padded for our study into the uniform representation of
matrices. The maximum number of words within a sentence is taken during padding, and the remaining
sentences with less word length are padded to the same dimension as the maximum. But we have to be
careful because the variation in the number of words in each sentence affects the sparseness of the
padding structure. If, for example, the difference between the maximum and a minimum number of words
in a sentence is large, the minimum is padded with zeros to give the same dimension in this case a 20-
dimension vector. The model will perform worse if the zero-padded part of the sentence is large. As a
result, for this research the padding size kept under control by reducing the gap between the minimum
length and maximum length of the sentences. The sentence size should not be less than 5 and greater
than 20 so the words become the same 20-dimension vectors. After the words are in the same
dimensions, the representation of words into vectors is the task of the embedding layer within the model.

1.1.1. Word Embedding


Word embedding is a key approach in natural language processing (NLP), in which words in sentences
are transformed to numerical vectors. To put it simply, word embedding is the representation of words as
Page 9/32
vectors and aims to group words that have similar meanings together in vector space. In some works, a
one-hot vector is used to represent words in a vector form, but it has a major drawback in those words in
a given document are treated as independent of each other, which doesn’t capture the relationship
between different words. An approach for creating such word embedding that is capable of capturing
context, semantic and syntactic similarity of words, and the relationship of one word with another is
Word2Vec.

In Word2Vec, words that are similar in meaning are clustered near to each other. Word2Vec is a simple
neural network with an input, one hidden layer, and output. In this study, the input to Word2Vec is the
comments that were collected, and the output is a set of feature vectors that serve as the input for the
deep learning model. Our goal when applying Word2Vec to our dataset is for words with the same
contexts to be gathered together in vector space. There are mainly two algorithms for Word2Vec, the
CBOW, and Skip-gram. For the Word2Vec, skip-gram algorithm is used (see Fig. 9). According to [13], Skip
Gram is known for representing uncommon words well and works well with small amounts of data.
Because it can capture several interpretations of a single word, skip-gram is ideal for a complicated
language like Amharic.

For this research, Gensim Word2Vec Model were used. Gensim requires tokenized words as a list for the
input layer. In skip-gram, the target word serves as an input and the context words are the output of the
model, as shown in the above Fig. 9. Below Table 3 presents the top four related words with Amharic
words አብይ and በርታ.

Table 3
Top four related words with Amharic words አብይ and በርታ
Number Similar words with the Probability of Similar words with the Probability of
word አብይ similarity word በርታ similarity

1. ዶክተር 98.43 አይዞህ 99.90

2. አህመድ 98.01 ጀግናችን 99.86

3. ሚኒስቴር 97.88 ምርጥ 99.83

4. ክቡር 97.82 አስተዋይ 99.73

Word Similarity is essential for NLP tasks such as sentiment analysis to find the closeness between two
chunks of text by their contextual meaning. Training the vectors to capture contextual and semantic
similarity gives words with similar meaning or influence in a sentence similar value. Adding word
similarity maximizes the classification model performance because it provides the model with the context
and relationship between words for easier feature extraction.

6. EXPERIMENTAL RESULT

Page 10/32
Four experiments were conducted using the collected dataset. The dataset was divided into three; 4000
sentience for training, 500 for validation and 500 for testing. The first experiment was conducted using
Convolutional neural network. The second and third was conducted using Bi-directional LSTM and Gated
Recurrent Neural Network. Lastly, the researcher experiment by combining CNN with Bi-directional models
in order to take advantage of the strengths the two techniques.

6.1. Experimenting Using CNN


The CNN layer receives the vector representation of the data as an input to learn features once the data
has been preprocessed and the embedding component has been constructed. Feature extraction is done
in the CNN Layer. CNN layer 128 filters with 5 Kernels and Relu as the activation function was used. The
output was subsequently transmitted to the pooling layer, which in our case was GlobalMaxPooling1D.
GlobalMaxPooling1D has been used to downsample the input representation by taking the maximum
value across time and also converting the output from 2D to 1D. These values were subsequently handed
on to the fully connected layer or output layer. A sigmoid activation function was used to maintain the
output value between 0 and 1 for the negative and positive sentiment. Because the task we're dealing
with is a binary classification problem binary cross-entropy was used for the loss function. The
researcher employed Adam as the optimizer and measured accuracy, recall, as well as Precision during
training. 10 epochs, or iterations of training data, are used to train the model. The model is assessed on
the test dataset once the model is fitted; accordingly, the result is presented as shown below in Table 4:

Table 4
1st Evaluation result for CNN Model
80% for Train, 10% Validation and 10% for Testing

Metrics Result

Training Validation Test

Accuracy 99.43% 84.96% 84.32%

Precision 99.77% 83.66% 83.25%

Recall 99.08% 87.56% 84.87%

From the above Table 4 it is observed that, the model achieved 99.43% accuracy for the training dataset
while it achieves 84.96 and 84.32 for validation and testing accuracy respectively. This clearly seen in the
Learning Curve depicted in Fig. 10 below.
The training accuracy increases as the number of epochs increases, but the Validation accuracy
decreases as the number of epochs increases. As a result, it can be concluded that the model is
overfitted. When compared to the work required to combat overfitting, building a model and executing the
Page 11/32
code is the easier part. The researcher used many regularization approaches for our model, such as
Seeding (also known as Random state) from 42 to 50. To reduce the model's vulnerability to overfitting,
the researcher added one Dense layer (Hidden layers) with 64 neurons and the activation function Relu.
Then added a dropout layer to the Convolutional layer before feeding it into the pooling layer then added
a dense layer. After the dense layer, the researcher also added another dropout layer, which was then fed
into the fully connected layer. Dropout was discovered to be incredibly essential since it allows the model
to avoid overfitting by dropping neurons at a random point. The batch size was increased from 64 to 100,
and the epoch number was decreased from 10 to 9. Adjustment is made based on manual tunning and
the experimental result is presented in Table 5.

Table 5
Model Result after Regularization
80% for Train, 10% Validation and 10% for Testing

Metrics Result

Training Validation Test

Accuracy 89.63% 87.11% 84.79%

Precision 79.62% 80.32% 80.39%

Recall 72.81% 73.57% 73.69%

As presented in Table 5, after regularization the model achieved an accuracy of 89.63%, 87.11%, and
84.79 for training, validation, and testing respectively. The accuracy of the model was improved and the
result shows that there is minimal difference observed among training, validation, and test accuracy. This
further shows that the problem of overfitting is solved as compared to the previous result achieved before
regularization. Figure 11 also shows the learning curve of CNN Model after regularization.
From the learning curve in Fig. 11, the model has no overfitting problem since the gap that was shown
between the training and the validation has been decreased. The CNN model for Amharic sentiment
dataset has finally registered an accuracy, Precision, recall of 84.79%, 80.39%, and 73.69% respectively.
6.2. Experimenting Using Bidirectional-LSTM
The Bidirectional-LSTM layer receives the vector representation of the data as an input to learn features
once the data has been preprocessed and the embedding component has been constructed. Bi-directional
LSTM (Bi-LSTM), can extract important contextual data from both past and future time sequences. Bi-
LSTM, in contrast to LSTM, contains forward and backward layers for conducting additional feature
extractions which is suitable for Amharic language because the language by its nature needs context
information to understand the sentence. Bi-LSTM has one hidden layer for each direction to extract
features. One copy of the hidden layer fits in the input sequences as the traditional LSTM, while the other

Page 12/32
is placed on a reversed copy of the input sequence. The results obtained from all these LSTMs are
concatenated by default. For both the forward and backward hidden layers in our model, the researcher
used a bidirectional LSTM with a 64-memory unit. Then add a dropout of (0.4, 0.5), Random state of 50,
Embedded size of 32, batch size of 100, and 3 epochs to minimize overfitting. To calculate the loss
function Binary Classification were used and Adam as an optimizer. The experimental result of Bi-LSTM
is presented in Table 6.

Table 6
Bi-LSTM Model Evaluation
80% for Train, 10% Validation and 10% for Testing

Metrics Result

Training Validation Test

Accuracy 90.76% 89.18% 85.27%

Precision 90.99% 90.06% 85.24%

Recall 90.88% 87.63% 81.67%

The Bi-LSTM model result shows an accuracy of 90.76%, 89.18%, and 85.27% for the training, validation,
and testing respectively. Hereunder Fig. 12 presents the learning curve of Bi-LSTM.
From the learning curve depicted in Fig. 12 that, the difference between the training and validation
accuracy is nominal, indicating that it is not overfitted and hence capable of generalizing to previously
unknown data in the real world. The model result shows a satisfactory fit to our dataset. To get to the
ideal state for the model, the researcher employed regularization approaches like dropout as discussed
above.

The accuracy, precision, and recall of the Bi-LSTM for Amharic sentiment dataset were 85.27 percent,
85.24 percent, and 81.67 percent, respectively. The result shows that BI-LSTM model performs better than
CNN model which further indicates the capability of BI-LSTM to improve the classification performance
by considering the previous and future words during learning.

6.3. Experimenting Using Gated Recurrent Neural Network


(GRU)
For GRU first, the researcher creates a suitable embedding layer with the maximum feature and provide
the output shape. Between the embedding layer and the hidden layer, the input values serve as weights.
Gated recurrent units make up the hidden layer. The researcher used GRU with two layers and get the
representation of the entire sequence that was then passed as input to the outer layer, which used the
Sigmoid activation function to categorize the sentiment as positive or negative and Adam as optimizer.

Page 13/32
For each GRU 64 units and 32 units of memory were used. After building the model, the test result shows
the model was overfitted. So, to overcome overfitting the researcher added a dropout of (0.5, 0.5), change
the Random state from 50 to 42, batch size of 128, and 10 epochs. The one hyperparameter that made
the difference was modifying the default value of Adam learning rate from 0.1 to 0.0001. Table 7 below
shows the experimental result of GRU.

Table 7
GRU Model Evaluation result
80% for Train, 10% Validation and 10% for Testing

Metrics Result

Training Validation Test

Accuracy 97.73% 92.67% 88.99%

Precision 89.89% 90.63% 90.61%

Recall 88.82% 89.65% 89.67%

As presented in Table 7, GRU model registers an accuracy of 97.73%, 92.67%, and 88.99% for the training,
validation, and testing, which are close to the result that was obtained for BI-LSTM. Though the number
of epochs considered for the GRU to get this accuracy is twice that of BI-LSTM, GRU solves greatly the
overfitting challenge as compared to Bi-LSTM with some parameter tuning. Hereunder Fig. 13 depicts the
learning curve of the GRU model.
From the learning curve of GRU model, the gap between the training and the validation accuracy is
minimal but the model at the start begins to underfit. However, when the researcher increases the epoch
number the accuracies start to increase which overcome underfitting. The loss was high with 64% at first
iteration but it decreases to a minimum in the last epoch to 32%. In the end, the GRU model converged to
the solution faster without the need for large iterations to arrive at those optimal values. In summary, the
GRU model for Amharic sentiment dataset achieved 88.99%, 90.61%, 89.67% accuracy, precision, and
recall, respectively.
6.4. Experimenting Using CNN-Bidirectional-LSTM
When the researcher combined CNN and Bi-LSTM, the intention is to take advantage of the best features
of each model to develop a model that could comprehend and classify the Amharic sentiment datasets
with better accuracy. Combining the two models will provide the best feature extraction with context
understanding. From the embedding layer, the input value is passed to the convolutional layer with a size
of 64-filter and 3 kernel sizes as well as with an activation function of Relu. After the convolutional layer,

Page 14/32
there is a max-pooling 1D layer with a pool size of 4. The output from this layer is passed into the
bidirectional layer with 64 units. The output was then passed into the fully connected layer with Sigmoid
as the binary classifier. For the optimizer, Adam and Binary Crossentropy for loss function were used. The
model was fitted with our data, and the result is shown below in Table 8:

Table 8
CNN-Bi-LSTM Model Evaluation result
80% for Train, 10% Validation and 10% for Testing

Metrics Result

Training Validation Test

Accuracy 99.63% 88.13% 87.88%

Precision 99.50% 87.53% 87.43%

Recall 99.50% 84.38% 87.86%

From Table 8, the trained model registers accuracy, precision and recall of 99%, while the model performs
poorly during validation and testing, on the given unseen datasets. This shows the model is memorizing
the training data instead of learning which resulted in overfitting. Hereunder the learning curve depicted in
Fig. 14 shows clearly the behavior of model accuracy vs. model loss.
The Learning curve in Fig. 14, shows the training loss is close to 0 while the loss for the validation set is
increasing which indicates overfitting. To overcome overfitting, the researcher applied different first
regularization methods like weight decaying, adding dropouts, adjusting the learning, batch size,
momentum of the model, and reducing the iteration of the model. Various hyperparameters were tuned
until the model's optimal value was reached, which shifted it from overfitting to an ideal fit for our
dataset. Table 9 shows the optimal values for CNN-BI-LSTM.

Page 15/32
Table 9
Optimal value for tuning the
CNN-BI-LSTM model
Dropout 0.2 and 0.3

Learning rate 0.0001

Momentum 0.7

Epoch No 5

Batch size 128

random state 0

Using the aforementioned optimized hyperparameters depicted in Table 9, the experimental result is
shown below in Table 10.

Table 10
Evaluation result CNN-Bi-LSTM model after
hyperparameter tuning
80% for Train, 10% Validation and 10% for Testing

Metrics Result

Training Validation Test

Accuracy 99.73% 91.11% 91.60%

Precision 99.56% 94.37% 90.47%

Recall 99.89% 87.77% 93.91%

As shown in Table 10, 99.73 percent, 91.11 percent, and 91.60 percent accuracy were achieved for
training, validation, and testing, respectively. This hybrid model outperforms previous models, and when
looking at the marginal differences between training, validation, and testing, the difference is small,
indicating how well the model works in unknown datasets and its generalization ability. Hereunder Fig. 15
depicts the learning curve of the hybrid CNN and Bi-LSTM model.
Overall, for the given Amharic sentiment dataset the CNN-Bi-LSTM model achieved 91.60 percent, 90.47
percent, 93.91 percent accuracy, precision, and recall, respectively.

7. Comparison of Models
As previously stated, the experiments were carried out using four distinct deep learning models, based on
which promising results for Amharic sentiment analysis were obtained. Figure 16 presents the
comparison between the four models.
Page 16/32
Figure 16 shows, the performance of the four models for Amharic sentiment dataset, and when
comparing their performance CNN-BI-LSTM showed a much better accuracy, precision, and recall. CNN-Bi-
LSTM uses the capability of both models to classify the dataset which is CNN that is well recognized for
feature selection while Bi-LSTM enables the model to include the context by providing past and future
sequences. Combining these two models the accuracy was 91.60%. Figure 17 shows the confusion
matrix for CNN-BI-LSTM.

Figure 17 provides the confusion matrix for CNN-BI-LSTM, each entry in a confusion matrix denotes the
number of predictions made by the model where it classified the classes correctly or incorrectly. Out of
the 500-testing dataset available for testing CNN-BI-LSTM correctly predicted 458 of the sentiment
sentences. The Misclassification Rate is also known as Classification Error shows the fraction of
predictions that were incorrect. It is calculated using Eq. 5.1.

F P +F N
Misclassification Rate = T P +T N +F P +F N
(5.1)

The misclassification rate for CNN-BI-LSTM is calculated first by adding false positive and false negative
divided by the total testing dataset. FP for this model is 26 while the FN is 16 which gives a
misclassification rate of 8.4% for the model which showed a low misclassification rate.

The confusion matrix in Fig. 17 shows that, the number of false-positive are higher than that of false
negative. Table 11 shows Type one and Type two errors encountered by the model.

Table 11
Examples of misclassification by the model
Amharic sentiment dataset Actual Class Predicted class

በርታ አብቹ የአማራን ግድያም አስቁምልን ኢትዮጵያም ትለማለች Positive Negative

ጠላት አይኑ እያየ ይቃጠል አብያችን እንወድሀለን Positive Negative

በርታ አብቹ የኢትዮጵያ ጠላት ሁሌም ተቀብሮ እና ተጠልቶ ይቀራል Positive Negative

BBC ስለ ኢትዮጵያ ችግር ማውራት የሚያስደስታችሁ ሚዲያዋች አንድ ቀን እንኳን የተሰራውን መልካም ነገር Negative Positive
አታውሩም

በወጣቱ መርገፍ በሰው ሞት በደም መፋሰስ እንኳን ደስ አለን ለማለት እንኳን ይከብዳል Negative Positive

From Table 11 it can be seen that; the model gets confused when it came across comments that have
sarcasm, figurative speech, or sentiment sentences that contain both words that give positive and
negative sentiment in one comment. For example, “በርታ አብቹ የአማራን ግድያም አስቁምልን ኢትዮጵያም ትለማለች” and “ጠላት አይኑ እያየ
ይቃጠል አብያችን እንወድሀለን” the first sentence contains the positive words like በርታ and ኢትዮጵያም ትለማለች while the second
sentience contain አብያችን እንወድሀለን. But it also contains words that imply a negative sentiment like for the first
sentence ግድያም while the second contains ጠላት አይኑ እያየ ይቃጠል the above sentences belong to positive class but
the model predicted it as negative due to the words contained within the sentence which caused
misclassification. “በወጣቱ መርገፍ በሰው ሞት በደም መፋሰስ እንኳን ደስ አለን ለማለት እንኳን ይከብዳል” the word “እንኳን ደስ አለን” implies a

Page 17/32
positive sentiment while the overall sentiment of the comment is negative caused the model to predict the
sentiment as positive. From the CNN-Bi-LSTM model classification error, the model struggles to
understand sarcasm, figurative speech, mixed sentiments that are available within the dataset.

8. Discussion of Results
This research addresses gaps from previous works through a comprehensive experimental study. The
researcher studied the impacts of datasets preparation, word embedding, and deep learning models, with
a focus on the problem of sentiment analysis. Four deep learning models CNN, Bi-LSTM, GRU, and CNN-
Bi-LSTM for Amharic sentiment analysis were compared, the experiment result showed that combining
CNN with Bi-LSTM generated a model that outperformed the others. Each model was compared at the
model’s specific optimal point; that is, when the models reached their goodfit. CNN-Bi-LSTM takes
advantage of the strengths of the two models; CNN is recognized for its ability to extract as many
features as possible from a sentence and Bi-LSTM keeps the chronological order between words from
past and future which enables the model to understand context.

Several factors influence the performance of deep learning models for instance data preparation, the size
of the dataset, as well as the number of words within the sentence impact the performance of the model.
When training the model using 3000 sentences of the datasets and with a limited number of words within
a sentence gives an accuracy of 85.00%. As the number of words increases to greater than five words per
comment within the sentence the performance improves from 85.00–88.66% which is a 3.6%
improvement. Whereas increasing the size of the dataset to 5000 showed an accuracy of 91.60 which is
a 3% upgrade. From the results, we can see the impact the size of the dataset, as well as the size of
words within a single comment, has on the performance of the model. Other factors like word embedding,
filters size, kernel size, pool size, activation function, batch size, adjusting hyperparameter and the
optimization mechanism also play a major role in the performance of the models. Overall tuning the
above factors showed a significant amount of improvement to the deep learning model performance. But
factor such as padding respond differently from model to model for instance applying pre-padding to
CNN increases the model performance by 4% while other models perform poorly using pre-padding.

Kapočiūtė-Dzikienė et al. [15], claim that deep learning models tend to underperform when used for
morphologically rich languages and hence recommend traditional machine learning approach with
manual feature engineering. Despite the author's conclusion, the recommendation does not hold true
when comparing the performance of Amharic sentiment analysis model constructed in this study using
deep learning with machine learning model proposed by [16] and [12]. Findings from this study show
deep learning models bring improvement compared to traditional machine learning in terms of work
needed for feature extraction, performance, and scalability. Manual feature engineering wasn’t used for
this work; so, it eliminates extra effort that was needed for feature extraction and in addition, the models
were capable of understanding the context of a given sentence. When considering the model’s
performance, a small (+ 1%) but significant increase was achieved. Scalability is the main challenge for

Page 18/32
standard machine learning models while the deep learning models used in this research showed that the
accuracy for the model increases as the size of the dataset for training and testing increases.

Two researchers attempted to design a deep learning model for Amharic sentiment analysis. The CNN
model designed by [2] was overfitted and did not generalize well from training data to unseen data. This
problem was solved in this research by adjusting the hyperparameter of the model and shift the model
from overfitted to fit that can generalize well to unseen data. The CNN-Bi-LSTM model designed in this
study outperforms [17]’s LSTM model with a 5% increase in performance. This work has a major
contribution to update the state-of-the-art Amharic sentiment analysis with improved performance.

The proposed model achieved 91.60% which is 6.81%, 6.33%, and 2.61% improvement from CNN, Bi-
LSTM, and GRU respectively. The proposed model achieved a very promising result for sentiment
analysis. Mostly in this research work, overfitting was encountered but different hyperparameters were
applied to control the learning process. Hyperparameters like Learning rate, dropout, Momentum, and
random state for our case shifted the model from overfitting to a good fit. If a model achieved a high
accuracy but is overfitted it won’t be useful in the real world because the model generalization capacity is
not applicable.

9. Conclusion
Currently, in Ethiopia, a large number of comments are available on various social media sites, which
must be gathered and analyzed in order to assess the general public's opinion. Finding and monitoring
comments, as well as extracting the information contained in them manually, is a tough undertaking due
to the huge range of opinions on the internet. As a matter of fact, the normal human reader will have
trouble finding appropriate websites, accessing, and summarizing the information contained inside. As a
result, automated sentiment analysis methods are necessary. Different researchers used sentimental
analysis for Amharic sentiment either with Lexical or Machine Learning. Both approaches require the
interference of the programmer at one point or another. But when it comes to deep learning it minimizes
human involvement which makes life easier. In this research, the researcher applied sentimental analysis
on Amharic political sentences using four different deep learning approaches; CNN, Bi-LSTM, GRU, and
hybrid of CNN with Bi-LSTM. To the researcher’s knowledge, this is the first work that applied BI-LSTM,
GRU, and CNN-Bi-LSTM.

Experimental result shows that the hybrid CNN-Bi-LSTM model achieved a better performance of 91.60%
compared to other models where 84.79%, 85.27%, and 88.99% for CNN, Bi-LSTM, and GRU respectively.
The researcher conducts a hyperparameter search to find appropriate values to solve overfitting problems
of our models. While these results verify the main contribution of the study there is still room for
improvement.

When working on this research problems like manually collecting and annotating the dataset is a very
tiring task. Even though a promising accuracy was achieved the model was trained with limited dataset
which made the model learn only limited features and only considered binary classification. The model
Page 19/32
struggle to distinguish sarcasm, figurative speech and sentiment sentences that contain both words that
give positive and negative sentiment. These challenges are area that need further research.

10. Recommendation
The goal of this study is to design and develop a Deep Learning model for Amharic sentiment analysis.
Overall, the findings of this research were promising. It takes more time and money to develop a full-
fledged and more efficient sentimental analysis. It also necessitates a well-coordinated team effort
involving linguistic specialists and computer scientists. Therefore, based on the result of the current
study, the following is a way forward for further work:

Binary class for classification was considered for this research which doesn’t consider the degree to
which a given text is positive or negative. Interpretation of opinions could be challenging for humans
as a binary distinction of opinions as only positive or negative may not be good enough. The multi-
class classification approach to sentiment analysis provides a more refined breakdown than the
positive-negative binary scheme. So multi-class classification is recommended as future work.
For Amharic language, there is no standardized and well-organized corpus for experimentation for
sentiment analysis which is a redundant and time-consuming task. So, the construction of a
standardized corpus is very important because it would allow for more experiments, especially for
deep learning model experimentation
This research work focused on sentiment related to political situations but it can be extended to
other domains, like agriculture, industry, tourism, sport, entertainment, and employee as well as
customer satisfaction areas.
Aspect level sentiment analysis using deep learning is especially important because deep learning is
essentially good at learning complicated feature representations. Deep learning models can capture
semantic connections between the target word and the sentiment words. Deep learning will help
tremendously in capturing complex relations needed for aspect level sentiment analysis.
Sarcasm comments are ironic or satirical remarks it is a closely related concept to sentiment
analysis. Sarcasm comments are difficult to analyze because it usually makes a change of a
sentence’s literal meaning, and greatly influence the sentiment expressed by the sentence. Due to the
impressive success in many other NLP problems, researchers have attempted to solve sarcasm
using deep learning techniques. For future work, sarcastic comment analysis is an area that hasn’t
been explored for Amharic language.
Further research is needed by integrating attention-based mechanism with a deep learning model for
sentiment analysis. Attention mechanism is used to get the attention of the model on the features
which contribute much to the prediction task.
Social media comments are written in a different language so designing a bilingual or multilingual
sentiment analysis is another area that needs further research.

Page 20/32
Declarations
Data Availability Statement
The datasets used and/or analyzed during the current study available from the corresponding author on
reasonable request.

References
1. A. Ghoroghi, Y. Rezgui, I. Petri, and T. Beach, “Advances in application of machine learning to life
cycle assessment: a literature review,” Int. J. Life Cycle Assess., pp. 433–456, 2022, doi:
10.1007/s11367-022-02030-3.
2. A. Alemu and Y. Getachew, “Deep Learning Approach for Amharic Sentiment Analysis,” University Of
Gondar, Ethiopia, 2019.
3. B. Jang, M. Kim, G. Harerimana, S. U. Kang, and J. W. Kim, “Bi-LSTM model to increase accuracy in
text classification: Combining Word2Vec CNN and attention mechanism,” Appl. Sci., vol. 10, no. 17,
p. 5841, 2020.
4. G. Loye, Gated Recurrent Unit (GRU) With PyTorch. FloydHub Blog; blog.floydhub.com. Recurrent Unit
(GRU) With PyTorch, 2019. [Online]. Available: https://doi.org/Gated
5. Y. Cheng, L. Yao, G. Xiang, G. Zhang, T. Tang, and L. Zhong, “Text sentiment orientation analysis
based on multi-channel CNN and bidirectional GRU with attention mechanism,” IEEE Access, vol. 8,
no. 15, pp. 13497–34964, 2020.
6. G. Liu and J. Guo, “Bidirectional LSTM with attention mechanism and convolutional layer for text
classification,” Neurocomputing, vol. 337, pp. 325–338, 2019.
7. J. Mungalpara, “What Does It Mean By Bidirectional LSTM? | Analytics Vidhya | Medium,” 2021.
[Online]. Available: https://medium.com/analytics-vidhya/what-does-it-mean-by-bidirectional-lstm-
63d6838e34d9
8. S. Sachin, A. Tripathi, N. Mahajan, S. Aggarwal, and P. Nagrath, “Sentiment analysis using gated
recurrent neural networks,” SN Comput. Sci., vol. 1, no. 2, pp. 1–13, 2020.
9. I. (Ed.. Merriam-Webster, Merriam-Webster’s collegiate dictionary (10th ed). Merriam-Webster, 2002.
10. G. Moges, “Semantic-aware Amharic text classification using Deep learning approach,” Master’s
thesis, Addis Ababa University. Ethiopia, 2020.
11. S. Gebremeskel, “Sentiment mining model for opinionated Amharic texts,” Master Thesis, Addis
Ababa Univ. Addis Ababa. Ethiop., vol. 2010, 2010.
12. A. Getachew, “Opinion mining from Amharic entertainment texts college of natural science opinion
mining from Amharic entertainment texts,” 2014.
13. T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector
space. arXiv,” 2013.

Page 21/32
14. X. Rong, “word2vec parameter learning explained. arXiv,” 2014.
15. J. Kapočiūtė-Dzikienė, R. Damaševičius, and M. Woźniak, “Sentiment analysis of lithuanian texts
using traditional and deep learning approaches,” Computers, vol. 8, no. 1, p. 4, 2019.
16. W. Philemon, “A Machine Learning Approach to Multi-Scale Sentiment Analysis of Amharic Online
Posts,” 2015.
17. T. Fikre, “Effect of preprocessing on long short term memory based sentiment analysis for Amharic
language,” Master’s thesis, Addis Ababa University. Ethiopia, 2020.

Figures

Figure 1

Sample CNN Model Architecture for text classification [5]

Page 22/32
Figure 2

Bidirectional-LSTM [7]

Figure 3

Page 23/32
the Internal structure of GRU [4]

Figure 4

The proposed model architecture [3]

Page 24/32
Figure 5

Architecture of Sentiment Analysis for Amharic Language using Deep Learning

Page 25/32
Figure 6

Data Pre-processing

Figure 7

Sample output from padding

Figure 8

Steps to train a Word2Vec Model

Page 26/32
Figure 9

The Skip-gram model [14]

Page 27/32
Figure 10

Learning curve for CNN Model

Figure 11

Learning curve for CNN Model after regularization

Page 28/32
Figure 12

Learning curve for Bi-LSTM Model

Figure 13

Learning curve for GRU Model

Page 29/32
Figure 14

Learning curve for hybrid CNN and Bi-LSTM Model

Figure 15

Learning curve for hybrid CNN and Bi-LSTM Model

Page 30/32
Figure 16

Comparison of models constructed in this study

Figure 17
Page 31/32
Confusion matrix for CNN-Bi-LSTM

Page 32/32

You might also like