Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Ampersand 12 (2024) 100169

Contents lists available at ScienceDirect

Ampersand
journal homepage: www.elsevier.com/locate/amper

Contextual word disambiguates of Ge’ez language with homophonic using


machine learning
Mequanent Degu Belete a, Ayodeji Olalekan Salau b, c, *, Girma Kassa Alitasb a, Tigist Bezabh d
a
School of Electrical and Computer Engineering, Debre Markos Institute of Technology, Debre Markos University, Debre Markos, Ethiopia
b
Department of Electrical/Electronics and Computer Engineering, Afe Babalola University, Ado-Ekiti, Nigeria
c
Saveetha School of Engineering, Saveetha Institute of Medical and Technical Sciences, Chennai, Tamil Nadu, India
d
ICT Center, Debre Markos City, East Gojjam Zone, Ethiopia

A R T I C L E I N F O A B S T R A C T

Keywords: According to natural language processing experts, there are numerous ambiguous words in languages. Without
Ge’ez language automated word meaning disambiguation for any language, the development of natural language processing
WSD technologies such as information extraction, information retrieval, machine translation, and others are still
Text vectorization
challenging task. Therfore, this paper presents the development of a word sense disambiguation model for
Machine learning
duplicate alphabet words for the Ge’ez language using corpus-based methods. Because there is no wordNet or
public dataset for the Ge’ez language, 1010 samples of ambiguous words were gathered. Afterwards, the words
were preprocessed and the text was vectorized using bag of words, Term Frequency-Inverse Document Fre­
quency, and word embeddings such as word2vec and fastText. The vectorized texts are then analysed using the
supervised machine learning algorithms such Naive Bayes, decision trees, random forests, K-nearest neighbor,
linear support vector machine, and logistic regression. Bag of words paired with random forests outperformed all
other combinations, with an accuracy of 99.52%. However, when Deep learning algorithms such as Deep neural
network and Long Short-Term memory were used for the same dataset, a 100% accuracy was achieved.

1. Introduction processed in personal and corporate settings. People can readily


comprehend the meaning of language in varied settings in human
Word sense disambiguation (WSD) is a fundamental challenge in communication. Word sounds and physical expressions may be simply
natural language processing (NLP) applications such as text categori­ added to the context to obtain the full meaning of the words. Humans,
zation and interpretation (Zhang et al., 2023; Basili et al., 1997). The on the other hand, find it challenging to comprehend all of their natural
difficulty of finding the right sense of lexical terms in raw texts applies to language communication as well as all written and recorded data uti­
classification, machine translation, information retrieval, and any other lizing the traditional or manual technique. This necessitates the use of
language engineering activity (Rippeth et al., 2023). The ubiquitous computers to enhance and replace the procedures of their communica­
ambiguity of words and their application in texts causes problems. tion. Computers do a range of tasks that make humans’ lives simpler.
Furthermore, the specificity of senses in the knowledge areas where Natural Language Processing (NLP) is one of the areas to which com­
words are employed tends to complicate the disambiguation process, puters contribute. Even if people find it easy to acquire and utilize their
impacting the completeness of most online sources, such as dictionaries language, computers cannot conduct Natural Language Processing un­
and general purpose lexical resources. Disambiguation is the process of less they are properly configured to do so (Jurafsky and Martin, 2008;
removing ambiguity by making the meaning of words obvious (Kri­ Naavigli, 2009). NLP may be used to address the analysis or synthesis of
tharoula et al., 2023). Disambiguation clarifies the meaning of words spoken or written language in software or hardware components of a
(Wassie et al., 2014). On the other hand, the technological progress itself computer system (Jackson and Moulinier, 2002). Thus, NLP solves
has resulted in faster communication and production of a high amount natural language issues by automatically analyzing and producing the
of data. This has changed the way massive data and information were natural language (Joseph et al., 2016). NLP is important for textual

* Corresponding author. Department of Electrical/Electronics and Computer Engineering, Afe Babalola University, Ado-Ekiti, Nigeria.
E-mail addresses: mekuanentde@gmail.com (M.D. Belete), ayodejisalau98@gmail.com (A.O. Salau), girmakassa21@gmail.com (G.K. Alitasb), tigstbezabih9@
gmail.com (T. Bezabh).

https://doi.org/10.1016/j.amper.2024.100169
Received 23 June 2023; Received in revised form 9 December 2023; Accepted 3 March 2024
Available online 4 March 2024
2215-0390/© 2024 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-
nc/4.0/).
M.D. Belete et al. Ampersand 12 (2024) 100169

information retrieval, speech recognition, word sense disambiguation et al., 2014); Amharic WSD Using WordNet (Hassen, 2015); Amharic
(WSD), and artificial intelligence that is similar to humans (Sarmah and Minimally Supervised Machine Learning (EJGU, ND); Amharic docu­
Sama, 2016; Assegie et al., 2023). However, NLP encounters difficulties ments case study of WSD (Kassie, 2009); Amharic unsupervised WSD
when a term has many meanings, a condition known as ambiguity. (ASSEMU, 2011); Ge’ezCorpus Based WSD (Aschale and Anlay, 2021);
Natural language words can have several lexical meanings. Polysemous Tigrigna Machine LearningWSD (Reda, 2017); Tigrigna
words are those that have several meanings or senses. Polysemous words Semi-Supervised Machine Learning (MURUTS, 2018); Afaan Oromo
can be characterized in a number of ways depending on the context in Knowledge-Based WSD (Olika, 2018); Afaan Oromo WSD Using Word­
which they appear; this is known as the local or sentential context Net (Tesfaye, 2017). All of these studies demonstrate that Ge’ez lan­
(Phong Thanh et al., 2005). To make computers understand polysemous guage has not received adequate attention. Even the paper (Aschale and
words we need a study on Word Sense Disambiguation (WSD), which is Anlay, 2021) does not cover equivalent syllabi of the language.
“the task of examining word tokens in context and determining which Furthermore, it does not handle ambitious words with more than two
sense of each word is being used (Agirre and Edmonds, 2007; Ejgu, meanings. To the best of the author’s current knowledge, no
2017; Jurafsky and Martin, 2006). Disambiguating word senses has the computer-based methods exist for solving syllabi of comparable words
potential to improve many natural language processissng tasks. with the same sound but distinct meanings for the Ge’ez language. As a
Among the Ethiopic languages, one of the languages that face am­ result, we created a Word Sense Disambiguation (WSD) model for Ge’ez
biguity of words is Geez. Ge’ez (ግዕዝ also known as “Ethiopic” by some) language using a supervised corpus-based technique to eliminate the
is an ancient South Semitic language that arose in northern Ethiopia and ambiguity described above. For this study, twenty (ten pair homo­
Eritrea on the Horn of Africa (Alemu et al., 2023a, 2023b). Later, it was phones) polysemous words were selected to prepare the corpus. These
adopted as the official language of the Kingdom of Aksum and the homophones and polysemous are: ነስሐ ሐ and ነስኀ ኀ, ሐየሰ and ኀየሰ, ጸ መመ,
Ethiopian imperial court. Ge’ez is now solely used in the liturgy of the and ፀመመ, ቀሰመ, and ቀሠ ሠመ, መሀ ሀረ and መሐ ሐረ, ኀለየ and ሐለየ,መል ልአ and
Ethiopian Orthodox Tewahedo Church, the Eritrean Orthodox Tewa­ መል ልዐ ፈጸ መ and ፈ ፀመ, ሐ ደ መ and ሀደ መ, ሠ ርሐ ሐ and ሰርሐ ሐ and these pairs of
hedo Church, the Ethiopian Catholic Church, and the Beta Israel Jewish homophones pronounced as Nesha, hayes, tsememe, qeseme,
community. Tigrigna, Tigre, and Amharic are all linked to Ge’ez mehare, hayele, melea, fetseme, hademe and serha, respectively.
(Demilie et al., 2022). Geez is also considered an extinct sister language The number of samples prepared to conduct the study are 1010
of Tigre and Tigrinya (Kassa, 2018). In Geez, we have letters that have (50–52 for each polysemous) sentences and up to nine meanings for each
one more homophonic character representations these words have ambiguous term. This study is constrained by a tiny dataset, just one
similar syllabic sounds such as ሀ (ha):ሐሐ(ha): ኅ (ha’); አ(a):ዐ
ዐ(a); ሰ(se):ሠ
ሠ ambiguous word in a single phrase is handled, as the language is under
(se); and ጸ(tse):ፀፀ(tse). The existence of these letters becomes one of the resourced, because there is a lack of public dataset, and other works in
causes for ambiguity of words. For example, letters with the same sound this area are based on public datasets.
ሐ and ኀ; ሰ and ሠ have different contextual meanings while they appear The major contributions of this paper are as follows:
on writing. This is shown as:
ልኀ (melha)
መል ንጉሥ መልኀ መጥባህቶ ላዕለፀላዕቱ -means, the king brought his sword ✓ This paper presents a word sense disambiguation model for duplicate
= in front of his enemies. alphabet words for the Ge’ez Language using corpus-based methods.
ልሐ
መል ቅዱስ ያሬድ ድምፀ ቃሉ መልሐ በዜማ - means, saint Yared made his vocal ✓ 1010 samples of ambiguous words were acquired because there is no
(melha) = cord tasty with a song.
wordNet or other publicly available dataset for the Ge’ez language.
ብሐ (sebiha)
ሰብ አዳም ሰብሐ አምላኮ በይነ ዘወሀቦ ለሔዋን- means, Adam praise the lord ✓ When using Naïve Bayes (NB) for training, Multinomial NB is
for having Eve as his companion.
employed for bag of words or Term Frequency-Inverse Document
=
ሠብሐ ለበአለ ገና ገዝአ ሶር ሠብሐ ስጋሁ - means, the oxen bought for Christmas
(sebiha) = becomes more fatty. Frequency (TFIDF) (score 93.91% for bag of word (BoW) and 90.38%
for TFIDF).
✓ Gaussian NB (simply NB) approach was discovered to be the worst
Because of these homophonic characteristics, language users are when word embedding techniques are used for text vectorizing.
subjected to typographic mistakes, which include the insertion, deletion, ✓ Utilizing the BoW feature extraction approach and the random forest
transposition, and substitution of letters. The substitution of these algorithm, we achieved an accuracy score of 99.52%.
comparable syllabic sounds with their variant letters alters the meaning,
resulting in a shift in the cognitive domain of the language. This ambi­ 2. Literature review
guity may be seen in computer knowledge representation as well. This is
due to the absence of a competent natural language processing system 2.1. Ge’ez language
for Ge’ez. This deficiency is evident in the language’s information
retrieval, information extraction, machine learning, and speech recog­ In Ethiopia, there are about 83 languages and one of them is Geez
nition contexts. The presence of word or letter variations causes ambi­ (Hagos and Asefa, 2020). The Ge’ez also known as Ethiopic script is one
guity in many languages, including the Geez language. Word of the ancient languages of the country possibly developed from the
disambiguation occurs in Geez when one word has two or more mean­ Sabaean/Minean script (Omniglot, 2018). Geez is widely used in the
ings and there is a variation in meaning when reading the same sound. Ethiopian Orthodox church nowadays. It has letters/Fidel, which are
The ambiguity may arise in two ways: the first is while words of the same 34*7 size syllables. It is delivered as course at higher education insti­
syllabi have two or more meanings such as (በ በኩር (bekur), which means tution such as and BDU (post graduate level) (Faculty_of_Humanities,
first child, and በኩ ር, which means boss or administrator) and (ግ ግብ ር 2021) and in DMU at strategic level plan (Haddis_Alemayehu_Cultural,
(gibr), which means governmental tax, and ግ ብ ር, which means work or 2021). A supplementary course in many primary schools has come
activity). The second is while, words of different syllabi but similar across the earliest known inscriptions in the Ge’ez script date to the 5th
sounds refer two different meanings such as (ጸ ጸብ ሐ (tsebha), which century BC. The type of writing system in Geez is termed as abugida
means sunsets, and ጸ ብኅ (tsebha) which means make a spiced stew or (አቡጊዳ). Ge’ez linguistically evolved and produced Tigrigna and
curry). Amharic languages (Demilie and Salau, 2022; Hagos and Asefa, 2020).
To address such variants and ambiguity of terms different re­
searchers conduct various studies on many Ethiopian Languages. 2.2. Word sense dis ambiguous approaches
Browsing the available documents online the following research outputs
are observed. Amharic WSD Using Semi-Supervised Learning (Wassie In a Word Sense Disambiguous (WSD) task, the three basic ap­
proaches are knowledge-based, corpus-based, and hybrid approaches.

2
M.D. Belete et al. Ampersand 12 (2024) 100169

2.2.1. Knowledge-based WSD sense examples for the four ambiguous words. The authors utilized the
Disambiguation was performed using word nets and other ontologies same machine learning algorithms as Solomon (2010). However, they
to discover pairings of dictionary senses having the most word overlap in achieved maximum accuracy of 83.3% using EM algorithm. Whereas,
their meanings. This technique employs explicit lexicon, such as Ma­ (Teklay, 2018) used a total of 1250 Tigrigna sense examples to disam­
chine Readable Dictionary (MDR), thesauri, ontologies, collocations, biguate words: kefele (ከፈለ), OareQe (ዓረቐ), seOare (ሰዓረ), Halefe (ሓለፈ),
and so on, to extract knowledge from word definitions and relationships and medeb (መደብ) and achieved an accuracy of 93.7%, using ADTree.
across word senses in order to assign the proper meaning for a word To disambiguate polysemous words of Ge’ez language for WSD in NLP
locally by specifying explicit sense distinctions (Temesgen, 2021; Zhang, task, Aschale and Anlay (2021) evaluated the three corpus-based ap­
2020). The fundamental disadvantage of this strategy is that it requires a proaches. The authors obtained a total of 2119 Ge’ez sense samples for
knowledge source in the relevant language and domain. However, it is six ambiguous words. SMO, Naïve Bayes, and Bagging, EM, Simple
preferred because it does not require training and can disambiguate K-means, Farthest First, and Hierarchical clustering machine learning
random text, and can resolve confusing words by sentential context algorithms were used to train the required systems. The authors reported
(Edmonds, Eneko Agirre and Philip, 2007). that AD Tree of semi-supervised outperformed other methods by
achieving an accuracy of 91.1%. However, the author did not handle for
2.2.2. Corpus-based WSD ambiguous words that have more than two senses.
Corpus based WSD uses machine-learning techniques to disambig­ Wassie (2022) presented a machine translation for Ge’ez language.
uate words according to their context. In this approach, a corpus that The author showed how the proposed translator effectively translated
contains sense samples is prepared and the machine-learning algorithm the Ge’ez language with a high accuracy. A performance comparison of
is applied to train to the required system; such that the system utilized to Word sense disambiguation approaches for Indian languages was per­
identify the meaning of the polysemous word locally. This approach can formed by Shree and Shambhavi (2015). Alemu and Fante (2021)
be further categorized into Supervised, unsupervised and semi super­ created a Geez Language word meaning disambiguation prototype
vised techniques. model utilizing a Corpus-Based method to identify training sets that
minimize the quantity of needed human involvement. Gujjar et al.
2.2.3. Hybrid approaches (2023) review paper provides an overview of several ways used to
Both corpus and knowledge-based techniques have advantages and resolve ambiguity in Hindi words, including supervised, unsupervised,
disadvantages. However, the hybrid approach combines both corpus and knowledge-based methods. A multi-modal retrieval system was
and knowledge-based approaches, with the goal of gaining the benefit of presented in Yin and Huang (2023) that makes the most use of pre­
having more knowledge sources and machine learning (corpus) ap­ trained Vision-Language models, as well as open knowledge bases and
proaches because knowledge-based methods help corpus-based methods datasets. Our system is made up of the following components: (1) Gloss
achieve better performance in the disambiguation process and vice versa matching: a pretrained bi-encoder model is utilized to match situations
(Shree and Shambhavi, 2015). with appropriate meanings of target words; (2) Prompting: matched
glosses and additional textual information, such as synonyms, are
2.3. Related work inserted using a prompting template; (3) Image retrieval: utilizing
prompts as queries, semantically matched images are recovered from
Teshome (1999) used a knowledge-based approach to construct an vast open datasets; (4) Modality fusion: contextual information from
Amharic disambiguation algorithm and to examine the performance of multiple modalities is fused and utilized for prediction. Jarrar et al.
an information retrieval system for the Amharic language when lin­ (2023) used Target Sense Verification to create an end-to-end Word
guistic disambiguation is used. The author built thesaurus using the Sense Disambiguation system. This system was used to assess three
Ethiopian Penal Code, which comprises 865 entries. The researcher Target Sense Verification models that were available in the literature.
compared his work to the Lucien algorithm and found that his work Our top model has an accuracy of 84.2% when using Modern and 78.7%
outperformed it. However, the author evaluated WSD using faux words while utilizing Ghani.
rather than genuine sense tagged words.
Solomon (2011) used a supervised corpus based approach for 3. Methodology
Amharic language WSD. The author used a monolingual corpus of En­
glish language to the number of 1045 English sense samples gathered This section provides a brief overview of the proposed method for
from British national corpus for five main words, metrat (መጥራት), contextual word disambiguation in the ge’ez language with duplicate
mesal (መሳል), atena (አጠና), mesasat (መሳሳት), keretse (ቀረጻ), and then alphabets. Fig. 1 depicts the primary components of the proposed system
these sense samples were translated to Amharic, manually annotated for completing various activities. This section covers data collection and
and preprocessed. The author came to a conclusion that Naive Bayes annotation, text preparation, data splitting, feature extraction, model
(NB) method, with an accuracy of 83% achieved the highest accuracy. development, and evaluation.
Solomon (2010) on the other hand, used an unsupervised corpus-based
technique on the same dataset as Solomon (2011). The author ran five 3.1. Data collection and data annotation
clustering methods (average, full link, expectation maximization (EM),
hierarchical agglomerative link: single and simple k means) through To create the necessary mechanism for disambiguating words with
their paces. The author determined that basic k-means clustering algo­ duplicate alphabets, we first selected words with the same pronuncia­
rithms produced the maximum accuracy of 79.4% based on the tech­ tion but different writing, because we had the same sounds but distinct
niques chosen. As a result, the results reveal that for corpus-based morphs of Ge’ez letters.
approaches, supervised techniques outperform unsupervised ones. The authors instructed the local Ge’ez instructor to choose duplicate
However, for Amharic language challenges in general, writers in alphabet, and confusing words and provide examples for the selected
Solomon (2010, 2011) lacked standard sense annotated corpus and words. As a result, ten pairs of words with duplicate alphabets are
other machine-readable language resources such as glossaries and chosen randomly. Possible meanings are identified for each ploysemous
thesauri. Meresa (2017) and Teklay (2018) applied unsupervised and words and 1010 examples or phrases are constructed accordingly. The
semi-supervised corpus based approaches for disambiguate polysemous examples are written in such a way that the sense(s) per word are
words in Tigrigna texts respectively. balanced.
The author, Meresa (2017) used ambiguous words: ሓለፈ (Halefe),
መደብ (medeb), ሃደመ (hademe), ከበረ (kebere) and a total of 631 Tigrigna

3
M.D. Belete et al. Ampersand 12 (2024) 100169

Because feature extraction may increase learning algorithm accuracy


and reduce time, it is critical to select the optimal feature extraction
algorithms (Zareapoor, 2015) (Singh et al., 2013) (Liang et al., 2017).
The technique of converting words into vectors is known as feature
extraction. In this study, bag of words (BoW), TFIDF, and word em­
beddings such as word2vec and fastText are used to transform texts into
vectors. We used the default settings while utilizing bag of words and
TFIDF since the default parameter combinations outperformed alterna­
tive combinations. However, with word2vevec and fastText, we ob­
tained superior representation when epoch = 200, vector size = 200,
and window size = 10 were used for both the continuous bag of words
(CBOW) and skip gram (SG) techniques, while the remaining parameters
were left at their default values. The epoch, vector size, and window size
default values are 5, 100, and 5, respectively. We employed both the
CBOW and SG methods to word embedding techniques since they are
both competitive (Degu et al., 2023) and we took the better result of
CBOW and SG techniques for presentation in Table 2.

3.4. Dataset splitting, development of models and evaluation

The converted text, now known as dataset, is divided into train and
test datasets for training and evaluating the models using train_test_split
method respectively. From the 1010 instances obtained, 80% were uti­
Fig. 1. Proposed methodology. lized as train data and the remaining 20% as test data. Because the data
collection is unbalanced, the training examples are oversampled. Ma­
3.2. Data preprocessing chine learning algorithms are used to train models, as detailed in the
machine leaning section in section two. In this work, machine learning
The gathered and tagged text data, referred to as the corpus here­ methods such as Naive bayes, decision tree, random forest, K-nearest
after, is prepared for the following stage, which is data preparation and neighbor, support vector machine, and logistic regression were utilized
handling unbalanced samples per sense. Text-preprocessing operations to train the required model. We ended up with 20 models after
such as tokenization and punctuation mark removal are performed for combining the four feature extraction approaches and the five machine
data preprocessing since the data is text. We did not undertake stop word learning algorithms. To attain the best model performance, we tuned
removal, normalization, or stemming for this study. Stemming and stop additional influential parameters such as depth for random forest, kernel
word removal were not performed due to a lack of resources and the for support vector machine, and so on during training the WSD system.
authors inability to write or understand the Ge’ez language. However, After training the models, the next step is to evaluate them. The accuracy
the normalization was not performed because the Ge’ez language is case of models is measured. Machine-learning models are evaluated using
sensitive. classification reports and confusion metrics. However, because we had
60 classes, handling confusion metrics, F-scores, accuracy, and recall
• Removing punctuation marks: Punctuation marks are treated as if scores of models became too challenging. As a result, the accuracy score
they were noise. They are used to separate lists, sentences, and words was used to evaluate the models given in the study.
in written language. Even though in today’s written language two
dots are replaced by a single space to separate words, two dots are 4. Results and discussion
still used to divide words in Ge’ez literatures and writers. Punctua­
tion marks, on the other hand, are unneeded in linguistic calculations 4.1. Dataset description
like WSD and are considered as noise. As a result, punctuation marks
should be eliminated. Table 1 shows the number of ambiguous words, the number of senses
• Tokenization: Text should be tokenized to vectorize texts, to (contextual meanings) per ambiguous word, and the number of in­
transform words into numbers. The corpus is tokenized into sen­ stances (sentences) per ambiguous in the columns polysemous, senses,
tences and then the sentences are tokenized into words. and examples respectively. As indicated in Table 1, the number of

Handling imbalanced cases in a dataset is another duty of data Table 1


preparation. Unbalanced data is addressed via sampling techniques Details of the dataset.
known as samplers, which include Synthetic Minority Oversampling, Polysemous Senses Examples Polysemous Senses Examples
undersampling, oversampling, and random oversampling. Because
ነስሐ 4 51 1 51
sampling approaches require numerical inputs, textual data should be
መሀረ
ነስኀ 2 51 መሐረ 1 51
transformed to vectors through a process known as feature extraction ሐየሰ 3 51 ኀለየ 2 50
(described in section 3.4). The vectors are then input into samplers. ኀየሰ 6 50 ሐለየ 3 50
Except for the random over sampler, the vectors were most likely un­ ጸመመ 2 50 መልአ 2 51
suitable to the conditions asked by all of the samplers given. To balance ፀመመ 8 50 መልዐ 2 51
ቀሰመ 2 50 ፈጸመ 3 52
the data, random over sampling techniques were used. ቀሠመ 1 50 ፈፀመ 9 50
ሐደመ 2 51 ሠርሐ 3 50
3.3. Feature extraction ሀደመ 2 50 ሰርሐ 2 50

This section presents and analyzes the findings produced from various combi­
After the text has been analyzed, it is sent through a feature extractor nations of feature extraction techniques and training algorithms, as shown in
to be represented in vector space, transforming words to numbers. Table 2.

4
M.D. Belete et al. Ampersand 12 (2024) 100169

Table 2
Accuracy scores of the models.
Feature extraction NB DT RF KNN LSVM LR

BoW Accuracy 0.9391 0.9727 0.9952 0.8990 0.9919 0.9919


TFIDF Accuracy 0.9038 0.9711 0.9887 0.7980 0.9903 0.9823
Word2Vec Accuracy 0.5865 0.9471 0.9775 0.9278 0.9807 0.9823
FastText Accuracy 0.5945 0.9375 0.9775 0.9375 0.9791 0.9679

examples is 1010, obtained by summing the examples columns; the


Table 3
number of senses of each (pair of) ambiguous words is not the same;
Comparison of the proposed approach with related works.
hence, training samples were oversampled from 808 (80% of 1010) to
2496. However, the test data (20% of total examples) was not sampled, Author Method Algorithm Accuracy
(%)
and the total number of examples was 202 (1010*0.2).
As presented in Table 2, BoW obtained a 99.52% performance when Solom et al. [29] WSD for Amharic text k-means 79.4
clustering
combined with the RF algorithm. TFIDF (99.03%), Word2Vec (98.23%),
Meresa [30] WSD for Tigregna text EM 83.3
and FastText (97.91%) performed best when paired with LSVM, LSVM, Aschale and Anlay Corpus-Based WSD for AD Tree 91.1
and LR, respectively. However, when BoW (89.89%) and TFIDF [15] Ge’ ez
(79.80%) were combined with KNN, these vectorizers performed poorly. Teklay [31] Tigregna WSD AD Tree 93.7
In contrast, both Word2Vec (58.65%) and FastText (59.45%) fared Proposed Homophonic WSD for Random forest 99.5
Ge’ ez
poorly with the NB algorithm. Except for KNN, all machine learning
algorithms performed at their best when paired with BoW, including RF
(99.52%), LSVM (99.19%), LR (99.19%), DT (94.91%), and NB
(90.38%). KNN, on the other hand, achieved 93.75% for FastText. When Table 4
Comparison of deep learning approaches.
paired with NB (58.65%), KNN (79.80%), RF (97.75%), LR (96.79%),
and LSVM (97.91%) achieved the lowest accuracies. Word2Vec, TFIDF, Deep learning Accuracy Loss Epoch
FastText (or Word2Vec), FastText, and FastText. algorithm (categorical (sparse_categorical_crossentropy)
accuracy)
In general, as shown in Fig. 2, we found out that:
DNN 100% 0.04 (4%) 150
LSTM 100% 0.03 (3%) 150
• LSVM and LR models showed consistency.
• NB algorithm achieved poorest results especially for word
embedding. 5. Conclusion and recommendations
• Unlikely NB, KNN performed better results for word embeddings.
• RF algorithm conclude with accuracy of 99.52%. 5.1. Conclusion

4.2. Comparison with related works This paper presented the development of Word sense disambiguation
(WSD) for Ge’ez language. To achieve this, we began by preparing a
Despite the fact that various authors utilized different datasets and corpus for the selected pair of duplicated words. Next, the corpus was
that not all of the works given in Table 3 are in the same language, preprocessed by applying punctuation mark remover and tokenizer,
authors have evaluated the various system performances in Table 3. through a process called text preprocessing. After text preprocessing,
As shown in Table 3, this study achieved the highest accuracy score feature extraction, dataset splitting, training, and finally testing of the
for WSD (99.5%) in the case of Ethiopian languages. In addition to the models was carried out.
suggested approach, this work provides a corpus of 1010 Ge’ez felt The results show that bag of words (BoW) text representation tech­
sentences for the use of WSD. Authors also employed deep learning al­ nique outperformed the TFIDF. When we compared the word-
gorithms such as Deep Neural network (DNN) and Long Short-Term embedding techniques, word2vec outperformed the fast text tech­
memory (LSTM) which achieved a high performance as presented in nique. On the other hand, when frequency based text vectorizer (bag of
Table 4. The results show that both the DNN and LSTM achieved a 100% words and TFIDF) was compared with word embeddings, both the bag of
accuracy after 150 epoch. words and TFIDF based models outperformed the word embedding
based models except for KNN when machine learning algorithms were
applied on a small dataset for training of Ge’ez language.
When utilizing Naïve Bayes (NB) for training, in the case of bag of
words or TFIDF, Multinomial NB (scoring 93.91% for BoW and 90.38%
for TFIDF) was used. In the case of word embeddings, however, Gaussian
NB (scored 59%) was used because multinomial NB does not accept
negative inputs. As a result, we discovered that the Gaussian NB (simply
NB) approach is the worst when word embedding techniques are used
for text vectorizing. Finally, utilizing the bag of word feature extraction
approach and the random forest algorithm, we achieved an accuracy of
99.52%.

5.2. Recommendation

Texts are preprocessed in word disambiguation, text categorization,


and similar tasks, and the process is known as natural language pro­
cessing. Natural language preparation, on the other hand, is a significant
Fig. 2. Test accuracy scores of each models. difficulty for under-resourced languages such as Geez. As a result, we

5
M.D. Belete et al. Ampersand 12 (2024) 100169

suggest that Geez language specialists and software developers collab­ Degu Belete, M., Tesfahun, A., Takele, H., March, 2023. Amharic Language Hate Speech
Detection System from Facebook Memes Using Deep Learning System. Unpublish.
orate to create Geez corpus and build Geez language-based text pro­
Demilie, W.B., Salau, A.O., 2022. Automated all in one misspelling detection and
cessing tools. A large dataset and stemmed words can help the suggested correction system for Ethiopian languages. J. Cloud Comput. 11, 48. https://doi.org/
model perform better. As a result, this work may be extended by 10.1186/s13677-022-00299-1.
increasing the dataset and incorporating a word-stemming module into Demilie, W.B., Salau, A.O., Ravulakollu, K.K., 2022. Evaluation of part of speech tagger
approaches for the Amharic language: a review. In: 2022 9th International Conference
the system. on Computing for Sustainable Global Development (INDIACom), pp. 569–574. https://
In addition, the paper tackles word disambiguation in the Geez doi.org/10.23919/INDIACom54597.2022.9763213.
language using machine learning. However, for text-related systems, Edmonds, Eneko Agirre and Philip, 2007. Word Sense Disambiguation Algorithms and
Applications. Springer.
recurrent neural network models such as bi-directional LSTM and bi- Ejgu, A., 2017. Minimally Supervised Machine Learning Word Sense Disambiguates to
directional GRU are recommended. As a consequence, future studies Amharic Text. University of Gondar, Gondar, Ethiopia.
can use deep learning methods such as transfer learning and RNN to Faculty_of_Humanities, 2021. Humanities to Teach Ge’ez (BDU) Retrieved Feb 11 2021,
from. https://bdu.edu.et/fh/?q=node/215.
concentrate on the WSD for Ge’ez utilizing big datasets. Gujjar, V., Mago, N., Kumari, R., Patel, S., Chintalapudi, N., Battineni, G., 2023.
A literature survey on word sense disambiguation for the Hindi language. Information
Funding 14, 495. https://doi.org/10.3390/info14090495.
Haddis_Alemayehu_Cultural, 2021. HaddisAlemayehu Cultural Studies and Ministry of
Culture and Tourism Develop Strategic Plan for the Development of Geez Language
Authors declare no funding for this research. (DMU) Retrieved Feb 10, 2021, from. http://www.dmu.edu.et/debre-markos-universit
y-haddisalemayehu-cultural-studies-and-ministry-of-culture-and-tourism-develop-
strategic-plan-for-the-development-of-geez-language/.
Availability of data Hagos, Gebremeskel, Asefa, Abera, 2020. Linguistic evolution of ethiopic language: a
comparative discussion. International Journal of Interdisciplinary Research and
The datasets generated during and/or analyzed during the current Innovations 8 (1), 1–9.
Hassen, S., 2015. Amharic Word Sense Disambiguation Using Wordnet. Addis Ababa,
study are not publicly available but are available from the corresponding Ethiopia. Addis Ababa University.
author on reasonable request. Jackson, Peter, Moulinier, Isabelle, 2002. Natural Language Processing for Online
Applications: Tezt Retrieval and Categorization. John Benjamins Publishing Company,
Amsterdam.
Code availability Jarrar, M., Malaysha, S., Hammouda, T., Khalilia, M., 2023. Salma: Arabic Sense-
Annotated Corpus and Wsd Benchmarks. https://doi.org/10.48550/arXiv.2310.19029.
Not applicable. arXiv preprint arXiv:2310.19029.
Joseph, Sethumya R., Halomany, Halomani, Letsholo, Kelesto, Kaniwa, Freeson, 2016.
Natural Language processing: a review. International Journal Of Research in Enginering
CRediT authorship contribution statement and Applied Science 6 (3).
Jurafsky, Daniel, Martin, James H., 2006. Speech and Language Processing: an
Introduction to Natural Language Processing, Computational Linguistics, and Speech
Mequanent Degu Belete: Conceptualization, Data curation, Formal Recognition.
analysis, Methodology, Resources, Software, Visualization. Ayodeji Jurafsky, Daniel, Martin, James H., 2008. Speech and Language Processing: an
Olalekan Salau: Data curation, Formal analysis, Investigation, Meth­ Introduction to Natural Language Processing, Computational Linguistics, and Speech
Recognition (ResearchGate) Retrieved Feb 2, 2021, from. https://www.researchgate.
odology, Supervision, Validation, Writing – review & editing. Girma net/publication/200111340_Speech_and_Language_Processing_An_Introduction_to_Nat
Kassa Alitasb: Data curation, Formal analysis, Resources, Writing – ural_Language_Processing_Computational_Linguistics_and_Speech_Recognition.
original draft. Tigist Bezabh: Investigation, Methodology, Resources, Kassa, T., 2018. Morpheme-Based Bi-directional Ge’ez -Amharic Machine Translation.
Addis Ababa University. Retrieved 2 10, 2021.
Software, Writing – review & editing. Kassie, T., 2009. Word Sense Disambiguation for Amharic Text Retrieval: A Case Study
for Legal Documents. Addis Ababa University, Addis Ababa , Ethiopia.
Kritharoula, A., Lymperaiou, M., Stamou, G., 2023. Large Language Models and
Declaration of competing interest Multimodal Retrieval for Visual Word Sense Disambiguation. arXiv preprint arXiv:
2310.14025.
Liang, Hong, Sun, Xiao, Sun, Yunlei, Gao, Yuan, 2017. Text feature extraction based on
The authors declare that they have no known competing financial
deep learning: a review. EURASIP J. Wirel. Commun. Netw. 1, 1–12.
interests or personal relationships that could have appeared to influence Meresa, M., 2017. Unsupervised Machine Laerning Approach for Tigrigna Word Sense
the work reported in this paper. Disambiguation. Unpublished Master’s thesis , University of Gonder.
Muruts, T., 2018. Word sense disambiguation for Tigrigna Language Using semi-
supervised machine learning approach. Addis Ababa, Ethiopia. Addis Ababa University.
Acknowledgements Naavigli, Roberto, 2009. Word sense disambiguation: a survey. ACM Comput. Surv. 41
(2).
Olika, S., 2018. Word sense disambiguation for afaan Oromo using knowledge base.
Not applicable.
Addis Ababa, Ethiopia. Mary’s University, St.
Omniglot, 2018. Ge’ez Script: the Online Encyclopedia of Writing System and
References Languages (Omniglot) Retrieved 2021, from. https://omniglot.com/writing/ethiopic.ht
m.
Phong, Thanh, Tou, Hwee, Lee, Wee Sun, 2005. Word sense disambiguation with semi-
Agirre, Eneko, Edmonds, Philip, 2007. [Word sense disambiguation. Text, Speech and
supervised learning. AAAI’05: Proceedings of the 20th National Conference on
Language Technology 33, 1–366. https://doi.org/10.1007/978-1-4020-4809-8.
Artificial Intelligence - Volume 3 July 2005 Pages 1093–1098. https://doi.org/
Alemu, A., Fante, K., 2021. A corpus-based word sense disambiguation for Geez
10.5555/1619499.1619509.
Language. Ethiopian Journal of Science and Sustainable Development 8 (1), 94–104.
Reda, M.M., 2017. Unsupervised Machine Learning Approach for Tigrigna Word Sense
https://doi.org/10.20372/ejssdastu:v8.i1.2021.283.
Disambiguation. University of Gondar, Gondar - Ethiopia.
Alemu, A.A., Melese, M.D., Salau, A.O., 2023a. Ethio-Semitic language identification
Rippeth, E., Carpuat, M., Duh, K., Post, M., 2023. Improving Word Sense
using convolutional neural networks with data augmentation. Multimed. Tool. Appl.
Disambiguation in Neural Machine Translation with Salient Document Context. arXiv
https://doi.org/10.1007/s11042-023-17094-y.
preprint arXiv:2311.15507.
Alemu, A.A., Melese, M.D., Salau, A.O., 2023b. Towards audio-based identification of
Sarmah, J., Sama, S.K., 2016. Survey on word sense disambiguation: an initiative
Ethio-Semitic languages using recurrent neural network. Sci. Rep. 13, 19346 https://
towards an indo-aryan language. I. J Enginering and Manufacturing 3, 37–52.
doi.org/10.1038/s41598-023-46646-3.
Shree, M.R., Shambhavi, B.R., 2015. Performance comparison of Word sense
Aschale, A., Anlay, K., 2021. Corpus-based word sense disambiguation for Ge’ez
disambiguation approaches for Indian languages. In: 2015 IEEE International Advance
Language. Ethiopian Journal of Science and Sustainable Development 8 (1), 94–104.
Computing Conference (IACC), Banglore, India, pp. 166–169. https://doi.org/10.1109/
Assegie, T.A., Salau, A.O., Omeje, C.O., Braide, S.L., 2023. Multivariate sample
IADCC.2015.7154691.
similarity measure for feature selection with a resemblance model. Int. J. Electr.
Singh, V., Kumar, B., Patnaik, T., 2013. Feature extraction techniques for handwritten
Comput. Eng. 13 (3), 3359–3366. https://doi.org/10.11591/ijece.v13i3.pp3359-3366.
text in various scripts: a survey. Int. J. Soft Comput. Eng. 3 (1), 238–241.
Assemu, S., 2011. Unsupervised machine learning approach for word sense
Solomon, M., 2010. Word Sense Disambiguation for Amharic Text: A Machine Learning
disambiguation to Amharic words. Addis Ababa, Ethiopia. Addis Ababa.
Approach. Unpublished Master’s thesis, Addis ababa University.
Basili, R., Rocca, M.D., Pazienza, M.T., 1997. Contextual word sense tuning and
Solomon, A., 2011. Unsupervised Machine Learning Approach for Amharic Word Sense
disambiguation. Appl. Artif. Intell. 11 (3), 235–262. https://doi.org/10.1080/
Disambiguation. Unpublished Master’s thesis, Addis Ababa University, Ethiopia.
088395197118244.

6
M.D. Belete et al. Ampersand 12 (2024) 100169

Teklay, M., 2018. Word Sense Disambiguation for Tigregna. Unpublished Master’s Yin, Z., Huang, X., 2023. HKUST at SemEval-2023 Task 1: Visual Word Sense
thesis, Addis ababa University, Addis Abeba. Disambiguation with Context Augmentation and Visual Assistance. https://doi.org/
Temesgen, T., 2021. Word Sense Disambiguation for Wolaita Language Using Machine 10.48550/arXiv.2311.18273. arXiv preprint arXiv:2311.18273.
Learning Approach. In: Adama, Ethiopia, August. Zareapoor, M., 2015. Feature extraction or feature selection for text. I.J. Information
Tesfaye, B., 2017. Afaan Oromo Word Sense Disambiguation Using WordNet. Addis Engineering and Electronic Business 2, 60–65.
Ababa, Ethiopia. Addis Ababa University. Zhang, Y., 2020. Feature extraction with TF-IDF and game-theoretic shadowed sets.
Teshome, K., 1999. Word Sense disambiguation for Amharic text retrieval: a case study Information Processing and Management of Uncertainty in Knowledge-Based Systems
for legal documents. Master Thesis. Addis Ababa University. 1237, 722–733. https://doi.org/10.1007/978-3-030-50146-4_53.
Wassie, A.K., 2022. Machine Translation for Ge’ez Language, pp. 1–10. https://arxiv. Zhang, S., Nath, S., Mazzaccara, D., 2023. GPL at SemEval-2023 task 1: WordNet and
org/ftp/arxiv/papers/2311/2311.14530.pdf. CLIP to disambiguate images. In: Proceedings Of the 17th International Workshop On
Wassie, Getahun, Babu, Ramesh, Teferra, Solomon, Meshesha, Million, 2014. A word Semantic Evaluation (SemEval-2023), 1592–1597. Association for Computational
sense disambiguation model for Amharic words using semi-supervised learning Linguistics, Toronto, Canada.
paradigm. Sci. Technol. Arts Res. J. 3 (3), 147–155.

You might also like