Professional Documents
Culture Documents
Mihiret Bekele Proposal
Mihiret Bekele Proposal
SCHOOL OF INFORMATICS
DEPARTMENT OF COMPUTER SCIENCE
proposal title
on
Word Sense Disambiguation for Wolaita
Language by using Machine Learning Approach
BY;-
MIHIRET BEKELE COMPUTER SCIENCE
MSc (Regular)
submitted to. Dr.Tewodros Abebe
Submission. Date; - Jan 23/2023
1
List of Tables
Table 1 Estimation of budgets
Table 2 Schedule for research completion
2
Contents
1 Introduction 4
1.1 Motivation of the Study . . . . . . . . . . . . . . . 5
4 Significance of Research 8
5 Research Methodology 9
5.1 Literature Review . . . . . . . . . . . . . . . . . . . . 9
5.2 Rule based Approad . . . . . . . . . . . . . . . . . . 9
5.3 Data Collection . . . . . . . . . . . . . . . . . . . . . 9
5.4 Tools and Techniques . . . . . . . . . . . . . . . . . . 10
5.5 Performance Analysis: . . . . . . . . . . . . . . . . . 10
5.6 Related Works . . . . . . . . . . . . . . . . . . . . . . 10
6 Budget 13
7 Schedule 13
8 Reference 14
3
1 Introduction
The Application of Natural language Processing has advanced the
various things in the world. Among these machine translation,
speech recognition, Information Retrieval, Web-based question an-
swering and so many things advanced the world by many aspects.
However, these NLP applications mainly faced with ambiguity prob-
lems. Therefore, to solve ambiguity problem Word Sense Disam-
biguation (WSD) was developed for many languages. WSD has
many applications. In speech synthesis, WSD is important to deter-
mine the correct pronunciations of words in order to generate speech
that sounds natural. This process is difficult since there exists some
words which are pronounced in more than one way depending on
their content. In machine translation, WSD is required in both
stages since a word in the source language may have more than one
possible translation in the target language. In order to be able to
correctly translate a text, we need to know which sense is intended
in the text. In Information Retrieval, engines need WSD for filtering
out documents with senses irrelevant to the query. The goal NLP
is to get computers to perform useful tasks involving human lan-
guage, tasks like enabling human-machine communication, improv-
ing human-human communication, or simply doing useful processing
of text or speech [1].However, computers cannot understand what
human beings can easily understand. The most challenging issues in
NLP is ambiguous words in which computers can’t identify the sense
of each words in the context of the given text or Speech. These can
make difficulties for high level NLP tasks discussed above. There-
fore, WSD which can solve ambiguity problem. It is a natural classi-
fication problem: given a word and its possible senses, as defined by
a dictionary, classify an occurrence of the word in context into two or
more of its sense classes [2].There are mainly three approaches that
are used in WSD. These are Corpus-Base Approach, Knowledge-
Based Approach and Hybrid Approach. In Corpus-Based Approach,
there are three approaches such as Supervised-learning, Unsuper-
vised learning and Semi-supervised Learning. WSD was first for-
mulated into as a distinct computational task during the early days
of machine translation in the 1940s, making it one of the oldest prob-
lems in computational linguistics. Warren Weaver, [3] in his famous
1949 memorandum on translation, first introduced the problem in
4
a computational context. Early researchers understood the signifi-
cance and difficulty of WSD well. In the 1970s, WSD was a subtask
of semantic interpretation systems developed within the field of ar-
tificial intelligence. However, since WSD systems were at the time
largely rule-based and hand-coded they were prone to a knowledge
acquisition bottleneck. By the 1980s large-scale lexical resources,
such as the Oxford Advanced Learner’s Dictionary of Current En-
glish (OALD), became available: hand-coding was replaced with
knowledge automatically extracted from these resources, but dis-
ambiguation was still knowledge-based or dictionary-based. In the
1990s, the statistical revolution swept through computational lin-
guistics, and WSD became a paradigm problem on which to apply
supervised machine learning techniques. The 2000s saw supervised
techniques reach a plateau in accuracy, and so attention has shifted
to coarser-grained senses, domain adaptation, semi-supervised and
unsupervised corpus-based systems, combinations of different meth-
ods, and the return of knowledge-based systems via graph-based
methods. Still, supervised systems continue to perform best [4].In
Ethiopia also WSD was started by many researcher for Amharic,
Afan Oromo and Tigrigna by using supervised, Unsupervised and
semi-supervised approach. In addition to this, recently Knowledge
based Approaches were conducted [5] [6] [7] [2]. In this research,
we are going to do WSD for Wolaita language by using machine
learning approach. Like English, Amharic, Afan Oromo and other
languages, Wolaita Language has also an ambiguous words which are
polymers. For Web and AI technologies rapidly developing, under
resourced languages may be faced with difficulties. So to solve this
kind of ambiguity problem we are intended to do WSD by machine
learning approach by using manually annotated sample Corpus.
5
[10] and also speech recognition [11] and soon researches were con-
ducted and also new researches are being conducted now. Therefore,
we are motivated to play our role by conducting WSD for wolaita
language to solve ambiguity problems which mainly occur in NLP
tasks like Machine translation, Speech recognition, Question answer-
ing, Information retrieval and soon. Therefore incorporating WSD
with these application may solve ambiguity problems.
6
conducted in many languages such as English [12] and other lan-
guages. In Ethiopia, Word Sense Disambiguation researches con-
ducted in Amharic, Afan Oromo and Tigrinya. In case of Wolaita,
no prior work was done regarding to word sense disambiguation.
This may face many challenges in high level NLP tasks that are
discussed above. However many NLP researches were conducted for
Wolaita Languages.Dr. Tewodros A. Gebreselassie [8]conducted Fi-
nite Stated Morphological Transducer for wolaita Language which
is an important step towards developing further NLP. On the other
hand, Birhanesh Fikre Shirko [10] conducted POS tagger by using
Transformational machine learning algorithm which may solve syn-
tax ambiguity problem and it can play important role for other NLP
tasks. Most of the time semantic ambiguity is solved by word sense
disambiguation but it is not has been done yet for wolaita Language.
Wolaita Language is one of the widely spoken Omotic language in
Ethiopia. Wolaita has existed in written form since the 1940s, when
the Sudan Interior Mission first devised a system for writing it. The
writing system was later revised by a team led by Dr. Bruce Adams.
They finished the New Testament in 1981 and the entire Bible in
2002. It was one of the first languages the Derg selected for their
literacy campaign (1979–1991), before any other southern languages
[13]. Furthermore, Wolaita English Dictionary (WED) and Wolaita
Holy Bibles [14] were also developed recently by Wolaita Linguis-
tics. Therefore Wolaita language require many NLP applications
to advance the society. As many languages, ambiguous words are
also found in Wolaita language. So it needs disambiguating an am-
biguous words from the text. And also the development of WSD
may help other Omotic languages since they are similar to wolaita
language by many aspects. If there is no WSD found, the language
may be faced with difficulties in the high level NLP applications.
To do this, our study answers the following questions.
How to develop WSD for Wolaita Language which can disam-
biguate ambiguous words from the sentences?
How to collect ambiguous words from annotated corpus and
how to train Machine learning algorithms?
To what extent the system disambiguates ambiguous words
from Wolaita texts?
7
3.1 Scope and Limitation
3.1.1 Scope of study
4 Significance of Research
The Application of Natural language Processing has advanced the
various things in the world. Among these machine translation,
speech recognition, Information Retrieval, Web-based question an-
swering and so many things advanced the world by many aspects.
However, these NLP applications mainly faced with ambiguity prob-
lem. Therefore, to solve ambiguity problem Word Sense Disam-
biguation (WSD) was developed for many languages. It has many
applications. In speech synthesis, WSD is important to determine
the correct pronunciations of words in order to generate speech that
sounds natural. This process is difficult since there exists some
words which are pronounced in more than one way depending on
their content. In machine translation, WSD is required in both
stages since a word in the source language may have more than one
possible translation in the target language. In order to be able to
8
correctly translate a text, we need to know which sense is intended
in the text. In Information Retrieval, engines need WSD for filter-
ing out documents with senses irrelevant to the query. In addition
to this, WSD for Wolaita Language may help other researcher to
conduct high level researches for the language. This makes wolaita
language to be developed as English and other resourced languages.
As a result society may be advanced with the technology as much
as possible.
5 Research Methodology
Various literatures that are considered to be relevant for the re-
search work are reviewed to get better understanding of the area
and to have detailed knowledge on the various techniques that are
essential for WSD systems regarding to WSD of other languages,
Machine learning, Classifier algorithms and Wolaita language and
its structure.
9
than five ambiguous words will be selected by linguistic experts and
ambiguous words and sense examples will be collected from Wolaita
language Department, Wogeta FM and freely available software like
WED and Geeshsha Maxafaa which is Holy bible developed for
Wolaita. An English corpora, British National Corpus (BNC) is
used to acquire sense examples for Wolaita ambiguous words and
the examples are translated to Wolaita language.
10
model combining various Chinese and English knowledge resources
by word sense mapping is designed [16]. In Ethiopia also many
researches were conducted in Amharic, Afan Oromo and Tigrinya
which are under resourced languages. Among them, Teshome [17]
was the first research who attempted WSD for Amharic which tries
to resolve lexical ambiguity .He demonstrated word sense disam-
biguation based on semantic vector analysis which can improve the
effectiveness of an Amharic Information Retrieval system. And also
Solomon Mekonnen [18] conducted corpus based approach where
machine learning techniques are applied to a corpus of Amharic sen-
tences so as to acquire disambiguation information automatically. A
total of 1045 English sense examples for the five ambiguous words
are collected from British National Corpus (BNC) and the sense ex-
amples are translated to Amharic using dictionary. Getahun Wassie
[7] designed a WSD (word sense disambiguation) prototype model
for Amharic words using semi-supervised learning method to ex-
tract training sets which minimizes the amount of the required hu-
man intervention which used in supervised learning. Segid Hassen
Yesuf [19] conducted a knowledge-based word sense disambiguation
method that employs Amharic WordNet development. Knowledge-
based Amharic WSD extracts knowledge from word definitions and
relations among words and senses. The works done before can only
disambiguate one target word at a time. Mieraf Mulugeta [6]con-
ducted Word Sense Disambiguation for Amharic Sentences using
WordNet Hierarchy to overcome the previous research problems that
is the works done before can only disambiguate one target word
at a time. Therefore, conducted his research to disambiguate tar-
get words at sentence level. On the other hand, Mulat Getaneh
[20] conducted automatic WordNet construction using word embed-
ding to make other NLP applications such as WSD, Information
Retrieval, Machine Translation and soon application to use word-
Net as resource. WSD also conducted for Tigrinya and Meresa Me-
brahtu Reda [21] used unsupervised machine learning techniques to
address the problem of automatically deciding the correct sense of
an ambiguous word Tigrigna texts based on its surrounding context.
In addition to this, WSD is also conducted to Afan Oromo Texts.
Tesfa Kebede Hundesa [22] worked in a corpus based approach for
disambiguation by using supervised machine learning techniques for
Afaan Oromo language, to acquire disambiguation information au-
11
tomatically by taking only five ambiguous words. To overcome the
problem of scarcity in training data, Workineh Tesema [23] used
unsupervised approach that exploits sense in a corpus which is not
labelled by using Vector Space Model. Yehuwalashet Bekele Tesema
[24] designed and tested a hybrid system which finds the meaning of
words based on surrounding contexts combining unsupervised with
rule based approach to overcome the bottle neck of machine learning
algorithms. And also Workineh Tesema [25] also used hybrid ap-
proach by which the context of a given word is captured using term
co-occurrences within a defined window size of words. The similar
contexts of a given senses of ambiguous word are clustered using
hierarchical and partitional clustering. Furthermore Shibiru Olika
Gonfa [5] apply Knowledge based WSD method which is based on
the database developed from scratch that uses Afaan Oromo Dic-
tionary to disambiguate polysemous words in the sentence. The
disambiguation process becomes accomplished based on words and
sense relations developed in the database which is called WordNet.
To summarize that WSD is one of the NLP Task which is used to dis-
ambiguate ambiguous word which has more than one context. Ac-
cording to literatures, most researches have been done by using three
Approaches. The first one is Corpus Based Approach which consists
supervised Learning, Unsupervised learning, Semi-supervised and
hybrid approach. The second one is Knowledge based Approach
which focuses on machine Readable databases like WordNet and
others. And the third approach is hybrid which combines corpus-
based approach with Knowledge-based Approach. According to lit-
eratures, most researchers used five and more than five ambiguous
words with their corresponding examples by which ambiguous words
are disambiguated as their sense in the context. However, recently
some researchers used WordNet for WSD by using knowledge-based
approach to solve the bottleneck of corpus based approach [5] [6]
[19] [20]. For Wolaita Language, there are no prior researches done
regarding to word sense disambiguation to disambiguate an ambigu-
ous words. Due to lack of corpus resource for wolaita Language, We
are going to annotate sample based corpus by which more than one
thousand examples with their context will be annotated for more
than five ambiguous words and Machine Learning Algorithms such
as Supervised, Unsupervised learning, semi-supervised learning al-
gorithms and neural network algorithms will be applied.
12
6 Budget
7 Schedule
The total time proposed to complete the research is 6 months. De-
tailed activity chart with time associated is given Chart below.
Table 2: Schedule
No. proposal phases duration Jan-march2023
1 Problem identification,title selection. 2 weeks
2 Proposal development 3 weeks
3 Review of literature 2 weak
4 Research methodology(selection of methods and tools) 5 weeks
5 Design and evaluation frame work 3 weeks
6 Proposal completion 1 weak
13
8 Reference
[1] Daniel Jurafsky James H. Martin, Speech and Language Process-
ing: An introduction to natural language processing,computational
linguistics, and speech recognition, 2006. [2] M. M. Reda, ”Un-
supervised Machine Learning Approach for Tigrigna WSD,” Com-
puter Engineering and Intelligent Systems ISSN 2222-1719 (Paper)
ISSN 2222-2863 (Online), vol. 9, p. 6, 2018. [3] ”Warren Weaver,”
Wikipedia, the free encyclopedia, [Online]. Available: https://en.wikipedia.org/wiki/WarrenW e
sensedisambiguation, ”W ikipedia, thef reeencyclopedia, [Online].Available :
https : //en.wikipedia.org/wiki/W ord−sensed isambiguationHistory.[5]S.O.Gonf a, ”W ORD
SupervisedLearningP aradigm, ”Science, T echnologyandArtsResearchJournal, 2014.[8]T.A.
StateM orphologicalAnalyzerf orW olaita, ”vol.LN ICST, 05July2018.[9]T.A.Gebresilase, ”Go
https : //scholar.google.com/scholar?hl = enass dt = 0[10]B.F.Shirko, ”P artof SpeechT aggin
http : //etd.aau.edu.et/handle/123456789/14582.[12]U dayaRajDhungana1, SubarnaShakya2
https : //en.wikipedia.org/wiki/W olayttal anguage.[14]”W ED, ”GoogleP lay, [Online].Availa
https : //play.google.com/store/apps/details?id = com.laxsil.dagmawi.woliticdictioneryhl =
enU Sgl = U S.[15]BiancaScarlini, T ommasoP asini, RobertoN avigli, ”SEN SEM BERT :
Context−EnhancedSenseEmbeddingsf orM ultilingualW ordSenseDisambiguation, ”Associ
BasedChineseW ordSenseDisambiguationwithM ulti−KnowledgeIntegration, ”CM C, vol.61
06.[Online].Available : http : //localhost : 80/xmlui/handle/123456789/14751.[19]S.H.Y esuf
https : //en.wikipedia.org/wiki/W ord−sensed isambiguation.[Accessed20F eb2021].[27]T.K.H
https : //www.tutorialspoint.com/naturall anguagep rocessing/naturall anguagep rocessingi nt
ExtendingSwahiliLanguageT echnologywithM achineLearning, ”F acultyof Artsof theU niver
https : //en.wikipedia.org/wiki/W arrenW eaver.[Accessed20F eb2021].[31]”F romW ikipedia,
https : //en.wikipedia.org/wiki/Y ehoshuaB ar−Hillel.[Accessed20F eb2021].
14