Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

WOLAITA SODO UNIVERSITY

SCHOOL OF INFORMATICS
DEPARTMENT OF COMPUTER SCIENCE
proposal title
on
Word Sense Disambiguation for Wolaita
Language by using Machine Learning Approach
BY;-
MIHIRET BEKELE COMPUTER SCIENCE
MSc (Regular)
submitted to. Dr.Tewodros Abebe
Submission. Date; - Jan 23/2023

January 24, 2023

1
List of Tables
Table 1 Estimation of budgets
Table 2 Schedule for research completion

List of Abbreviations and Acronyms


NLP. . . . . . . . . . . . Natural Language Processing
WSD. . . . . . . . . . . . Word Sense Disambiguation
WED. . . ..... . . Wolaita English Dictionary
BNC........................British National Corpus

2
Contents
1 Introduction 4
1.1 Motivation of the Study . . . . . . . . . . . . . . . 5

2 Objective of the Study 6


2.1 General Objective . . . . . . . . . . . . . . . . . . . 6
2.2 Specific Objectives . . . . . . . . . . . . . . . . . . . 6

3 Statement of the Problem 6


3.1 Scope and Limitation . . . . . . . . . . . . . . . . . . 8
3.1.1 Scope of study . . . . . . . . . . . . . . . . . 8
3.1.2 Limitation of study . . . . . . . . . . . . . . . 8

4 Significance of Research 8

5 Research Methodology 9
5.1 Literature Review . . . . . . . . . . . . . . . . . . . . 9
5.2 Rule based Approad . . . . . . . . . . . . . . . . . . 9
5.3 Data Collection . . . . . . . . . . . . . . . . . . . . . 9
5.4 Tools and Techniques . . . . . . . . . . . . . . . . . . 10
5.5 Performance Analysis: . . . . . . . . . . . . . . . . . 10
5.6 Related Works . . . . . . . . . . . . . . . . . . . . . . 10

6 Budget 13

7 Schedule 13

8 Reference 14

3
1 Introduction
The Application of Natural language Processing has advanced the
various things in the world. Among these machine translation,
speech recognition, Information Retrieval, Web-based question an-
swering and so many things advanced the world by many aspects.
However, these NLP applications mainly faced with ambiguity prob-
lems. Therefore, to solve ambiguity problem Word Sense Disam-
biguation (WSD) was developed for many languages. WSD has
many applications. In speech synthesis, WSD is important to deter-
mine the correct pronunciations of words in order to generate speech
that sounds natural. This process is difficult since there exists some
words which are pronounced in more than one way depending on
their content. In machine translation, WSD is required in both
stages since a word in the source language may have more than one
possible translation in the target language. In order to be able to
correctly translate a text, we need to know which sense is intended
in the text. In Information Retrieval, engines need WSD for filtering
out documents with senses irrelevant to the query. The goal NLP
is to get computers to perform useful tasks involving human lan-
guage, tasks like enabling human-machine communication, improv-
ing human-human communication, or simply doing useful processing
of text or speech [1].However, computers cannot understand what
human beings can easily understand. The most challenging issues in
NLP is ambiguous words in which computers can’t identify the sense
of each words in the context of the given text or Speech. These can
make difficulties for high level NLP tasks discussed above. There-
fore, WSD which can solve ambiguity problem. It is a natural classi-
fication problem: given a word and its possible senses, as defined by
a dictionary, classify an occurrence of the word in context into two or
more of its sense classes [2].There are mainly three approaches that
are used in WSD. These are Corpus-Base Approach, Knowledge-
Based Approach and Hybrid Approach. In Corpus-Based Approach,
there are three approaches such as Supervised-learning, Unsuper-
vised learning and Semi-supervised Learning. WSD was first for-
mulated into as a distinct computational task during the early days
of machine translation in the 1940s, making it one of the oldest prob-
lems in computational linguistics. Warren Weaver, [3] in his famous
1949 memorandum on translation, first introduced the problem in

4
a computational context. Early researchers understood the signifi-
cance and difficulty of WSD well. In the 1970s, WSD was a subtask
of semantic interpretation systems developed within the field of ar-
tificial intelligence. However, since WSD systems were at the time
largely rule-based and hand-coded they were prone to a knowledge
acquisition bottleneck. By the 1980s large-scale lexical resources,
such as the Oxford Advanced Learner’s Dictionary of Current En-
glish (OALD), became available: hand-coding was replaced with
knowledge automatically extracted from these resources, but dis-
ambiguation was still knowledge-based or dictionary-based. In the
1990s, the statistical revolution swept through computational lin-
guistics, and WSD became a paradigm problem on which to apply
supervised machine learning techniques. The 2000s saw supervised
techniques reach a plateau in accuracy, and so attention has shifted
to coarser-grained senses, domain adaptation, semi-supervised and
unsupervised corpus-based systems, combinations of different meth-
ods, and the return of knowledge-based systems via graph-based
methods. Still, supervised systems continue to perform best [4].In
Ethiopia also WSD was started by many researcher for Amharic,
Afan Oromo and Tigrigna by using supervised, Unsupervised and
semi-supervised approach. In addition to this, recently Knowledge
based Approaches were conducted [5] [6] [7] [2]. In this research,
we are going to do WSD for Wolaita language by using machine
learning approach. Like English, Amharic, Afan Oromo and other
languages, Wolaita Language has also an ambiguous words which are
polymers. For Web and AI technologies rapidly developing, under
resourced languages may be faced with difficulties. So to solve this
kind of ambiguity problem we are intended to do WSD by machine
learning approach by using manually annotated sample Corpus.

1.1 Motivation of the Study


There are many language-related software like Wolaita English Dic-
tionary (WED), Wolaita Holy Bibles, and other applications that
have been developed for the Wolaita language. In addition to this,
many NLP researches were conducted for wolaita language such as
Finite State Transducer for the morphology of wolaita [8], Text-to
Speech synthesizer for Wolaita [9] POS tagger for Wolaita language

5
[10] and also speech recognition [11] and soon researches were con-
ducted and also new researches are being conducted now. Therefore,
we are motivated to play our role by conducting WSD for wolaita
language to solve ambiguity problems which mainly occur in NLP
tasks like Machine translation, Speech recognition, Question answer-
ing, Information retrieval and soon. Therefore incorporating WSD
with these application may solve ambiguity problems.

2 Objective of the Study


2.1 General Objective
The general objective of this research work is to develop Word Sense
Disambiguation for Wolaita Language by using machine learning
approach.

2.2 Specific Objectives


ˆ To study the general morphology of Wolaita language.
ˆ To develop manually annotated corpus for selected ambiguous
words of the language.
ˆ To review techniques of WSD adopted for other languages.
ˆ To develop a WSD algorithm which best fits for Wolaita lan-
guage.
ˆ To develop a prototype of WSD for Wolaita language.
ˆ To evaluate the performance of the prototype.

3 Statement of the Problem


Now a days, web technology and Artificial Intelligence have been
rapidly developing. These technologies highly use Natural Language
processing in many aspects. For example, Machine Translation, In-
formation Retrieval, Question answering, Speech Recognition and
soon. However, NLP faced with ambiguity problem for computers
can’t understand the sense of the target word from the text as hu-
mans understand. Therefore to solve ambiguity problems, many re-
searches concerned with word sense disambiguation that have been

6
conducted in many languages such as English [12] and other lan-
guages. In Ethiopia, Word Sense Disambiguation researches con-
ducted in Amharic, Afan Oromo and Tigrinya. In case of Wolaita,
no prior work was done regarding to word sense disambiguation.
This may face many challenges in high level NLP tasks that are
discussed above. However many NLP researches were conducted for
Wolaita Languages.Dr. Tewodros A. Gebreselassie [8]conducted Fi-
nite Stated Morphological Transducer for wolaita Language which
is an important step towards developing further NLP. On the other
hand, Birhanesh Fikre Shirko [10] conducted POS tagger by using
Transformational machine learning algorithm which may solve syn-
tax ambiguity problem and it can play important role for other NLP
tasks. Most of the time semantic ambiguity is solved by word sense
disambiguation but it is not has been done yet for wolaita Language.
Wolaita Language is one of the widely spoken Omotic language in
Ethiopia. Wolaita has existed in written form since the 1940s, when
the Sudan Interior Mission first devised a system for writing it. The
writing system was later revised by a team led by Dr. Bruce Adams.
They finished the New Testament in 1981 and the entire Bible in
2002. It was one of the first languages the Derg selected for their
literacy campaign (1979–1991), before any other southern languages
[13]. Furthermore, Wolaita English Dictionary (WED) and Wolaita
Holy Bibles [14] were also developed recently by Wolaita Linguis-
tics. Therefore Wolaita language require many NLP applications
to advance the society. As many languages, ambiguous words are
also found in Wolaita language. So it needs disambiguating an am-
biguous words from the text. And also the development of WSD
may help other Omotic languages since they are similar to wolaita
language by many aspects. If there is no WSD found, the language
may be faced with difficulties in the high level NLP applications.
To do this, our study answers the following questions.
ˆ How to develop WSD for Wolaita Language which can disam-
biguate ambiguous words from the sentences?
ˆ How to collect ambiguous words from annotated corpus and
how to train Machine learning algorithms?
ˆ To what extent the system disambiguates ambiguous words
from Wolaita texts?

7
3.1 Scope and Limitation
3.1.1 Scope of study

The current study focuses on introducing character recognition model


for Wolaytta using machine-learning approach. Because of unavail-
ability of character tagged dataset for the language, the study will
prepare new dataset. The dataset will be preprocessed in machine
understandable form to extract.The study will be conducted on
WSD for Wolaita language by using machine learning approach to
solve the problem of ambiguity that occurs in the language. Re-
garding to the time and resource available for study: This Study:
ˆ Deals only with textual information but not accept data in
voice or sound form.
ˆ Deals with only semantic level analysis. That means the system
does not perform any kind of grammar and spelling correction.

3.1.2 Limitation of study

ˆ Due to unavailability of corpus for Wolaita Language, our study


will be limited for few ambiguous words but not all ambiguous
words from Wolaita texts.

4 Significance of Research
The Application of Natural language Processing has advanced the
various things in the world. Among these machine translation,
speech recognition, Information Retrieval, Web-based question an-
swering and so many things advanced the world by many aspects.
However, these NLP applications mainly faced with ambiguity prob-
lem. Therefore, to solve ambiguity problem Word Sense Disam-
biguation (WSD) was developed for many languages. It has many
applications. In speech synthesis, WSD is important to determine
the correct pronunciations of words in order to generate speech that
sounds natural. This process is difficult since there exists some
words which are pronounced in more than one way depending on
their content. In machine translation, WSD is required in both
stages since a word in the source language may have more than one
possible translation in the target language. In order to be able to

8
correctly translate a text, we need to know which sense is intended
in the text. In Information Retrieval, engines need WSD for filter-
ing out documents with senses irrelevant to the query. In addition
to this, WSD for Wolaita Language may help other researcher to
conduct high level researches for the language. This makes wolaita
language to be developed as English and other resourced languages.
As a result society may be advanced with the technology as much
as possible.

5 Research Methodology
Various literatures that are considered to be relevant for the re-
search work are reviewed to get better understanding of the area
and to have detailed knowledge on the various techniques that are
essential for WSD systems regarding to WSD of other languages,
Machine learning, Classifier algorithms and Wolaita language and
its structure.

5.1 Literature Review


Various literatures that are considered to be relevant for the re-
search work are reviewed to get better understanding of the area
and to have detailed knowledge on the various techniques that are
essential for WSD systems regarding to WSD of other languages,
Machine learning, Classifier algorithms and Wolaita language and
its structure.

5.2 Rule based Approad


Rule-based approaches fail to satisfy the demands of portability and
robustness, and finding the rules based on which the method is sup-
posed to provide the best results necessitates a great deal of linguis-
tic expertise.

5.3 Data Collection


In this research, machine learning algorithms will be applied. As
Wolaita language has no sense annotated corpus available for WSD.
It requires manual annotation of sample corpus. Therefore more

9
than five ambiguous words will be selected by linguistic experts and
ambiguous words and sense examples will be collected from Wolaita
language Department, Wogeta FM and freely available software like
WED and Geeshsha Maxafaa which is Holy bible developed for
Wolaita. An English corpora, British National Corpus (BNC) is
used to acquire sense examples for Wolaita ambiguous words and
the examples are translated to Wolaita language.

5.4 Tools and Techniques


WSD for wolaita language will be done by using machine learning
algorithms such as supervised algorithms, Transformational based
Algorithm and Neural Networks. According to the collected data,
the algorithms will be trained and an algorithm which will have
higher accuracy will be selected. To do this research, we will use
the Anaconda python distribution.

5.5 Performance Analysis:


Performance analysis of the proposed research will be analyzed using
Precision, Recall and F-measure test on various types of sufficiently
large test samples.

5.6 Related Works


There are many researches that were conducted in English, Eu-
ropean languages and other languages for WSD. Most European
languages have enough resources for NLP tasks. Therefore, many
researches have been conducted to solve many NLP problems. Re-
garding to English WSD, Udaya Raj Dhungana and his co-workers
[12] developed a new model of WordNet that organizes the differ-
ent senses of polysemy words as well as the single sense words in
English based on the clue words. Similarly, Bianca Scarlini, Tom-
maso Pasini and Roberto Navigli [15] proposed SENSEMBERT, a
knowledge-based approach that brings together the expressive power
of language modelling and the vast amount of knowledge contained
in a semantic network to produce high-quality latent semantic rep-
resentations of word meanings in multiple languages. In addition to
this, WSD was conducted to propose a graph-based Chinese WSD
method with multi-knowledge integration. Particularly, a graph

10
model combining various Chinese and English knowledge resources
by word sense mapping is designed [16]. In Ethiopia also many
researches were conducted in Amharic, Afan Oromo and Tigrinya
which are under resourced languages. Among them, Teshome [17]
was the first research who attempted WSD for Amharic which tries
to resolve lexical ambiguity .He demonstrated word sense disam-
biguation based on semantic vector analysis which can improve the
effectiveness of an Amharic Information Retrieval system. And also
Solomon Mekonnen [18] conducted corpus based approach where
machine learning techniques are applied to a corpus of Amharic sen-
tences so as to acquire disambiguation information automatically. A
total of 1045 English sense examples for the five ambiguous words
are collected from British National Corpus (BNC) and the sense ex-
amples are translated to Amharic using dictionary. Getahun Wassie
[7] designed a WSD (word sense disambiguation) prototype model
for Amharic words using semi-supervised learning method to ex-
tract training sets which minimizes the amount of the required hu-
man intervention which used in supervised learning. Segid Hassen
Yesuf [19] conducted a knowledge-based word sense disambiguation
method that employs Amharic WordNet development. Knowledge-
based Amharic WSD extracts knowledge from word definitions and
relations among words and senses. The works done before can only
disambiguate one target word at a time. Mieraf Mulugeta [6]con-
ducted Word Sense Disambiguation for Amharic Sentences using
WordNet Hierarchy to overcome the previous research problems that
is the works done before can only disambiguate one target word
at a time. Therefore, conducted his research to disambiguate tar-
get words at sentence level. On the other hand, Mulat Getaneh
[20] conducted automatic WordNet construction using word embed-
ding to make other NLP applications such as WSD, Information
Retrieval, Machine Translation and soon application to use word-
Net as resource. WSD also conducted for Tigrinya and Meresa Me-
brahtu Reda [21] used unsupervised machine learning techniques to
address the problem of automatically deciding the correct sense of
an ambiguous word Tigrigna texts based on its surrounding context.
In addition to this, WSD is also conducted to Afan Oromo Texts.
Tesfa Kebede Hundesa [22] worked in a corpus based approach for
disambiguation by using supervised machine learning techniques for
Afaan Oromo language, to acquire disambiguation information au-

11
tomatically by taking only five ambiguous words. To overcome the
problem of scarcity in training data, Workineh Tesema [23] used
unsupervised approach that exploits sense in a corpus which is not
labelled by using Vector Space Model. Yehuwalashet Bekele Tesema
[24] designed and tested a hybrid system which finds the meaning of
words based on surrounding contexts combining unsupervised with
rule based approach to overcome the bottle neck of machine learning
algorithms. And also Workineh Tesema [25] also used hybrid ap-
proach by which the context of a given word is captured using term
co-occurrences within a defined window size of words. The similar
contexts of a given senses of ambiguous word are clustered using
hierarchical and partitional clustering. Furthermore Shibiru Olika
Gonfa [5] apply Knowledge based WSD method which is based on
the database developed from scratch that uses Afaan Oromo Dic-
tionary to disambiguate polysemous words in the sentence. The
disambiguation process becomes accomplished based on words and
sense relations developed in the database which is called WordNet.
To summarize that WSD is one of the NLP Task which is used to dis-
ambiguate ambiguous word which has more than one context. Ac-
cording to literatures, most researches have been done by using three
Approaches. The first one is Corpus Based Approach which consists
supervised Learning, Unsupervised learning, Semi-supervised and
hybrid approach. The second one is Knowledge based Approach
which focuses on machine Readable databases like WordNet and
others. And the third approach is hybrid which combines corpus-
based approach with Knowledge-based Approach. According to lit-
eratures, most researchers used five and more than five ambiguous
words with their corresponding examples by which ambiguous words
are disambiguated as their sense in the context. However, recently
some researchers used WordNet for WSD by using knowledge-based
approach to solve the bottleneck of corpus based approach [5] [6]
[19] [20]. For Wolaita Language, there are no prior researches done
regarding to word sense disambiguation to disambiguate an ambigu-
ous words. Due to lack of corpus resource for wolaita Language, We
are going to annotate sample based corpus by which more than one
thousand examples with their context will be annotated for more
than five ambiguous words and Machine Learning Algorithms such
as Supervised, Unsupervised learning, semi-supervised learning al-
gorithms and neural network algorithms will be applied.

12
6 Budget

Table 1: Cost Breakdown


No. Material Amount Unit price Total price
1 To collect data 325 26*325=8450 8450
2 Payment for data collector 250 4*250*15=15000 15000
3 Transportation Cost 250 4*250=1000 1000
4 For Document Binding 110 5*110=550 550
5 Total - - 25,000.00

7 Schedule
The total time proposed to complete the research is 6 months. De-
tailed activity chart with time associated is given Chart below.

Table 2: Schedule
No. proposal phases duration Jan-march2023
1 Problem identification,title selection. 2 weeks
2 Proposal development 3 weeks
3 Review of literature 2 weak
4 Research methodology(selection of methods and tools) 5 weeks
5 Design and evaluation frame work 3 weeks
6 Proposal completion 1 weak

13
8 Reference
[1] Daniel Jurafsky James H. Martin, Speech and Language Process-
ing: An introduction to natural language processing,computational
linguistics, and speech recognition, 2006. [2] M. M. Reda, ”Un-
supervised Machine Learning Approach for Tigrigna WSD,” Com-
puter Engineering and Intelligent Systems ISSN 2222-1719 (Paper)
ISSN 2222-2863 (Online), vol. 9, p. 6, 2018. [3] ”Warren Weaver,”
Wikipedia, the free encyclopedia, [Online]. Available: https://en.wikipedia.org/wiki/WarrenW e
sensedisambiguation, ”W ikipedia, thef reeencyclopedia, [Online].Available :
https : //en.wikipedia.org/wiki/W ord−sensed isambiguationHistory.[5]S.O.Gonf a, ”W ORD
SupervisedLearningP aradigm, ”Science, T echnologyandArtsResearchJournal, 2014.[8]T.A.
StateM orphologicalAnalyzerf orW olaita, ”vol.LN ICST, 05July2018.[9]T.A.Gebresilase, ”Go
https : //scholar.google.com/scholar?hl = enass dt = 0[10]B.F.Shirko, ”P artof SpeechT aggin
http : //etd.aau.edu.et/handle/123456789/14582.[12]U dayaRajDhungana1, SubarnaShakya2
https : //en.wikipedia.org/wiki/W olayttal anguage.[14]”W ED, ”GoogleP lay, [Online].Availa
https : //play.google.com/store/apps/details?id = com.laxsil.dagmawi.woliticdictioneryhl =
enU Sgl = U S.[15]BiancaScarlini, T ommasoP asini, RobertoN avigli, ”SEN SEM BERT :
Context−EnhancedSenseEmbeddingsf orM ultilingualW ordSenseDisambiguation, ”Associ
BasedChineseW ordSenseDisambiguationwithM ulti−KnowledgeIntegration, ”CM C, vol.61
06.[Online].Available : http : //localhost : 80/xmlui/handle/123456789/14751.[19]S.H.Y esuf
https : //en.wikipedia.org/wiki/W ord−sensed isambiguation.[Accessed20F eb2021].[27]T.K.H
https : //www.tutorialspoint.com/naturall anguagep rocessing/naturall anguagep rocessingi nt
ExtendingSwahiliLanguageT echnologywithM achineLearning, ”F acultyof Artsof theU niver
https : //en.wikipedia.org/wiki/W arrenW eaver.[Accessed20F eb2021].[31]”F romW ikipedia,
https : //en.wikipedia.org/wiki/Y ehoshuaB ar−Hillel.[Accessed20F eb2021].

14

You might also like