Dinka Final Thesis

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 92

WOLLEGA UNIVERSITY

SCHOOL OF GRADUATE STUDIES


COLLEGE OF ENGINEERING AND TECHNOLOGY
DEPARTMENT OF COMPUTER SCIENCE
PROGRAM MASTERS REGULAR
DECISION TREE TOPDOWN CHART PARSER AFAN OROMO
SENTENCE PARSING
A Thesis Submitted in Partial Fulfillment of the Requirement for the Degree

of Master of Science in Computer Science

M.Sc. Thesis
BY:

Dinka Getahun Mokonnen

Advisor: Kamal Mohammed (Ass. Profesor)

June, 2023

Nekemte, Ethiopia

i
WALLAGA UNIVERSITY
SCHOOL OF GRADUATE STUDIES
P.O. Box: 395, Nekemte, Ethiopia.
APPROVAL SHEET FOR SUBMITTING FINAL THESIS
As members of the Board of Examining of the Final MSc. thesis open defense, we certify
that we have read and evaluated the thesis prepared by Mr. Dinka Getahun Mokonnen
under the title ―Decision tree topdown chart parser afaan oromo sentence parsing‖ and
recommend that the thesis be accepted as fulfilling the thesis requirement for the Degree of
Master of Science in Computer Science.
Examining Committee Name Signature Date

1. Chairperson _______________________ _________ __________

2. Internal Examiner _______________________ _________ _________

3. External Examiner _______________________ _________ _________

Final Approval and Acceptance


Thesis Approved by

1. _______________________ _____________ _____________


Department PGC Signature Date

2. _______________________ _____________ _____________


Dean of College Signature Date
Certification of the Final Thesis
I hereby certify that all the correction and recommendation suggested by the board of
examiners are incorporated into the final thesis entitled “Decision tree topdown chart parser
afan oromo sentence parsing” by Mr. Dinka Getahun Mokonnen.

3. ______________________ _____________ _____________


Dean of SGS Signature Date

ii
Declaration
I dedicate this work to my dear father, Getahun and my mother Marame Debela, who
passed away unexpectedly before three years. and As thesis research advisor, I here by
certify that I have read and evaluated this thesis organized, under my guidance, advice,
done by Dinka Getahun, entitled in ―Decision Tree Topdown Chart Parser Afan
Oromo Sentence Parsing‖ is accepted in partial fulfillment of the thesis requirement
for the award of Degree of Masters of Science in computer science. I recommend that
it be submitted as fulfilling the thesis requirement.

Mr. Kamal Mohammed (Assis. Prof.)

Advisor Signature Date:- _______________ ____________

iii
STATEMENT OF THE AUTHOR

I Mr. Dinka Getahun Mokonen hereby declare and affirm that the thesis
entitled―decision tree topdown chart parser afan oromo sentence parsing‖ is my own
work conducted under the supervision of Msc. Kemal Mohamed (Ass. Professor). I
have followed all the ethical principles of scholarship in the preparation, data
collection, data analysis and completion of this thesis. All scholarly matter that is
included in the thesis has been given recognition through citation. I have adequately
cited and referenced all the original sources. I also declare that I have adhered to all
principles of academic honesty and integrity and I have not misrepresented, fabricated,
or falsified any idea / data / fact / source in my submission. This thesis is submitted in
partial fulfillment of the requirement for a degree from the Post Graduate Studies at
Wallaga University. I further declare that this thesis has not been submitted to any other
institution anywhere for the award of any academic degree, diploma or certificate.

I understand that any violation of the above will be cause for disciplinary action by the
University and can also evoke penal action from the sources which have thus not been
properly cited or from whom proper permission has not been taken when needed.
Dinka Getahun

Name Signature Date

iv
ACKNOWLEDGMENT
First of all, I would like to thank Almighty God for giving me endurance for
completion of this thesis.
Next, I would like to express my heartfelt appreciation and gratitude to my thesis
advisor Msc. Kemal Mohamed (Ass. Profesor) for he gave me perfect advices or his
valuable suggestions and guidance throughout my study whenever I faced serious
problems and made me feel hopeful to finish this thesis for he give me good guidance .
I would like thank to computer sciences department head Mr. Tariku B and IT
department head Mr. Gemechu B. for their unreserved encouragement and support they
rendered to me during the entire period of study.
I woud like to show my gratitude to all the computer sciences department‘s staffs for
their comments and suggestions.
Finally I am also thanks to my family especially my wife Abaynesh Wana, for her
unlimited support and encouragement.

v
Abstract
Previously many sentence parsers are developed for foreign languages such as English,
Arabic, etc. as well as for Amhari language from local languages of Ethiopia. Parsing
Afan Oromo sentence is also needed and a necessary mechanism for other natural
language processing applications like machine translation, question answering,
knowledge extraction and information retrieval.
The study of natural language processing is gaining popularity daily for both academic and
commercial purposes. Higher NLP systems, such machine translation, can only be produced
once the lesser ones, like part-of-speech taggers and syntactic parsers, have been
successfully developed. Even among the more basic NLP systems, there is this functional
reliance. This thesis can be seen as an effort to combine concepts and results from earlier
attempts at an Afan Oromo part-of-speech tagger in order to address a little more
challenging issue with the parsing of decision tree sentence language. In this thesis, an effort
is made to extract features such Afan Oromo word and phrase classes, sentence formalisms,
and sentence parsers that can be implemented using Afan Oromo decision trees. The study's
sample data came from sources that are often used in language instruction and language
learning. Also, this data has been manually examined, annotated, tagged, and processed
before being utilized as a corpus to extract the grammatical rules and assign probabilities.
We also developed simple algorithm of a lexicon generator to decision tree generate the
lexical rules. Python programming language and NLTK are used as an implementation tools
for this study. Then, the experimentation took place on a parser.In 300 sentences with 3029
words each, experiments have been done. In this study, 20% (60) of the decision tree
sentences and 80% (240) training data of the corpus sentences were employed as test data
sets. The integrated part of speech tagger employed 3029 manually annotated words from
this study and 28 categories of tag sets. Of of these, 27 are tag categories while the
remaining one, "X," is used for unidentified terms. The study's findings revealed that the
tagger achieved accuracy levels of 89.7% on the training set and 84% on the test set,
respectively. The decision tree phrase parsing studies produced accuracy results of 71.6% on
the test set created for this purpose and 80.0% on the training set.
Keywords: NLP, Parser, decision tree grammar, top-down chart parser, lexicon
Generator, lexicon.

vi
ABBREVIATIONS
ADJ, Anadjective
AdjP AdjectivalPhrase
ADV Anadverb
ADVC An adverb not separated from aconjunction
AdvP AdverbialPhrase
AUX Auxiliary verbs and all their otherforms
DT Complement
CONJ Aconjunction
ITJ Interjections
JC A conjunction not separated from anadjective
JNU A numeral used as anadjective
JP An adjective not separated from apreposition
JPN A noun not separated from a preposition and that function as anadjective
N Noun in allforms
NC A conjunction not separated from anoun
NP A preposition not separated from anoun
NP NounPhrase
NUM Number
NV Verbalnouns
PP PrepositionalPhrase
PREP Apreposition
PUNCT Punctuation
REL Relativeclause
V Verb in allforms except auxiliary,
VC A verb prefixed or suffixed by aconjunction
VCO Compoundverbs
VP VerbPhrase

vii
Table of
Contents……………………………………………………………………………page
Declaration ........................................................................................................................ iii
STATEMENT OF THE AUTHOR ................................................................................... iv
ACKNOWLEDGMENT..................................................................................................... v
Abstract.............................................................................................................................. vi
ABBREVIATIONS .............................................................................................................. vii
CHAPTER ONE ..................................................................................................................... 1
INTRODUCTION .............................................................................................................. 1
1.1. Background ................................................................................................................ 1
1.2. Statement of the Problem ........................................................................................... 3
1.3. Objective of the Study ................................................................................................ 5
1.4. Methodology .............................................................................................................. 7
1.4.4. ParsingTechniquesandPrototypeDevelopment .......................................................... 8
1.5. Application of Results and Beneficiaries .................................................................... 8
1.6. Scope of the Study ..................................................................................................... 8
1.7. Limitation of the Study .............................................................................................. 9
1.8. Organization of theThesis .......................................................................................... 9
CHAPTER TWO .............................................................................................................. 10
REVIEW OF LITERATURE ........................................................................................... 10
2.1 Introduction .............................................................................................................. 10
2.2 Decision tree grammar Sentence Parsing (DTP) ...................................................... 10
2.3 Approachesn to Decision tree grammer sentence parsing ......................................... 12
2.4 Knowledge Required by the Parser ......................................................................... 19
2.5 RelatedNLPinparsing ............................................................................................... 24
2.6 Related NLP in Afan Oromo ................................................................................... 24
2.7 Related NLP ComponentSystems............................................................................. 25
CHAPTER THREE .......................................................................................................... 27
THE STRUCTURE OF AFAAN OROMO ...................................................................... 27
3.1 Introduction............................................................................................................... 27
3.2 Afan Oromo Alphabet and Writing System ................................................................ 27
3.3 Punctuation Marks in Afan Oromo ............................................................................. 28
3.4 Word Categories in Afan Oromo ............................................................................... 28
3.5 Phrasal Categories .................................................................................................... 34
3.6 Sentences .................................................................................................................. 39
CHAPTER FOUR ............................................................................................................ 45
DATA PREPARATION AND PCFG EXTRACTION ................................................... 45
4.2 The Design Approach of the Parser ......................................................................... 45

viii
4.3 The Sample Corpus................................................................................................... 46
4.4 The Morphological Pre-processing .......................................................................... 47
4.5 The Part Of Speech Tagger ..................................................................................... 48
4.6 Extraction of A Probabilistic Context Free Grammar .............................................. 50
4.7 Chom sky Normal from (CNF) Representation ....................................................... 51
CHAPTER FIVE .............................................................................................................. 52
PARSING ALGORITHM AND EXPERIMENTATION................................................ 52
5.2 The Parsing Algorithm.............................................................................................. 52
5.3 PCFGParsing............................................................................................................ 53
5.5 TheExperiment.......................................................................................................... 60
CHAPTER SIX ................................................................................................................ 65
5. CONCLUSION AND RECOMMENDATION ....................................................... 65
6.1 Conclusion................................................................................................................ 65
6.1 Recommendations .................................................................................................... 67
7. REFERENCES ................................................................................................................. 84

ix
List of Tables
Table 3.1: personal pronouns........................................................................................... 30
Table 5.4: Parsing result on Training Set before making no error correction................. 58
Table 5.8: Parsing result on Training Set before making no error correction ................ 62
Table 5.9 1: Parsing result on Training Set after making some error correction ........... 62
Table 5.10: Parsing result on Test Set ............................................................................. 63

x
List of Figure

Figure 2.1: Example of parse tree .................................................................................... 11


Figure 3.2: The Structure of a grammer Noun Phrase .................................................... 43
Figure 3.3: The Structure of a tree Verb Phrase .............................................................. 44
Figure 5.1:-Tree structure diagram of the tree sentence ................................................. 54
Figure 5.2: An 8-level-chart ............................................................................................. 55
Figure 5.3: An algorithm for preprocessing an input sentence to the parser .................. 56
Figure 5.5: The Parse Chart Procedure to Implement the Inside-Outside Algorithm ..... 59
Figure 5.6 : The Overall Implementation of the Parsing Algorithm ................................ 60
Figure 5.7: Diagrammatic Representation of the Parser ................................................ 60

xi
CHAPTER ONE
INTRODUCTION
1.1. Background
As opposed to formal language, a natural language or ordinary language is one that is
spoken, written, or signed by humans for everyday communication (such as computer-
programming languages or the "languages" used in the study of formal logic). [1]
One of the most fundamental parts of human conduct is language, which is also a very
important part of our daily life. In its written form, it serves as a method of long-term
information and knowledge recording and transmission from one generation to the next. It
helps us communicate with people and organize our daily lives in verbal form [2].

Human communication is fundamentally based on language, whether it is spoken or written.


So, any prospect of creating computer systems with intelligence close to that of a human
depends on the ability to facilitate natural language interaction. One unusual example of
these systems that sparks study in the field of computational linguistics is the accurate and
quick machine translation of natural languages. Natural language processing is the set of
computing processes needed to allow a computer to process information using natural
language [4]. In order to communicate with computers in spoken and written environments
using natural human languages rather than computer languages, a branch of artificial
intelligence known as natural language processing (NLP) analyzes, comprehends, and
generates the languages that people naturally use. It investigates issues with decision-making
and interpreting spoken languages [1].

In order to achieve human-like language processing for a variety of tasks or applications,


NLP is a theoretically motivated set of computer approaches for evaluating and modeling
naturally occurring texts at one or more levels of linguistic analysis.[5] The following levels
of natural language processing (NLP) need to be thoroughly researched in order to fully
realize their impressive capabilities: phonological (i.e., sounds or combinations of sounds),
morphological (processing of individual word forms), lexical (procedures operating on full
words), syntactic (grouping the words of a sentence into structural units), and semantic
(contextual knowledge to the purely syntactic process in order to res (using additional
information about the social environment in which a given document exists). [6] Including
details about the various levels of the data's underlying organization, as previously defined,
1
aids in comprehending the nature of the language under consideration and, as a result, in
implementing the intended systems, as stated by Warner [6]. To successfully deploy
numerous NLP systems at various levels of processing, many applications in the NLP field
currently call for readily accessible linguistic knowledge. Systems have been created, for
instance, to process natural language at the phoneme, word, phrase, and pragmatic levels.
These systems are created in a way that allows the output of one level of system to be used as
an input for the system above it. For instance, the output of a sentence-level syntactic and
semantic parser could be fed into a word-level morphological synthesizer [2].
Many studies in this field have been conducted in a variety of languages, including English.
All of these initiatives are primarily intended to help computers understand human languages
[6]. As far as this paper's knowledge goes, very little research has been done on any of the
Ethiopian languages—including Afan Oromo, the subject of the current paper far. This study
will use the occasion to shed some light on syntactic processing. [7]
Therefore, the need for NLP systems such as sentence parser is unquestionable for Afan
Oromo. Afan
Oromo language is the official language of Oromia National Regional State. It is used
in offices, schools, colleges, universities and in media. Thus, the availability of huge
electronic and non-electronic data was motivated us to develop an NLP application.
―For computational linguists, parsing corresponds to produce some sort of a structure
that fits and confirms a particular theory of syntax or language in general‖ [10]. We
have seen the purpose of parsers in terms of standard tools for NLP that do not represent
a final goal as such, but should contribute to improve other applications and serve for many
tasks. Thus, we are motivated to develop Afan Oromo Sentence Parser by using top-down
chart parsing approach A flat input sentence is transformed into a hierarchical structure that
corresponds to the units of meaning in a sentence during the syntactic processing (parsing,
from this point on). Syntactic parsing, according to Volk [11], "consists of partial functions
that link features to decision feature values or to syntactic categories." In order to create a
data structure that may be used to extract the meaning of the sentence, parsing is, broadly
speaking, the de-linearization of linguistic input, or the use of syntax to determine the
functions of words in the input sentence [12]. The relationship between the words in a
sentence is shown by the syntactic structure of the sentence. To convert the potentially

2
ambiguous input phrase into unambiguous forms is the main goal of this step.
Since the early 1960s, many parsing algorithms have been created for a variety of languages,
including English [13]. This type of parser was invented by Earley. The first effective chart
parser for the English language is this one. At that time, numerous initiatives have been made
to add Decision Tree Sentence Parsing (ASP) to other languages around the globe [14]. Afan
Oromo is one of the languages that should have a decision tree sentence parser, however to
the best of the author's knowledge, only one system of this kind has been created (Diriba, [7])
for the language. Therefore, the creation of such a decision-making mechanism is crucial. As
a result, this study addresses this issue, attempts to close the gap left by earlier research into
the creation of a tree sentence parser for language, and sheds insight on potential future
research directions.
1.2. Statement of the Problem
Afan Oromo is one of the major languages that are widely spoken in Ethiopia.
Currently, it is the official language of the regional state of Oromia (the largest regional
state in Ethiopia) being used as a working language in offices, medium of instruction
for primary and junior-secondary schools, and it is also given as a subject for secondary
schools (9 -12 grades). As Mandafro report in his work [11] , at the country level, in
Ethiopia, out of public universities, 8 universities are offering degree programs
majoring in Afan Oromo and Addis Ababa University is offering Afan Oromo language
at Master‘s degree level.

Like Amharic, another major language and working language of Ethiopia, which belongs to
Semitic family languages, Afan Oromo is part of the lowland east Cushitic group within the
Cushitic family of the Afro Asiatic phylum.

According to Abebe [9], Afan Oromo language is not only spoken in Ethiopia, it has
also spoken in Somalia, Kenya, Uganda, Tanzania and Djibouti. Although Afan Oromo
is today spoken by such a large number of people, few advances have been made in
computational linguistics or natural language processing in the language.
―Computational approaches to linguistic analysis of Afan Oromo so far have been
hindered due to non-availability of well-studied linguistic resources‖ [12]. Since Afan

3
Oromo language is the official language of Oromia National Regional State as
mentioned above and used in offices, schools, colleges, universities and in media,
various written materials are being published electronically and non-electronically now
a day. Thus, this creates an interest of NLP researches in this language. For instances;
morphological synthesizer [9], spell checker [13], grammar checker [14], part of speech
tagging [15][16][12], named entity recognition[1], news text summarization [17]
machine translation [8], word sense disambiguation [18], question answering [19], text
retrieval [20] and search engines [21] are some NLP applications among the
applications that require a sentence parser for successful and full-fledged
implementation. Besides, sentence parser is useful NLP application in teaching and
learning process for phrase identification and to know word relations in sentences of
the Afan Oromo language. It is also an important tool in NLP and it serves as an
intermediate component for different higher level applications like machine translation
[4].
On the other hand, as we have mentioned in above section, an Internet is one of the
main sources of information. The enormous amount of information on the Internet
could be used to enhance development by making it accessible to the public. To fully
localize and utilize these resources which are available on the Internet, translation of
documents from one language to another may be necessary. For example, many
documents on the Internet are written in English, because of this, English to Afan
Oromo translation and vice versa may be required in syntax-based machine translation
[22]. Besides, according to[23], parsers have become efficient and accurate enough to
be useful in many natural language processing systems, most notably in machine
translation. Therefore, machine translation, which uses Afan Oromo language
sentences as an input, and sentence parsers as a component, plays a great role in solving
the translation problem. Thus, we were proposed to develop a sentence parser for Afan
Oromolanguage.
To this end, the researcher has gone through different literatures to find if there is any
sentence parser, which can parse both simple decision tree parser sentences in Afan Oromo.
Thus, to the best of the researcher‘s knowledge concerned, there is no Afan Oromo
sentence parser for both decision tree parser sentences. However, there is one attempt

4
by [5] on decision tree parser sentence parser for Afan Oromo language using supervised
learning
technique for simple declarative Afan Oromo sentence. In his study, the chart algorithm
has been used. In addition, the unsupervised learning algorithm was designed to guide
the parser in predicting unknown and ambiguous words in a sentence. It also adopts an
intelligent (Rule-Based learning module) approach to develop a prototype. The result
obtained was 80% on the training dataset and 20% on the test dataset were not included,
which could have been usedas a preprocessor to the parser. It was developed only for simple
declarative sentencesof Afan Oromo language.
Due to this fact, the researcher is motivated to develop aparser for both decision tree Afan
Oromo sentences. Hence the focus of thisstudy is, therefore, in designing and developing
sentence parser for Afan Oromo text, which includes both decision tree sentences.
Obviously, the parser will have themajor significance for the language users.
Moreover, as the nature and structure ofsentences parsing (syntactic parsing) in Afan Oromo
is different from English, Amharic or other languages, sentence parser developed for such
languages could not be functional for Afan Oromo language. This is due to the fact that the
language hasdifferent syntactic and morphological nature and they have also their own
grammaticaland word formation technique that is different from other languages. As a result,
Sentence parser developed for other languages could not be used for Afan Oromo
Language, which results in the need for the independent sentence parser. So that we decided
to develop sentence parser for Afan Oromo simple and complex sentencesusing top down
chart parsing algorithm.
Based on the above justification this study attempts to answer the following questions:
 What are the properties and word orders in Afan Oromo Language?
 Is it possible to use other languages sentence parsers for Afan Oromo language?
 Does the adoption of other language parsing algorithms work for Afan Oromo
Language?
1.3. Objective of the Study

1.3.1. General Objective


The general objective of this research is to design a sentence parser for Afan Oromo
Language using top-down chart parsing algorithm

5
1.3.2. Specific Objectives
in order to achieve the general objective of this research, the following specific
Objectives are formulated.
To identify the properties of Afan Oromo sentences based on the knowledge base of
the language which are the basic word order, word categories,morphologicalproperties,
phrase structure, and sentences in the language tha are useful for sentence parsing.
To select sample sentences that would potentially serve for the experiment
To extract an appropriate grammar rule to represent the structure of Afan Oromo
sentences.
To design a general architecture of Afan Oromo parser
To develop a simple algorithm for lexical generator in order to automatically generate
lexical rules from sample corpus.
To select and customize an appropriate parsing algorithm for Afan Oromo sentence
parser.
To evaluate performance of the parser
Review the basic word categories, morphological properties, phrase structures,and the
various kinds of sentences of Afan Oromo with the aim of investigating patterns that
allow computerre presentation;
Collect sample simple and tree sentences to be used in the experiment;

Build the database of thep a r t -of-speech t a g g e r using the stems of words taken
from the sample corpus and calculate the lexical and transitional probabilities for
them.
Generate the grammar rules appropriate for the language;

Use a combination of statistical lexical co-occurrence techniques in order to guess


the POS for unknownwords;
Customize the parsing algorithm that was used for decision tree simple sentence
parser for Afan Oromo .
Develop a prototype parser that will implement the findings of the study and test it to
measure its performance;
Finally, forward recommendations based on the findings of the study.

6
1.4. Methodology
In order to develop a Sentence Parser for Afan Oromo language, exploring of the
Characteristics of the language and different approaches which can be used for the
Development should be needed. The followings are the methods that have been followed
to achieve the general and specific objectives of this thesis work.
1.4.1. Literature Review
A variety of relevant relevant literature sources. . Books, research reports, journal
articles, manuals, and other published and unpublished documents, including those from the
web, were reviewed for the purposes of this study. All of this raises the question of
whether researchers are interested in NLP-related issues, especially the parsing of
tree sentences (approaches, methods, strategies, etc.) And the language issues under
consideration (basic word categories, morphological properties, phrase structure, etc.), and
different types to understand sentence types from Afan Oromo)
. Additionally, I looked at the literature, especially in the areas of parsing and
general computational linguistics (such as the algorithms and data structures used), to
get a better understanding of how the language works. This understanding has
allowed researchers to implement features of the language that they have determined
are suitable for their study and to employ parsing algorithms appropriately.
1.4.2. Discussion with Linguists
We have also had successful discussions with linguists and experts in the field of
English decision tree sentences and their subcomponents, particularly the tree phrase
structure of language. In-depth conversations with linguists. Both native and non-native
English speakers helped researchers understand the well-formedness of English
sentences and the correctness of parsers.
1.4.3. DataCollection
Afan Oromo's 300 Decision Trees and 10 Simple Decision Trees theorems were two of the
two types of theorems that were gathered from books, periodicals, and newspapers that had
been published. The sentences were chosen so that the decision tree verb phrases and simple
noun phrases were present in the tree sentences. Simple verb and noun phrases can both be
found in simple sentences. As a result, adequate consideration is given to linguistic structures
and linguistic types so that the entire chosen sentence can satisfy the necessary language
structure, which is advantageous for the research process.

7
1.4.4. ParsingTechniquesandPrototypeDevelopment
In data preprocessing, 3029 words h ave been annotated with their respective category of tags.
Then, the string that stems were tagged using HMM, have been used to determine the Necessary
lexical dictionary, Probabilistic Context Free Grammar (PCFG) rules. Based on these a
parsing algorithm have be adopted and modified in order to generate the sentence parser.
The development of the prototype using NLTK Tkinter and implementation of the parsing
algorithm using python havebeen carriedout.
1.4.5. Testing Techniques
The project has used two different kinds of data sets: a training set and a test set. Of of the
300 total tree sentences, 80% of the sample corpus was randomly chosen as training data, and
the remaining 20% was used as test data. To verify that the algorithm also works for basic
sentences, more than 10 simple sentences were taken and evaluated. The experiment was
carried out in two stages: first on the training set, then on the test set and the outcomes were
assessed.
The parse results were compared to phrases that had been manually parsed, and the
experiment was repeated until no further progress was apparent.
1.5. Application of Results and Beneficiaries
One of the essential elements of higher level NLP systems are syntactic parsers. Hence, parsing
systems would be crucial in many NLP applications for the Afan Oromo language. The results of
this study will be extremely valuable to academics working to improve the capacity of computers
to understand Afan Oromo language, hence undertaking research in the field of NLP spatially
sentence parsing is of utmost importance. The main winners will be academics with an interest in
conceptual parsing, machine translation, phrase recognition, spell checkers, text summarization,
etc. Besides, linguists and students in the field of Afan Oromo language might potentially employ
the outcome of this research to parse phrases in the language decision tree. The result could be
applied to language instruction to identify phrasal categories and understand the relationships
between words in a phrase.
1.6. Scope of the Study
The main focus of this work is the design and prototype development for a top down chart
parserfor afaan oromo sentences. The prototype will be designed by studying the word
classes of afaan oromo language, the types of sentences and their construction. However, it is
8
not the scope of this work to incorporate the parser to higher level NLP applications like
grammar checker, questionanswering, etc., as a component.
1.7. Limitation of the Study
This study has the following limitations:
1. The study did not incorporate all kinds of Afan Oromo sentences with
there attributes like case, numbers, genders, person and tenses.
2. The prototype developed used manually annotated morphological analysis
prepared for the purpose of this study. This is due to lack of the source code
of the morphological analyzer for Afan Oromo which was previously developed.
3. The tree sentences that are included in the sample do not exhibit tree nounphrases,
interrogative.
4. Moreover, the researcher of this study believed that the size of the corpus
used is still verysmall.

1.8. Organization of theThesis


There are six chapters in this work. The definition of an NLP and the purpose of a syntactic analysis are
presented in Chapter 1. The problem statement and the study's goal are both presented in this chapter.
Chapter two discusses several methods and tactics for creating decision sentence parsers as well as
problems with sentence parsing. The Afan Oromo language, including its different word classes, phrase
categories, and tree sentence formalisms, is described in the third chapter. Additionally covered here are
the language's tree phrases and sentence constructions. The design of the lexicon for the Afan Oromo tree
sentence parser is covered in chapter four. Chapter 5 contains a report on the algorithm created, the
experiment run, and the findings. Finally, the conclusions and recommendations made based on the
findings of the study are presented in chapter six.

9
CHAPTER TWO
REVIEW OF LITERATURE
2.1 Introduction
The primary goal of the study is to build and implement a decision Afan Oromo sentence parser
for decision tree grammar sentences, as was already mentioned in the first chapter. When
developing parsers to examine how the syntactic structure of sentences can be computed, it is
common practice to take into account both the grammar and the parsing technique. While the
parsing approach is a way of examining a sentence to ascertain its structure by using the
grammar as the source of syntactic knowledge, the grammar is a formal statement of the
structures permitted in the language. This chapter discusses several sentence parsing methods
and strategies. An outline of decision grammar sentence parsing and its assessment standards
is provided in the first section. The second section reviews the various methods and
procedures for the task of decision sentence parsing. The lexicon and grammatical rules, or
the knowledge needed by the parser, are covered in the third portion of this chapter. In the last
section, several prior Afan Oromo NLP research projects that are connected to this study are
summarized.
2.2 Decision tree grammar Sentence Parsing (DTP)

A natural language system must use knowledge of the grammatical structure of the language,
such as what words are, how they are put together to make sentences, what they mean, how
word meanings affect sentence meanings, and so forth[19].
The word "parsing" is derived from the Latin phrase "pars orationis" (part of speech), and it
describes the act of giving each word in a sentence a part of speech (such as a noun,
adjective, etc.) and organizing the words into phrases Alen[19].
Nevertheless, parsing can be done at the word or sentence level in natural language
processing. Tokenizing a word into its constituent parts, or individual morphemes, is the
process of word parsing. Tokenizing the term into morphologically sound parts is necessary.
These tokenized parts will be examined further to determine how they contribute to the
classification and meaning of the entire word [2]. [20].
Sentence In the process of parsing (also known as syntactic parsing), grammatical rules are
combined in various ways to produce a tree that could represent the structure of the input
sentence. In other words, according to Allen [19], it is a task in NLP where a flat input

10
sentence is transformed into a hierarchical structure that is consistent with the units of
meaning in the sentence. Token by token, the parser receives the input string (or tokens). The
morphological analyzer is called by the parser for each token, and it breaks down words into
their roots and affixes in accordance with the language's morphological rules (Afan Oromo).
A lexicon, which is made up of a collection of records relating various kinds of linguistic
data, is where roots and affixes are maintained. A diagrammatic representation of the input
text, the parse tree keeps track of linguistic rules and how they are applied. According to
Allen [19], each node of a parse tree corresponds to either an input word or a non-terminal in
the grammar. A different grammatical rule is applied at each level of the parse tree. The final
terminal symbols, however, are linked to the input word via their lexical category.
„Gaangeen, Tolaan kalessa bitee hara ganama
duute‟ “The mule that Tola bought ide
yesterday‖
(S
(NP
(NP (N Gaangeen))
(ADVP (NP (N Tolaa)) (ADVP (ADV kaleesa) (V
bitee)))) (VP (ADVP (ADV hara) (ADP ganama))
(VP (V duute))))

Figure 2.1: Example of parse tree


In this case, the symbols for sentence, noun phrase, verb phrase, adverb phrase noun, V, and
adverb adjective are S, NP, VP, ADVP, N, V, ADP, ADV, and JJ, respectively. Both
manually and using a tree are acceptable methods of parsing. When the amount of text to be
parsed grows, the manual procedure becomes time-consuming, error-prone, and
expensive.[5] Yet, decision tree grammar sentence parsing eliminates such choices and is
crucial to computers that understand natural language.[7] Nowadays, there are a variety of

11
methods for parsing sentences, which can be broadly divided into rule-based methods and
statistical methods. These strategies are covered in the section that follows.
2.3 Approachesn to Decision tree grammer sentence parsing

Due to the simultaneous exploration of computational and psychological difficulties, parsing


is a decision-making process. The forms of the grammars utilized have frequently been
subject to strict constraints when parsing is viewed as applying a mental grammar, with the
majority of these criteria mandating that the linguists' grammar be used directly. There
wouldn't be a requirement for a precise and intentional mapping between the grammar and
the parser if the study of grammars were not seen as the study of a mental system. [21][22].
No additional stricter relation would be required as long as the parser recovers the same
structural descriptions that the grammar assigns to the strings in the language. The only
pressing factors influencing the architecture of the parser and the methodologies are time and
space restrictions, which are independent of grammar. [13] [21] [23]. Examining the ways
that a sentence could have been created from the start symbol is necessary for sentence
parsing (usually sentence written as an S). [24] [19] [21] Many studies have been conducted
and are continuously being conducted to investigate these effective sentence parsing
methods. Nonetheless, there are two broad categories that can be used to classify the efforts
done thus far and the methods identified so far to address this issue: Stochastic versus rule-
based. [7]
2.3.1 Rule-Based Approachto decision tree SentenceParsing
According to Mao [23], a rule-based system learns a set of rules based on a set of tokens
(strings), and then parses sentences in accordance with these rules. Nevertheless, this method
does not parse a text using probabilities (or statistics). It is fully predicated on data from the
knowledge base and, if any, learning techniques for handling ambiguity and attempting to
guess unfamiliar words.
When parsing a sentence, a rule-based parser consists of various parts. [19] The grammar
rules are the primary part of any parser (sometimes called Rewrite rules, production rules).
These are the rules that the parser refers to before beginning to parse a sentence. Lexicon (or
dictionary) is another important part of rule-based parser, according to Diriba [7]. Alexicons,
or lists of all the grammatical categories of the words and phrases utilized in the parsing
process, are necessary for a parser. It offers distinctive coding for all word classes with

12
distinctive grammatical behavior. The lexicon is essential because the parser uses this
dictionary to parse sentences into syntactic tree structures as soon as it gets input tokens
(strings). The lexicon includes a list of every lexical category to which the word might be
assigned.
In a rule-based approach, morphological rules are also helpful. The morphological rules offer
information that can be used to handle words that are not in the parser's dictionary. In other
words, these criteria can be used to reasonably infer the grammatical categories of words that
are unknown [7]. There are two methods for parsing that can be used in the Rule-based
approach. Top-down and Bottom-up parsing approaches are these. [7]
2.3.2 Stochastic Approach to Decision tree Sentence Parsing
Probability (sometimes known as statistics) is used by stochastic-based parsers to analyze the
parsing problem. The Markov assumption in sentence parsing, the Bayes (Network) theorem,
and independent events are the foundations of the stochastic approach, often known as the
corpus-based approach. These ideas are used in the approach to identify each word's most
likely lexical sequence within a given sentence [7]. The corpus-based technique can be
further divided into supervised and unsupervised approaches depending on the type of text
corpora used [19]. Unsupervised approaches use natural corpora, such as those found in
books and newspapers, while supervised approaches use annotated text corpora.
Systems for decision tree grammar syntactic analysis constructed using the supervised
technique are known as supervised parsers, and they use probability (i.e. statistics) to study
the parsing problem. In a supervised parser, the lexicon, which contains every word together
with every potential lexical category and its estimated lexical probabilities1, and the list of
contextual probabilities for each lexical category, are the two key information sources. The
proper lexical category for a given circumstance is indicated in the list of contextual
probability. [19].
Lack of manually or decision tree parsed text (corpora) and the requirement for manual
parsing each time the parser is applied to a new text are the two main issues in developing
supervised parsers [19]. Manual parsing is very expensive and time-consuming, but if pre-
tagged corpora are widely accessible, stochastic parsers in general and the Hidden Markov
Model (HMM) technique in particular can be easily adapted for new languages.
The training process for parsers created utilizing unsupervised stochastic approaches does not

13
require any pre-tagged material. The syntactic analysis technique was developed using some
heuristics or probabilistic data obtained from the corpus [7] [3]. These parsers share a
characteristic with their supervised counterparts in that they both make the HMM
assumption. HMM is a set of states (in this case, lexical categories) with directed edges and
transition probabilities that show the likelihood of shifting to the state at the end of the
directed edge, assuming that one is currently in the state at the start of the edge. The states
are also labelled with a function which indicates the probabilities of outputting different
symbols if in that state (while in a state, one outputs a single symbol before moving to the
next state).
In this case, the symbol output from a state/lexical category is a word belonging to that
lexical category [3]. However, the unsupervised stochastic parser has such unique
features as training takes place on an unparsed or fresh text,uses the Baum-Welch
algorithm (which is different from the Viterbi algorithm), and soon.
2.3.3 Parsing Strategies

Several approaches have been put out to address parsing-related issues such where to begin, how to
look at a string or a rule's right-hand side (RHS), and how to consider alternatives. NLP researchers
offered many approaches as answers, including top-down, bottom-up, left-to-right, right-to-left,
depth-first, breadth-first, and chart parsing. The successive subsections that follow describe a few of
the most significant solutions offered at various points in time.

2.3.3.1 Top-down Vs Bottom-upParsing

Top-down and bottom-up are competing ideas that have been put up as alternatives to
address the strategy issue regarding the course of the parsing procedure. Top-down parsing
starts with the start symbol, which is typically a sentence S, and moves the grammatical
rules forward until the symbols at the tree's terminals represent the sentence components
that are being parsed. As an illustration, if the rule S is applied and the parser begins in
state(S), the symbol list will be (NP VP).
The rule NP, ARTN, and the symbol list will then be applied (ARTN VP), and so on. The
parser might recursively proceed in this way until it completely reaches the states of the
terminal symbols, at which point it may check the input sentence to see if the word classes
within corresponded with the written sequence of terminal symbols [19]. Top-down parsers
are the term used to describe parsers created in this manner. To determine its next step, this
14
parser forms an assumption about what it is looking for. Thus, a top-down parser is
distinguished by a series of objectives to ascertain the remaining words.
Contrarily, bottom-up parsing starts with the sentence to be parsed and applies the
grammar rules backward until a single tree has been formed [3], whose terminals are the
sentence's words and whose top node is the start symbol (often S, for sentence). To put
it another way, it begins with each word and assigns its grammatical category up until
the start symbol. The highest-level label sequence is used as the new string in this
process, which is repeated for each state. The task of the parser would now appear to be
that of attempting to group words into their respective categories together (e.g. take a
sequence ART ADJ N and identify it as an NP) in a manner permitted by the grammar.
Top-down methods have the advantage of being highly predictive. This means that a word
might be ambiguous when considered in isolation, but if some of those grammatical categories
cannot be used in a legal sentence, then these categories may never even be considered
[19].
Large constituents may need to be constructed repeatedly when they are utilized in other rules,
which is a severe issue with this method's redundancy of effort. The bottom-up parser, in
contrast, only builds each element precisely once and examines the input phrase once. The
bottom-up parser operates from left to right, so the first thing to note about it is that it exhausts all
of its options with that item before moving on to the next two, and so forth. In other words, the
parser builds successive layers of syntactic abstraction based on the data provided, and it is fully
driven by the data presented to it.

Sadly, Allen [19] claims that whether top-down or bottom-up implementation is chosen, the
payback is unaffordable because the parser would tend to repeatedly try the same matches,
duplicating a lot of its work. So, there should be a method that enables the parser to save results
of the matching it has already performed in order to avoid such redundancy problems. This
method is known as chart-based parsing.

As a result, combining the two approaches may produce a better parser. A minor adjustment to
the bottom-up chart algorithm results in a method that is predictive like top-down approaches
while avoiding any work redundancy as in bottom-up approaches.

15
2.3.3.2 Left-to-rightVsRight-to-left
These are the opposing answers that have been put up in response to the query about the
proper order to examine substrings of an RHS. In contrast to right-to-left (i.e., end-to-
beginning) parsing, left-to-right parsing processes the words of the sentence from left to right
(i.e., from beginning to end). In other words, it starts with the leftmost symbol and moves on
to the next symbol on its right. The parser will eventually function in any manner, therefore
logically it may not matter which direction the parsing process takes [25]. However
compared to left-to-right parsing, right-to-left parsing is perhaps less understandable.
However there are times when employing both tactics is advantageous.
If the sentence is harmed, for example, by the presence of a misspelled word, using a parsing
technique that incorporates both left-to-right and right-to-left techniques may be helpful. The
text to the right of the error can then be parsed thanks to this. The top-down method has
trouble with rules that exhibit left recursion when applied from left to right [25]. Left
recursion happens when the first category of a rule on the RHS is more general than the one
on the LHS (Left Hand Side). In this case, it is possible to transform a left-recursive language
into an equivalent grammar that does not employ left-recursive rules and yields the same set
of strings (although it will not assign the same structures).
2.3.3.3 Depth-firstVsBreadth-first

There are two competing approaches—depth-first and breadth-first—to the problem of


investigating the parse space (i.e., alternative parses) of a given in placed sentence. In the former
scenario, before attempting another alternative, a single alternative parse is fully pursued at each
option point. The situation at the decision points must be recalled, and choices might be
abandoned after fruitless investigation. On the other hand, breadth-first pursues each alternative
parse at each decision point for one step at a time. As each alternative computation must be
simultaneously remembered, it necessitates extensive bookkeeping. While breadth-first search
is guaranteed to be full, depth-first search is typically incomplete.

2.3.3.4 Chart Parsing

The chart is a data structure for representing fragments of the parsing process in a way that they can
be utilized again in the future. An n-word sentence's chart is made up of n+1 vertices and a number of
edges that connect the vertices. A chart parser is a type of parsing algorithm that keeps a table of all

16
the well-formed substrings it has so far discovered in the text it is parsing. Although a variety of
parsing algorithms can utilize chart approaches, they have often been applied to a specific bottom-up
parsing algorithm [12].

This method's key premise is that increasing parsing efficiency is essential. There are three
considerations to keep in mind for chart parsing efficiency, as stated by Russell and Norvig [28]: it is
advised not to do twice what can be done once, not to do once what can be avoided entirely, and not
to represent different ions if that is not the study's focus.

Chart parsers keep track of every constituent that has been retrieved from the sentence so far.
In other words, it keeps track of rules that have matched but are not fully satisfied while storing
the intermediate results. More specifically, it is advised to record the results in a data structure
known as a chart once it is realized that "reenfi tiskee hoolota loolaan ajjesee," "'the body of the
shepherd that was killed by flood,'" is a tree NP as it is used in the sentence "reenfitis kee
hoolota loolaan ajjesee gara hospitalaatti ergame," "The body of the shepherd Dynamic
programming techniques that prevent duplicate work include recording interim outcomes.

Chart-based techniques use a combination of top-down and bottom-up processing, which


means they never have to take into account some constituents that would not lead to a
complete parse, as can be seen from the ongoing situations. (This also means that it can
handle rules with empty right hand sides and left recursive rules without entering an infinite
loop.) The algorithm's output is a packed forest of parse trees, each of whose parts is labeled
with the appropriate grammatical category in the tree.

A chart-based parser's fundamental process entails joining an active arc (also known as an edge) with
a finished constituent. Either a new finished constituent or a new arc that extends the initial active arc
are the outcomes. Up until they themselves are readded to the chart, new complete constituents are
kept on an agenda list. For instance, [0, 5, S NP VP•] indicates that a S that covers the string from 0 to
5 is made up of an NP followed by a VP. The numbers here show how the grammatical rules are
indexed.

What has been discovered thus far and what needs to be discovered are distinguished by the symbol •
in an edge. The indexing of the grammatical categories is shown by the numbers before the symbol S.

17
Edges that end in a • are referred to be full edges. A parser would have a S if it could discover a VP to
follow the edge [0, 2, S NP] • VP] which states that an NP spans from 0 to 2. Active arcs have edges
like this with a dot before the end.

The same ingredient is never produced more than once, making chart-based parsers more
effective than search-based ones. To parse a phrase of length n, a pure top-down or bottom-up
search strategy could need up to Cn operations, where C is a constant that depends on the
particular algorithm utilized. This exponential tree quickly renders the method useless, even if C
is relatively tiny [19] [16][29].

A chart-based parser, however, is said to need K 2 time and space trees, where N is the sentence's
length and K is a constant dependent on the algorithm, to build each element as a lexical category
between every place. It significantly decreases parsing operations as a result. To parse an n-word
phrase using chart parsing, create a chart with n+1 vertices and add edges one at a time,
attempting to generate a full edge that spans from vertex 0 to n and is of category S. There is no
going back; whatever entered into the chart remains there. In general, there are two distinct
problems that need attention. The first covers strategies for increasing the effectiveness of
parsing approaches by decreasing the search but leaving the end result same, and the second
involves methods for picking between many interpretations that a parser might be able to
identify. The following strategy is typically used to achieve this. The bottom up method failed
to store any intermediate findings, as was already mentioned. It is the main justification
behind its excessively time-wasting behavior of repeatedly checking things that have
previously been checked and cannot be changed. This might qualify as amnesic!
Each new category is examined to determine if it exhausts an RHS, and each new neighboring pair
of categories is examined to see if they do the same. The solution now needs to make it impossible
to keep track of which categories a parser is in, which makes it slightly more difficult to use [19]
[28] [29].
Although any parser must store some states in order to remember what it is doing at any given
time, chart parsers in particular must remember the multiple hypothesis states that are currently
being considered. This issue of storing intermediate results is independent of the distinctions
already discussed. The secret to effective parsing also turns out to be the storage of interim
findings.

18
The intermediate findings discovered during a parse are encoded by chart-based parsers
using a chart-based data structure [19] [28][27].

Using strategies that express uncertainty can help parsers be more effective because they
won't have to make a hasty decision only to change their minds later. Instead, the
uncertainty is carried forward through the parse until all but one of the Possibilities are
eliminated by the input. The effectiveness of the method presented here stems from the fact
that all potential outcomes are taken into account beforehand, and the data is saved in a
table that controls the parser, allowing for significantly faster parsing methods. It is clear
that chart parsers outperform all other parsers in terms of efficiency. To prevent effort
redundancy, they encode interim results. Moreover, the chart parser anticipates the word
category of unidentified terms and encodes uncertainty to prevent ambiguity (those that are
not in the knowledge base). As a result, Allen's [19] [30] [31] chart parser will be used in the
current investigation to forecast the category of unknown words and eliminate uncertainty.

2.4 Knowledge Required by the Parser


A secure specification of the language for which the parser is designed is necessary in order to
construct effective parsing algorithms. A significant portion of the knowledge needed for NLP
must originate from language studies. This aids in our comprehension of the structure and
functionality of the language. Understanding the language's grammar also aids in developing
an understanding of how the system ought to operate in a variety of situations. Because of
this, NLP systems will have to be built on the knowledge that linguists have about the
structure of languages [32][33]. A knowledge base is needed to direct the parser in the syntactic
analysis system that will be created. A morphological analyzer, a POS tagger, a lexicon, and a set of
grammar rules are the four primary parts of the system that has to be constructed. While the first two
of these parser knowledge base components are presented here, the final three are covered in a
different chapter part.
2.4.1 The Lexicon
A lexicon is a collection of details about words in a language and the lexical categories to which they
belong. Lexicon is a linguistic name for dictionary. It usually acts as the foundation of any NLP
system and, in order to direct processing, must include details about each probable word that the
system might encounter.

19
Typically, a lexicon is organized as a list of lexical entries, such as ("pig" N V ADJ). In
addition to its common usage as a noun ("Jane pigged herself on pizza"), "pig" can also be
used as a verb and an adjective ("pig iron"). A lexical entry will typically include more
details about the functions a word performs, such as feature information, such as whether a
verb is transitive, intransitive, or bi-transitive, etc., or what form the verb takes, such as
present participle or past tense, etc. [12]. Allen [19] contends that as long as a lexicon is
supplied, a grammar need not include any lexical rules of the kind N - flower. Abebe [2]
illustrates the straightforward decision tree grammar lexicon for Afan Oromo in the following.

N – Nama, V - Deeme

Adj – guddaa, Adv - ariifatee

In this illustration, the words on the right side are classified by POS symbols on the left.
2.4.2 The GrammarRule

The formal description of the rules and syntax that a language can use is known as grammar. The
most typical way to portray grammars is as a collection of grammar rules that generalize to group
words into what are frequently referred to as "parts of speech" or grammatical categories. Several
linguistic theories are based on grammar rules, and many natural language comprehension systems
are built on top of these ideas [7]. Oromo grammar is organized in an LR (Left to Right) table. This
is a basic grammar example for the phrase "Tola went to school," "Tolaan gara mana barumsaa
demee."

S => NP VP VP => PP VP NP => NAME PP =>P N

VP => V

NAME => Tolaan (n)

P => gara

N => manabarumssa

V =>demee

There are various grammar specifications, often known as grammatical formalisms. The most
popular and widely used formalisms are Probabilistic Context Free Grammars [19], Decision
Tree Grammar [34], Transitional Grammar [34] by Chomsky, Transition Network Grammars
20
[35] by Wood, Unification Based Grammar [36] by Kay. As a result, the grammar rules will
alter based on the theoretical foundation of the particular grammar.

2.4.2.1 Context Free Grammars

Context-free grammars are those that are made up solely of rules with a single symbol on the left-
hand side (CFG). A CFG is a formal system that delineates how every legal text can be derived from
a distinctive symbol known as an axiom, or sentence symbol, in order to represent a language. A
CFG rule can only be non-monotonic if its right-hand side is empty since CFG rules must be
monotonic. CFGs are crucial for two reasons: first, the formalism is strong enough to capture the
majority of natural language structure and, second, it is sufficiently constrained to enable the
development of effective parsers for sentence analysis [19]. This formalism is made up of a collection
of productions, each of which asserts that a certain symbol may be changed into a specific pattern of
symbols. One such production that claims S can be substituted by the sequence of NP and VP is S, NP,
VP. The sequences of symbols in NP and VP, respectively, replace each other (for example, NP, Adj N
and VP, V NP).

Non-terminals, also known as symbols that need to be replaced, are always represented by identifiers,
which are collections of letters and digits. In at least one production, every non-terminal must come
before a colon. The axiom is a non-terminal that only ever appears before the colon and never
between the colon and the period in any production. There must only be one non-terminal that
satisfies the requirements of the axiom. Terminals are symbols that cannot be changed; they can be
expressed by identifiers or literals (which are a sequence of characters bounded by apostrophes).

2.4.2.2 Transition Network Grammars


An additional formalism with a wide range of applications is the Transition Network Grammars. It is
built on the idea of a transition network with labeled arcs and nodes. Simple transition networks, also
known as finite state machines (FSMs), are comparable to regular grammars in terms of expressive
power and are therefore unable to adequately characterize all languages that may be described by a
DTG [19]. To acquire the descriptive power of DTGs, the network grammar concept of recursion is
required. Recursive Transition Network Grammars is the name of the grammatical formalism that is
based on this idea (RTNG).
Augmented Transition Network (ATN) is yet another linguistic formalism that Woods [35]

21
developed. One of the most often used formalisms for developing natural language
grammars is this one [37]. The presentation of regular (or finite-state) grammar is known
as a transition network. The network is a directed graph with terminal symbols for arc
labels (words or word categories). The graph's start state is represented by one node, while
the final state is represented by one or more nodes. The assumption is that if there is a path
from the start state to some final state such that thelabels on the arcs of the path match the words
of the sentence, a sentence is in the language defined by the network.
2.4.2.3 Decision Sensitive Grammars

The Decision Sensitive Grammar, a phrase structure grammar containing decision-sensitive


rules, is yet another widely used grammatical formalism. The following two definitions of a
decision-sensitive grammar rule are equivalent:

 Rules of the form x, y where x and y are strings of alphabet symbols, with the
restriction that length (x) <= length of(y).
 Rules of the form A, y | x z where A is a non-terminal symbol, y is a sequence of one or
more terminal and non-terminal symbols, and x and z are sequences of zero or more
terminal and non-terminalsymbols.
The meaning of the latter rule (or production) is that A can be rewritten as y if it appears in the
context ‗x z‘, i.e. immediately preceded by the symbols x and immediately followed by the
symbols z [37]. Context-sensitive grammars are more powerful than CFGs though the former
kinds of grammars are much harder to work with than the latter [12].
2.4.2.4 Unification-based Grammars

The term "unification-based grammars" refers to a grammar formalism that extensively uses feature
structures (such as case, gender, and tense), including the values reflected in the lexical entries of
words. The process of unification operates on these feature structures (i.e. the entire grammar can be
specified as a set of constraints between feature structures). The unification-based grammars, of
which DTGs are the most prevalent, can be supported by CFGs or any of the grammar formalisms
mentioned above. According to Joshi (NY), mentioned in Daniel [3], recursion can be embedded in
the feature structures, which is the major cause of the unification-based grammars' overwhelming
power.

22
2.4.2.5 Probabilistic Decision Free Grammars(DTFG)

DTGs can be generalized, just as Finite State Machines could be, to the probabilistic case. This
can be done by gathering some usage statistics for rules. That is, by simply counting the instances
of each rule in a corpus of parsed phrases and estimating the chance of each rule being utilized
using this statistical data. The likelihood of utilizing rule Rj to derive a category C from m
grammar rules with left-hand side C can be calculated given the category C and the m rules.
Pr (Rj | C) = count (#times Rj used) / Sum i=1, m (#times Ri used)
Probabilistic Decision Free Grammars (DTG) formalism refers to such Decision Tree
Grammars and associated probabilities. So, a typical PTFG grammar based on a parsed version
of a given corpus contains counts for LHS, counts for rules, and probabilities for each rule
produced. A four-tuple (W, N, S, R) is hence a typical definition of a probabilistic Decision-free
Grammar (PDG), where:

W = {w1, w2, wu} is a set of terminal symbols like words in a sentence,

N= {N1, N2… Nv} is a set of non-terminal symbols,

S= { N1} is a set that only has one starting symbol, and


R= {R1, R2…Rw} is a set of grammar rules with probabilities. For a rule Rm R, it is a context-
free grammar (CFG) rule with the form Rm: Ni j, and its probability P (Rm) =P (N I j).
[38].
In order to implement PCFG, some independence assumptions regarding rule use must be
made. It is specifically expected that the way a component is employed as a sub
constituent has no bearing on how likely it is that the constituent would be generated by a
rule Rj. This presumption allows for the development of a formalism based on the
likelihood that a constituent C produces the sequence of words Wi,Wi+1,..., Wj, denoted
as Wij, where I and j denote the position of the word in the sentence.
Although they do not take context or lexical co-occurrence into account, PDTGs are particularly
helpful for sentence parsing since, even when compared to CFGs, they allow handling scenarios
like structural ambiguity, ungrammatical phrase analysis, and grammar learning. It introduces
statistical features of natural languages and provides a better probabilistic model for syntax analysis [38]. In
this study, Afan Oromo words are described and represented according to their accepted grammatical and
syntactic categories using the PDTGs grammar formalism. This chapter's final section examines similar NLP
systems that can serve as crucial building blocks for the creation of autonomous sentence parsing.

23
2.5 RelatedNLPinparsing

According to Win Win Thant, Tin Myat Htwe, and Ni Lar Thein [14], the challenge of assigning
function tags and context free grammar (CFG) to parse Myanmar phrases was addressed using
Naive Bayes. Due to Myanmar's free-phrase-order and grammatical morphological system,
statistical function labeling for Myanmar sentences can be difficult. Function tagging was utilized
as a pre-processing step before parsing. Assigning function tags to nodes in a syntactic parse tree is
a task that Mihai Lintean and Vasile Rus [29] outlined using two machine learning techniques,
naive Bayes and decision trees. They made use of a number of Blaheta and Johnson-inspired
elements [39]. The collection of functional tags in Penn Treebank and the set of classes they
utilized in their model are identical.

By using numerous dependence rules and segmentation, Yong-uk Park and Hyuk-chul Kwon [4]
attempted to disambiguate for a syntactic analysis system. Parsing involves segmentation. If there
are no syntactic connections between two adjacent morphemes, the syntactic analyzer creates a
new segment between the two morphemes, finds all potential partial parse trees of that
segmentation, and combines them into full parse trees.

2.6 Related NLP in Afan Oromo


Little effort was made in Afan Oromo spatial NLP in the situation of automatic sentence
parsing, as was covered in chapter one. The only effort to create a straightforward automatic
sentence parser for the Afan Oromo language was made by Diriba [7]. Diriba [7] applied the
chart algorithm to his study with a few modifications. In order to make it easier to prepare texts
in a file to be parsed with appropriate lexical categories, a module for morphological analyzer
that divides words into their root form and associated morpheme was also built. The
unsupervised learning technique was also created to assist the parser in predicting unknown
and unclear terms in a phrase. The design of grammar rules, lexicon, morphological rules, and
contextual data was also based on an analysis of the linguistic characteristics of Afaan Oromo
grammatical categories. In actuality, this system was a first for this language.

24
The study used an intelligent (Rule-Based+ learning module) technique to create a prototype for the
language, an easy-to-use Oromo parser. It briefly explains the steps involved in the automated
sentence parsing of free texts. In other words, the goal was to create a prototype and use it for an
experiment. On the training test, the outcome was 95%, and on the test set, it was 88.5%.
2.7 Related NLP ComponentSystems
2.7.1 Morphological Analyzer

Recognizing and distinguishing specific word forms from the input text is the first stage in every
NLP task [31]. A lexicon that simply lists all word forms along with their part of speech and
inflectional information, such as number and tense, can provide this information in some languages,
such as English. The number of forms that must be listed in such a lexicon is manageable because
such languages have an inflectional system that is relatively straightforward. However, for many
other highly inflectional languages, such as Afan Oromo, where each noun or verb has a number of
inflected forms, a full lexical listing is just not possible.

This is due to the fact that each lexical word may have literally thousands of unique surface forms, each
with different inflectional characteristics but identical vocabulary parts overall [40]. As a result, NLP for
these languages would only be useful if it included a morphological analyzer that could compute the
parts-of-speech (POS) and inflectional categories of words using the morphological information of the
language [41][39].

Hence, a morphological analyzer is a key component required to break down words into their
morphemic components as well as to identify the word classes (such as noun, verb, etc.) into which a
specific word may belong before the work of parsing is completed. It involves the rules required to treat
words that are not in the parser's vocabulary and to produce information that is useful.

In other words, these criteria can be used to make educated assumptions about the
grammatical categories of words that are unknown. Moreover, a morphological analyzer
might be beneficial to help a parts-of-speech tagger (POST), a key element of a syntactic
parsing system. The next part, which discusses this second crucial element for a sentence
parser, provides a basis for include a morphological analyzer in this study.

In this aim, Abebe created a prototype morphological analyzer for Afan Oromo [2]. By
removing prefixes, stems, and suffixes from a given corpus, he created a morphological
25
dictionary (also known as a signature) using a Rule Based Method for Afan Oromo decision
tree Morphological Synthesizer.

In this study, it is assumed that the results from the prototype morphological analyzer for Afan Oromo created
by Abebe [2] won't have a big impact on how the input text is preprocessed before it is sent to the parser
together with the other NLP component system. Thus, this study will use manually processed words.

2.7.2 Part-Of-SpeechTagger
A POS tagger, an NLP system that automatically assigns the potential parts-of-speech categories to a
given word in a sentence, is the other important and fundamental portion of a sentence analysis
system. Since a POST entails recognizing the syntactic categories of words in a text, one of the main
reasons for implementing POST into a given automatic sentence parser is to eliminate improbable
parses (false analyses of a sentence). That is, if we can correctly assign the POS tags, a given
statement, such as "Gaangeen tolaan kalessa bitee hara ganama du'e" or "The mule that Tolla bought
yesterday died this morning," will become clear.

„Gaangeen\N Tolaan\N kalessa\Adj bitee\ V hara\Adv ganama\Adv dutee\V‟


Also, it is vital to have a mechanism in place to determine the proper syntactic category of the
word in the given location in a sentence because it is quite possible that a same morphological
form will appear with a variety of syntactic categories.

26
CHAPTER THREE

THE STRUCTURE OF AFAAN OROMO

3.1 Introduction

The word classes, phrases, and sentences in Afan Oromo are discussed in this chapter since
each of these units has an effect on the current topic. Nonetheless, the paper starts with a
basic and condensed explanation of the lexical categories of the language before delving into
the grammatical categories of the language. The parts of speech that the lexical categories
fall under are known as traditional grammar. However the paper tends to use the word since
grammatical categories are more comprehensive. We employ lexical categories to describe
individual words and non-lexical or phrasal categories to describe different kinds of phrases
in order to distinguish between words and phrases.
The lexical categories covered in this chapter include conjunctions, adverbs, adjectives,
verbs, and verb tenses. Although they are treated independently, pronouns are nevertheless
classified as nouns. This chapter also covers other Afan Oromo words, including as
interjections and numerals. The chapter opens with a quick overview of Afan Oromo's
writing structure and punctuation marks in order to aid in comprehending this portion.
Based on information culled from Diriba [7], Abebe [42], Baye [43] [44] [45], Askale [46]
and Tilahun [47], and Girma [10], the analyses and debates in this chapter. These sources can
be used to learn more information about the topic.
3.2 Afan Oromo Alphabet and Writing System

Afan Oromo writing system is a modification to Latin writing system. Thus, the
language shares a lot of features with English writing except some modification. Fortunately
the study gets advantage of the Afan Oromo writing alphabets called commonly ―qubee
Afan Oromo‖ that has been designed and used so far by the language experts in the area.
The writing system of the language is straightforward which is designed based on the Latin
script. Thus letters in English language are also in Afan Oromo except the way it is written.
Any literature pertaining to the language will provide a full description of the Afan Oromo
writing system; however readers are advised to consult Diriba [7] and Girma [10] for a more
in-depth analysis of the writing system.

27
3.3 Punctuation Marks in Afan Oromo

Afan Oromo punctuation marks all follow the same punctuation pattern as English and other
languages that utilise the Roman writing system, according to analysis of Afan Oromo literature.
When making a statement, use a period (.) or a question mark (?) In command and exclamatory
sentences, use the interrogative form and the exclamation mark (!) A comma (,) that separates lists
from sentences and a semicolon (;) denote the conclusion of a sentence (;). The use of commas
separates the list of concepts, names, things, etc.
3.4 Word Categories in Afan Oromo

The grammatical categories of Afan Oromo have improved over time in terms of word
categories and other syntactic aspects, much like the grammatical categories of other languages,
like English, for instance. As a result, the language has now categorised and summarised the
eight conventional grammars into five groups. The following eight categories are used to
classify Afan Oromo words in conventional word categories (or Grammatical Categories). They
include the pronoun, conjunction, interjection, and the noun, verb, adjective, adverb, and ad
position.
Afan Oromo words are divided into five groups by contemporary syntacticians like Baye
[43] [44] [45], who place pronouns and adjectives under the noun category, conjunctions
under the ad position (pre- and postposition) categories, and adverbs, adjectives, and verbs
together. Adjectives and adverbs are classified under the same lexical category by some, such
as Askale [46]. In any instance, there are five syntactic subcategories that operate as the
phrase's heads. The language has five grammatical categories, each of which is headed by
five word categories, according to the aforementioned classification. Interjections, which are
"words" without syntactic functions, are not taken into account as grammatical categories in
this classification.
The classification system created by Baye [43] [44] [45] is used in the current investigation.
This is so that the parser that will be created by this study doesn't become redundant due to
the typical classification scheme's repetition. Instead, a subcategorization system is employed
to make the grammar rule more condensed and expressive. As a prelude to the tasks to be
completed in chapters four and five, which are the core and primary contributions of this
thesis in the field, the following portions of this chapter dig into the discussion of the
grammatical categories of Afan Oromo.
28
3.4.1 Categories of Nouns

Afan Oromo's definition of a noun is comparable to that of other languages like Amharic. Nouns
in the Afan Oromo language are used to name or identify specific instances of any of these
things, persons, places, or concepts. The Afan Oromo noun categories used in this study are
nouns, adjectives, and pronouns. In the following statement, the positions held by words like
"Fardaa" and "Horse" are regarded as noun positions. As in "Fardi marga dheeda" (the Horse
grazed grass). Moreover, the single and plural forms of two numbers are recognised in Afan
Oromo nouns. A plural noun is marked by a variety of forms while a singular noun is marked by
zero morphemes. The instances that follow serve as illustrations.
Singular Plural plural marker

1. nama man namoota Men {-(o)ota}


2. sangaa Ox saangoota oxen (pl) {-(o)ota}
3. godina zone godinalee Zones {-lee}
4. bineensa hyena bineeyyii Hyenas {-yyii}
5. jabbii caf jabbiilee Cafs {-lee}
Nouns sometimes used as adjectives, like in the following sentences:
Tulluun mana citaa ijaare. [Tullu built a thatched house]
Several categories are used to classify adjectives. Adjectives in Afan Oromo can be either
derived or primal. The nouns they describe are followed by the adjectives. For instance, the
adjective "guraacha," which means "black," appears after the noun it modified in the phrase
"holaa guraacha." In the adjective example, it is likewise true that not all words that come
after nouns can be adjectives. For instance, the term "education" in "mana barumssa"
"school" is not an adjective but a noun. Moreover, nouns, but not adjectives, can be found in
subject or object positions. Adjectives in Afan Oromo, on the other hand, can be either
derived or primitive, just like nouns.
According to the information above, Afan Oromo nouns are made plural by adding suffixes
like "-een," "-wan," "-(o) ota," "-yyii," and "-lee." Certain nouns allow for the employment of
multiple varieties of plural markers. For instance, the plural of "jabbii" might be either
"jabbilee" or "jabbiloota." Nonetheless, most nouns favour one plural marker over the other
Abebe [42]. For instance, the word "sagalee" can be made plural by adding the suffix "-(o)
ota" or "-lee," but it prefers the former over the latter.
29
In a variety of ways, adjectives are comparable to nouns. For instance, Baye [43] came to the
conclusion that adjectives and nouns both have number inflections. He went on to explain
that both can be made plural by adding the above-mentioned plural maker to nouns,
particularly "-(o)ota," while omitting derived adjectives that are made plural by reduplication.
For instance, in Afan Oromo, the noun "nama" (which means "man") is made plural by
adding the suffix "-ota," becoming "namoota" (which means "man," but without the last
vowel). Adjectives function similarly. For instance, the adjective "guraacha" (which means
"black") can be pluralized similarly to how nouns are and becomes "guraachota" (which
means "the black").
In terms of gender inflection, the two subcategories also have a lot in common. It should be
noted, nevertheless, that adjectives cannot replace nouns in sentence structure. Since
pronouns can appear in the same location in an Afan Oromo phrase, they appear to be the
only words that can be substituted for one another. Take the following sentences as an
example, where one is correct and the other is not.

Tollaan barsiisaadha. ―Tolla is a teacher‖


Inni barsiisaadha. ―He is a teacher‖.
Guraacha barsiisaadha. ―Black is teacher‖, which is ungrammatical.
There are also personal pronouns which are included under this category. See the following
table.
Table 3.1: personal pronouns

Person Accusative Nominative

1sg (a)na (me) an-i (I)

1pl Nu (Us ) nu-hi/nu-ti ( we)

2sg Si‘I (you) Ati (you)

2pl Isin (you) isin-φ (you)

3sgm Isa (him) in-ni (he)

3sgf Ishii (her) ishii-n (she)


3pl Isaan (they) isaa-φ (they)

30
3.4.2 Categories of Verbs
The discussion of this section is based on the information collected from Baye [43] [44] and
Askale [46] and Abebe [42]. These works consist of all the information required by the
current study. Verbs are forms which occur in clause final positions and belong to a distinct
category from that of nouns. For example in the following sentence,
Caalaan farda bite.―Chala bought a
horse‖ Leensaan dhufte. ―Lensa has
come‖ Tulluun dheeraadha. ―Tullu
is tal‖
The italicized parts are all verbs. Baye [45] divides verbs into anumber of sub categories
based on the type of constituents they are associated with. These are intransitive, transitive,
modals and auxiliaries verbs. The intransitive verbs are those verbs which do not take any
phrase as their complement. For example in the sentence ‗Abbabaan furdate‘ (Abebe got fat),
‗furdate‘ ―got fat‖ is an intransitive verb which has no complement. There is also what Sag
and Wasow [48] call strictly transitive verb. These types of verbs are those which take one
complement in Afan Oromo. Fore xample,
Inni [teechuma] NP cabse ―he broke the chair‖
Caalaan [mana] NP bite ―Chala bought a house‖
The NP in these two examples are complement to the verbs ‘cabse’broke and ‗bite’ bought.
For the detailed treatment of these sub categorizationssee Baye [45].
3.4.3 Categories of Adverbs

Afan Oromo adverbs are words which are used to modify verbs. Adverbs usually
precede the verbs they modify or describe. Example;
Tolaan kaleessa dhufe. “Tola came yesterday”
In this example, the adverb ‗kaleessa’ ―yesterday‖ precedes the verb ‗dhufe‟ ―came‖ that it
modifies. However, it should be noted that any word that comes before a verb is not necessarily
an adverbs. For instance, in ‘muka cabse’ ―broke wood‖, the word ‗muka‟ ―wood‖
precedes the verb ‗cabse‟ ―broke‖. In this case the word ‗muka‘is a noun and in turn is
modified by the verb ‗cabse‟. Hence, the verb functionally shares the feature of an adjective
(modifier). There are different types of adverbs. These are adverbs of time, place, manner,
frequency, degree, etc. in general; adverbs are treated as the subclass of verbs. Days of the
week in Afan Oromo language may be used also either as a noun or as an adverb.

31
Ad positions in Afan Oromo
The term Ad positions refers to words, which be will have meaning only when they are
attached or used together with other words such as nouns, verbs, pronouns and adjectives.
Ad positions are characterized by having no inflectional or derivational morphology and
belong to the closed system.
Adpositions can appear as:
A simple adpositions that stands alone as separate
words Examples Toleraawalin ―with
Tolera‖
Gara mana ―to house‖
A simple adposition prefixed or attached with other words (e.g. nouns and verbs).
Example harka-an ―byhand‖
Ummata-f ― to/for thepublic‖
As compound adpositions consisting of two parts,adpositional prefixes and post
positions put afternouns. The postposition scan either be single adposition that
stand by their own or an adposition not separated from anoun.
Examples sanduqa gubba-rra ―On top side of thebox‖

Box (noun) a postposition ―inside‖, postposition―over‖


3.4.4 Conjunctionsin AffanOromoo

Conjunctions in Afan Oromo are coordinating or subordinating. They coordinate words,


phrases, clause and sentences. A list of Afan Oromo coordinating conjunctions is found in
Hamiid [49] together with a detailed discussion on such coordinating and subordinating
conjunctions.
This paper adopts the current trend that conjunctions and adpositions appear in the same
categories, the adposition Grammatical Categories. One problem that arises by categorizing
adpositions and conjunctions into different categories is the problem pertaining to distinguish
conjunctions from adpositions. Theproblem indistinguishing thetwo mainly arises from the fact that
thesamewords are mostly used as both adpositions and conjunctions. However, in cases where it
is possible to separate adpositions from conjunctions, they are parsed separately. That is when the
parser is able to distinguish h between the two sub categories a distinct category is given to both of
them.

32
3.4.5 Numerals

These are words representing numbers. They can be cardinal or ordinal numbers. A list
of the Afan Oromo Cardinal numbers is found in Hamiid [49]. In Afan Oromo, the ordinal
numbers are formed from the cardinal numbers by suffixing the suffix {–ffaa}.

Example Cardinal Gloss Ordinal


Gloss

Tokko One tokkooffaa First

Sadi Three Sadaffaa Third


Like English, compound Afan Oromo numerals are put separately. The following are
examples to illustratethis.
Example Dhibalamma, ―twohundred‖

Dhibalammaffaa ―twohundredths‖
Dhiba lammaa-fi shan “two hundred and five‖
In Afan Oromo, there are also numerals that indicate distribution. These numerals are
called distributive numerals.
Example „sadisadi’ ―three three‖
There are also special numerals in Afan Oromo that correspond to the English
―half‖, ―quarter‖ etc.
Examples of these include ‗walakkaa’ ―half‖ and ‗siisoo’ ―one third‖.
3.4.6 Interjections

Like English, Afan Oromo has many words or phrases used to express such emotions
as sudden surprise, pleasure, annoyance and so on. Such Afan Oromo words are called
interjections. These Afan Oromo interjections can stand-alone by themselves outside a
sentence or can appear anywhere in a sentence.
Examples ashuu! ― wonderful!‖
wayyoo ― my goodness‖
ani bade! “my goodness‖
A long list of Afan Oromo interjections is found in Hamiid [49]. Based on the above lexical
categories, the nextn section explores the types of phrases found in Affan Oromoo. The idea
33
of headedness discussed in this chapter of this paper may indicate that the types of phrases found in
the language depend onn the lexical categories of the language. Moreover, Baye [45] and Sag
and Wasow [48] divide the types of phrases based on the lexical categories. Thus this paper
will depend on this classification for the purpose of the problem under consideration but keeping
the idea of headedness in mind. Moreover, this paper depends entirely on Baye [45] and Sag
and Wasow [48] for the analysis of Afan Oromo phrasal categories.

3.5 Phrasal Categories

As it is indicated above phrasal categories depend on the lexical categories of a language.


They use the lexical categories as the head of their phrases. Thus all phrasal categories are
hierarchical in nature. This hierarchical nature of categorization is very important for it
enable us to classify feature structures in more subtle way that will allow intermediate level
categories of various sorts. For example, verbs may be classified as intransitive or transitive;
and transitive verbs may further be sub classified as strict transitive (those taking a direct
object and nothing else) or intransitive. Thus the hierarchical system let us talk about the
properties shared by two distinct types by associating a feature or a constant with their
common super type. But before talking about the phrases in Afan Oromo, let‘s define the word
phrase.
3.5.1 A Phrase
A phrase can be defined as a syntactic combination of a word with one or more other
words. A phrase is constrained or restricted by two things: in terms of the constituents‘ and
the lexical categories like nouns, verbs, etc. Thus, we can determine the number of phrases by
the number of words. A question of how to check whether a structure is a phrase can be
answered using the following four guiding principles Baye [45]. These are:
1. If the constituents of the phrase can be moved together to another place
without separation.
2. If the phrase can replaced by a pronoun (for noun phrase).
3. If one of the constituents of that phrase is missed, the meaning of that phrase
wills be corrupted.
4. If an insertion of other word in between that phrase affects the
meaning. Based on the type of lexical categories in Afan Oromo, there are

34
five phrase types in the language. They will be reviewed in the following
subsections.
3.5.2 Noun phrases
A noun phrase is made of one noun and one or more other lexical categories including the noun
itself. For example, in the phrase ‗mana citaa ―thatched house‖, there are two nouns which make
the noun phrase: ‗mana‟ ―house‖ and ‗citaa‟―thatched‖.
Thus, noun phrase and phrases in general must meet the above criteria to be called a phrase. In
the following sentence ‗Tolaan mana citaa qaba‟ ―Tola has owned a thatched house‖,
‗mana citaa’ ―thatched house‖ is a noun phrase. But to check whether it is really a phrase or
not, we can see the above criteria. The following arrangement is impossible for the above
reasons.

 Qaba Tolaan mana citaa. ―legal movement‖


 Citaa Tolaan mana qaba. (Illegal movement because of the above reason 1
and4.)
 Mana Tolaan citaa qaba. (illegal because of rule1&4)
The above sentences with asterisks have illegal phrase construction because of the
aboverules. Thus we checked that ―mana citaa” is a phrasal structure.

As indicated above, nouns can appear in a number of positions, such as in the positions of the
three nouns in ‗Tolaan kitaaba Haawwiif bite‟ ―Tola bought Hawi a book‖. These same
positions allow sequences of a noun followed by an article, as in ‗Tolaan kitaabicha Haawwiif
kenne‟ ―Tola gave Hawi the book.‖. Since the position of the article can also be filled by
demonstratives (‗kun’, ‗sun‟, etc.), possessives („koo‟, „kee‟, „keessan‟, etc), or quantifiers
(e.g. ‗xiqqoo‟), the more general term ―Determiner‖ abbreviated as (DET) isused.
Moreover, each constituent in a phrase has its own positions and functions. For example,
‗mana‟ and ‗citaa‘ are both constituents of the phrase „mana citaa‟. A phrase is usually headed
by one word. The head word is the core component of a phrase. Without ahead a phrase can‘t
be built. On the other hand a head can stand-alone by itself. A head word can determine not only
phrase type but also lexical categories. If the head is a noun, then the phrase is a noun phrase,
etc Sag and Wasow [48]; Levine and Green [50].
NP has a lot of constituents in Afan Oromo. As indicated above one of the constituents is the

35
determiner. Consider the following example,
A) [Namni tokko]NP [saree ajjeese]VP ― A man kil ed adog‖
B) [Namichi]NP[saree ajjeese]VP―the man kil ed adog‖
We can see that NP consists of determiners of type articles ‗tokko‟ ―a‖ in (A) and ‗–ichi‟
―the‖ in (B). However, the position of these determiners in Afan Oromo is different from
English in that determiners come after the nouns they modify. Not only determiners but also all
modifiers for nouns come after it in the language. A NP may also consist of two nouns like
„mana citaa‟ ―thatched house‖. In Afan Oromo the order of words especially the head
word and the modifiers and specifies are different from the word in English. For example, a NP
in Afan Oromo consists of one noun word as head word and another noun plus an adjective
as modifiers and specifies like in the following example.
„Mana citaa bareedaa‟ ‗a beautiful thatched house‘.
In addition to the above developments Afan Oromo has a NP which may appear as accusative,
nominative, genitive, dative and instrumental. This type of existence of nouns Ina different form
for different function is called case. In the following subsection it will be reviewed in brief.
However, for detailed treatment of case in Afan Oromo, readers are referred to Abebe [42].
3.5.2.1 Accusative and Nominative Case
The accusative case form is the basic form of nouns and pronouns in Afan Oromo. Abebe [46];
Baye [43]; Gragg [51].This means that nouns and pronouns in direct object position do not
have overt case marker, as shown in the following sentences. While the nominative case (words
in their subject position form) are inflected for agreement in terms of case.
A. [Tulluu -n]NP [mana]NP ijaar-e
― Tulu built (a) house.‖
B. Tulluu-n [farda adii](NP) yaabbat-e
― Tuluu rode (a) white horse.‖
C. Tulluu-n [intala-tii]NP
beellam –e― Tul uu dated
thegirl.‖
D. Tulluu- n [intala –tii diimttuu] (NP)
beellam- e ― Tul uu dated the whitegirl.‖
E. nam- ni of
36
jaalat -a ―Man
loveshimself‖
F. fard -i marga
dheed-e ―A
horse
grazedgrass.‖
In the above example, the phrase ‗Tulluu-n‘ is an NP as nominative case (subject case) and
„mana‟ [house] is an NP as accusative case. Thus in the above example one can see that NP
as subject has case marker, i.e. „-n‟, „ni‟, and „i‟ while NP as object form has no case marker
in a sentence.
The object NPs ‗mana’ house‘ (head noun) in (a), ‗farda adii‘―white horse‖ (a head noun
and modifying adjective) in (b), ‗intala-it’ ―the girl‖ and ‗mucaa’‗ ―the boy‖ (head noun
in (c) & (d) respectively, and intala-ittii diim- ttuu ―the white girl‖(a head noun along with
singulative marker and modifying adjective) in (e) are not all overtly marked for accusative
case.
Similarly, personal pronouns ‗ana‘ ―me‖, ‗nuu’ ―us‖, ‗sii’ 'you' (second person singular), ‗isin’
―you‖ (2pl), ‗isa’ ―him‖, ‗ishii’ ―he‖, ‗isaan’' ―them‖ in object position do not affix accusative
case marker.
It can be noted from (a-f) in the above examples that nominative case in Afan Oromo
nouns is marked by ‗-ni‘, ‗-i‘, ‗-n‘ and ‗φ‘ (empty set). Abebe [46] generalized these
subject markers as in the following case.
i. ‗-ni‘ occurs after a noun which ends with a short vowel that is
dropped, (e.g. in Eabove),
ii. –i‘ occurs after a noun that ends with a short vowel which is
preceded by consonant cluster; the short vowel of the stem is again
deleted (e.g.F),
iii. ‗-n‘occursafteranounthatendswithalongvowel(e.g.A—D),and
iv. ‗φ‘ occurs after a noun that ends with a consonant (e.g.G).
An adjective modifying a head noun in external argument position attaches similar
suffixes as the nouns in the above examples for subject markers.
A. nam-nifurdaa-ndhibee hin-danda‘-u "A fat man cannot resist disease"

37
B. [fard-i gurraach-I] NP collee-dha " Black horse is smart"
As can be observed in (A and B), subject marker in the nominative case is copied onto the
modifying adjectives. The forms of the nominative marking suffixes on the adjectives are
phonologically conditioned in the same way as they are made on nouns. Detailed treatment of
case in Afan Oromo is found in Abebe [46]. The same is true for personal pronouns in
AfanOromo.
There are personal pronouns such as ‗nu‘ ―us‖ and ‗sii‘ ―you‖ which do not seem to fit into
the rules in above. ‗Sii‘ ―you‖ (accusative or object case)' and ‗ati '―you‖
(Nominative) are different forms from one another and may be considered suppletive. On
the other hand, ‗nu‘ ―us‖ and ‗nuhi/nuti" ―we‖ share some common phonetic form that have
been summarized from above in the above rules. In general NP constituents are:
1. A noun as headword
2. Specifiers like adjectives, adposition,etc
3. Quantifiers likenumbers

Furthermore, Afan Oromo simple noun phrase is head final. A more detailed discussion of noun
phrase will be presented in the sub topic ―Sentence in Afan Oromo‖. The last point to make
about Afan Oromo Sentence is that it has discontinuous morpheme to indicate negative markers.
For example, Abbabaan hin ddhufne
3.5.3 Verb Phrases
It is important to establish the word complement here before moving on to the discussion of verb
phrases (VP). A complement is a word or phrase that the head word might take as its components to
make it grammatical in simple terms. Like "Abbabaan dhufe," "Abebe came," some words do not
require complements, while others just require one complement, as in "Abbabaan muka cabse," and
yet others require two complements, as in "Abbabaan konkoolataa naaf bite." Using this concept as a
guide, three categories can be used to classify Afan Oromo verb phrases. All three of these verbs fall
within the intransitive category.
„Abbabaan dhufe‟ ―Abebe came‖.
„Abbabaan teechuma cabse‟ ―Abebe broke chair‖
„Abbabaan konkoolataa naaf bite‟ ―Abebe bought me a car‖.
All varieties of adverbs, adpositional phrases, and noun phrases can be found as components
of VP. The subtopic "Sentences in Afan Oromoo" will provide a more thorough

38
explanation.AdjectivePhrase
Adjectives serve as noun phrase specifiers. They typically follow the noun (typically the head word)
that they refer to. For instance, "big house" "managuddaa". Nouns can function as adjectives in
adjectives, such as "mana citaa" (thatched house) or "mana gubate" (burn to house), which are verbs.
The next subtopic "Sentences in Afan Oromo" contains still another further detail.
3.5.4 Adverb Phrase
Adverb phrases in Afan Oromo consist of one or more different lexical categories, including the
adverbs themselves as modifiers and specifiers. It is possible, for instance, to have two adverbs
in an adverb phrase in Afan Oromo, as in the phrase "kaleessa galgala," which means "yesterday
night." Adverbs and related phrases are employed to modify verbs, as was already mentioned.
They therefore come before verbs in a phrase. In general, an adverb phrase can be made up of
an adverb as the head word, a noun phrase, another adverb, etc.See Baye[45] for a thorough
explanation.].
3.5.5 AdpositionalPhrase
Ad positional phrases are combination of nouns and ad position. They usually specify verb
phrase. This phrasal category sometimes is called ad positional objects (Baye 1986).
A. ‗Inni gara mana deeme‘ ―He went to the house‖
B. ‗Lammaan kophee Caaltuu-f bite‘. ―Lemma bought a pen toChaltu‖.

Ad positions in an ad positional phrase can be either stand independently like in


(A) or affixed to the ad positional object like in (B) above.

3.6 Sentences
3.6.1 Afan Oromo Simple Sentences
A simple Afan Oromo sentence consists of a noun phrase NP, which is the subject, followed by
a verb phrase VP that comprises the predicate.
‗Namichi saree jaalata’ ―The man loves dog.‖
Baye [45] classifies simple sentences into four, namely: declarative sentences, interrogative
sentences, negative sentences and imperative sentences. Declarative sentences are used to
convey ideas and feelings that the speaker has about things, happenings, feelings, etc, that could
be physical, mental, real or imaginary.
Example: ‗Haawwin abokatoo taate.‘ ―Hawi became a
lawyer‖
39
A sentence that questions about the subject, the complement, or the action the verb specifies, is
called an interrogative sentence.
Example: „Haawwin yoom dhuftee?‟ ―When did Hawi
come?‖
Afan Oromo phrases are frequently constructed using interrogative pronouns like "eenyu" for
"who," "maal" for "what," "essa" for "where," "meeqa" for "how many/how much," and
"yoom" for "when." Then, other interrogative prepositional phrases can be created by
combining these interrogatives with prepositions, such as "eenyu irra" for "from whom,"
"maalif" for "why," etc.
Negative sentences only contradict a declarative assertion that has been made.

Example: ‗Tolaan laqana nyaatee.‘ ―Tola ate his lunch‖


„Tolaan laqana hin-nyaanee.‟ ―Tola did not eat his lunch‖
The verb in the two sentences in the aforementioned example is "nyaatee," or "ate." The
negative prefix /hin-/ and suffix /-nee/ are used to indicate that the sentence's subject is
masculine and third person singular. Simple imperative phrases give instructions, and the
second person pronoun they usually but not always refer to is suggested by the verb's suffix.

In Afan Oromo, a sentence is made up of zero or more noun phrases and one or more verb phrases.
A sentence, on the other hand, is regarded as a unique category of phrase made up of a noun phrase
(NP) and a verb phrase (VP). Thus, while discussing sentences, we can use the term "phrases."

We need to take a moment to consider the standard nomenclature parts of speech (POS) before
moving on. There are several feature structures that work better with some POS (lexical
categories) than others. For instance, in Afan Oromo, CASE is only appropriate for nouns,
adjectives, and pronouns. For nouns, verbs, and determiners, the characteristics PER (SON) and
NUM (BER) are employed. Therefore, we must ensure that the appropriate feature corresponds
to the appropriate lexical categories in the parser that will be constructed. Furthermore, it should
be mentioned that the lexical categories covered in this chapter are significant since they function
as the head of a phrase in an Afan Oromo sentence.

As has been mentioned thus far, the head will always indicate the importance of a lexical
category in Afan Oromo. Head does the same task as the POS, but it also adds further value by
giving us a method to the features that each POS requires into consideration. Furthermore,

40
HEAD permits us to introduce decision features, or features inside features, according to Sag
and Wasow [48]. The ability to convey the relationship between a headed phrase and its head
daughter simply will be of immediate service. Let's define a tree structure that is typical of
practically all disciplines in order to clarify these terminology. With an upside-down tree
diagram, every statement may be expressed. As in the phrase "Namichi sareejaalata," which
means "The man loves dog."

Nodes joined by branches are referred to as branches in a tree. It is claimed that a node (in
the example above, S) dominates a branch when it is located above another node (in the
example above, NP or VP). The terminal (or leave) nodes, or those at the base of the tree
that do not dominate anything else, are known as. One node on a tree is said to be its mother
node and to immediately dominate the node directly above it. It is said that a node's offspring
is the node directly beneath it. They are sisters, two daughters of the same level.

Let's return to the concept of headed phrase and head daughter while keeping in mind the above
straightforward definition of "daughter." The mother and one of the daughters must have the same
(unified) values for both POS and characteristics, according to Afan Oromo grammar rules (and
any other language, for that matter). The head daughter is always the constituent on the right side
of the rule that has the unifying (matching) feature value.

According to Sag and Wasow's general principle [48], which applies to all trees constructed using
headed rules, the head values of the mother and the daughter in every headed phrase must be unified (or
have the same value). Furthermore, they claim that the rules governing phrase structure are no different
in kind from those governing word structures, other than the fact that grammar rules rather than lexical
entries govern them. So, based on grammar rules, we can state that a sentence (or a phrase) is
grammatically correct (or well-formed). For instance, according to Sag and Wasow [48], a sentence is
well-formed simply in case each local subtree within it is also.

I. Contains a lexical entries, or


II. Satisfies some grammar rules and principles.
The schematic representation of a sentence as forwarded by Sag and Wasow [48] is: S PHRASE
PHRASE
HEAD NOUN HEAD VERB AGR AGR

41
According to the structure, a sentence is made up of a noun phrase, a verb phrase, and a
noun. Ahead, united by their agreement in a sentence, see Sag and Wasow [48], respectively.
It should be mentioned that this is merely a broad illustration of sentences in general.
However, the goal of this study is to create a parser for basic statements that express real or
ideal, concrete or abstract thoughts, feelings, or behaviors. A period is used to end certain
types of phrases in Afan Oromo. Additionally, for the same reason, the paper will also
incorporate the feature of type agreement for all grammatical categories and past tense for
verbs in the development of the parser.
3.6.2 Afan Oromo Decision Sentences
In Afan Oromo, decision sentences are those that are made up of decision phrases like the
main clause (MC) and subordinate clause (SC). Each MC should consist of NP, VP, or AdjP,
intern. The pattern of combination may consist of simple VP and decisions, simple NP and
decisions, or both simple NP and decisions. Before examining how choice phrases combine
to form decision sentences, it is worthwhile to look at the structure of decision phrases.
One that has a sentence embedded within it is known as a decision MC. "Huccuun adii
Tolaan kan bitee," for example. A grammar MC/NP with "cloth" as its head is "The white
cloth that Tola bought." To create the straightforward NP "huccuu adii" "a white cloth," this
head was coupled with the complement "adii" "white." The dependent clause/subordinate
phrase "that Tola bought" and the simple NP "Tolaan kan bitee" were combined to create the
complex NP mentioned above. The clause is a subordinate clause and cannot stand alone
since it contains the relativizer "that," "kan," in it. A parse structure tree showing the
structure of this complex NP looks like this.

42
Figure 3.2: The Structure of a grammer Noun Phrase
The relative clause "Tolaankan bitee," which means "that Tola bought," modifies the noun
phrase "Huccuun adii," which means "white cloth," as can be seen in the tree diagram. The
3rd person singular masculine object of the verb 'kan' 'bitee' "that (he)" has the same person
level as the word "Huccuun adii," and it is in the third person. Therefore, "Tolaan kan bitee"
is known as a relative phrase in Adugna[52]. It means "that Tola bought."
Similar to this, a sentence is a decision if it includes more than one verb or sentence. In other
words, a tree VP/SC contains an embedded sentence that serves as a complement or modifier,
much like a tree NP/MC does.
„Toolaan akka ishee jaalatee Haawiin siritti bektee‟.
―Hawi knew that Tola loves her‖
Toolaan akka ishee jaalatee is the dependent clause in this sentence. The phrase "as Tola
loves her" is what caused the clause to become dependent. This sentence serves as an
adverbial phrase of reason since it explains why Hawi knew he was loved. The following
tree diagram can be used to show the structure of this VP:
(CS
(MC
(NP (N
Toolaan))
(PP (P
akka))

43
(VP (PR ishee) (V jaalatee)))
(SC (NP (N Haawwiin)) (VP (ADV siritti) (V bektee))))

[0.00041472000000000004]

Figure 3.3: The Structure of a tree Verb Phrase


In conclusion, simple sentences consist of simple noun phrases and simple verb phrases, but tree
sentences can have a tree NP and a simple VP, a simple NP and a tree VP, or a tree NP and a
decision VP. Due to time and resource limitations, only tree sentences that are made up of a simple
NP and a tree VP are taken into consideration in this study.

44
CHAPTER FOUR

DATA PREPARATION AND PCFG EXTRACTION


4.1 Introduction
Each and every sentence in the text Consider each sentence separately.
Determine the stem of each word in the sentence. You can ask for the HMM POS tagger.
The marked sentence stems can be obtained.Use the Morphological Synthesising Function to
update the tagger's Category output.Send the final string of words with tags to the parser.
Prior to analysing the sample text for the study, the first section of the chapter describes the
methodology used to construct the parser.

A simple manual part-of-speech tagger and morphological analyzer are covered in the third
and fourth sections, respectively. The subject of the fifth section is the extraction of a
probabilistic context-free grammar rule from the tagged corpus..
4.2 The Design Approach of the Parser

The use of statistical methods has greatly accelerated development in a number of language
processing domains. Disambiguation, document database classification, speech recognition,
and grammar learning are a few areas where statistics has been helpful. Chomsky discovered
that statistics was the most effective tool for looking at some linguistic phenomena'
regularities [34].

Researchers in the field of natural language processing (NLP) are now unable to extract the
statistical data that can help them understand language since little to no effort is being made
to make large Afan Oromo corpora online accessible. Because of this, this study depends on
materials that have been manually annotated and labelled.

As mentioned in the second chapter, the Afan Oromo tree sentence parsing system was
designed taking into account the PCFG bottom up chart parsing technique. As was mentioned
in chapter two, the majority of the structure may be described by CFGs using natural
language. They are essential because they are sufficiently limited to permit Allen [19] the
creation of efficient parsers for sentence analysis. PCFGs, which define a language as a
probability over strings and are used in many applications [40], are the probabilistic
equivalent of CFGs.

45
Because they can deal with frequent parsing concerns such structural ambiguity, which
becomes more of a difficulty as grammar becomes more complex, anticipating the parse
space, and ungrammatical phrase analysis, PCFGs tend to be more advantageous for sentence
parsing than CFGs.
Yao and Lua [53] provide the probability of the parse tree of the sentence w1,n as the sum of
the probabilities of all rules used in the parsing if a sentence w1,n (where w1,n is a series of
words w11,..., w1n) has T (n) possible parse trees (possible structures), and the parse tree that
0 I T(n).
( ) = ∏ ( ) … … …… … … …… … . . 1 = and the probability of the sentence w1, n is
the sum of the probabilities of all possible parse treesthat

( ) (1,) = ∑ ( ) … … … …… … ……. .
P(w1,n) shows the potential grammaticality of a 1, n in the language, while P(ti) shows the
potential of the it parse tree among all feasible parses. The more P(ti) there is, the more
logically sound the parse is. This argues that all one needs to do is discover 0 () () (in order to
get the best possible parse of a sentence. P(w1,n) and 0 are two important facts in syntax
analysis because the first one shows how a sentence can be justified in PCFGs, the second
one shows how it can be justified in PCFGs, and the third one shows how many parses can
be made. The P(ti) values are the major focus to find the most probable parse structure.
4.3 The Sample Corpus
In order to conduct this study, 300 Afan Oromo simple and compound sentences that were
selected from newspapers and widely used grammar books were used. There was no
annotated text for the sample corpus's grammar induction and training purposes, therefore the
manual morphological analysis of each word, hand tagging, and sentence parsing procedure
took a long time. The researcher admits that the sample size still seems somewhat small..

The phrases were taken from the books "Seerluga Afan Oromo" by professor Baye Yimam
[44] and "NATOO: Yaadrimee Caasluga Afan Oromo" by Berkesa Adugna [52], which were
both written with the intention of serving as references for teaching Afan Oromo language at
the tertiary and secondary levels, respectively. Language consultants were consulted before

46
the references were chosen. On the other hand, the articles of the human rights law were
chosen since they are used as a model for NLTK corpora's natural language processing.
The sentences were chosen to represent two or more phrase classes, embed one of the various
clause types (such as a relative clause, reason clause, result clause, or time clause) for
decision sentences, and contain one noun phrase and one verb phrase for simple sentences all
of which were covered in chapter three.
For the purposes of processing this study, the sentences were then manually annotatedUsing
the Afan Oromo language's phrase structure rules, the researcher and an Afan Oromo lecturer
from Teachers Training College manually tagged and processed the remaining phrases
In Baye [45], some of the example sentence parses were given.
The linguistic advisor for the thesis as well as another authority on the Afan Oromo language
was then contacted for feedback and suggestions. The probability calculations for the terms
in the sentences, the induction of grammar rules, and the probability assignment to the
grammar rules were performed on 240 sentences (approximately 80% of the sample corpus).
The remaining 60 sentences (or 20% of the corpus) served as the test set, while 60 sentences
were randomly selected to serve as the training set.

4.4 The Morphological Pre-processing

Even while it is possible to list every term that the system accepts in straightforward
instances and small systems, doing so for sentence parsers that support a large vocabulary
would be quite difficult. There are numerous words, but each word can also be joined with
related affixes to form new words. One way to deal with this is to preprocess the input
sentence into a string of morphemes, as Allen [19] suggests. In the Afan Oromo language, a
word could contain only one stem but multiple morphemes.
The bulk of words in Afan Oromo, which is an inflectional language, are made up of a stem
and an affix (for instance, "student" in the singular forms "barat-a" and "student" in the plural
forms "barat-oota" and "the students" respectively). As was noted in chapter two, one of the
most important NLP systems in the development of part of speech tagging and sentence
parsing systems is a morphological analyzer. The researcher found it difficult to incorporate
the prototype since the requisite materials were not discovered in any of the archives, even

47
though Abeshu [42] only made one attempt to develop an Afan Oromo decision
morphological analyzer..
Therefore, efforts were taken to include a manually annotated stem and affix (or affixes)
specifically for this study aim.

4.5 The Part Of Speech Tagger

As a result, efforts to incorporate a manually annotated stem were made. In addition, Nedjo
[54] developed a rudimentary tree POS tagger prototype for the Afan Oromo language, as
was discussed in chapter two. Nedjo [54], in contrast, used the Viterb approach with HMM in
his work while Nedjo [54] used the Maximum Entropy Markov Model. The researcher was
allowed to add an HMM POS tagger solely for the purpose of this experiment in order to
further their research.
Word Code Table:

This table contains the words from the example text, together with their matching word codes, which are
listed in order for each word. The 3029 words came from 300 phrases in a decision tree.
Category Code Table:
The word categories discovered using the universal part-of-speech tag set are kept in this
table.These are the common tag sets:
ADJ,J Anadjective
AdjP AdjectivalPhrase
ADV Anadverb
ADVC An adverb not separated from aconjunction
AdvP AdverbialPhrase
AUX Auxiliary verbs and all their otherforms
DT Complement
CONJ Aconjunction
ITJ Interjections
JC A conjunction not separated from anadjective
JNU A numeral used as anadjective
JP An adjective not separated from apreposition

48
JPN A noun not separated from a preposition and that function as anadjective
N Noun in allforms
NC A conjunction not separated from anoun
NP A preposition not separated from anoun
NP NounPhrase
NUM Number
NV Verbalnouns
PP PrepositionalPhrase
PREP Apreposition
PUNCT Punctuation
REL Relativeclause
VC A verb prefixed or suffixed by aconjunction
VCO Compoundverbs
VP VerbPhrase
X to representunknown
This table contains the 28 word categories that were identified and assigned a corresponding
category code. There are a total of 27 categories found; one more category is found for all
uncertain terms. The 27th category covers all punctuation. There have been a few minor
changes, such the use of "J" instead of "Adj," which both refer to the class of adjectives in
the POS tagger and parser, respectively.
Lexical Probabilities Table:
The lexical probabilities table contains the likelihood of each word in a given corpus having
one of the supplied categories (tags). P (wiCi), p (word category), or p (word tag) are all
acronyms for this. In one pre-tagged corpus, p (HaallliiN) shows the (lexical) likely that the
word Haallli, which is a common name in the language, will be used as a noun, whereas p
(HaallliiADV) indicates the (lexical) likelihood that Haallli will be used as an adverb. The
likelihood of lexical invention is calculated using the frequency of each word within a
category. Using mathematics, the following is provided:
P(Wi\Ci)= number oftimes Wi appears incategoryCi Equation3
Total number words with category Ci
Such probability values of the lexical tables were recalculated using the new data entered in
the word category code table.

49
Transition Probabilities Table:
The likelihood of a tag given one or more prior tags is known as a transition probability and
is symbolized by the symbol p (CiCi-1... Cn). For instance, p (Ci = NounCi-1 = Adjective)
provides the likelihood that a noun will be followed by an adjective.
We can have bigram (n = 2), trigram (n = 3), or generally an n-gram transitional
probabilities depending on the value of n (which indicates the maximum number of
categories being examined). The abigram model is indicated by the symbol p (CiCi-1) if (n =
2). The model visa trigram is represented by the symbol p (CiCi-1Ci-2) if (n = 3). These
models presuppose that the chance of a specific category occurring depends only on the one
or more categories that come right before it.
given a database of texts tagged with part of speech, the bi-gram or transitional probabilities can
be estimated simply by counting the number of times each pair of categories occurs compared to
individual category counts. Mathematically this is writtenas:
P (Ci=/ Ci-1=) = count of the number of times and occur together in the
corpus..…Equation 4
Number of times occurs in the corpus
Where and are parts of speech codes.Such probability values of the transitio nal
probability table were recalculated using the categorical sequence of the newly entered data into
the word codetable.
4.6 Extraction of A Probabilistic Context Free Grammar
In the process of extracting the PCFGs, the sentences that are used as a training set were first
manually hand parsed and represented in the following manner.
„Ani huccuu adii Tolaan bitee kaleesa argee.‟

(CS
(MC (NP (N Ani ))(VP(NP (N huccuu) (Adj adii) )(NP(N Tolaan)(V
bitee)))) (SC (VP (Adv kaleesa) (V argee))))
These manual parsings of the training set led to the development of the CFG rules. Then, in
order to gather statistical information about the observed grammatical rules, the number of
instances of each rule in the manually parsed training sentences was counted (which was

50
found to be the simplest way for this purpose). The probability of adopting each rule was
then determined using the statistical data (See also Allen, [19]). Finally, each retrieved rule's
probability values were assigned using the formula below.
P(Rj/C)= Count of the number of times Rj occurs in thePCFGtable ....Equation5
Count of the total occurrence of the category C on the LHS in the PCFG

4.7 Chom sky Normal from (CNF) Representation

In order to keep things simple, the Chomsky Normal Form (CNF) is used to represent the PCFG.
Once the probabilistic context-free grammar has been recovered (CNF), the next step was to
transform it into Chomsky Normal Form. For ease of use, this conversion was made. This constraint
won't truly limit your capacity to express yourself because any CFG rule can be written in the CNF
with simplicity. This conversion followed the rules of phrase building for the language; nevertheless,
it was actually more about displaying it as it was in its original form than it was about converting it to
the CNF. Since all clauses considered were 4-word phrases, the majority of the criteria were thus
already contained in the CNF.

A sample of the extracted CNF rules and their corresponding probability values are shown in Table
4.1 below.

LHS RHS1 RHS2 Probability


CS MC SC 1
MC NP VP 1
SC NP VP 1
NP R1 ADJ 0.006
R1 N ADJ 1
NP NP ADP 0.005
NP N E 0.647
NP N N 0.004
NP N P 0.026
VP ADV V 0.5
VP V E 0.5

51
CHAPTER FIVE
PARSING ALGORITHM AND EXPERIMENTATION
5.1 Introduction
The tests conducted, the parsing algorithms utilized to develop the prototype Afan Oromo
tree sentence parser, as well as the analysis and findings from the investigations are covered
in this chapter. Also, this chapter describes the parser's design, which includes the input
output interface, the probabilistic rule base, and the module for parsing charts.
The next section provides a discussion of the Inside-Outside algorithm, which served as the
foundation for this parser. This chapter's third section introduces and discusses the Inside-
Outside algorithm's implementation of PCFG parsing. The design of the parser is covered in
the fourth section, and a report on the trials, the results, and the solutions is given in the fifth.
5.2 The Parsing Algorithm
For computational models of natural languages, ambiguity resolution is a crucial problem
[19]. The parse space of a sentence, for instance, is the space of feasible syntactic
interpretations. When using a chart parsing algorithm, which calculates each constituent's
probability based on the probabilities of its sub constituents and the rules utilized, it is
possible to assess the likelihood of several parse trees for a given text.
The parser developed for this study is based on this method and uses a modified parse chart
(given by Yao and Lua [53]) to assist in parsing. Section provides the formula used to choose
the best (or most likely) sentence parse structure.

52
5.3 PCFGParsing

In this research, an Inside-Outside algorithm-based parser was used to implement PCFG


parsing of Afan Oromo tree sentences. The settings, parse method, and parse chart are
introduced in this section.
5.3.1 ParseTree
The initial stage in parsing a sentence is identifying a plausible legal structure (or structures),
which often results in a tree, also referred to as a parse tree. Due to the fact that the PCFG
principles used in this study only apply to the CNF, each potential parse tree of a sentence is
a binary tree. The initial stage in parsing a sentence is identifying a plausible legal structure
(or structures), which often results in a tree, also referred to as a parse tree. Due to the fact
that the PCFG principles used in this study only apply to the CNF, each potential parse tree
of a sentence is a binary tree. Level 1 of a parse tree often consists of a collection of terminal
nodes, or words, whereas level 2 typically consists of a collection of non-terminal nodes, or
the POS tags of the associated words. In the CNF, each node in level 2 only has one
descendant in level 1, which is denoted by N j, wj.
For a non-terminal node Ni, if a rule Ni wj exists, then Ni is supported by the grammar rule
Ni wj, and if a rule Ni Nj Nk exists, then Ni is supported by the grammar rule Ni Nj Nk.
assuming that an n-word sentence, w1, n, contains one POS tag for each word. Given that
each non-terminal node of the trees is supported by a grammatical rule and that T(n) specifies
the maximum number of structure trees that W1,N are likely to contain, T(n) can be
determined as follows:T (n) = 1, n = 1 …………….………… Equation 9
−1
( ) = ∑ ( ) (−), > 1 …… … 10
=1
Because there are trees in the parse tree space that involve nodes that are not supported by
grammar rules, the total number of parse trees for a sentence is less than T (n), which
suggests that the value of T (n) increases exponentially as the number of words in a sentence
increases.

53
5.3.2 Parse Chart
As was mentioned previously, this study used the Yao and Lua [53] parse chart to develop
the Inside-Outside method. Equation 9 served as the basis for this parse chart.
Since it is symmetrical, just the top-right half of the matrix 10 in section 5.3.1 would be used.
This chart's size is determined by the amount of words in the text being analyzed. It denotes
that the chart was an n-by-n matrix for an n-word text.
Afan Oromo tree sentence analysis begins before the input sentence enters the parse chart, as
you should be aware of. As a result, a sentence with five words would be assigned six POS
tags, a sentence with six words would be assigned seven, a sentence with seven words would
be assigned eight, and so on. Only the heuristics employed to cope with verbal affix
movements during the structural representation of Afaan Oroomoo tree sentences were to
blame for this. The phrase "Toolaan ishee akka jaalatee, Haawwiin siritti bektee" serves as
an illustration of this. Hawi was aware of Tolla's genuine love for her.

Figure 5.1:-Tree structure diagram of the tree sentence

The highlighted verbal affix/adposition "that" detaches from "jaalatee" (that he loved her)
and assumes the position before the preposition, as seen by the above figure. All of the
AffanOromoo tree sentences that were taken into consideration in this study were created by
embedding clauses that contain such relativizers. These clauses were subsequently analyzed
into verbs and verbal affixes (referred to as Complements COMP), and these were designated
by ADP. If each N(i, j) is supported by a grammatical rule, the element N(1,7) is the starting
symbol. In the diagram, the element N(i, j [1, 7]) signifies a non-terminal node. The non-

54
terminal N(i,j) (i, j [1, 8]) in the diagonal supported by the rule N(i,i) wi is the word wi's
position on the grammatical plane. An example of an 8-level-chart that can parse 7-word
sentences is shown in Figure 5.3 below. Sentences.

Figure 5.2: An 8-level-chart


Since each non-terminal node in this study's example sentence w1,n is supported by a
grammar rule, there are n+1 terminal nodes in level 1 (the diagonal), n non-terminal nodes in
level 2, n-1 non-terminal nodes in level 3,..., and 1 non-terminal node (the starting symbol) in
level n.
5.4 The Design of the Parser
5.4.1 Preprocessing the Parser‟s in put
A sentence is input from a file for parsing. For example, if we take the sentence:

„Jabbilee adii Haawwiin kaleesa bitteeargee#‟

―I saw the white calf that Hawi boughtyesterday.‖

Here the # symbol indicates the end of the sentence. When this sentence passes through the
morphological analysis process, each word is analyzed into astem and affix(s), and outputs the
following format that the tagger would use as aninput:
„JabbiadiiHaawwii kaleesa bitteeargee‟
The following format is produced by tagging each stem with the corresponding POS:
Jabbi, Nadii, Adj Haawwii, N Kaleesa, Adv Bittee, V Argee.
Each marked stem will then be re-synthesized with its corresponding affix, using a
hyphen (-) as a separator as seen below.

55
Jabbi-lee, Nadii, Adj Haawwii-n, Kaleesa, Adv Bittee, and V Argee.
In order to determine whether the category of each tagged stem changes when combined with
the affix(es), each word will then undergo a post tagger morphological process. Not just for
the reasons described above, but also because it simplifies the challenging task of parsing
complicated Afaan Oromo phrases, this is a crucial milestone in the development of the
parser. Alternatively put, At this point, it is determined which Afan Oromo verbs often
referred to as Relativizers take affixes (such as -een, -wan, -(o)ota, -yyii, and -lee). Last but
not least, the input sentence would take on the following format and be submitted to the
sentence processing module:a
Jabbi-lee\Nadii\Adj Haawwii-n\N kaleesa\Adv bittee\V argee\V#\PUNCt
The input preprocessor module algorithm is provided below.
every sentence in the document Take everything one sentence at a time.
For each each word in the phrase Identify each word's stem Request
the HMM POS tagger.
Get the sentence stems that have been tagged
To update the Category output of the tagger, use the Morphological
Synthesising Function.
Send the parser the last string of words with tags.

Figure 5.3: An algorithm for preprocessing an input sentence to the parser

5.4.2 The Input & Out put Interface


The application programme for the sentence parser was created with a sample interface. The
Afan Oromo sentence parser window appears when the application programme for the
AfanOromo decision sentence parser runs. In actuality, this is the primary interface that
appears when the application programme is launched for the first time and remains visible
until the programme is closed. Just below the menubar, this window features four main
buttons. Each of these buttons is briefly described below.
The POS Tagger button
When this button is clicked, the input words' tagged stems are shown.
The Tagged Sentence button
By separating and tagging the complement separately from the other words in the input
sentence, this button shows the final labelled output sentence.

56
The Parse button
Using this button, a user can parse each sentence in a saved file individually. Consequently,
the function of the produced prototype is as follows.
"Mootumaan, Biyyattin sadarkaa guddaa dinagdee irratti galmeesiftee ibsee," is the output of
the morphological analysis component, which accepts the input sentence as input.
The biggest economic victory the nation has had was described by the government.
The tagger then takes the previously mentioned string of stems as input, assigns the proper
POS tag to each stem, and generates the following format for the parser to utilise as input.
‗Mootumaan, NOUN Biyyattin, NOUN Sadarka, ADV Gudda, ADJ Dinagdee, NOUN
Irratti, ADP Galmeesiftee, VERB ibsee, PUNCT
However, each tagged stem word would be synthesised with its affixes (if any) before the
aforementioned output of the tagger is sent to the parser, and the resulting word's category
would be sought from a table that updates the categories of inflected stems. The complex
sentence above, for example, might take on the following structure after going through this
process.Mootumaa-n\Biyya-ttin, Sadarkaa, Guddaa, Dinagdee, Iratti, Galmesiftee, Ibsee,
Verb. Punct'
The terms in the input sentence that are impacted by the aforementioned procedures are
highlighted in the example above. This was the final product of the POS tagger with
morphological analysis' help. Each word and each POS tag are extracted by the parser from
the tagged phrase and are then stored in one-dimensional array variables. The parser's output
includes the parser result, the grammatical rules employed, and the likelihood of the chosen
parse structure. The outcomes for the sample sentence displayed thus far include:
The parse structure's propensity:
0.000034129851158584 the parse result:
(CS (SC (NP (N
Mootumaa-n)(,))) (MC-
VP
(S
(NP (N Biyya-ttin) (ADJP (ADV sadarkaa) (ADJ guddaa)))
(VP (NP (N dinagdee) (ADPirratti)) (V galmeesiftee))) (VP (V

ibsee)))) The rules applied in parsing the above sentenceinclude:

CS SCMCP VP

57
SC NPN NP N
MC-VP SVPS NPVP NP NADJP ADJP
ADV ADJ VP NPV NP NADP V E
The rules used in parsing are grammatical rules, and in this case, rules involving terminal
nodes (for example, N Mootuma-n) are not displayed since such lexical rules are not
included in the PCFG table. This is so that the parser can process sentences that have already
been pre-tagged and handle lexical information.

The Probabilistic Rule Base


By utilising Python to induce the manually parsed text, the PCFG rules have been
transformed into CNF. The rules that make up CNF were taken out of the sample corpus and
are displayed in the same database as the lexical probability table and others in the format
listed below.

Table 5.4: Parsing result on Training Set before making no error correction

PCFG Representation

E is a bare production in the table above. The left hand side of the rules are kept in the field
LHS, while the first and second right hand side rules are kept in RHS1 and RHS2,
respectively. The related probability values for each of the grammar rules are displayed in the
probability field. (See Appendix 5 for a complete list of the rules derived from the corpus by
PCFG). The Word Code and Lexical Probabilities tables, which were created by the tagger
and kept in the same database as the PCFG table, provide the word information needed for
parsing.

58
The chart parsing module

The PCFG Inside-Outside algorithm is implemented by the chart parsing module, which is
supported by a parse chart created by Yao and Lua [53] based on equations 7 and 8. Equation
11 is used to calculate the value of each non-terminal node N(i,j) in the parse chart where i is
different from j and i, j = [1,n].
The Inside-Outside algorithm utilised in this study to implement PCFG parsing is as follows.

Figure 5.5: The Parse Chart Procedure to Implement the Inside-Outside Algorithm
The inside probability of every parse in the parse tree space were determined using this
procedure, which was also coded during the prototype's construction. Equations 9 and 10,
which add a step to construct the parse from the bottom up, were employed by the algorithm
to calculate the probability. The categories of words in a phrase are fed into the diagonal (i.e.,
the first level) of the chart one by one during parsing. Here, the probability of each parse is
determined along with the construction of the parse tree space from bottom to top.

The parsing of each sentence


Extract thewordsandtagsfromtheoutputofthetagger.
InsertthePOStagsintothediagonalofthechartmatrix to determine
the probability of each node in the chart matrix, call the
parsechart operation that uses the Inside-Outside algorithm.
From the likelihood array, select the value with the highest
likelihood.
Take a look at the chart input table and find the rules that
provided the highest likelihood.
59
Figure 5.6 : The Overall Implementation of the Parsing Algorithm
The parser is represented diagrammatically in the figure below.

Figure 5.7: Diagrammatic Representation of the Parser

5.5 TheExperiment

For the experiment, the example text chosen, which was covered in chapter 4, was used. The
researcher and an Afan Oromo language instructor from Metu University's department of
Afan Oromo manually analyzed each word in the corpora and hand-tagged and hand-parsed
each sentence. The linguistic advisor and other language experts at the Metu University of
Language Studies and some other language experts provided comments and suggestions.
Based on a random process, 300 sample sentences were chosen.
The distribution of the various phrase structures and the method they obtain their decision features
(i.e., being made up of a simple NP and a tree VP).
The primary goal of this study was to parse Afan Oromo phrases utilizing the PCFG bottom
up chart parsing approach and the Inside Outside algorithm, which receives its POS inputs
via a POS tagger, as was made abundantly clear from the outset. As a result, the experiment
started by determining whether the POS tagger showed any improvement in accuracy.
In order to achieve this, the tagger's accuracy was 76.3% when it was first trained and then
used on the same data. The mistakes that were found were mostly human-caused (made
during the manual morphological analysis and tagging) as well as mistakes made when
creating the lexical and transitional probability tables.

60
The tagger achieved 89.7% accuracy on the training set after analyzing the manually
completed tasks, the lexical probability calculations, and the transitional probability
calculations, and making modifications as needed. This was greater than the tagger's original
training set score of 84%.
The tagger's accuracy increased in this study from 66.6% when tested with the Test Set to
80%. One of the detected sources of inaccuracy was the unintentional conflicting category
that the morphological analysis and the bi-gram for a particular word proposed. Due to time
constraints, this source of error was left unresolved. Additionally, the tiny size of the sample
corpus taken into account may have contributed to inaccuracies in this stage of testing the
tagger.The use of the morphological preprocessing analysis before the application of the
HMM tagging, the category checking mechanism after the tagged stems were synthesized
with their affixes, the statistical category guessing mechanism that fully relies on the
transitional probabilities, and the slight increase in the corpus size are generally considered to
be the main drivers of the improvement on the POS tagging module (or on the input
preprocessing module in general).

5.5.1 Experiment on the Training Set

The 240 sentences that were randomly selected from the sample corpus and saved as the
training set were used for all the manually performed tasks, including the morphological
analysis, tagging and parsing as well as the probability calculations for the words in the
sentences, the induction of grammar rules, and the assignment of probabilities to the
grammar rules, which were all covered in chapter 4. On these 240 sentences, the initial
experiment was run.
The availability of two rules that are similarly more probable (than the other rules) yet have
the same RHS was a significant cause of inaccuracy. These rules had probabilities of 1.0 and
0.524, respectively, and were Sb NP VP and VP NP VP. The former b rule dominates since it
has a higher probability than the latter one does, because S frequently showed up at a node
where VP should have come from.
After that, the parser was retrained, and the test was run once more on the Training Set.
Under the experiment results sub-section 5.5.3.1, the final results obtained both before and
after these changes were made were subsequently presented.

61
5.5.2 Experiment on the TestSet
On the remaining 60 sentences from the original corpus, which were kept as Test Set, the
second experiment was conducted. Section 5.5.3.2 of the corpus contains the findings related
to this hidden portion.
5.5.3 Results of theExperiment
The findings of the experiments conducted on the Training Set and Test Set are discussed in
the sections that follow. Additionally, it provides the training set results (both with and
without adjustments to the lexicon, grammar, and algorithm).

5.5.3.1 Result on the TrainingSet


The following table displays the outcome of the Parser's training and testing on the same
dataset, referred to as the Training Set.
Data/set No of sentences No of erroneously Accuracy
parsed sentences
Training Set 240 80 66.6%

Table 5.8: Parsing result on Training Set before making no error correction

The accuracy attained should have been higher than 66.6% given that the parser was trained
and evaluated using the same data, or the Training Set. The contradiction between the rules S
NP VP and VP NP and VP, which was handled by thinking of the two rules as ones that
employ distinct RHS, was the primary cause of errors at this time, as was already mentioned.
Human error contributed to the experiment's low accuracy and made it less accurate than was
expected. The final accuracy obtained on this set, after human errors were discovered and
corrected, is shown in the table below.
Data/set No of sentences No of erroneously parsed Accuracy
Sentences
Training Set 240 48 80.0%

Table 5.9 1: Parsing result on Training Set after making some error correction

5.5.3.2 Result on TheTestSet


The test on the unseen part of the training corpus provided the result shown in the
following table.

62
Data/set No of No of erroneously Accuracy
sentences parsed sentences
Test Set 60 17 71.6%

Table 5.10: Parsing result on Test Set


E(0.066) In contrast, the likelihood of VP V E (0.332) is larger than VP N V (0.007) in the
PCFG rules table that was generated from grammar phrases, which is completely the
opposite situation. Thus, practically every sentence that was taken into account in this test
was incorrectly parsed. E(0.066) In contrast, the likelihood of VP V E (0.332) is larger than
VP N V (0.007) in the PCFG rules table that was generated from grammar phrases, which is
completely the opposite situation. Thus, practically every sentence that was taken into
account in this test was incorrectly parsed.Using the prior simple sentence parsing for Afan
Oromo text as a starting point, a third test was created by simply integrating the manually
annotated morphological analysis module and HMM posts tagger. Ten sentences were used
for the test, some of which the simple sentence parser incorrectly parsed. In light of this, nine
out of ten basic sentences that were left unparsed in the earlier trials were effectively parsed.
5.5.4 Solution to IdentifiedProblems
The numerous testing steps of this study ran into mistakes due to human error, including
incorrect manual morphological analysis, tagging, and parsing of the data from the sample
corpus. Through an iterative process, these flaws as well as others found in the database, such
as incorrectly calculated probability for lexical, transitional, and PCFG rules, were fixed.
Furthermore, errors were made when the tagger incorrectly tagged or left untagged words
that did not have their stems in the database or any of their inflectional forms in the sample
corpus used. The tagger typically tags untagged words as UNC, thus a module that was used
in earlier studies to infer the word categories for untagged words was used to deal with them.
As a result, in this study, new words or stems were deambiguated using probabilities of
bigram lexical co-occurrence. This somewhat resolved the issue.
The accuracy report on the Test Set, however, was incredibly low because there was no
definitive method of managing terms that had been incorrectly tagged. The supremacy of S,
NP VP over another rule with the same RHS but comparatively lower probability, VP,NP
VP, was the main issue with mis-parsing. By placing an apostrophe on the VP at the RHS of
the second one, designating a SC and MC for the subordinate and main clause, and treating

63
the two conflicting rules as ones with different RHS, this issue was resolved. Additional
causes of annoyance during the tests were errors during the extraction of rules and their
probabilities and the inadequacy of rules at the rules library, i.e. at the PCFG table, which is
frequently referred to as under generation. Both of these issues were reduced by an iterative
method.

64
CHAPTER SIX

6. CONCLUSION AND RECOMMENDATION

6.1 Conclusion
In order to address a slightly more complex issue in the domain of the development of a
Decision tree sentence parser for Afan Oromo, the thesis has attempted to present a method
of combining the concepts and results of previously examined Afan Oromo NLP systems but
in a different manner. In order to achieve this, a POS tagger and a straightforward Afan
Oromo phrase parser were utilized as the basis. The Probabilistic Context Free Grammars
PCFGs parsing was also implemented using the Inside-Outside method and a chart parse
module that was first introduced by Yao and Lua [53] to parse Chinese sentences. Efforts
have also been made to construct a prototype.
While constituents are frequently defined deductively in terms of the relationships that exist
between their pieces, parsing is the process of identifying analyses of sentences, that is,
consistent sets of relationships between constituents that are determined to hold in a
particular sentence. The process of parsing sentences of this type can be done in one of two
ways: manually or via a tree. The manual approach is time-consuming, costly, and error-
prone; it is also evident that the issue will only become worse as the amount of information
increases. Thus, the second method of sentence parsing, decision sentence parsing, eliminates
such tree ities and is crucial in systems for natural language understanding.

This study's ultimate objective was to create a decision tree sentence parser for Afan Oromo.
To that purpose, key parsing ideas and words were reviewed, and locations where the results
of the sentence parser are relevant were shown. In addition, the rule-based and stochastic
techniques, which are the two main approaches to NLP in general and tree sentence parsing
in particular, as well as alternative strategies, were briefly introduced and reviewed. We also
went into some detail on the knowledge base of a sentence parser and the components that
are required, including the lexicon and grammatical formalisms, to store information that
would help with the parsing process.
The literature related to the Afan Oromo writing system, lexical categories, and grammatical
constructions was then evaluated and debated. This was due to the fact that a fundamental
65
aspect of building a parser understands the syntax of the language. Hence, it became evident
which language characteristics were taken into account when constructing the different
parser components. Almost

All lexical and phrasal categories, sentence formalisms, typical tree sentence properties, and
linguistic factors that were taken into account when constructing the different parts of the
parser were also explained.Next, the developed sample corpus that was used in this study and
some of the major problems the researcher faced in the process of getting the
necessarysampleandthesteps taken to deal with the problems were presented. In other words,
300 tree sentences were gathered from two commonly used Afan Oromo grammar books,
published articles, and newspapers because there hasn't been a corpus produced for studies
on Afan Oromo tree sentence parsing up to this point.

Each word in the corpus was then manually morphologically examined, tagged, and parsed.
Each word in the Training Set, which was a portion of the sample corpus utilized as a
training set, had its lexical and transitional probabilities assessed. The grammatical rules
were extracted, the probabilities associated with each rule in the training set were
determined, and the resulting PCFG rules were simplified by being expressed in Chomsky
Normal Form (CNF) and shown in a table dubbed PCFG-CNF.Later in the thesis, the
algorithms and modules needed by the parser to access the knowledge base and parse
incoming phrases with the proper lexical categories were provided. A prototype was made
using PythonTkinter to establish an interface that would enable a user to communicate with
the system. Two phases of experiments were carried out, the first on the Training Set and the
second on the Test Set. The performance improvement of the original, simply statistical part-
of-speech tagger, a straightforward sentence parser, and a newly created sentence parser was
measured using only one parameter, the percentage of correctly tagged and parsed words and
sentences in the sampled text. Using a decision tree In this investigation, Afan Oromo parser.
The results obtained using the few samples were excellent, with a training set accuracy of
80.0% and a test set accuracy of roughly 71.6%. The experiment was run repeatedly on both
the Training Set and the Test Set, finding mistakes and making corrections, before obtaining
such precision. The majority of the errors that were found were caused by human error in the

66
preprocessing of the input from the parser, conflicting PCFG rules, low likelihood, and the
absence of some rules. Before concluding the thesis, a discussion of potential causes of errors
and their fixes was held.Although the parser created for this study had somewhat above-
average accuracy, it might not have immediate practical applications because it wasn't trained
on a lot of data that supported all of the characteristics of Afan Oromo (tree) phrases. It is
feasible to draw the conclusion that this thesis was an effort to highlight the potential for
employing probabilistic approaches, specifically HMM on Afan Oromo NLP, and to use
statistical approaches for decision sentence parsing in addition to rule-based or hybrid
approaches.
The researcher believed that Ethiopian students and researchers would develop this kind of
practice, paving the way for the eventual realization of higher-level and more difficult
research projects like conceptual parsing and machine translations, which are all NLP tasks.

6.1 Recommendations
This study suffers from numerous flaws. These restrictions are listed below and are active
research areas that should be addressed by those with an interest in the field. The work of
those scholars may facilitate efforts to develop a powerful sentence parser for the Afan
Oromo language. As potential research areas, the following could be suggested.
 The bigram lexical co-occurrence method, which was built by assuming the input and
output formats of a prior study, was used in this investigation to guess for unknown
terms. It was helped by a manually annotated morphological analyzer. The different
inflections of a given stem in the database respond nicely to this strategy. Despite
being an improvement over past research, some words that were completely novel to
the database (i.e., those for which there are no inflectional forms in the database) were
nonetheless incorrectly labeled. In order to achieve a better result, future studies may
combine the studied integrated system of morphological analysis systems with both
grammatical and trigrammatic lexical co-occurrence..

 Future researchers can create corpora with proportionate representations of both


simple and grammar Afan Oromo sentences, extract the PCFG rules, and test the
effectiveness of the strategy and methodologies utilized in this study to parse both
simple and grammar Afan Oromo sentences.

67
 To expand the existing system's ability to parse different sentence kinds, replicate
this work using a large data set and including all forms of sentences with all
attributes, including case, number, gender, person, tense, and definiteness. Based on
this, it would be possible to investigate how PCFG performed on the NLP for Afan
Oromo.

 Conduct comparable studies in additional regional tongues, such as Tigrigna, and so


on.
 Relying on the methods employed in this study as their own.Other potential future
study areas that are worthwhile exploring as a continuation of a fully-fledged Afan
Oromo sentence parser include noun phrase recognition, conceptual parsing, word
sense disambiguation, and machine translations.
 In order to conduct experiments and studies on statistical natural language processing,
it can become necessary to create processed Afan Oromo corpora (i.e., data that has
been hand-tagged, parsed, and evaluated morphologically).
 The probabilities associated with the extracted grammatical rules are still static values.
In order to dynamically update the probability values and include new grammatical
rules during sentence parsing, both the grammar induction and the accompanying
probability computations could be made.

68
7. REFERENCES
[1] "Parsing of part-of-speech tagged Assamese texts," International
Journal of Computer Science, vol. 6, no. 1, 2009, pp. 28–34.
Mirzanur Rahman, Sufal Das, and Utpal Sharma.

[2] Abebe Abeshu. "Analysis of Rule Based Approach for Afan Oromo
Automatic Morphological Synthesizer," STAR Journal, pp. 94 - 97,
2013.

[3] Danel Gochel Agonafer, "An Integrated Approach to AutomaticComplex


Sentence Parsing For Amharic Text,‖ Unpublished Msc Thesis, ADDIS
ABABA, 2003.

[4] Kwon, Yong-uk Park and Hyuk-chul, "KoreanSyntacticAnalysisusing


Dependency Rules and Segmentation," Proceedings of the Seventh
International Conference on Advanced LanguageProcessing and
Web Information Technology (ALPIT2008), vol. 7, no. 1, pp. 59- 63,
2008.

[5] Edward, Michael Liddy, Encyclopedia of Library and Informat ion


Science, 2nd ed., Marcel Decker.

[6] Warner, Amy J, "Natural Language Processing.," Annual Review of


Information Science and Technology, vol. 22, pp. 79-107, 1987.

[7] Diriba. MEGERSA, "An Automatic Sentence Parser for OromoLanguage


Using SupervisedLearning Technique,"Unpublis hed Masters Thesis,
Addis Abeba,2002.

[8] AberaN, "Long vowels in Afan Oromo:Agenericapproach," Master‘s


thesis, School of graduate Studies Addis Ababa Univers ity, Addis
Ababa, Ethiopia,1988.

[9] Grage G. & Kumsa T, "Oromo dictionary," African studies center.


Michigan State University, Michigan, USA, 1982.
[10] Girma Debele Dinegde ,Martha Yifiru Tachbelie, "Afan Oromo News
Text Summarizer," International Journal of Computer Applications,
vol. 103, no. 4, pp. 1-6,2014.

[11] M. Volk, "Parsing German With GSPG: The Problem of Separable-


Prefix Verbs," MA Thesis., 1988.

[12] Wilson, Kyongho Min &William H., "Are Efficient Natural Language
Parsers Robust?," School , Sydney ,Australia , 2005.

[13] Dostert, B. H andF.B. Thompson, "Syntactic Analysis in REL


English.," in Papers in Computational Linguistics, Budapest,1976.

[14] Thant, Win Win, Tin Myat Htwe, and Ni Lar Thein, " Parsing of
Myanmar sentences with function tagging," University of Computer
Studies, Yangon, Myanmar,2012.

[15] Getachew Mamo, Million Meshesha, "Parts of Speech Tagging for


Afaan Oromo," International Journal of Advanced Computer Science
and Applications,Special Issue on Artificial Intelligence, no. 53230,
2013.

[16] Abraham Tesso Nedjo, Degen Huang, Xiaoxia Liu, "Automatic Part-
of-speech Tagging for Oromo Language Using Maximum Entropy
Markov Model (MEMM)," Journal of Information & Computational
Science 11:10, vol. 11, no. 10, p. 3319–3334, 1 July 2014.

[17] Debela. Tesfaye, "Designing a RuleBased Stemmer for Afaan


Oromoo Text," Masters Thesis, Addis Ababa University,2013.

[18] Y. Mao, "Natural Language Processing Module (PartofSpeech Tagging


and Sentence Parsing) Laboratory Manual," Cognitive Science In
Context (CSIC), 10 10 1997. [Online]. Availab le :
http://www.csic.cornell.edu/201/natural_language.[Accessed15 8
2015].

[19] Allen J, Natural Language Understanding, 2nd ed., California: The


Benjamin/ Cummings Publishing Company, 1995.
[20] Abiyot Bayou, "Developing Automatic WordParser for Amharic Verbs
and Their Derivation," Master Thesis at School of Informat ion Studies
for Africa, Addis Ababa,2000.

[21] Merlo paola, Parsing with Principle and Classes Information, Boston:
Kluwer Academic, 1996.

[22] Reyle, U and C. Rohrer, Natural Language Parsing and Linguis t ic


Theories, Boston: Reidel Publishing Company, 1988.

[23] Y. Mao, "Natural Language Processing Module (PartofSpeech Tagging


and Sentence Parsing) Laboratory Manua," 1997. [Online ]. Available:
http://www.csic.cornell.edu/201/natural_langua ge/.
[Accessed 27 92015].

[24] Tamas E. Doskocs, "Natural Language Processing in Informat ion


Retrieval," Journal of the American Society for Information Science,
vol. 37, no. 4, pp. 191 - 196, 1986.

[25] B. Prichett, Grammatical Competence and Parsing Performance,


Chicago: The University of Chicago,1992.

[27] Shieber, Stuart M., Yves Schabes, and Fernando CN Pereira, "Princip
les and Implementation of Deductive Parsing," The Journal of Logic
Programming, vol. 12, pp. 1-37,1995.

[28] Stuart J. Russel, Peter Norving, Artficial Intelligence: A Modern


Approch, 3rd ed., Prentice Hall, 2010.

[29] Rus, Mihai Lintean, Vasile, "Naive Bayes and Decision Trees for Function
Tagging," in In Proceedings of the International Conference of the
Florida Artificial Intelligence Research Society, Key West, FL, 2007.

[30] Daniel Schutzer, Artificial Intelligence: An Applications-Orie nted


Approach, New Uork: Van Nostrand Reinhold Company., 187.

[31] Salton, G and Michael J. McGill, Natural Language Processing, New


York: McGraw-Hill, 1983.
[32] Spark jones, & Bonnie Lynn Webber, Readings in Natural Language
Processing., Los Altos, USA: Morgan Kaufnann, 1986.

[33] Gazdar, Gerald; Mellish, Chris., "Natural Language Processing in


Prolog.," susx university, 1996. [Online]. Availab le:
http://www.cogs.susx.ac.uk/local/books/nlp- in-prolog/ch01/chapter-
01-sh-1.6.html#sh-1.6... [Accessed 14 May2016].

[34] Naom Chomsky, Aspectsof the Theory of Syntax, Cambridge,


Massachusetts :MIT Press, 1965.

[35] William Woods, "Transition Network Grammars of natural Language


Analysis," in Communication of the ACM, Karen, 1970.

[36] Kay, Martin, "Functional grammar," in n Proceedings of the Fifth


Annual Meeting of the, UK,1979.

[37] Ralph Grishman, "Natural Language Processing," Journal of the


American Society, vol. 35, no. 5, pp. 291-296,1984.

[38] Liang HUANG,Yinan PENG,Huan WANG,Zhenyu WU, "PCFG


Parsing for RestrictedClassical Chinese Texts," National Univers ity of
Singapore, Singapore,1998.

[39] Blaheta, D. and Johnson, "Assigning function tags to parsed text," in


In Proceedings of the 1st Annual Meeting of the North American
Chapter of the Association for Computational Linguistics.,2000.

[40] Berwick, Robert C., and Amy S. Weinberg, The Grammatical Basis of
Linguistics Performance: Language Useand Acquisition, London: MIT
press,1989.

[41] Biber, Douglas, Susan Conrad, and RandiReppen, "Corpus Linguistics:


Investigating Language Use," Cambridge Univers ity Press, New
York,1998.

[42] Abebe Keno, "Case Systems in Oromo," MA Thesis, Addis Ababa


University, 2002.
[43] Baye Yimam, "Oromo Substantives: Some Aspects ofTheir Morphology
and Syntax," MAThesis.AddisAbabaUniversit y., Addis Ababa ,1981.

[44] Baye Yimam, Seerluga Afaan Oromoo, Adiis Ababa: Adiis Ababa
University press, 2003.

[45] Baye Yimam, "The Phrase Structure of Ethiopian Oromo," PhD


Desertation : University of London, London,1986.

[46] Askale Lemma, "Seerluga Afaan Oromoo," Unpublished Hand out for
Oromo Syntax, AAU, Addis Ababa, 1997.

[47] Tilahun Gamta, Oromo-English Dictionary, Addis Ababa :Addis Ababa.


University Press.,1989.

[48] Sag, Ivan A., et al., Syntactic Theory: A Formal Introductio n,


Standford : Centre for the Study of Language and Information,1999.

[49] Hamid Muudee, Hami d Muudee‘s English-Oromo Dictionary Vol.1,


Atlanta: Sagalee Oromoo Publishing, Inc, 1995.

[50] Levine, R.D. and G.M. Green., Studies in Contemporary


PhraseStructure Grammar, Cambridge : Cambridge. University
Press,1999.
[51] G. Gragg, Oromo Dicationary:East Lansing, Michigan: Michiga n State
University,1982.

[52] Adugna Berkesa,NATOO: Yaadrimee Caasluga AfaanOromoo, Addis


Ababa,2010.

[53] Yao, Yuan, and Kim Teng Lua, A Probabilistic Context-Free


Grammar Parser for Chinese, Singapore: National University of
Singapore . Department of Information Systems and Computer Scienc,
1998.

[54] Abreham Teso Nedjo, "AoutomaticPart-of-speach Tagging for Oromo


Laguage Using Maximum Entropy Markov Model (MEMM)," Jornal
of Inforamtion and Computational Science, vol. 11, no. 10, pp.
3319 - 3334, 1 July 2014.
[55] Owens Jonathan, A Grammar of Harar Oromo (North Eastern
Ethiopia), Bumbers : Bumbers. Buske.,1985.

[56] Satta, Mark Jan, Nederhof Giorgio, "Parsing Non-Recursive Context-


Free Grammars," in In Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics (ACL ANNUAL'02),
Philadelphia, Pennsylvania, USA, 2002.
APPENDICES

APPENDIX A: THE SAMPLE TEXT

1. Biqiltoonnis\NOUNta‘an\CONJbineeldonni\NOUN
jiraachuuf\ADJ,\COMAnyaata\NOUN isaan\PRON\barbaachisa\VERB.\PUNCT
2. Xurii \NOUNqaama\NOUNkeessaa\baasuuf, bishaan\NOUNtumsa\VERBguddaa\ADV
godha\VERB.\PUNCT
3. Akaakuu\ADJ nyaataa \NOUNqaamaaf \NOUNbarbaachisan\ADJ filachuun\VERB\
fayyaa \NOUNkeenyaaf\PRONgaarii\ADVdha\ADP.\PUNCT
4. Guddinaa \NOUNfi\CONJ jabina\NOUN qaamaa \NOUNargachuuf\nyaata\NOUN
walmadaale\ADJargachuun\ADVdansaa\VERBdha\ADP.\PUNCT
5. Waan arganne hunda nyaachu osoo hin taane,nyaatamadaalamaa soorachuutu bu‘aa qaba.
6. Ani \PRONkaleesa\NOUN malee\CONJhaar‘a
\NOUNnyaata\NOUN\hin\nyaane\VERB.\PUNCT
7. Marartun\PROPN gara\ADJ gabaa\NOUN demtee\VERB,\COMA Midhaan\NOUN
bitte\VERB.\PUNCT
8. Bulchaan\PROPN\ dhengada\ADV sare\PROPN gamadaa\PROPN
ajjesse\VERB.\PUNCT
9. Yoo\ADPdheebotte\NOUN,\COMA bishaan\NOUN Amboo\PROPN
dhugi\VERB.\PUNCT
10. Barumsi\NOUN waan\CONJitti\ADP
cimeef\ADJ,\COMAaddaan\ADVkute\VERB.\PUNCT
11. Isheen\PRON hoojii\NOUN ishii\PRON waan\CONJ
beektuuf\NOUN,\COMAmana\NOUN barumsaa\NOUN iraa\ADP
haafte\VERB.\PUNCT
12. Yoo \ADPdhufuu\NOUN baattellee\ADJ,\COMA xalayaa\NOUN naaf\PRON
barreessi\VERB.\PUNCT
13. BoonaaN\PROPN biyyaa\NOUN alaatii\ADJ akka\ADPdhufeen\NOUN,\COMA
hiriyyoota\NOUN isaaf\PRONdubbii\NOUN godhe\VERB.\PUNCT
14. Yoo\ADP finfinnee \PROPNdeemteef\NOUN,\COMA meeshaa\NOUN naa\PRON
bitta\VERB.\PUNCT
15. Bokkaa\NOUN cimaa\ADJ waan\CONJ roobeef\NOUN,\COMA lagni\NOUN
guutee\ADJ riqicha\NOUNcabse\VERB.\PUNCT
16. Namni \NOUNkamiyyu\ADJ taanaan\NOUN ,\COM\maqaa\NOUN mataa\ADJ
isaa\PRONqaba\VERB.PUNCT
17. Gargaarsi \NOUNmootummaa\COMNOUNduubaan\ NOUNjiraannaan\NOUN umanni
\NOUNmisoomaaf\NOUN seexaa\NOUN cimaa \ADJqaba\VERB.\PUNCT

18. Namni\NOUN barate\NOUN tokko\ADV haalaa\NOUN fi \CONJbeekumsa\NOUN


isaa\PRON haawwii\NOUN ummataa\COMNwajjin\ADJdeemsisuu
\ADVqaba\VERB.\PUNCT
19. Hiyyeessi\NOUN jabaate\ADJ yoo \ADPhojjate\NOUN,\COMA hattuma\NOUN
keessaa\ADJ bahuu\ADV ni\ADPdanda‘a\VERB.\PUNCT
20. Kaayyoon\NOUN barataa \NOUNtokkoo\ADJ inni\ADP guddaan\ADJbarumsa\NOUN
isaatti \PRONjabaatee\ADJ,\COMAbiyya\ isaa
\PRONguddisuu\ADVta‘uu\ADP qaba\VERB.\PUNCT
21. Maqaa\NOUN moggaassuun\ADV nama\NOUN hin\ADPdhibu\VERB,\COMA
akka\ADP hawwan\NOUN sanatt\ADVi bakkaan\NOUN gahuutu\ADV nama\NOUN
dhiba\VERBmalee\ADP.\PUNCT
22. Gaafa\ miilkiin\NOUN sif \PRONhin\ADP tolle\ADJ har‘a \NOUNcaraa\NOUNgaarii
\ADJmiti\ ADPjetta\VERB.\PUNCT
23. Qorumsi\NOUN ga‘e \ADJjedhanii\NOUN batattisuu\NOUN irra\ADP laf-jala\NOUN
itti\ADPqophaa‘uutu\ADVgaarii\VERBdha\AUX.\PUNCT
24. Namni\NOUN abdii\NOUN fi\CONJ akeeka\NOUN qabu\ADVtattaaffii
\NOUNcimaa\ADV godhee\NOUN bakka\NOUN yaadee\ADV
hin\ADP hanqatu\VERB.\PUNCT
25. Mucaan \NOUNobbo\ADP magarsaa\ PROPNeega\CONJ kashlabbaayee\NOUN
manabarumsaa\NOUN hafee\NOUN,\COMA ganda\NOUN keessa\ADJ jooraa\NOUN
oola\ADVture\VERB.\PUNCT
26. Eda\NOUN robaa\NOUN waan\CONJ buleef\NOUN,\COMA lagi\NOUN
caanco\PROPN guteetu\ADVdanbali‘e\VERB.\PUNCT
27. Galataan\PROPN mana\NOUN ajjeeruu\NOUNdhaaf\ADPbisii\NOUN
maraa\ADVjira\VERB.\PUNCT
28. Gadaan\NOUN baranaa\NOUN horata\NOUN moo\CONJ
Birmajjii\NOUNdha\VERB?\QUESTIONMARK
29. Yemmu\ADPwaraabesi\NOUNyuusu\ADV,\COMAsareen\NOUN dute\VERB.\PUNCT
30. Gamadaan\PROPN kaleessa\ADV sangaa\NOUNgurgure\VERB.\PUNCT
31. Yemmun \ADP ani \PRONxule\NOUN gahu\NOUN,\COMABantiin\PROPN achii\ADV
hin\ADPjiru\VERB.\PUNCT
32. Margeen \PROPNyeroo\ADVhunda\ADV mana\NOUN barumsaa\NOUN
deemti\VERB.\PUNCT
33. Biyya\NOUN misoomsuuf\ADJ tokkummaan\ADV haa\ADPkaanu\VERB.\PUNCT
34. Yemmuu\ADP deemtu\NOUN,\COMA na\PRON dubbisii\ADVdarbii\VERB.\PUNCT
35. Dhagaan\PROPN gaara\NOUN sana\ADJ iraa\ADP kokolaatee\NOUN,\COMA
Farda\NOUNajeese\VERB.\PUNT
36. Situ\PRON na \PRONarge\VERB malee\CONJ ani\PRON\ si\PRONhin
\ADPagarre\VERB.\PUNCT
37. Jaartiin\NOUN mataa\NOUN arrii\ADJ,\COMA nama\NOUN walitti\ADV
naqaxe\VERB.\PUNCT
38. Huccuu \NOUNhaphii\ADJ uffatee\NOUN,\COMA qorra\NOUNkeessa
\ADJciisa\VERB.\PUNCT
39. Otto\ADP loon\NOUN hin \ADPgalin\NOUN,\COMA hattun\NOUN mooraa\ADV
guutte\VERB.\PUNCT
40. Erga\ biftuun\NOUN lixxee\AD,\COMAbokaan\NOUNrobe\VERB.\PUNCT
Appendex B Sample
Output
Sent: Toolaan akka ishee jaalatee Haawwiin siritti
bektee Parser: <ViterbiParser for <Grammar with 21
productions>>Grammar: Grammar with 21 productions
(start state = CS)
CS -> MC SC
[1.0] MC ->
NP PP VP
[1.0] SC -> NP
VP [1.0]
VP -> P PR V [0.4]
VP -> PR V [0.4]
VP -> ADV V [0.2]
NP -> N [1.0]
PP -> P [1.0]
V ->'jaalatee'[0.4]
V ->'demee' [0.12]
V ->'bektee'[0.48]
N -> 'Toolaan' [0.45]
N -> 'Haawwiin' [0.4]
N -> 'dinagdee'
[0.15] ADV ->
ADV ADV [0.2]
ADV -> ADV
[0.3]
ADV ->
'siritti' [0.5]
PR -> PR
[0.4]
PR ->
'ishee'
[0.6] P ->
P [0.5]
P -> 'akka' [0.5]
Inserting tokens into the most likely constituents
table... Insert:|=. |Toolaan
Insert:|.=...... |akka
Insert:|..=..... |ishee
Insert:|...=.... |jaalatee
Insert: |....=..|
Haawwiin
Insert:|
................ =.|sir
itti
Insert:| ....... =|bektee
Finding themost likely constituents spanning 1
textelements...

Insert: |=......| N -> 'Toolaan' [0.45] 0.4500000000


Insert:|=. ..... | NP ->N[1.0] 0.4500000000
Insert:|.=. .... | P ->'akka'[0.5] 0.5000000000
Insert:|.=. .... | PP ->P[1.0] 0.5000000000
Discard:|.=. .....| P ->P[0.5] 0.2500000000
Discard:|.=. .....| P ->P[0.5] 0.2500000000
Insert: |..=....| PR -> 'ishee' [0.6] 0.6000000000
Discard:|..=. ....| PR ->PR[0.4] 0.2400000000
Insert: |...=...| V -> 'jaalatee' [0.4] 0.4000000000
Insert: |....=..| N -> 'Haawwiin' [0.4] 0.4000000000
Insert:| ..... =..| NP ->N[1.0] 0.4000000000
Insert: |.....=.| ADV -> 'siritti' [0.5] 0.5000000000
Discard:| ..... =.| ADV ->ADV[0.3] 0.1500000000
Insert:| ....... =| V ->'bektee'[0.48] 0.4800000000
Finding themost likely constituents spanning 2
textelements...
Insert: |..==...| VP ->PRV[0.4] 0.0960000000
Insert: |.....==| VP ->ADVV[0.2] 0.0480000000
Finding the most likely constituents spanning 3 text
elements...
Insert: |.===...| VP ->P PRV[0.4] 0.0480000000
Insert: |....===| SC ->NPVP [1.0] 0.0192000000
Finding themost likely constituents spanning 4
textelements...
Insert: |====...| MC ->NP PPVP [1.0] 0.0216000000
Insert: |====...| SC ->NPVP [1.0] 0.0216000000
Finding themost likely constituents spanning 5
textelements...
Finding themost likely constituents spanning 6
textelements...
Finding the most likely constituents spanning 7 text elements...
Insert: |=======| CS ->MCSC[1.0]
0.000
4147200

Time (secs) # Parses Average P(parse)


----- 0.5830 10.00041472000000
----- - n/a 10.00041472000000

Draw parses (y/n)? y


TS =tree sentence

MC= Main
Clause SC=Sub
Clause

Print parses
(y/n)? y (CS
(MC
(NP (N
Toolaan))
(PP (P
akka))
(VP (PR ishee) (V jaalatee)))
(SC (NP (N Haawwiin)) (VP (ADV siritti) (V bektee))))
[0.00041472000000000004]

You might also like