Professional Documents
Culture Documents
Dinka Final Thesis
Dinka Final Thesis
Dinka Final Thesis
M.Sc. Thesis
BY:
June, 2023
Nekemte, Ethiopia
i
WALLAGA UNIVERSITY
SCHOOL OF GRADUATE STUDIES
P.O. Box: 395, Nekemte, Ethiopia.
APPROVAL SHEET FOR SUBMITTING FINAL THESIS
As members of the Board of Examining of the Final MSc. thesis open defense, we certify
that we have read and evaluated the thesis prepared by Mr. Dinka Getahun Mokonnen
under the title ―Decision tree topdown chart parser afaan oromo sentence parsing‖ and
recommend that the thesis be accepted as fulfilling the thesis requirement for the Degree of
Master of Science in Computer Science.
Examining Committee Name Signature Date
ii
Declaration
I dedicate this work to my dear father, Getahun and my mother Marame Debela, who
passed away unexpectedly before three years. and As thesis research advisor, I here by
certify that I have read and evaluated this thesis organized, under my guidance, advice,
done by Dinka Getahun, entitled in ―Decision Tree Topdown Chart Parser Afan
Oromo Sentence Parsing‖ is accepted in partial fulfillment of the thesis requirement
for the award of Degree of Masters of Science in computer science. I recommend that
it be submitted as fulfilling the thesis requirement.
iii
STATEMENT OF THE AUTHOR
I Mr. Dinka Getahun Mokonen hereby declare and affirm that the thesis
entitled―decision tree topdown chart parser afan oromo sentence parsing‖ is my own
work conducted under the supervision of Msc. Kemal Mohamed (Ass. Professor). I
have followed all the ethical principles of scholarship in the preparation, data
collection, data analysis and completion of this thesis. All scholarly matter that is
included in the thesis has been given recognition through citation. I have adequately
cited and referenced all the original sources. I also declare that I have adhered to all
principles of academic honesty and integrity and I have not misrepresented, fabricated,
or falsified any idea / data / fact / source in my submission. This thesis is submitted in
partial fulfillment of the requirement for a degree from the Post Graduate Studies at
Wallaga University. I further declare that this thesis has not been submitted to any other
institution anywhere for the award of any academic degree, diploma or certificate.
I understand that any violation of the above will be cause for disciplinary action by the
University and can also evoke penal action from the sources which have thus not been
properly cited or from whom proper permission has not been taken when needed.
Dinka Getahun
iv
ACKNOWLEDGMENT
First of all, I would like to thank Almighty God for giving me endurance for
completion of this thesis.
Next, I would like to express my heartfelt appreciation and gratitude to my thesis
advisor Msc. Kemal Mohamed (Ass. Profesor) for he gave me perfect advices or his
valuable suggestions and guidance throughout my study whenever I faced serious
problems and made me feel hopeful to finish this thesis for he give me good guidance .
I would like thank to computer sciences department head Mr. Tariku B and IT
department head Mr. Gemechu B. for their unreserved encouragement and support they
rendered to me during the entire period of study.
I woud like to show my gratitude to all the computer sciences department‘s staffs for
their comments and suggestions.
Finally I am also thanks to my family especially my wife Abaynesh Wana, for her
unlimited support and encouragement.
v
Abstract
Previously many sentence parsers are developed for foreign languages such as English,
Arabic, etc. as well as for Amhari language from local languages of Ethiopia. Parsing
Afan Oromo sentence is also needed and a necessary mechanism for other natural
language processing applications like machine translation, question answering,
knowledge extraction and information retrieval.
The study of natural language processing is gaining popularity daily for both academic and
commercial purposes. Higher NLP systems, such machine translation, can only be produced
once the lesser ones, like part-of-speech taggers and syntactic parsers, have been
successfully developed. Even among the more basic NLP systems, there is this functional
reliance. This thesis can be seen as an effort to combine concepts and results from earlier
attempts at an Afan Oromo part-of-speech tagger in order to address a little more
challenging issue with the parsing of decision tree sentence language. In this thesis, an effort
is made to extract features such Afan Oromo word and phrase classes, sentence formalisms,
and sentence parsers that can be implemented using Afan Oromo decision trees. The study's
sample data came from sources that are often used in language instruction and language
learning. Also, this data has been manually examined, annotated, tagged, and processed
before being utilized as a corpus to extract the grammatical rules and assign probabilities.
We also developed simple algorithm of a lexicon generator to decision tree generate the
lexical rules. Python programming language and NLTK are used as an implementation tools
for this study. Then, the experimentation took place on a parser.In 300 sentences with 3029
words each, experiments have been done. In this study, 20% (60) of the decision tree
sentences and 80% (240) training data of the corpus sentences were employed as test data
sets. The integrated part of speech tagger employed 3029 manually annotated words from
this study and 28 categories of tag sets. Of of these, 27 are tag categories while the
remaining one, "X," is used for unidentified terms. The study's findings revealed that the
tagger achieved accuracy levels of 89.7% on the training set and 84% on the test set,
respectively. The decision tree phrase parsing studies produced accuracy results of 71.6% on
the test set created for this purpose and 80.0% on the training set.
Keywords: NLP, Parser, decision tree grammar, top-down chart parser, lexicon
Generator, lexicon.
vi
ABBREVIATIONS
ADJ, Anadjective
AdjP AdjectivalPhrase
ADV Anadverb
ADVC An adverb not separated from aconjunction
AdvP AdverbialPhrase
AUX Auxiliary verbs and all their otherforms
DT Complement
CONJ Aconjunction
ITJ Interjections
JC A conjunction not separated from anadjective
JNU A numeral used as anadjective
JP An adjective not separated from apreposition
JPN A noun not separated from a preposition and that function as anadjective
N Noun in allforms
NC A conjunction not separated from anoun
NP A preposition not separated from anoun
NP NounPhrase
NUM Number
NV Verbalnouns
PP PrepositionalPhrase
PREP Apreposition
PUNCT Punctuation
REL Relativeclause
V Verb in allforms except auxiliary,
VC A verb prefixed or suffixed by aconjunction
VCO Compoundverbs
VP VerbPhrase
vii
Table of
Contents……………………………………………………………………………page
Declaration ........................................................................................................................ iii
STATEMENT OF THE AUTHOR ................................................................................... iv
ACKNOWLEDGMENT..................................................................................................... v
Abstract.............................................................................................................................. vi
ABBREVIATIONS .............................................................................................................. vii
CHAPTER ONE ..................................................................................................................... 1
INTRODUCTION .............................................................................................................. 1
1.1. Background ................................................................................................................ 1
1.2. Statement of the Problem ........................................................................................... 3
1.3. Objective of the Study ................................................................................................ 5
1.4. Methodology .............................................................................................................. 7
1.4.4. ParsingTechniquesandPrototypeDevelopment .......................................................... 8
1.5. Application of Results and Beneficiaries .................................................................... 8
1.6. Scope of the Study ..................................................................................................... 8
1.7. Limitation of the Study .............................................................................................. 9
1.8. Organization of theThesis .......................................................................................... 9
CHAPTER TWO .............................................................................................................. 10
REVIEW OF LITERATURE ........................................................................................... 10
2.1 Introduction .............................................................................................................. 10
2.2 Decision tree grammar Sentence Parsing (DTP) ...................................................... 10
2.3 Approachesn to Decision tree grammer sentence parsing ......................................... 12
2.4 Knowledge Required by the Parser ......................................................................... 19
2.5 RelatedNLPinparsing ............................................................................................... 24
2.6 Related NLP in Afan Oromo ................................................................................... 24
2.7 Related NLP ComponentSystems............................................................................. 25
CHAPTER THREE .......................................................................................................... 27
THE STRUCTURE OF AFAAN OROMO ...................................................................... 27
3.1 Introduction............................................................................................................... 27
3.2 Afan Oromo Alphabet and Writing System ................................................................ 27
3.3 Punctuation Marks in Afan Oromo ............................................................................. 28
3.4 Word Categories in Afan Oromo ............................................................................... 28
3.5 Phrasal Categories .................................................................................................... 34
3.6 Sentences .................................................................................................................. 39
CHAPTER FOUR ............................................................................................................ 45
DATA PREPARATION AND PCFG EXTRACTION ................................................... 45
4.2 The Design Approach of the Parser ......................................................................... 45
viii
4.3 The Sample Corpus................................................................................................... 46
4.4 The Morphological Pre-processing .......................................................................... 47
4.5 The Part Of Speech Tagger ..................................................................................... 48
4.6 Extraction of A Probabilistic Context Free Grammar .............................................. 50
4.7 Chom sky Normal from (CNF) Representation ....................................................... 51
CHAPTER FIVE .............................................................................................................. 52
PARSING ALGORITHM AND EXPERIMENTATION................................................ 52
5.2 The Parsing Algorithm.............................................................................................. 52
5.3 PCFGParsing............................................................................................................ 53
5.5 TheExperiment.......................................................................................................... 60
CHAPTER SIX ................................................................................................................ 65
5. CONCLUSION AND RECOMMENDATION ....................................................... 65
6.1 Conclusion................................................................................................................ 65
6.1 Recommendations .................................................................................................... 67
7. REFERENCES ................................................................................................................. 84
ix
List of Tables
Table 3.1: personal pronouns........................................................................................... 30
Table 5.4: Parsing result on Training Set before making no error correction................. 58
Table 5.8: Parsing result on Training Set before making no error correction ................ 62
Table 5.9 1: Parsing result on Training Set after making some error correction ........... 62
Table 5.10: Parsing result on Test Set ............................................................................. 63
x
List of Figure
xi
CHAPTER ONE
INTRODUCTION
1.1. Background
As opposed to formal language, a natural language or ordinary language is one that is
spoken, written, or signed by humans for everyday communication (such as computer-
programming languages or the "languages" used in the study of formal logic). [1]
One of the most fundamental parts of human conduct is language, which is also a very
important part of our daily life. In its written form, it serves as a method of long-term
information and knowledge recording and transmission from one generation to the next. It
helps us communicate with people and organize our daily lives in verbal form [2].
2
ambiguous input phrase into unambiguous forms is the main goal of this step.
Since the early 1960s, many parsing algorithms have been created for a variety of languages,
including English [13]. This type of parser was invented by Earley. The first effective chart
parser for the English language is this one. At that time, numerous initiatives have been made
to add Decision Tree Sentence Parsing (ASP) to other languages around the globe [14]. Afan
Oromo is one of the languages that should have a decision tree sentence parser, however to
the best of the author's knowledge, only one system of this kind has been created (Diriba, [7])
for the language. Therefore, the creation of such a decision-making mechanism is crucial. As
a result, this study addresses this issue, attempts to close the gap left by earlier research into
the creation of a tree sentence parser for language, and sheds insight on potential future
research directions.
1.2. Statement of the Problem
Afan Oromo is one of the major languages that are widely spoken in Ethiopia.
Currently, it is the official language of the regional state of Oromia (the largest regional
state in Ethiopia) being used as a working language in offices, medium of instruction
for primary and junior-secondary schools, and it is also given as a subject for secondary
schools (9 -12 grades). As Mandafro report in his work [11] , at the country level, in
Ethiopia, out of public universities, 8 universities are offering degree programs
majoring in Afan Oromo and Addis Ababa University is offering Afan Oromo language
at Master‘s degree level.
Like Amharic, another major language and working language of Ethiopia, which belongs to
Semitic family languages, Afan Oromo is part of the lowland east Cushitic group within the
Cushitic family of the Afro Asiatic phylum.
According to Abebe [9], Afan Oromo language is not only spoken in Ethiopia, it has
also spoken in Somalia, Kenya, Uganda, Tanzania and Djibouti. Although Afan Oromo
is today spoken by such a large number of people, few advances have been made in
computational linguistics or natural language processing in the language.
―Computational approaches to linguistic analysis of Afan Oromo so far have been
hindered due to non-availability of well-studied linguistic resources‖ [12]. Since Afan
3
Oromo language is the official language of Oromia National Regional State as
mentioned above and used in offices, schools, colleges, universities and in media,
various written materials are being published electronically and non-electronically now
a day. Thus, this creates an interest of NLP researches in this language. For instances;
morphological synthesizer [9], spell checker [13], grammar checker [14], part of speech
tagging [15][16][12], named entity recognition[1], news text summarization [17]
machine translation [8], word sense disambiguation [18], question answering [19], text
retrieval [20] and search engines [21] are some NLP applications among the
applications that require a sentence parser for successful and full-fledged
implementation. Besides, sentence parser is useful NLP application in teaching and
learning process for phrase identification and to know word relations in sentences of
the Afan Oromo language. It is also an important tool in NLP and it serves as an
intermediate component for different higher level applications like machine translation
[4].
On the other hand, as we have mentioned in above section, an Internet is one of the
main sources of information. The enormous amount of information on the Internet
could be used to enhance development by making it accessible to the public. To fully
localize and utilize these resources which are available on the Internet, translation of
documents from one language to another may be necessary. For example, many
documents on the Internet are written in English, because of this, English to Afan
Oromo translation and vice versa may be required in syntax-based machine translation
[22]. Besides, according to[23], parsers have become efficient and accurate enough to
be useful in many natural language processing systems, most notably in machine
translation. Therefore, machine translation, which uses Afan Oromo language
sentences as an input, and sentence parsers as a component, plays a great role in solving
the translation problem. Thus, we were proposed to develop a sentence parser for Afan
Oromolanguage.
To this end, the researcher has gone through different literatures to find if there is any
sentence parser, which can parse both simple decision tree parser sentences in Afan Oromo.
Thus, to the best of the researcher‘s knowledge concerned, there is no Afan Oromo
sentence parser for both decision tree parser sentences. However, there is one attempt
4
by [5] on decision tree parser sentence parser for Afan Oromo language using supervised
learning
technique for simple declarative Afan Oromo sentence. In his study, the chart algorithm
has been used. In addition, the unsupervised learning algorithm was designed to guide
the parser in predicting unknown and ambiguous words in a sentence. It also adopts an
intelligent (Rule-Based learning module) approach to develop a prototype. The result
obtained was 80% on the training dataset and 20% on the test dataset were not included,
which could have been usedas a preprocessor to the parser. It was developed only for simple
declarative sentencesof Afan Oromo language.
Due to this fact, the researcher is motivated to develop aparser for both decision tree Afan
Oromo sentences. Hence the focus of thisstudy is, therefore, in designing and developing
sentence parser for Afan Oromo text, which includes both decision tree sentences.
Obviously, the parser will have themajor significance for the language users.
Moreover, as the nature and structure ofsentences parsing (syntactic parsing) in Afan Oromo
is different from English, Amharic or other languages, sentence parser developed for such
languages could not be functional for Afan Oromo language. This is due to the fact that the
language hasdifferent syntactic and morphological nature and they have also their own
grammaticaland word formation technique that is different from other languages. As a result,
Sentence parser developed for other languages could not be used for Afan Oromo
Language, which results in the need for the independent sentence parser. So that we decided
to develop sentence parser for Afan Oromo simple and complex sentencesusing top down
chart parsing algorithm.
Based on the above justification this study attempts to answer the following questions:
What are the properties and word orders in Afan Oromo Language?
Is it possible to use other languages sentence parsers for Afan Oromo language?
Does the adoption of other language parsing algorithms work for Afan Oromo
Language?
1.3. Objective of the Study
5
1.3.2. Specific Objectives
in order to achieve the general objective of this research, the following specific
Objectives are formulated.
To identify the properties of Afan Oromo sentences based on the knowledge base of
the language which are the basic word order, word categories,morphologicalproperties,
phrase structure, and sentences in the language tha are useful for sentence parsing.
To select sample sentences that would potentially serve for the experiment
To extract an appropriate grammar rule to represent the structure of Afan Oromo
sentences.
To design a general architecture of Afan Oromo parser
To develop a simple algorithm for lexical generator in order to automatically generate
lexical rules from sample corpus.
To select and customize an appropriate parsing algorithm for Afan Oromo sentence
parser.
To evaluate performance of the parser
Review the basic word categories, morphological properties, phrase structures,and the
various kinds of sentences of Afan Oromo with the aim of investigating patterns that
allow computerre presentation;
Collect sample simple and tree sentences to be used in the experiment;
Build the database of thep a r t -of-speech t a g g e r using the stems of words taken
from the sample corpus and calculate the lexical and transitional probabilities for
them.
Generate the grammar rules appropriate for the language;
6
1.4. Methodology
In order to develop a Sentence Parser for Afan Oromo language, exploring of the
Characteristics of the language and different approaches which can be used for the
Development should be needed. The followings are the methods that have been followed
to achieve the general and specific objectives of this thesis work.
1.4.1. Literature Review
A variety of relevant relevant literature sources. . Books, research reports, journal
articles, manuals, and other published and unpublished documents, including those from the
web, were reviewed for the purposes of this study. All of this raises the question of
whether researchers are interested in NLP-related issues, especially the parsing of
tree sentences (approaches, methods, strategies, etc.) And the language issues under
consideration (basic word categories, morphological properties, phrase structure, etc.), and
different types to understand sentence types from Afan Oromo)
. Additionally, I looked at the literature, especially in the areas of parsing and
general computational linguistics (such as the algorithms and data structures used), to
get a better understanding of how the language works. This understanding has
allowed researchers to implement features of the language that they have determined
are suitable for their study and to employ parsing algorithms appropriately.
1.4.2. Discussion with Linguists
We have also had successful discussions with linguists and experts in the field of
English decision tree sentences and their subcomponents, particularly the tree phrase
structure of language. In-depth conversations with linguists. Both native and non-native
English speakers helped researchers understand the well-formedness of English
sentences and the correctness of parsers.
1.4.3. DataCollection
Afan Oromo's 300 Decision Trees and 10 Simple Decision Trees theorems were two of the
two types of theorems that were gathered from books, periodicals, and newspapers that had
been published. The sentences were chosen so that the decision tree verb phrases and simple
noun phrases were present in the tree sentences. Simple verb and noun phrases can both be
found in simple sentences. As a result, adequate consideration is given to linguistic structures
and linguistic types so that the entire chosen sentence can satisfy the necessary language
structure, which is advantageous for the research process.
7
1.4.4. ParsingTechniquesandPrototypeDevelopment
In data preprocessing, 3029 words h ave been annotated with their respective category of tags.
Then, the string that stems were tagged using HMM, have been used to determine the Necessary
lexical dictionary, Probabilistic Context Free Grammar (PCFG) rules. Based on these a
parsing algorithm have be adopted and modified in order to generate the sentence parser.
The development of the prototype using NLTK Tkinter and implementation of the parsing
algorithm using python havebeen carriedout.
1.4.5. Testing Techniques
The project has used two different kinds of data sets: a training set and a test set. Of of the
300 total tree sentences, 80% of the sample corpus was randomly chosen as training data, and
the remaining 20% was used as test data. To verify that the algorithm also works for basic
sentences, more than 10 simple sentences were taken and evaluated. The experiment was
carried out in two stages: first on the training set, then on the test set and the outcomes were
assessed.
The parse results were compared to phrases that had been manually parsed, and the
experiment was repeated until no further progress was apparent.
1.5. Application of Results and Beneficiaries
One of the essential elements of higher level NLP systems are syntactic parsers. Hence, parsing
systems would be crucial in many NLP applications for the Afan Oromo language. The results of
this study will be extremely valuable to academics working to improve the capacity of computers
to understand Afan Oromo language, hence undertaking research in the field of NLP spatially
sentence parsing is of utmost importance. The main winners will be academics with an interest in
conceptual parsing, machine translation, phrase recognition, spell checkers, text summarization,
etc. Besides, linguists and students in the field of Afan Oromo language might potentially employ
the outcome of this research to parse phrases in the language decision tree. The result could be
applied to language instruction to identify phrasal categories and understand the relationships
between words in a phrase.
1.6. Scope of the Study
The main focus of this work is the design and prototype development for a top down chart
parserfor afaan oromo sentences. The prototype will be designed by studying the word
classes of afaan oromo language, the types of sentences and their construction. However, it is
8
not the scope of this work to incorporate the parser to higher level NLP applications like
grammar checker, questionanswering, etc., as a component.
1.7. Limitation of the Study
This study has the following limitations:
1. The study did not incorporate all kinds of Afan Oromo sentences with
there attributes like case, numbers, genders, person and tenses.
2. The prototype developed used manually annotated morphological analysis
prepared for the purpose of this study. This is due to lack of the source code
of the morphological analyzer for Afan Oromo which was previously developed.
3. The tree sentences that are included in the sample do not exhibit tree nounphrases,
interrogative.
4. Moreover, the researcher of this study believed that the size of the corpus
used is still verysmall.
9
CHAPTER TWO
REVIEW OF LITERATURE
2.1 Introduction
The primary goal of the study is to build and implement a decision Afan Oromo sentence parser
for decision tree grammar sentences, as was already mentioned in the first chapter. When
developing parsers to examine how the syntactic structure of sentences can be computed, it is
common practice to take into account both the grammar and the parsing technique. While the
parsing approach is a way of examining a sentence to ascertain its structure by using the
grammar as the source of syntactic knowledge, the grammar is a formal statement of the
structures permitted in the language. This chapter discusses several sentence parsing methods
and strategies. An outline of decision grammar sentence parsing and its assessment standards
is provided in the first section. The second section reviews the various methods and
procedures for the task of decision sentence parsing. The lexicon and grammatical rules, or
the knowledge needed by the parser, are covered in the third portion of this chapter. In the last
section, several prior Afan Oromo NLP research projects that are connected to this study are
summarized.
2.2 Decision tree grammar Sentence Parsing (DTP)
A natural language system must use knowledge of the grammatical structure of the language,
such as what words are, how they are put together to make sentences, what they mean, how
word meanings affect sentence meanings, and so forth[19].
The word "parsing" is derived from the Latin phrase "pars orationis" (part of speech), and it
describes the act of giving each word in a sentence a part of speech (such as a noun,
adjective, etc.) and organizing the words into phrases Alen[19].
Nevertheless, parsing can be done at the word or sentence level in natural language
processing. Tokenizing a word into its constituent parts, or individual morphemes, is the
process of word parsing. Tokenizing the term into morphologically sound parts is necessary.
These tokenized parts will be examined further to determine how they contribute to the
classification and meaning of the entire word [2]. [20].
Sentence In the process of parsing (also known as syntactic parsing), grammatical rules are
combined in various ways to produce a tree that could represent the structure of the input
sentence. In other words, according to Allen [19], it is a task in NLP where a flat input
10
sentence is transformed into a hierarchical structure that is consistent with the units of
meaning in the sentence. Token by token, the parser receives the input string (or tokens). The
morphological analyzer is called by the parser for each token, and it breaks down words into
their roots and affixes in accordance with the language's morphological rules (Afan Oromo).
A lexicon, which is made up of a collection of records relating various kinds of linguistic
data, is where roots and affixes are maintained. A diagrammatic representation of the input
text, the parse tree keeps track of linguistic rules and how they are applied. According to
Allen [19], each node of a parse tree corresponds to either an input word or a non-terminal in
the grammar. A different grammatical rule is applied at each level of the parse tree. The final
terminal symbols, however, are linked to the input word via their lexical category.
„Gaangeen, Tolaan kalessa bitee hara ganama
duute‟ “The mule that Tola bought ide
yesterday‖
(S
(NP
(NP (N Gaangeen))
(ADVP (NP (N Tolaa)) (ADVP (ADV kaleesa) (V
bitee)))) (VP (ADVP (ADV hara) (ADP ganama))
(VP (V duute))))
11
methods for parsing sentences, which can be broadly divided into rule-based methods and
statistical methods. These strategies are covered in the section that follows.
2.3 Approachesn to Decision tree grammer sentence parsing
12
distinctive grammatical behavior. The lexicon is essential because the parser uses this
dictionary to parse sentences into syntactic tree structures as soon as it gets input tokens
(strings). The lexicon includes a list of every lexical category to which the word might be
assigned.
In a rule-based approach, morphological rules are also helpful. The morphological rules offer
information that can be used to handle words that are not in the parser's dictionary. In other
words, these criteria can be used to reasonably infer the grammatical categories of words that
are unknown [7]. There are two methods for parsing that can be used in the Rule-based
approach. Top-down and Bottom-up parsing approaches are these. [7]
2.3.2 Stochastic Approach to Decision tree Sentence Parsing
Probability (sometimes known as statistics) is used by stochastic-based parsers to analyze the
parsing problem. The Markov assumption in sentence parsing, the Bayes (Network) theorem,
and independent events are the foundations of the stochastic approach, often known as the
corpus-based approach. These ideas are used in the approach to identify each word's most
likely lexical sequence within a given sentence [7]. The corpus-based technique can be
further divided into supervised and unsupervised approaches depending on the type of text
corpora used [19]. Unsupervised approaches use natural corpora, such as those found in
books and newspapers, while supervised approaches use annotated text corpora.
Systems for decision tree grammar syntactic analysis constructed using the supervised
technique are known as supervised parsers, and they use probability (i.e. statistics) to study
the parsing problem. In a supervised parser, the lexicon, which contains every word together
with every potential lexical category and its estimated lexical probabilities1, and the list of
contextual probabilities for each lexical category, are the two key information sources. The
proper lexical category for a given circumstance is indicated in the list of contextual
probability. [19].
Lack of manually or decision tree parsed text (corpora) and the requirement for manual
parsing each time the parser is applied to a new text are the two main issues in developing
supervised parsers [19]. Manual parsing is very expensive and time-consuming, but if pre-
tagged corpora are widely accessible, stochastic parsers in general and the Hidden Markov
Model (HMM) technique in particular can be easily adapted for new languages.
The training process for parsers created utilizing unsupervised stochastic approaches does not
13
require any pre-tagged material. The syntactic analysis technique was developed using some
heuristics or probabilistic data obtained from the corpus [7] [3]. These parsers share a
characteristic with their supervised counterparts in that they both make the HMM
assumption. HMM is a set of states (in this case, lexical categories) with directed edges and
transition probabilities that show the likelihood of shifting to the state at the end of the
directed edge, assuming that one is currently in the state at the start of the edge. The states
are also labelled with a function which indicates the probabilities of outputting different
symbols if in that state (while in a state, one outputs a single symbol before moving to the
next state).
In this case, the symbol output from a state/lexical category is a word belonging to that
lexical category [3]. However, the unsupervised stochastic parser has such unique
features as training takes place on an unparsed or fresh text,uses the Baum-Welch
algorithm (which is different from the Viterbi algorithm), and soon.
2.3.3 Parsing Strategies
Several approaches have been put out to address parsing-related issues such where to begin, how to
look at a string or a rule's right-hand side (RHS), and how to consider alternatives. NLP researchers
offered many approaches as answers, including top-down, bottom-up, left-to-right, right-to-left,
depth-first, breadth-first, and chart parsing. The successive subsections that follow describe a few of
the most significant solutions offered at various points in time.
Top-down and bottom-up are competing ideas that have been put up as alternatives to
address the strategy issue regarding the course of the parsing procedure. Top-down parsing
starts with the start symbol, which is typically a sentence S, and moves the grammatical
rules forward until the symbols at the tree's terminals represent the sentence components
that are being parsed. As an illustration, if the rule S is applied and the parser begins in
state(S), the symbol list will be (NP VP).
The rule NP, ARTN, and the symbol list will then be applied (ARTN VP), and so on. The
parser might recursively proceed in this way until it completely reaches the states of the
terminal symbols, at which point it may check the input sentence to see if the word classes
within corresponded with the written sequence of terminal symbols [19]. Top-down parsers
are the term used to describe parsers created in this manner. To determine its next step, this
14
parser forms an assumption about what it is looking for. Thus, a top-down parser is
distinguished by a series of objectives to ascertain the remaining words.
Contrarily, bottom-up parsing starts with the sentence to be parsed and applies the
grammar rules backward until a single tree has been formed [3], whose terminals are the
sentence's words and whose top node is the start symbol (often S, for sentence). To put
it another way, it begins with each word and assigns its grammatical category up until
the start symbol. The highest-level label sequence is used as the new string in this
process, which is repeated for each state. The task of the parser would now appear to be
that of attempting to group words into their respective categories together (e.g. take a
sequence ART ADJ N and identify it as an NP) in a manner permitted by the grammar.
Top-down methods have the advantage of being highly predictive. This means that a word
might be ambiguous when considered in isolation, but if some of those grammatical categories
cannot be used in a legal sentence, then these categories may never even be considered
[19].
Large constituents may need to be constructed repeatedly when they are utilized in other rules,
which is a severe issue with this method's redundancy of effort. The bottom-up parser, in
contrast, only builds each element precisely once and examines the input phrase once. The
bottom-up parser operates from left to right, so the first thing to note about it is that it exhausts all
of its options with that item before moving on to the next two, and so forth. In other words, the
parser builds successive layers of syntactic abstraction based on the data provided, and it is fully
driven by the data presented to it.
Sadly, Allen [19] claims that whether top-down or bottom-up implementation is chosen, the
payback is unaffordable because the parser would tend to repeatedly try the same matches,
duplicating a lot of its work. So, there should be a method that enables the parser to save results
of the matching it has already performed in order to avoid such redundancy problems. This
method is known as chart-based parsing.
As a result, combining the two approaches may produce a better parser. A minor adjustment to
the bottom-up chart algorithm results in a method that is predictive like top-down approaches
while avoiding any work redundancy as in bottom-up approaches.
15
2.3.3.2 Left-to-rightVsRight-to-left
These are the opposing answers that have been put up in response to the query about the
proper order to examine substrings of an RHS. In contrast to right-to-left (i.e., end-to-
beginning) parsing, left-to-right parsing processes the words of the sentence from left to right
(i.e., from beginning to end). In other words, it starts with the leftmost symbol and moves on
to the next symbol on its right. The parser will eventually function in any manner, therefore
logically it may not matter which direction the parsing process takes [25]. However
compared to left-to-right parsing, right-to-left parsing is perhaps less understandable.
However there are times when employing both tactics is advantageous.
If the sentence is harmed, for example, by the presence of a misspelled word, using a parsing
technique that incorporates both left-to-right and right-to-left techniques may be helpful. The
text to the right of the error can then be parsed thanks to this. The top-down method has
trouble with rules that exhibit left recursion when applied from left to right [25]. Left
recursion happens when the first category of a rule on the RHS is more general than the one
on the LHS (Left Hand Side). In this case, it is possible to transform a left-recursive language
into an equivalent grammar that does not employ left-recursive rules and yields the same set
of strings (although it will not assign the same structures).
2.3.3.3 Depth-firstVsBreadth-first
The chart is a data structure for representing fragments of the parsing process in a way that they can
be utilized again in the future. An n-word sentence's chart is made up of n+1 vertices and a number of
edges that connect the vertices. A chart parser is a type of parsing algorithm that keeps a table of all
16
the well-formed substrings it has so far discovered in the text it is parsing. Although a variety of
parsing algorithms can utilize chart approaches, they have often been applied to a specific bottom-up
parsing algorithm [12].
This method's key premise is that increasing parsing efficiency is essential. There are three
considerations to keep in mind for chart parsing efficiency, as stated by Russell and Norvig [28]: it is
advised not to do twice what can be done once, not to do once what can be avoided entirely, and not
to represent different ions if that is not the study's focus.
Chart parsers keep track of every constituent that has been retrieved from the sentence so far.
In other words, it keeps track of rules that have matched but are not fully satisfied while storing
the intermediate results. More specifically, it is advised to record the results in a data structure
known as a chart once it is realized that "reenfi tiskee hoolota loolaan ajjesee," "'the body of the
shepherd that was killed by flood,'" is a tree NP as it is used in the sentence "reenfitis kee
hoolota loolaan ajjesee gara hospitalaatti ergame," "The body of the shepherd Dynamic
programming techniques that prevent duplicate work include recording interim outcomes.
A chart-based parser's fundamental process entails joining an active arc (also known as an edge) with
a finished constituent. Either a new finished constituent or a new arc that extends the initial active arc
are the outcomes. Up until they themselves are readded to the chart, new complete constituents are
kept on an agenda list. For instance, [0, 5, S NP VP•] indicates that a S that covers the string from 0 to
5 is made up of an NP followed by a VP. The numbers here show how the grammatical rules are
indexed.
What has been discovered thus far and what needs to be discovered are distinguished by the symbol •
in an edge. The indexing of the grammatical categories is shown by the numbers before the symbol S.
17
Edges that end in a • are referred to be full edges. A parser would have a S if it could discover a VP to
follow the edge [0, 2, S NP] • VP] which states that an NP spans from 0 to 2. Active arcs have edges
like this with a dot before the end.
The same ingredient is never produced more than once, making chart-based parsers more
effective than search-based ones. To parse a phrase of length n, a pure top-down or bottom-up
search strategy could need up to Cn operations, where C is a constant that depends on the
particular algorithm utilized. This exponential tree quickly renders the method useless, even if C
is relatively tiny [19] [16][29].
A chart-based parser, however, is said to need K 2 time and space trees, where N is the sentence's
length and K is a constant dependent on the algorithm, to build each element as a lexical category
between every place. It significantly decreases parsing operations as a result. To parse an n-word
phrase using chart parsing, create a chart with n+1 vertices and add edges one at a time,
attempting to generate a full edge that spans from vertex 0 to n and is of category S. There is no
going back; whatever entered into the chart remains there. In general, there are two distinct
problems that need attention. The first covers strategies for increasing the effectiveness of
parsing approaches by decreasing the search but leaving the end result same, and the second
involves methods for picking between many interpretations that a parser might be able to
identify. The following strategy is typically used to achieve this. The bottom up method failed
to store any intermediate findings, as was already mentioned. It is the main justification
behind its excessively time-wasting behavior of repeatedly checking things that have
previously been checked and cannot be changed. This might qualify as amnesic!
Each new category is examined to determine if it exhausts an RHS, and each new neighboring pair
of categories is examined to see if they do the same. The solution now needs to make it impossible
to keep track of which categories a parser is in, which makes it slightly more difficult to use [19]
[28] [29].
Although any parser must store some states in order to remember what it is doing at any given
time, chart parsers in particular must remember the multiple hypothesis states that are currently
being considered. This issue of storing intermediate results is independent of the distinctions
already discussed. The secret to effective parsing also turns out to be the storage of interim
findings.
18
The intermediate findings discovered during a parse are encoded by chart-based parsers
using a chart-based data structure [19] [28][27].
Using strategies that express uncertainty can help parsers be more effective because they
won't have to make a hasty decision only to change their minds later. Instead, the
uncertainty is carried forward through the parse until all but one of the Possibilities are
eliminated by the input. The effectiveness of the method presented here stems from the fact
that all potential outcomes are taken into account beforehand, and the data is saved in a
table that controls the parser, allowing for significantly faster parsing methods. It is clear
that chart parsers outperform all other parsers in terms of efficiency. To prevent effort
redundancy, they encode interim results. Moreover, the chart parser anticipates the word
category of unidentified terms and encodes uncertainty to prevent ambiguity (those that are
not in the knowledge base). As a result, Allen's [19] [30] [31] chart parser will be used in the
current investigation to forecast the category of unknown words and eliminate uncertainty.
19
Typically, a lexicon is organized as a list of lexical entries, such as ("pig" N V ADJ). In
addition to its common usage as a noun ("Jane pigged herself on pizza"), "pig" can also be
used as a verb and an adjective ("pig iron"). A lexical entry will typically include more
details about the functions a word performs, such as feature information, such as whether a
verb is transitive, intransitive, or bi-transitive, etc., or what form the verb takes, such as
present participle or past tense, etc. [12]. Allen [19] contends that as long as a lexicon is
supplied, a grammar need not include any lexical rules of the kind N - flower. Abebe [2]
illustrates the straightforward decision tree grammar lexicon for Afan Oromo in the following.
N – Nama, V - Deeme
In this illustration, the words on the right side are classified by POS symbols on the left.
2.4.2 The GrammarRule
The formal description of the rules and syntax that a language can use is known as grammar. The
most typical way to portray grammars is as a collection of grammar rules that generalize to group
words into what are frequently referred to as "parts of speech" or grammatical categories. Several
linguistic theories are based on grammar rules, and many natural language comprehension systems
are built on top of these ideas [7]. Oromo grammar is organized in an LR (Left to Right) table. This
is a basic grammar example for the phrase "Tola went to school," "Tolaan gara mana barumsaa
demee."
VP => V
P => gara
N => manabarumssa
V =>demee
There are various grammar specifications, often known as grammatical formalisms. The most
popular and widely used formalisms are Probabilistic Context Free Grammars [19], Decision
Tree Grammar [34], Transitional Grammar [34] by Chomsky, Transition Network Grammars
20
[35] by Wood, Unification Based Grammar [36] by Kay. As a result, the grammar rules will
alter based on the theoretical foundation of the particular grammar.
Context-free grammars are those that are made up solely of rules with a single symbol on the left-
hand side (CFG). A CFG is a formal system that delineates how every legal text can be derived from
a distinctive symbol known as an axiom, or sentence symbol, in order to represent a language. A
CFG rule can only be non-monotonic if its right-hand side is empty since CFG rules must be
monotonic. CFGs are crucial for two reasons: first, the formalism is strong enough to capture the
majority of natural language structure and, second, it is sufficiently constrained to enable the
development of effective parsers for sentence analysis [19]. This formalism is made up of a collection
of productions, each of which asserts that a certain symbol may be changed into a specific pattern of
symbols. One such production that claims S can be substituted by the sequence of NP and VP is S, NP,
VP. The sequences of symbols in NP and VP, respectively, replace each other (for example, NP, Adj N
and VP, V NP).
Non-terminals, also known as symbols that need to be replaced, are always represented by identifiers,
which are collections of letters and digits. In at least one production, every non-terminal must come
before a colon. The axiom is a non-terminal that only ever appears before the colon and never
between the colon and the period in any production. There must only be one non-terminal that
satisfies the requirements of the axiom. Terminals are symbols that cannot be changed; they can be
expressed by identifiers or literals (which are a sequence of characters bounded by apostrophes).
21
developed. One of the most often used formalisms for developing natural language
grammars is this one [37]. The presentation of regular (or finite-state) grammar is known
as a transition network. The network is a directed graph with terminal symbols for arc
labels (words or word categories). The graph's start state is represented by one node, while
the final state is represented by one or more nodes. The assumption is that if there is a path
from the start state to some final state such that thelabels on the arcs of the path match the words
of the sentence, a sentence is in the language defined by the network.
2.4.2.3 Decision Sensitive Grammars
Rules of the form x, y where x and y are strings of alphabet symbols, with the
restriction that length (x) <= length of(y).
Rules of the form A, y | x z where A is a non-terminal symbol, y is a sequence of one or
more terminal and non-terminal symbols, and x and z are sequences of zero or more
terminal and non-terminalsymbols.
The meaning of the latter rule (or production) is that A can be rewritten as y if it appears in the
context ‗x z‘, i.e. immediately preceded by the symbols x and immediately followed by the
symbols z [37]. Context-sensitive grammars are more powerful than CFGs though the former
kinds of grammars are much harder to work with than the latter [12].
2.4.2.4 Unification-based Grammars
The term "unification-based grammars" refers to a grammar formalism that extensively uses feature
structures (such as case, gender, and tense), including the values reflected in the lexical entries of
words. The process of unification operates on these feature structures (i.e. the entire grammar can be
specified as a set of constraints between feature structures). The unification-based grammars, of
which DTGs are the most prevalent, can be supported by CFGs or any of the grammar formalisms
mentioned above. According to Joshi (NY), mentioned in Daniel [3], recursion can be embedded in
the feature structures, which is the major cause of the unification-based grammars' overwhelming
power.
22
2.4.2.5 Probabilistic Decision Free Grammars(DTFG)
DTGs can be generalized, just as Finite State Machines could be, to the probabilistic case. This
can be done by gathering some usage statistics for rules. That is, by simply counting the instances
of each rule in a corpus of parsed phrases and estimating the chance of each rule being utilized
using this statistical data. The likelihood of utilizing rule Rj to derive a category C from m
grammar rules with left-hand side C can be calculated given the category C and the m rules.
Pr (Rj | C) = count (#times Rj used) / Sum i=1, m (#times Ri used)
Probabilistic Decision Free Grammars (DTG) formalism refers to such Decision Tree
Grammars and associated probabilities. So, a typical PTFG grammar based on a parsed version
of a given corpus contains counts for LHS, counts for rules, and probabilities for each rule
produced. A four-tuple (W, N, S, R) is hence a typical definition of a probabilistic Decision-free
Grammar (PDG), where:
23
2.5 RelatedNLPinparsing
According to Win Win Thant, Tin Myat Htwe, and Ni Lar Thein [14], the challenge of assigning
function tags and context free grammar (CFG) to parse Myanmar phrases was addressed using
Naive Bayes. Due to Myanmar's free-phrase-order and grammatical morphological system,
statistical function labeling for Myanmar sentences can be difficult. Function tagging was utilized
as a pre-processing step before parsing. Assigning function tags to nodes in a syntactic parse tree is
a task that Mihai Lintean and Vasile Rus [29] outlined using two machine learning techniques,
naive Bayes and decision trees. They made use of a number of Blaheta and Johnson-inspired
elements [39]. The collection of functional tags in Penn Treebank and the set of classes they
utilized in their model are identical.
By using numerous dependence rules and segmentation, Yong-uk Park and Hyuk-chul Kwon [4]
attempted to disambiguate for a syntactic analysis system. Parsing involves segmentation. If there
are no syntactic connections between two adjacent morphemes, the syntactic analyzer creates a
new segment between the two morphemes, finds all potential partial parse trees of that
segmentation, and combines them into full parse trees.
24
The study used an intelligent (Rule-Based+ learning module) technique to create a prototype for the
language, an easy-to-use Oromo parser. It briefly explains the steps involved in the automated
sentence parsing of free texts. In other words, the goal was to create a prototype and use it for an
experiment. On the training test, the outcome was 95%, and on the test set, it was 88.5%.
2.7 Related NLP ComponentSystems
2.7.1 Morphological Analyzer
Recognizing and distinguishing specific word forms from the input text is the first stage in every
NLP task [31]. A lexicon that simply lists all word forms along with their part of speech and
inflectional information, such as number and tense, can provide this information in some languages,
such as English. The number of forms that must be listed in such a lexicon is manageable because
such languages have an inflectional system that is relatively straightforward. However, for many
other highly inflectional languages, such as Afan Oromo, where each noun or verb has a number of
inflected forms, a full lexical listing is just not possible.
This is due to the fact that each lexical word may have literally thousands of unique surface forms, each
with different inflectional characteristics but identical vocabulary parts overall [40]. As a result, NLP for
these languages would only be useful if it included a morphological analyzer that could compute the
parts-of-speech (POS) and inflectional categories of words using the morphological information of the
language [41][39].
Hence, a morphological analyzer is a key component required to break down words into their
morphemic components as well as to identify the word classes (such as noun, verb, etc.) into which a
specific word may belong before the work of parsing is completed. It involves the rules required to treat
words that are not in the parser's vocabulary and to produce information that is useful.
In other words, these criteria can be used to make educated assumptions about the
grammatical categories of words that are unknown. Moreover, a morphological analyzer
might be beneficial to help a parts-of-speech tagger (POST), a key element of a syntactic
parsing system. The next part, which discusses this second crucial element for a sentence
parser, provides a basis for include a morphological analyzer in this study.
In this aim, Abebe created a prototype morphological analyzer for Afan Oromo [2]. By
removing prefixes, stems, and suffixes from a given corpus, he created a morphological
25
dictionary (also known as a signature) using a Rule Based Method for Afan Oromo decision
tree Morphological Synthesizer.
In this study, it is assumed that the results from the prototype morphological analyzer for Afan Oromo created
by Abebe [2] won't have a big impact on how the input text is preprocessed before it is sent to the parser
together with the other NLP component system. Thus, this study will use manually processed words.
2.7.2 Part-Of-SpeechTagger
A POS tagger, an NLP system that automatically assigns the potential parts-of-speech categories to a
given word in a sentence, is the other important and fundamental portion of a sentence analysis
system. Since a POST entails recognizing the syntactic categories of words in a text, one of the main
reasons for implementing POST into a given automatic sentence parser is to eliminate improbable
parses (false analyses of a sentence). That is, if we can correctly assign the POS tags, a given
statement, such as "Gaangeen tolaan kalessa bitee hara ganama du'e" or "The mule that Tolla bought
yesterday died this morning," will become clear.
26
CHAPTER THREE
3.1 Introduction
The word classes, phrases, and sentences in Afan Oromo are discussed in this chapter since
each of these units has an effect on the current topic. Nonetheless, the paper starts with a
basic and condensed explanation of the lexical categories of the language before delving into
the grammatical categories of the language. The parts of speech that the lexical categories
fall under are known as traditional grammar. However the paper tends to use the word since
grammatical categories are more comprehensive. We employ lexical categories to describe
individual words and non-lexical or phrasal categories to describe different kinds of phrases
in order to distinguish between words and phrases.
The lexical categories covered in this chapter include conjunctions, adverbs, adjectives,
verbs, and verb tenses. Although they are treated independently, pronouns are nevertheless
classified as nouns. This chapter also covers other Afan Oromo words, including as
interjections and numerals. The chapter opens with a quick overview of Afan Oromo's
writing structure and punctuation marks in order to aid in comprehending this portion.
Based on information culled from Diriba [7], Abebe [42], Baye [43] [44] [45], Askale [46]
and Tilahun [47], and Girma [10], the analyses and debates in this chapter. These sources can
be used to learn more information about the topic.
3.2 Afan Oromo Alphabet and Writing System
Afan Oromo writing system is a modification to Latin writing system. Thus, the
language shares a lot of features with English writing except some modification. Fortunately
the study gets advantage of the Afan Oromo writing alphabets called commonly ―qubee
Afan Oromo‖ that has been designed and used so far by the language experts in the area.
The writing system of the language is straightforward which is designed based on the Latin
script. Thus letters in English language are also in Afan Oromo except the way it is written.
Any literature pertaining to the language will provide a full description of the Afan Oromo
writing system; however readers are advised to consult Diriba [7] and Girma [10] for a more
in-depth analysis of the writing system.
27
3.3 Punctuation Marks in Afan Oromo
Afan Oromo punctuation marks all follow the same punctuation pattern as English and other
languages that utilise the Roman writing system, according to analysis of Afan Oromo literature.
When making a statement, use a period (.) or a question mark (?) In command and exclamatory
sentences, use the interrogative form and the exclamation mark (!) A comma (,) that separates lists
from sentences and a semicolon (;) denote the conclusion of a sentence (;). The use of commas
separates the list of concepts, names, things, etc.
3.4 Word Categories in Afan Oromo
The grammatical categories of Afan Oromo have improved over time in terms of word
categories and other syntactic aspects, much like the grammatical categories of other languages,
like English, for instance. As a result, the language has now categorised and summarised the
eight conventional grammars into five groups. The following eight categories are used to
classify Afan Oromo words in conventional word categories (or Grammatical Categories). They
include the pronoun, conjunction, interjection, and the noun, verb, adjective, adverb, and ad
position.
Afan Oromo words are divided into five groups by contemporary syntacticians like Baye
[43] [44] [45], who place pronouns and adjectives under the noun category, conjunctions
under the ad position (pre- and postposition) categories, and adverbs, adjectives, and verbs
together. Adjectives and adverbs are classified under the same lexical category by some, such
as Askale [46]. In any instance, there are five syntactic subcategories that operate as the
phrase's heads. The language has five grammatical categories, each of which is headed by
five word categories, according to the aforementioned classification. Interjections, which are
"words" without syntactic functions, are not taken into account as grammatical categories in
this classification.
The classification system created by Baye [43] [44] [45] is used in the current investigation.
This is so that the parser that will be created by this study doesn't become redundant due to
the typical classification scheme's repetition. Instead, a subcategorization system is employed
to make the grammar rule more condensed and expressive. As a prelude to the tasks to be
completed in chapters four and five, which are the core and primary contributions of this
thesis in the field, the following portions of this chapter dig into the discussion of the
grammatical categories of Afan Oromo.
28
3.4.1 Categories of Nouns
Afan Oromo's definition of a noun is comparable to that of other languages like Amharic. Nouns
in the Afan Oromo language are used to name or identify specific instances of any of these
things, persons, places, or concepts. The Afan Oromo noun categories used in this study are
nouns, adjectives, and pronouns. In the following statement, the positions held by words like
"Fardaa" and "Horse" are regarded as noun positions. As in "Fardi marga dheeda" (the Horse
grazed grass). Moreover, the single and plural forms of two numbers are recognised in Afan
Oromo nouns. A plural noun is marked by a variety of forms while a singular noun is marked by
zero morphemes. The instances that follow serve as illustrations.
Singular Plural plural marker
30
3.4.2 Categories of Verbs
The discussion of this section is based on the information collected from Baye [43] [44] and
Askale [46] and Abebe [42]. These works consist of all the information required by the
current study. Verbs are forms which occur in clause final positions and belong to a distinct
category from that of nouns. For example in the following sentence,
Caalaan farda bite.―Chala bought a
horse‖ Leensaan dhufte. ―Lensa has
come‖ Tulluun dheeraadha. ―Tullu
is tal‖
The italicized parts are all verbs. Baye [45] divides verbs into anumber of sub categories
based on the type of constituents they are associated with. These are intransitive, transitive,
modals and auxiliaries verbs. The intransitive verbs are those verbs which do not take any
phrase as their complement. For example in the sentence ‗Abbabaan furdate‘ (Abebe got fat),
‗furdate‘ ―got fat‖ is an intransitive verb which has no complement. There is also what Sag
and Wasow [48] call strictly transitive verb. These types of verbs are those which take one
complement in Afan Oromo. Fore xample,
Inni [teechuma] NP cabse ―he broke the chair‖
Caalaan [mana] NP bite ―Chala bought a house‖
The NP in these two examples are complement to the verbs ‘cabse’broke and ‗bite’ bought.
For the detailed treatment of these sub categorizationssee Baye [45].
3.4.3 Categories of Adverbs
Afan Oromo adverbs are words which are used to modify verbs. Adverbs usually
precede the verbs they modify or describe. Example;
Tolaan kaleessa dhufe. “Tola came yesterday”
In this example, the adverb ‗kaleessa’ ―yesterday‖ precedes the verb ‗dhufe‟ ―came‖ that it
modifies. However, it should be noted that any word that comes before a verb is not necessarily
an adverbs. For instance, in ‘muka cabse’ ―broke wood‖, the word ‗muka‟ ―wood‖
precedes the verb ‗cabse‟ ―broke‖. In this case the word ‗muka‘is a noun and in turn is
modified by the verb ‗cabse‟. Hence, the verb functionally shares the feature of an adjective
(modifier). There are different types of adverbs. These are adverbs of time, place, manner,
frequency, degree, etc. in general; adverbs are treated as the subclass of verbs. Days of the
week in Afan Oromo language may be used also either as a noun or as an adverb.
31
Ad positions in Afan Oromo
The term Ad positions refers to words, which be will have meaning only when they are
attached or used together with other words such as nouns, verbs, pronouns and adjectives.
Ad positions are characterized by having no inflectional or derivational morphology and
belong to the closed system.
Adpositions can appear as:
A simple adpositions that stands alone as separate
words Examples Toleraawalin ―with
Tolera‖
Gara mana ―to house‖
A simple adposition prefixed or attached with other words (e.g. nouns and verbs).
Example harka-an ―byhand‖
Ummata-f ― to/for thepublic‖
As compound adpositions consisting of two parts,adpositional prefixes and post
positions put afternouns. The postposition scan either be single adposition that
stand by their own or an adposition not separated from anoun.
Examples sanduqa gubba-rra ―On top side of thebox‖
32
3.4.5 Numerals
These are words representing numbers. They can be cardinal or ordinal numbers. A list
of the Afan Oromo Cardinal numbers is found in Hamiid [49]. In Afan Oromo, the ordinal
numbers are formed from the cardinal numbers by suffixing the suffix {–ffaa}.
Dhibalammaffaa ―twohundredths‖
Dhiba lammaa-fi shan “two hundred and five‖
In Afan Oromo, there are also numerals that indicate distribution. These numerals are
called distributive numerals.
Example „sadisadi’ ―three three‖
There are also special numerals in Afan Oromo that correspond to the English
―half‖, ―quarter‖ etc.
Examples of these include ‗walakkaa’ ―half‖ and ‗siisoo’ ―one third‖.
3.4.6 Interjections
Like English, Afan Oromo has many words or phrases used to express such emotions
as sudden surprise, pleasure, annoyance and so on. Such Afan Oromo words are called
interjections. These Afan Oromo interjections can stand-alone by themselves outside a
sentence or can appear anywhere in a sentence.
Examples ashuu! ― wonderful!‖
wayyoo ― my goodness‖
ani bade! “my goodness‖
A long list of Afan Oromo interjections is found in Hamiid [49]. Based on the above lexical
categories, the nextn section explores the types of phrases found in Affan Oromoo. The idea
33
of headedness discussed in this chapter of this paper may indicate that the types of phrases found in
the language depend onn the lexical categories of the language. Moreover, Baye [45] and Sag
and Wasow [48] divide the types of phrases based on the lexical categories. Thus this paper
will depend on this classification for the purpose of the problem under consideration but keeping
the idea of headedness in mind. Moreover, this paper depends entirely on Baye [45] and Sag
and Wasow [48] for the analysis of Afan Oromo phrasal categories.
34
five phrase types in the language. They will be reviewed in the following
subsections.
3.5.2 Noun phrases
A noun phrase is made of one noun and one or more other lexical categories including the noun
itself. For example, in the phrase ‗mana citaa ―thatched house‖, there are two nouns which make
the noun phrase: ‗mana‟ ―house‖ and ‗citaa‟―thatched‖.
Thus, noun phrase and phrases in general must meet the above criteria to be called a phrase. In
the following sentence ‗Tolaan mana citaa qaba‟ ―Tola has owned a thatched house‖,
‗mana citaa’ ―thatched house‖ is a noun phrase. But to check whether it is really a phrase or
not, we can see the above criteria. The following arrangement is impossible for the above
reasons.
As indicated above, nouns can appear in a number of positions, such as in the positions of the
three nouns in ‗Tolaan kitaaba Haawwiif bite‟ ―Tola bought Hawi a book‖. These same
positions allow sequences of a noun followed by an article, as in ‗Tolaan kitaabicha Haawwiif
kenne‟ ―Tola gave Hawi the book.‖. Since the position of the article can also be filled by
demonstratives (‗kun’, ‗sun‟, etc.), possessives („koo‟, „kee‟, „keessan‟, etc), or quantifiers
(e.g. ‗xiqqoo‟), the more general term ―Determiner‖ abbreviated as (DET) isused.
Moreover, each constituent in a phrase has its own positions and functions. For example,
‗mana‟ and ‗citaa‘ are both constituents of the phrase „mana citaa‟. A phrase is usually headed
by one word. The head word is the core component of a phrase. Without ahead a phrase can‘t
be built. On the other hand a head can stand-alone by itself. A head word can determine not only
phrase type but also lexical categories. If the head is a noun, then the phrase is a noun phrase,
etc Sag and Wasow [48]; Levine and Green [50].
NP has a lot of constituents in Afan Oromo. As indicated above one of the constituents is the
35
determiner. Consider the following example,
A) [Namni tokko]NP [saree ajjeese]VP ― A man kil ed adog‖
B) [Namichi]NP[saree ajjeese]VP―the man kil ed adog‖
We can see that NP consists of determiners of type articles ‗tokko‟ ―a‖ in (A) and ‗–ichi‟
―the‖ in (B). However, the position of these determiners in Afan Oromo is different from
English in that determiners come after the nouns they modify. Not only determiners but also all
modifiers for nouns come after it in the language. A NP may also consist of two nouns like
„mana citaa‟ ―thatched house‖. In Afan Oromo the order of words especially the head
word and the modifiers and specifies are different from the word in English. For example, a NP
in Afan Oromo consists of one noun word as head word and another noun plus an adjective
as modifiers and specifies like in the following example.
„Mana citaa bareedaa‟ ‗a beautiful thatched house‘.
In addition to the above developments Afan Oromo has a NP which may appear as accusative,
nominative, genitive, dative and instrumental. This type of existence of nouns Ina different form
for different function is called case. In the following subsection it will be reviewed in brief.
However, for detailed treatment of case in Afan Oromo, readers are referred to Abebe [42].
3.5.2.1 Accusative and Nominative Case
The accusative case form is the basic form of nouns and pronouns in Afan Oromo. Abebe [46];
Baye [43]; Gragg [51].This means that nouns and pronouns in direct object position do not
have overt case marker, as shown in the following sentences. While the nominative case (words
in their subject position form) are inflected for agreement in terms of case.
A. [Tulluu -n]NP [mana]NP ijaar-e
― Tulu built (a) house.‖
B. Tulluu-n [farda adii](NP) yaabbat-e
― Tuluu rode (a) white horse.‖
C. Tulluu-n [intala-tii]NP
beellam –e― Tul uu dated
thegirl.‖
D. Tulluu- n [intala –tii diimttuu] (NP)
beellam- e ― Tul uu dated the whitegirl.‖
E. nam- ni of
36
jaalat -a ―Man
loveshimself‖
F. fard -i marga
dheed-e ―A
horse
grazedgrass.‖
In the above example, the phrase ‗Tulluu-n‘ is an NP as nominative case (subject case) and
„mana‟ [house] is an NP as accusative case. Thus in the above example one can see that NP
as subject has case marker, i.e. „-n‟, „ni‟, and „i‟ while NP as object form has no case marker
in a sentence.
The object NPs ‗mana’ house‘ (head noun) in (a), ‗farda adii‘―white horse‖ (a head noun
and modifying adjective) in (b), ‗intala-it’ ―the girl‖ and ‗mucaa’‗ ―the boy‖ (head noun
in (c) & (d) respectively, and intala-ittii diim- ttuu ―the white girl‖(a head noun along with
singulative marker and modifying adjective) in (e) are not all overtly marked for accusative
case.
Similarly, personal pronouns ‗ana‘ ―me‖, ‗nuu’ ―us‖, ‗sii’ 'you' (second person singular), ‗isin’
―you‖ (2pl), ‗isa’ ―him‖, ‗ishii’ ―he‖, ‗isaan’' ―them‖ in object position do not affix accusative
case marker.
It can be noted from (a-f) in the above examples that nominative case in Afan Oromo
nouns is marked by ‗-ni‘, ‗-i‘, ‗-n‘ and ‗φ‘ (empty set). Abebe [46] generalized these
subject markers as in the following case.
i. ‗-ni‘ occurs after a noun which ends with a short vowel that is
dropped, (e.g. in Eabove),
ii. –i‘ occurs after a noun that ends with a short vowel which is
preceded by consonant cluster; the short vowel of the stem is again
deleted (e.g.F),
iii. ‗-n‘occursafteranounthatendswithalongvowel(e.g.A—D),and
iv. ‗φ‘ occurs after a noun that ends with a consonant (e.g.G).
An adjective modifying a head noun in external argument position attaches similar
suffixes as the nouns in the above examples for subject markers.
A. nam-nifurdaa-ndhibee hin-danda‘-u "A fat man cannot resist disease"
37
B. [fard-i gurraach-I] NP collee-dha " Black horse is smart"
As can be observed in (A and B), subject marker in the nominative case is copied onto the
modifying adjectives. The forms of the nominative marking suffixes on the adjectives are
phonologically conditioned in the same way as they are made on nouns. Detailed treatment of
case in Afan Oromo is found in Abebe [46]. The same is true for personal pronouns in
AfanOromo.
There are personal pronouns such as ‗nu‘ ―us‖ and ‗sii‘ ―you‖ which do not seem to fit into
the rules in above. ‗Sii‘ ―you‖ (accusative or object case)' and ‗ati '―you‖
(Nominative) are different forms from one another and may be considered suppletive. On
the other hand, ‗nu‘ ―us‖ and ‗nuhi/nuti" ―we‖ share some common phonetic form that have
been summarized from above in the above rules. In general NP constituents are:
1. A noun as headword
2. Specifiers like adjectives, adposition,etc
3. Quantifiers likenumbers
Furthermore, Afan Oromo simple noun phrase is head final. A more detailed discussion of noun
phrase will be presented in the sub topic ―Sentence in Afan Oromo‖. The last point to make
about Afan Oromo Sentence is that it has discontinuous morpheme to indicate negative markers.
For example, Abbabaan hin ddhufne
3.5.3 Verb Phrases
It is important to establish the word complement here before moving on to the discussion of verb
phrases (VP). A complement is a word or phrase that the head word might take as its components to
make it grammatical in simple terms. Like "Abbabaan dhufe," "Abebe came," some words do not
require complements, while others just require one complement, as in "Abbabaan muka cabse," and
yet others require two complements, as in "Abbabaan konkoolataa naaf bite." Using this concept as a
guide, three categories can be used to classify Afan Oromo verb phrases. All three of these verbs fall
within the intransitive category.
„Abbabaan dhufe‟ ―Abebe came‖.
„Abbabaan teechuma cabse‟ ―Abebe broke chair‖
„Abbabaan konkoolataa naaf bite‟ ―Abebe bought me a car‖.
All varieties of adverbs, adpositional phrases, and noun phrases can be found as components
of VP. The subtopic "Sentences in Afan Oromoo" will provide a more thorough
38
explanation.AdjectivePhrase
Adjectives serve as noun phrase specifiers. They typically follow the noun (typically the head word)
that they refer to. For instance, "big house" "managuddaa". Nouns can function as adjectives in
adjectives, such as "mana citaa" (thatched house) or "mana gubate" (burn to house), which are verbs.
The next subtopic "Sentences in Afan Oromo" contains still another further detail.
3.5.4 Adverb Phrase
Adverb phrases in Afan Oromo consist of one or more different lexical categories, including the
adverbs themselves as modifiers and specifiers. It is possible, for instance, to have two adverbs
in an adverb phrase in Afan Oromo, as in the phrase "kaleessa galgala," which means "yesterday
night." Adverbs and related phrases are employed to modify verbs, as was already mentioned.
They therefore come before verbs in a phrase. In general, an adverb phrase can be made up of
an adverb as the head word, a noun phrase, another adverb, etc.See Baye[45] for a thorough
explanation.].
3.5.5 AdpositionalPhrase
Ad positional phrases are combination of nouns and ad position. They usually specify verb
phrase. This phrasal category sometimes is called ad positional objects (Baye 1986).
A. ‗Inni gara mana deeme‘ ―He went to the house‖
B. ‗Lammaan kophee Caaltuu-f bite‘. ―Lemma bought a pen toChaltu‖.
3.6 Sentences
3.6.1 Afan Oromo Simple Sentences
A simple Afan Oromo sentence consists of a noun phrase NP, which is the subject, followed by
a verb phrase VP that comprises the predicate.
‗Namichi saree jaalata’ ―The man loves dog.‖
Baye [45] classifies simple sentences into four, namely: declarative sentences, interrogative
sentences, negative sentences and imperative sentences. Declarative sentences are used to
convey ideas and feelings that the speaker has about things, happenings, feelings, etc, that could
be physical, mental, real or imaginary.
Example: ‗Haawwin abokatoo taate.‘ ―Hawi became a
lawyer‖
39
A sentence that questions about the subject, the complement, or the action the verb specifies, is
called an interrogative sentence.
Example: „Haawwin yoom dhuftee?‟ ―When did Hawi
come?‖
Afan Oromo phrases are frequently constructed using interrogative pronouns like "eenyu" for
"who," "maal" for "what," "essa" for "where," "meeqa" for "how many/how much," and
"yoom" for "when." Then, other interrogative prepositional phrases can be created by
combining these interrogatives with prepositions, such as "eenyu irra" for "from whom,"
"maalif" for "why," etc.
Negative sentences only contradict a declarative assertion that has been made.
In Afan Oromo, a sentence is made up of zero or more noun phrases and one or more verb phrases.
A sentence, on the other hand, is regarded as a unique category of phrase made up of a noun phrase
(NP) and a verb phrase (VP). Thus, while discussing sentences, we can use the term "phrases."
We need to take a moment to consider the standard nomenclature parts of speech (POS) before
moving on. There are several feature structures that work better with some POS (lexical
categories) than others. For instance, in Afan Oromo, CASE is only appropriate for nouns,
adjectives, and pronouns. For nouns, verbs, and determiners, the characteristics PER (SON) and
NUM (BER) are employed. Therefore, we must ensure that the appropriate feature corresponds
to the appropriate lexical categories in the parser that will be constructed. Furthermore, it should
be mentioned that the lexical categories covered in this chapter are significant since they function
as the head of a phrase in an Afan Oromo sentence.
As has been mentioned thus far, the head will always indicate the importance of a lexical
category in Afan Oromo. Head does the same task as the POS, but it also adds further value by
giving us a method to the features that each POS requires into consideration. Furthermore,
40
HEAD permits us to introduce decision features, or features inside features, according to Sag
and Wasow [48]. The ability to convey the relationship between a headed phrase and its head
daughter simply will be of immediate service. Let's define a tree structure that is typical of
practically all disciplines in order to clarify these terminology. With an upside-down tree
diagram, every statement may be expressed. As in the phrase "Namichi sareejaalata," which
means "The man loves dog."
Nodes joined by branches are referred to as branches in a tree. It is claimed that a node (in
the example above, S) dominates a branch when it is located above another node (in the
example above, NP or VP). The terminal (or leave) nodes, or those at the base of the tree
that do not dominate anything else, are known as. One node on a tree is said to be its mother
node and to immediately dominate the node directly above it. It is said that a node's offspring
is the node directly beneath it. They are sisters, two daughters of the same level.
Let's return to the concept of headed phrase and head daughter while keeping in mind the above
straightforward definition of "daughter." The mother and one of the daughters must have the same
(unified) values for both POS and characteristics, according to Afan Oromo grammar rules (and
any other language, for that matter). The head daughter is always the constituent on the right side
of the rule that has the unifying (matching) feature value.
According to Sag and Wasow's general principle [48], which applies to all trees constructed using
headed rules, the head values of the mother and the daughter in every headed phrase must be unified (or
have the same value). Furthermore, they claim that the rules governing phrase structure are no different
in kind from those governing word structures, other than the fact that grammar rules rather than lexical
entries govern them. So, based on grammar rules, we can state that a sentence (or a phrase) is
grammatically correct (or well-formed). For instance, according to Sag and Wasow [48], a sentence is
well-formed simply in case each local subtree within it is also.
41
According to the structure, a sentence is made up of a noun phrase, a verb phrase, and a
noun. Ahead, united by their agreement in a sentence, see Sag and Wasow [48], respectively.
It should be mentioned that this is merely a broad illustration of sentences in general.
However, the goal of this study is to create a parser for basic statements that express real or
ideal, concrete or abstract thoughts, feelings, or behaviors. A period is used to end certain
types of phrases in Afan Oromo. Additionally, for the same reason, the paper will also
incorporate the feature of type agreement for all grammatical categories and past tense for
verbs in the development of the parser.
3.6.2 Afan Oromo Decision Sentences
In Afan Oromo, decision sentences are those that are made up of decision phrases like the
main clause (MC) and subordinate clause (SC). Each MC should consist of NP, VP, or AdjP,
intern. The pattern of combination may consist of simple VP and decisions, simple NP and
decisions, or both simple NP and decisions. Before examining how choice phrases combine
to form decision sentences, it is worthwhile to look at the structure of decision phrases.
One that has a sentence embedded within it is known as a decision MC. "Huccuun adii
Tolaan kan bitee," for example. A grammar MC/NP with "cloth" as its head is "The white
cloth that Tola bought." To create the straightforward NP "huccuu adii" "a white cloth," this
head was coupled with the complement "adii" "white." The dependent clause/subordinate
phrase "that Tola bought" and the simple NP "Tolaan kan bitee" were combined to create the
complex NP mentioned above. The clause is a subordinate clause and cannot stand alone
since it contains the relativizer "that," "kan," in it. A parse structure tree showing the
structure of this complex NP looks like this.
42
Figure 3.2: The Structure of a grammer Noun Phrase
The relative clause "Tolaankan bitee," which means "that Tola bought," modifies the noun
phrase "Huccuun adii," which means "white cloth," as can be seen in the tree diagram. The
3rd person singular masculine object of the verb 'kan' 'bitee' "that (he)" has the same person
level as the word "Huccuun adii," and it is in the third person. Therefore, "Tolaan kan bitee"
is known as a relative phrase in Adugna[52]. It means "that Tola bought."
Similar to this, a sentence is a decision if it includes more than one verb or sentence. In other
words, a tree VP/SC contains an embedded sentence that serves as a complement or modifier,
much like a tree NP/MC does.
„Toolaan akka ishee jaalatee Haawiin siritti bektee‟.
―Hawi knew that Tola loves her‖
Toolaan akka ishee jaalatee is the dependent clause in this sentence. The phrase "as Tola
loves her" is what caused the clause to become dependent. This sentence serves as an
adverbial phrase of reason since it explains why Hawi knew he was loved. The following
tree diagram can be used to show the structure of this VP:
(CS
(MC
(NP (N
Toolaan))
(PP (P
akka))
43
(VP (PR ishee) (V jaalatee)))
(SC (NP (N Haawwiin)) (VP (ADV siritti) (V bektee))))
[0.00041472000000000004]
44
CHAPTER FOUR
A simple manual part-of-speech tagger and morphological analyzer are covered in the third
and fourth sections, respectively. The subject of the fifth section is the extraction of a
probabilistic context-free grammar rule from the tagged corpus..
4.2 The Design Approach of the Parser
The use of statistical methods has greatly accelerated development in a number of language
processing domains. Disambiguation, document database classification, speech recognition,
and grammar learning are a few areas where statistics has been helpful. Chomsky discovered
that statistics was the most effective tool for looking at some linguistic phenomena'
regularities [34].
Researchers in the field of natural language processing (NLP) are now unable to extract the
statistical data that can help them understand language since little to no effort is being made
to make large Afan Oromo corpora online accessible. Because of this, this study depends on
materials that have been manually annotated and labelled.
As mentioned in the second chapter, the Afan Oromo tree sentence parsing system was
designed taking into account the PCFG bottom up chart parsing technique. As was mentioned
in chapter two, the majority of the structure may be described by CFGs using natural
language. They are essential because they are sufficiently limited to permit Allen [19] the
creation of efficient parsers for sentence analysis. PCFGs, which define a language as a
probability over strings and are used in many applications [40], are the probabilistic
equivalent of CFGs.
45
Because they can deal with frequent parsing concerns such structural ambiguity, which
becomes more of a difficulty as grammar becomes more complex, anticipating the parse
space, and ungrammatical phrase analysis, PCFGs tend to be more advantageous for sentence
parsing than CFGs.
Yao and Lua [53] provide the probability of the parse tree of the sentence w1,n as the sum of
the probabilities of all rules used in the parsing if a sentence w1,n (where w1,n is a series of
words w11,..., w1n) has T (n) possible parse trees (possible structures), and the parse tree that
0 I T(n).
( ) = ∏ ( ) … … …… … … …… … . . 1 = and the probability of the sentence w1, n is
the sum of the probabilities of all possible parse treesthat
( ) (1,) = ∑ ( ) … … … …… … ……. .
P(w1,n) shows the potential grammaticality of a 1, n in the language, while P(ti) shows the
potential of the it parse tree among all feasible parses. The more P(ti) there is, the more
logically sound the parse is. This argues that all one needs to do is discover 0 () () (in order to
get the best possible parse of a sentence. P(w1,n) and 0 are two important facts in syntax
analysis because the first one shows how a sentence can be justified in PCFGs, the second
one shows how it can be justified in PCFGs, and the third one shows how many parses can
be made. The P(ti) values are the major focus to find the most probable parse structure.
4.3 The Sample Corpus
In order to conduct this study, 300 Afan Oromo simple and compound sentences that were
selected from newspapers and widely used grammar books were used. There was no
annotated text for the sample corpus's grammar induction and training purposes, therefore the
manual morphological analysis of each word, hand tagging, and sentence parsing procedure
took a long time. The researcher admits that the sample size still seems somewhat small..
The phrases were taken from the books "Seerluga Afan Oromo" by professor Baye Yimam
[44] and "NATOO: Yaadrimee Caasluga Afan Oromo" by Berkesa Adugna [52], which were
both written with the intention of serving as references for teaching Afan Oromo language at
the tertiary and secondary levels, respectively. Language consultants were consulted before
46
the references were chosen. On the other hand, the articles of the human rights law were
chosen since they are used as a model for NLTK corpora's natural language processing.
The sentences were chosen to represent two or more phrase classes, embed one of the various
clause types (such as a relative clause, reason clause, result clause, or time clause) for
decision sentences, and contain one noun phrase and one verb phrase for simple sentences all
of which were covered in chapter three.
For the purposes of processing this study, the sentences were then manually annotatedUsing
the Afan Oromo language's phrase structure rules, the researcher and an Afan Oromo lecturer
from Teachers Training College manually tagged and processed the remaining phrases
In Baye [45], some of the example sentence parses were given.
The linguistic advisor for the thesis as well as another authority on the Afan Oromo language
was then contacted for feedback and suggestions. The probability calculations for the terms
in the sentences, the induction of grammar rules, and the probability assignment to the
grammar rules were performed on 240 sentences (approximately 80% of the sample corpus).
The remaining 60 sentences (or 20% of the corpus) served as the test set, while 60 sentences
were randomly selected to serve as the training set.
Even while it is possible to list every term that the system accepts in straightforward
instances and small systems, doing so for sentence parsers that support a large vocabulary
would be quite difficult. There are numerous words, but each word can also be joined with
related affixes to form new words. One way to deal with this is to preprocess the input
sentence into a string of morphemes, as Allen [19] suggests. In the Afan Oromo language, a
word could contain only one stem but multiple morphemes.
The bulk of words in Afan Oromo, which is an inflectional language, are made up of a stem
and an affix (for instance, "student" in the singular forms "barat-a" and "student" in the plural
forms "barat-oota" and "the students" respectively). As was noted in chapter two, one of the
most important NLP systems in the development of part of speech tagging and sentence
parsing systems is a morphological analyzer. The researcher found it difficult to incorporate
the prototype since the requisite materials were not discovered in any of the archives, even
47
though Abeshu [42] only made one attempt to develop an Afan Oromo decision
morphological analyzer..
Therefore, efforts were taken to include a manually annotated stem and affix (or affixes)
specifically for this study aim.
As a result, efforts to incorporate a manually annotated stem were made. In addition, Nedjo
[54] developed a rudimentary tree POS tagger prototype for the Afan Oromo language, as
was discussed in chapter two. Nedjo [54], in contrast, used the Viterb approach with HMM in
his work while Nedjo [54] used the Maximum Entropy Markov Model. The researcher was
allowed to add an HMM POS tagger solely for the purpose of this experiment in order to
further their research.
Word Code Table:
This table contains the words from the example text, together with their matching word codes, which are
listed in order for each word. The 3029 words came from 300 phrases in a decision tree.
Category Code Table:
The word categories discovered using the universal part-of-speech tag set are kept in this
table.These are the common tag sets:
ADJ,J Anadjective
AdjP AdjectivalPhrase
ADV Anadverb
ADVC An adverb not separated from aconjunction
AdvP AdverbialPhrase
AUX Auxiliary verbs and all their otherforms
DT Complement
CONJ Aconjunction
ITJ Interjections
JC A conjunction not separated from anadjective
JNU A numeral used as anadjective
JP An adjective not separated from apreposition
48
JPN A noun not separated from a preposition and that function as anadjective
N Noun in allforms
NC A conjunction not separated from anoun
NP A preposition not separated from anoun
NP NounPhrase
NUM Number
NV Verbalnouns
PP PrepositionalPhrase
PREP Apreposition
PUNCT Punctuation
REL Relativeclause
VC A verb prefixed or suffixed by aconjunction
VCO Compoundverbs
VP VerbPhrase
X to representunknown
This table contains the 28 word categories that were identified and assigned a corresponding
category code. There are a total of 27 categories found; one more category is found for all
uncertain terms. The 27th category covers all punctuation. There have been a few minor
changes, such the use of "J" instead of "Adj," which both refer to the class of adjectives in
the POS tagger and parser, respectively.
Lexical Probabilities Table:
The lexical probabilities table contains the likelihood of each word in a given corpus having
one of the supplied categories (tags). P (wiCi), p (word category), or p (word tag) are all
acronyms for this. In one pre-tagged corpus, p (HaallliiN) shows the (lexical) likely that the
word Haallli, which is a common name in the language, will be used as a noun, whereas p
(HaallliiADV) indicates the (lexical) likelihood that Haallli will be used as an adverb. The
likelihood of lexical invention is calculated using the frequency of each word within a
category. Using mathematics, the following is provided:
P(Wi\Ci)= number oftimes Wi appears incategoryCi Equation3
Total number words with category Ci
Such probability values of the lexical tables were recalculated using the new data entered in
the word category code table.
49
Transition Probabilities Table:
The likelihood of a tag given one or more prior tags is known as a transition probability and
is symbolized by the symbol p (CiCi-1... Cn). For instance, p (Ci = NounCi-1 = Adjective)
provides the likelihood that a noun will be followed by an adjective.
We can have bigram (n = 2), trigram (n = 3), or generally an n-gram transitional
probabilities depending on the value of n (which indicates the maximum number of
categories being examined). The abigram model is indicated by the symbol p (CiCi-1) if (n =
2). The model visa trigram is represented by the symbol p (CiCi-1Ci-2) if (n = 3). These
models presuppose that the chance of a specific category occurring depends only on the one
or more categories that come right before it.
given a database of texts tagged with part of speech, the bi-gram or transitional probabilities can
be estimated simply by counting the number of times each pair of categories occurs compared to
individual category counts. Mathematically this is writtenas:
P (Ci=/ Ci-1=) = count of the number of times and occur together in the
corpus..…Equation 4
Number of times occurs in the corpus
Where and are parts of speech codes.Such probability values of the transitio nal
probability table were recalculated using the categorical sequence of the newly entered data into
the word codetable.
4.6 Extraction of A Probabilistic Context Free Grammar
In the process of extracting the PCFGs, the sentences that are used as a training set were first
manually hand parsed and represented in the following manner.
„Ani huccuu adii Tolaan bitee kaleesa argee.‟
(CS
(MC (NP (N Ani ))(VP(NP (N huccuu) (Adj adii) )(NP(N Tolaan)(V
bitee)))) (SC (VP (Adv kaleesa) (V argee))))
These manual parsings of the training set led to the development of the CFG rules. Then, in
order to gather statistical information about the observed grammatical rules, the number of
instances of each rule in the manually parsed training sentences was counted (which was
50
found to be the simplest way for this purpose). The probability of adopting each rule was
then determined using the statistical data (See also Allen, [19]). Finally, each retrieved rule's
probability values were assigned using the formula below.
P(Rj/C)= Count of the number of times Rj occurs in thePCFGtable ....Equation5
Count of the total occurrence of the category C on the LHS in the PCFG
In order to keep things simple, the Chomsky Normal Form (CNF) is used to represent the PCFG.
Once the probabilistic context-free grammar has been recovered (CNF), the next step was to
transform it into Chomsky Normal Form. For ease of use, this conversion was made. This constraint
won't truly limit your capacity to express yourself because any CFG rule can be written in the CNF
with simplicity. This conversion followed the rules of phrase building for the language; nevertheless,
it was actually more about displaying it as it was in its original form than it was about converting it to
the CNF. Since all clauses considered were 4-word phrases, the majority of the criteria were thus
already contained in the CNF.
A sample of the extracted CNF rules and their corresponding probability values are shown in Table
4.1 below.
51
CHAPTER FIVE
PARSING ALGORITHM AND EXPERIMENTATION
5.1 Introduction
The tests conducted, the parsing algorithms utilized to develop the prototype Afan Oromo
tree sentence parser, as well as the analysis and findings from the investigations are covered
in this chapter. Also, this chapter describes the parser's design, which includes the input
output interface, the probabilistic rule base, and the module for parsing charts.
The next section provides a discussion of the Inside-Outside algorithm, which served as the
foundation for this parser. This chapter's third section introduces and discusses the Inside-
Outside algorithm's implementation of PCFG parsing. The design of the parser is covered in
the fourth section, and a report on the trials, the results, and the solutions is given in the fifth.
5.2 The Parsing Algorithm
For computational models of natural languages, ambiguity resolution is a crucial problem
[19]. The parse space of a sentence, for instance, is the space of feasible syntactic
interpretations. When using a chart parsing algorithm, which calculates each constituent's
probability based on the probabilities of its sub constituents and the rules utilized, it is
possible to assess the likelihood of several parse trees for a given text.
The parser developed for this study is based on this method and uses a modified parse chart
(given by Yao and Lua [53]) to assist in parsing. Section provides the formula used to choose
the best (or most likely) sentence parse structure.
52
5.3 PCFGParsing
53
5.3.2 Parse Chart
As was mentioned previously, this study used the Yao and Lua [53] parse chart to develop
the Inside-Outside method. Equation 9 served as the basis for this parse chart.
Since it is symmetrical, just the top-right half of the matrix 10 in section 5.3.1 would be used.
This chart's size is determined by the amount of words in the text being analyzed. It denotes
that the chart was an n-by-n matrix for an n-word text.
Afan Oromo tree sentence analysis begins before the input sentence enters the parse chart, as
you should be aware of. As a result, a sentence with five words would be assigned six POS
tags, a sentence with six words would be assigned seven, a sentence with seven words would
be assigned eight, and so on. Only the heuristics employed to cope with verbal affix
movements during the structural representation of Afaan Oroomoo tree sentences were to
blame for this. The phrase "Toolaan ishee akka jaalatee, Haawwiin siritti bektee" serves as
an illustration of this. Hawi was aware of Tolla's genuine love for her.
The highlighted verbal affix/adposition "that" detaches from "jaalatee" (that he loved her)
and assumes the position before the preposition, as seen by the above figure. All of the
AffanOromoo tree sentences that were taken into consideration in this study were created by
embedding clauses that contain such relativizers. These clauses were subsequently analyzed
into verbs and verbal affixes (referred to as Complements COMP), and these were designated
by ADP. If each N(i, j) is supported by a grammatical rule, the element N(1,7) is the starting
symbol. In the diagram, the element N(i, j [1, 7]) signifies a non-terminal node. The non-
54
terminal N(i,j) (i, j [1, 8]) in the diagonal supported by the rule N(i,i) wi is the word wi's
position on the grammatical plane. An example of an 8-level-chart that can parse 7-word
sentences is shown in Figure 5.3 below. Sentences.
Here the # symbol indicates the end of the sentence. When this sentence passes through the
morphological analysis process, each word is analyzed into astem and affix(s), and outputs the
following format that the tagger would use as aninput:
„JabbiadiiHaawwii kaleesa bitteeargee‟
The following format is produced by tagging each stem with the corresponding POS:
Jabbi, Nadii, Adj Haawwii, N Kaleesa, Adv Bittee, V Argee.
Each marked stem will then be re-synthesized with its corresponding affix, using a
hyphen (-) as a separator as seen below.
55
Jabbi-lee, Nadii, Adj Haawwii-n, Kaleesa, Adv Bittee, and V Argee.
In order to determine whether the category of each tagged stem changes when combined with
the affix(es), each word will then undergo a post tagger morphological process. Not just for
the reasons described above, but also because it simplifies the challenging task of parsing
complicated Afaan Oromo phrases, this is a crucial milestone in the development of the
parser. Alternatively put, At this point, it is determined which Afan Oromo verbs often
referred to as Relativizers take affixes (such as -een, -wan, -(o)ota, -yyii, and -lee). Last but
not least, the input sentence would take on the following format and be submitted to the
sentence processing module:a
Jabbi-lee\Nadii\Adj Haawwii-n\N kaleesa\Adv bittee\V argee\V#\PUNCt
The input preprocessor module algorithm is provided below.
every sentence in the document Take everything one sentence at a time.
For each each word in the phrase Identify each word's stem Request
the HMM POS tagger.
Get the sentence stems that have been tagged
To update the Category output of the tagger, use the Morphological
Synthesising Function.
Send the parser the last string of words with tags.
56
The Parse button
Using this button, a user can parse each sentence in a saved file individually. Consequently,
the function of the produced prototype is as follows.
"Mootumaan, Biyyattin sadarkaa guddaa dinagdee irratti galmeesiftee ibsee," is the output of
the morphological analysis component, which accepts the input sentence as input.
The biggest economic victory the nation has had was described by the government.
The tagger then takes the previously mentioned string of stems as input, assigns the proper
POS tag to each stem, and generates the following format for the parser to utilise as input.
‗Mootumaan, NOUN Biyyattin, NOUN Sadarka, ADV Gudda, ADJ Dinagdee, NOUN
Irratti, ADP Galmeesiftee, VERB ibsee, PUNCT
However, each tagged stem word would be synthesised with its affixes (if any) before the
aforementioned output of the tagger is sent to the parser, and the resulting word's category
would be sought from a table that updates the categories of inflected stems. The complex
sentence above, for example, might take on the following structure after going through this
process.Mootumaa-n\Biyya-ttin, Sadarkaa, Guddaa, Dinagdee, Iratti, Galmesiftee, Ibsee,
Verb. Punct'
The terms in the input sentence that are impacted by the aforementioned procedures are
highlighted in the example above. This was the final product of the POS tagger with
morphological analysis' help. Each word and each POS tag are extracted by the parser from
the tagged phrase and are then stored in one-dimensional array variables. The parser's output
includes the parser result, the grammatical rules employed, and the likelihood of the chosen
parse structure. The outcomes for the sample sentence displayed thus far include:
The parse structure's propensity:
0.000034129851158584 the parse result:
(CS (SC (NP (N
Mootumaa-n)(,))) (MC-
VP
(S
(NP (N Biyya-ttin) (ADJP (ADV sadarkaa) (ADJ guddaa)))
(VP (NP (N dinagdee) (ADPirratti)) (V galmeesiftee))) (VP (V
CS SCMCP VP
57
SC NPN NP N
MC-VP SVPS NPVP NP NADJP ADJP
ADV ADJ VP NPV NP NADP V E
The rules used in parsing are grammatical rules, and in this case, rules involving terminal
nodes (for example, N Mootuma-n) are not displayed since such lexical rules are not
included in the PCFG table. This is so that the parser can process sentences that have already
been pre-tagged and handle lexical information.
Table 5.4: Parsing result on Training Set before making no error correction
PCFG Representation
E is a bare production in the table above. The left hand side of the rules are kept in the field
LHS, while the first and second right hand side rules are kept in RHS1 and RHS2,
respectively. The related probability values for each of the grammar rules are displayed in the
probability field. (See Appendix 5 for a complete list of the rules derived from the corpus by
PCFG). The Word Code and Lexical Probabilities tables, which were created by the tagger
and kept in the same database as the PCFG table, provide the word information needed for
parsing.
58
The chart parsing module
The PCFG Inside-Outside algorithm is implemented by the chart parsing module, which is
supported by a parse chart created by Yao and Lua [53] based on equations 7 and 8. Equation
11 is used to calculate the value of each non-terminal node N(i,j) in the parse chart where i is
different from j and i, j = [1,n].
The Inside-Outside algorithm utilised in this study to implement PCFG parsing is as follows.
Figure 5.5: The Parse Chart Procedure to Implement the Inside-Outside Algorithm
The inside probability of every parse in the parse tree space were determined using this
procedure, which was also coded during the prototype's construction. Equations 9 and 10,
which add a step to construct the parse from the bottom up, were employed by the algorithm
to calculate the probability. The categories of words in a phrase are fed into the diagonal (i.e.,
the first level) of the chart one by one during parsing. Here, the probability of each parse is
determined along with the construction of the parse tree space from bottom to top.
5.5 TheExperiment
For the experiment, the example text chosen, which was covered in chapter 4, was used. The
researcher and an Afan Oromo language instructor from Metu University's department of
Afan Oromo manually analyzed each word in the corpora and hand-tagged and hand-parsed
each sentence. The linguistic advisor and other language experts at the Metu University of
Language Studies and some other language experts provided comments and suggestions.
Based on a random process, 300 sample sentences were chosen.
The distribution of the various phrase structures and the method they obtain their decision features
(i.e., being made up of a simple NP and a tree VP).
The primary goal of this study was to parse Afan Oromo phrases utilizing the PCFG bottom
up chart parsing approach and the Inside Outside algorithm, which receives its POS inputs
via a POS tagger, as was made abundantly clear from the outset. As a result, the experiment
started by determining whether the POS tagger showed any improvement in accuracy.
In order to achieve this, the tagger's accuracy was 76.3% when it was first trained and then
used on the same data. The mistakes that were found were mostly human-caused (made
during the manual morphological analysis and tagging) as well as mistakes made when
creating the lexical and transitional probability tables.
60
The tagger achieved 89.7% accuracy on the training set after analyzing the manually
completed tasks, the lexical probability calculations, and the transitional probability
calculations, and making modifications as needed. This was greater than the tagger's original
training set score of 84%.
The tagger's accuracy increased in this study from 66.6% when tested with the Test Set to
80%. One of the detected sources of inaccuracy was the unintentional conflicting category
that the morphological analysis and the bi-gram for a particular word proposed. Due to time
constraints, this source of error was left unresolved. Additionally, the tiny size of the sample
corpus taken into account may have contributed to inaccuracies in this stage of testing the
tagger.The use of the morphological preprocessing analysis before the application of the
HMM tagging, the category checking mechanism after the tagged stems were synthesized
with their affixes, the statistical category guessing mechanism that fully relies on the
transitional probabilities, and the slight increase in the corpus size are generally considered to
be the main drivers of the improvement on the POS tagging module (or on the input
preprocessing module in general).
The 240 sentences that were randomly selected from the sample corpus and saved as the
training set were used for all the manually performed tasks, including the morphological
analysis, tagging and parsing as well as the probability calculations for the words in the
sentences, the induction of grammar rules, and the assignment of probabilities to the
grammar rules, which were all covered in chapter 4. On these 240 sentences, the initial
experiment was run.
The availability of two rules that are similarly more probable (than the other rules) yet have
the same RHS was a significant cause of inaccuracy. These rules had probabilities of 1.0 and
0.524, respectively, and were Sb NP VP and VP NP VP. The former b rule dominates since it
has a higher probability than the latter one does, because S frequently showed up at a node
where VP should have come from.
After that, the parser was retrained, and the test was run once more on the Training Set.
Under the experiment results sub-section 5.5.3.1, the final results obtained both before and
after these changes were made were subsequently presented.
61
5.5.2 Experiment on the TestSet
On the remaining 60 sentences from the original corpus, which were kept as Test Set, the
second experiment was conducted. Section 5.5.3.2 of the corpus contains the findings related
to this hidden portion.
5.5.3 Results of theExperiment
The findings of the experiments conducted on the Training Set and Test Set are discussed in
the sections that follow. Additionally, it provides the training set results (both with and
without adjustments to the lexicon, grammar, and algorithm).
Table 5.8: Parsing result on Training Set before making no error correction
The accuracy attained should have been higher than 66.6% given that the parser was trained
and evaluated using the same data, or the Training Set. The contradiction between the rules S
NP VP and VP NP and VP, which was handled by thinking of the two rules as ones that
employ distinct RHS, was the primary cause of errors at this time, as was already mentioned.
Human error contributed to the experiment's low accuracy and made it less accurate than was
expected. The final accuracy obtained on this set, after human errors were discovered and
corrected, is shown in the table below.
Data/set No of sentences No of erroneously parsed Accuracy
Sentences
Training Set 240 48 80.0%
Table 5.9 1: Parsing result on Training Set after making some error correction
62
Data/set No of No of erroneously Accuracy
sentences parsed sentences
Test Set 60 17 71.6%
63
the two conflicting rules as ones with different RHS, this issue was resolved. Additional
causes of annoyance during the tests were errors during the extraction of rules and their
probabilities and the inadequacy of rules at the rules library, i.e. at the PCFG table, which is
frequently referred to as under generation. Both of these issues were reduced by an iterative
method.
64
CHAPTER SIX
6.1 Conclusion
In order to address a slightly more complex issue in the domain of the development of a
Decision tree sentence parser for Afan Oromo, the thesis has attempted to present a method
of combining the concepts and results of previously examined Afan Oromo NLP systems but
in a different manner. In order to achieve this, a POS tagger and a straightforward Afan
Oromo phrase parser were utilized as the basis. The Probabilistic Context Free Grammars
PCFGs parsing was also implemented using the Inside-Outside method and a chart parse
module that was first introduced by Yao and Lua [53] to parse Chinese sentences. Efforts
have also been made to construct a prototype.
While constituents are frequently defined deductively in terms of the relationships that exist
between their pieces, parsing is the process of identifying analyses of sentences, that is,
consistent sets of relationships between constituents that are determined to hold in a
particular sentence. The process of parsing sentences of this type can be done in one of two
ways: manually or via a tree. The manual approach is time-consuming, costly, and error-
prone; it is also evident that the issue will only become worse as the amount of information
increases. Thus, the second method of sentence parsing, decision sentence parsing, eliminates
such tree ities and is crucial in systems for natural language understanding.
This study's ultimate objective was to create a decision tree sentence parser for Afan Oromo.
To that purpose, key parsing ideas and words were reviewed, and locations where the results
of the sentence parser are relevant were shown. In addition, the rule-based and stochastic
techniques, which are the two main approaches to NLP in general and tree sentence parsing
in particular, as well as alternative strategies, were briefly introduced and reviewed. We also
went into some detail on the knowledge base of a sentence parser and the components that
are required, including the lexicon and grammatical formalisms, to store information that
would help with the parsing process.
The literature related to the Afan Oromo writing system, lexical categories, and grammatical
constructions was then evaluated and debated. This was due to the fact that a fundamental
65
aspect of building a parser understands the syntax of the language. Hence, it became evident
which language characteristics were taken into account when constructing the different
parser components. Almost
All lexical and phrasal categories, sentence formalisms, typical tree sentence properties, and
linguistic factors that were taken into account when constructing the different parts of the
parser were also explained.Next, the developed sample corpus that was used in this study and
some of the major problems the researcher faced in the process of getting the
necessarysampleandthesteps taken to deal with the problems were presented. In other words,
300 tree sentences were gathered from two commonly used Afan Oromo grammar books,
published articles, and newspapers because there hasn't been a corpus produced for studies
on Afan Oromo tree sentence parsing up to this point.
Each word in the corpus was then manually morphologically examined, tagged, and parsed.
Each word in the Training Set, which was a portion of the sample corpus utilized as a
training set, had its lexical and transitional probabilities assessed. The grammatical rules
were extracted, the probabilities associated with each rule in the training set were
determined, and the resulting PCFG rules were simplified by being expressed in Chomsky
Normal Form (CNF) and shown in a table dubbed PCFG-CNF.Later in the thesis, the
algorithms and modules needed by the parser to access the knowledge base and parse
incoming phrases with the proper lexical categories were provided. A prototype was made
using PythonTkinter to establish an interface that would enable a user to communicate with
the system. Two phases of experiments were carried out, the first on the Training Set and the
second on the Test Set. The performance improvement of the original, simply statistical part-
of-speech tagger, a straightforward sentence parser, and a newly created sentence parser was
measured using only one parameter, the percentage of correctly tagged and parsed words and
sentences in the sampled text. Using a decision tree In this investigation, Afan Oromo parser.
The results obtained using the few samples were excellent, with a training set accuracy of
80.0% and a test set accuracy of roughly 71.6%. The experiment was run repeatedly on both
the Training Set and the Test Set, finding mistakes and making corrections, before obtaining
such precision. The majority of the errors that were found were caused by human error in the
66
preprocessing of the input from the parser, conflicting PCFG rules, low likelihood, and the
absence of some rules. Before concluding the thesis, a discussion of potential causes of errors
and their fixes was held.Although the parser created for this study had somewhat above-
average accuracy, it might not have immediate practical applications because it wasn't trained
on a lot of data that supported all of the characteristics of Afan Oromo (tree) phrases. It is
feasible to draw the conclusion that this thesis was an effort to highlight the potential for
employing probabilistic approaches, specifically HMM on Afan Oromo NLP, and to use
statistical approaches for decision sentence parsing in addition to rule-based or hybrid
approaches.
The researcher believed that Ethiopian students and researchers would develop this kind of
practice, paving the way for the eventual realization of higher-level and more difficult
research projects like conceptual parsing and machine translations, which are all NLP tasks.
6.1 Recommendations
This study suffers from numerous flaws. These restrictions are listed below and are active
research areas that should be addressed by those with an interest in the field. The work of
those scholars may facilitate efforts to develop a powerful sentence parser for the Afan
Oromo language. As potential research areas, the following could be suggested.
The bigram lexical co-occurrence method, which was built by assuming the input and
output formats of a prior study, was used in this investigation to guess for unknown
terms. It was helped by a manually annotated morphological analyzer. The different
inflections of a given stem in the database respond nicely to this strategy. Despite
being an improvement over past research, some words that were completely novel to
the database (i.e., those for which there are no inflectional forms in the database) were
nonetheless incorrectly labeled. In order to achieve a better result, future studies may
combine the studied integrated system of morphological analysis systems with both
grammatical and trigrammatic lexical co-occurrence..
67
To expand the existing system's ability to parse different sentence kinds, replicate
this work using a large data set and including all forms of sentences with all
attributes, including case, number, gender, person, tense, and definiteness. Based on
this, it would be possible to investigate how PCFG performed on the NLP for Afan
Oromo.
68
7. REFERENCES
[1] "Parsing of part-of-speech tagged Assamese texts," International
Journal of Computer Science, vol. 6, no. 1, 2009, pp. 28–34.
Mirzanur Rahman, Sufal Das, and Utpal Sharma.
[2] Abebe Abeshu. "Analysis of Rule Based Approach for Afan Oromo
Automatic Morphological Synthesizer," STAR Journal, pp. 94 - 97,
2013.
[12] Wilson, Kyongho Min &William H., "Are Efficient Natural Language
Parsers Robust?," School , Sydney ,Australia , 2005.
[14] Thant, Win Win, Tin Myat Htwe, and Ni Lar Thein, " Parsing of
Myanmar sentences with function tagging," University of Computer
Studies, Yangon, Myanmar,2012.
[16] Abraham Tesso Nedjo, Degen Huang, Xiaoxia Liu, "Automatic Part-
of-speech Tagging for Oromo Language Using Maximum Entropy
Markov Model (MEMM)," Journal of Information & Computational
Science 11:10, vol. 11, no. 10, p. 3319–3334, 1 July 2014.
[21] Merlo paola, Parsing with Principle and Classes Information, Boston:
Kluwer Academic, 1996.
[27] Shieber, Stuart M., Yves Schabes, and Fernando CN Pereira, "Princip
les and Implementation of Deductive Parsing," The Journal of Logic
Programming, vol. 12, pp. 1-37,1995.
[29] Rus, Mihai Lintean, Vasile, "Naive Bayes and Decision Trees for Function
Tagging," in In Proceedings of the International Conference of the
Florida Artificial Intelligence Research Society, Key West, FL, 2007.
[40] Berwick, Robert C., and Amy S. Weinberg, The Grammatical Basis of
Linguistics Performance: Language Useand Acquisition, London: MIT
press,1989.
[44] Baye Yimam, Seerluga Afaan Oromoo, Adiis Ababa: Adiis Ababa
University press, 2003.
[46] Askale Lemma, "Seerluga Afaan Oromoo," Unpublished Hand out for
Oromo Syntax, AAU, Addis Ababa, 1997.
1. Biqiltoonnis\NOUNta‘an\CONJbineeldonni\NOUN
jiraachuuf\ADJ,\COMAnyaata\NOUN isaan\PRON\barbaachisa\VERB.\PUNCT
2. Xurii \NOUNqaama\NOUNkeessaa\baasuuf, bishaan\NOUNtumsa\VERBguddaa\ADV
godha\VERB.\PUNCT
3. Akaakuu\ADJ nyaataa \NOUNqaamaaf \NOUNbarbaachisan\ADJ filachuun\VERB\
fayyaa \NOUNkeenyaaf\PRONgaarii\ADVdha\ADP.\PUNCT
4. Guddinaa \NOUNfi\CONJ jabina\NOUN qaamaa \NOUNargachuuf\nyaata\NOUN
walmadaale\ADJargachuun\ADVdansaa\VERBdha\ADP.\PUNCT
5. Waan arganne hunda nyaachu osoo hin taane,nyaatamadaalamaa soorachuutu bu‘aa qaba.
6. Ani \PRONkaleesa\NOUN malee\CONJhaar‘a
\NOUNnyaata\NOUN\hin\nyaane\VERB.\PUNCT
7. Marartun\PROPN gara\ADJ gabaa\NOUN demtee\VERB,\COMA Midhaan\NOUN
bitte\VERB.\PUNCT
8. Bulchaan\PROPN\ dhengada\ADV sare\PROPN gamadaa\PROPN
ajjesse\VERB.\PUNCT
9. Yoo\ADPdheebotte\NOUN,\COMA bishaan\NOUN Amboo\PROPN
dhugi\VERB.\PUNCT
10. Barumsi\NOUN waan\CONJitti\ADP
cimeef\ADJ,\COMAaddaan\ADVkute\VERB.\PUNCT
11. Isheen\PRON hoojii\NOUN ishii\PRON waan\CONJ
beektuuf\NOUN,\COMAmana\NOUN barumsaa\NOUN iraa\ADP
haafte\VERB.\PUNCT
12. Yoo \ADPdhufuu\NOUN baattellee\ADJ,\COMA xalayaa\NOUN naaf\PRON
barreessi\VERB.\PUNCT
13. BoonaaN\PROPN biyyaa\NOUN alaatii\ADJ akka\ADPdhufeen\NOUN,\COMA
hiriyyoota\NOUN isaaf\PRONdubbii\NOUN godhe\VERB.\PUNCT
14. Yoo\ADP finfinnee \PROPNdeemteef\NOUN,\COMA meeshaa\NOUN naa\PRON
bitta\VERB.\PUNCT
15. Bokkaa\NOUN cimaa\ADJ waan\CONJ roobeef\NOUN,\COMA lagni\NOUN
guutee\ADJ riqicha\NOUNcabse\VERB.\PUNCT
16. Namni \NOUNkamiyyu\ADJ taanaan\NOUN ,\COM\maqaa\NOUN mataa\ADJ
isaa\PRONqaba\VERB.PUNCT
17. Gargaarsi \NOUNmootummaa\COMNOUNduubaan\ NOUNjiraannaan\NOUN umanni
\NOUNmisoomaaf\NOUN seexaa\NOUN cimaa \ADJqaba\VERB.\PUNCT
MC= Main
Clause SC=Sub
Clause
Print parses
(y/n)? y (CS
(MC
(NP (N
Toolaan))
(PP (P
akka))
(VP (PR ishee) (V jaalatee)))
(SC (NP (N Haawwiin)) (VP (ADV siritti) (V bektee))))
[0.00041472000000000004]