Download as pdf or txt
Download as pdf or txt
You are on page 1of 63

Ethiopia Information Communication

Technology Annual Conference 2014


EICTAC 2014

CONFERENCE BOOKLET

JUNE 6,2014

1
Website:-www.ictcoe.org.et Web-site: -www.ictet.org
Email:-melkamu.hunegnaw@ictcoe.org.et E-mail:-info@ictet.org

Organizer: - Ministry of Communication and Information


Technology (MCIT)
Address:-

Tel +251 115 53 36 23

Mob. +251 911 48 25 73

Fax +251 115 51 37 20

Website:-www.mcit.gov.et
Email: seyoumme@yahoo.com
P.O.Box:-1028

2
Table of Content
Page
1. Message from the Conference Organization Committee…………………………………………..1

2. The Status and the way forward of ICT Curriculum in Ethiopia Secondary Schools……4

3. The Development of Amharic Morphological Analyzer Using Memory Based Learning.11

4. Amharic Anaphora Resolution Using Knowledge Poor Approach………………………………….18

5. Web Based Amharic Question Answering System for Factoid Questions Using Machine Learning

Approach ……………………………………………………………………………………………………………………..25

6. A Mathematical Algorithm for Robust Color Retinal Image Analysis a preliminary study

……………………………………………………………………………………………………………………………….………31

7. Automatic Recognition of Ethiopia License plates…………………………………………………………...38

8. Designing Amharic Definitive Question Answering ………………………………………………………. 44

3
The Status and the way forward of ICT Curriculum in Ethiopian Secondary
Schools

Abebe Basazinew Bekele and Mulugeta Libsie


Department of Computer Science, College of Natural Sciences
Addis Ababa University
{abebeb2000, mlibsie}@yahoo.com

Abstract - This paper presents the result of a research automated means. These include hardware, software and
of ICT education in Ethiopian secondary schools, its status telecommunications in the forms of personal computers,
of implementation and suggests the way forward for future scanners, digital cameras, phones, faxes, modems, CD
ICT education. The aim was to obtain an indication of the and DVD players and recorders, digitized video, radio
drawbacks of ICT curriculum. The research, with a and TV programmes, database programmes and
convenience sample, was carried out using questionnaire, multimedia programmes” [2].
Focus Group Discussion (FGD) and observation (using a
The world is dynamic and is changing fast. Some of
check list). The research revealed that ICT education in these changes are social and political, others are
Ethiopian secondary schools has serious problems such as
ecological. Some are evolutionary, others revolutionary
inappropriate contents, lack of resources and insufficient
[3]. No matter where we plan to live or what we plan to
teachers’ knowledge and skills. The research concluded do, we can expect that constant and rapid change will be
that the teachers in most secondary schools are not
a normal part of life.
qualified for the level. The study also revealed that schools
have no sufficient resources like hardware, software and Technology, especially information and
Internet connection to teach ICT. We also found that the communication technology, is playing a large part in
teaching/learning materials such as textbooks and teachers these changes. Advances in computer systems have
guides were not developed in suitable way. This paper facilitated advances in entertainment, in medicine, in
outlines the problems of teaching ICT as a subject in military capacities, in the collection and dissemination
Ethiopian high schools and recommends actions for of global news reports, and in the management of
improvement. organizations. Continually evolving computer hardware
and software have allowed organizations to significantly
Keywords: ICT, ICT Education in Secondary Schools, ICT
change their work process; new products and services
Curriculum, ICT Teaching Resources, ICT teachers’ qualification.
have been developed, new industries have emerged, and
old companies and industries have failed. Moreover, the
I. INTRODUCTION technologies continue to expand in their capabilities and
Information and Communications Technology (ICT) applications [1].
is a broad term that is used to refer to the ways and Increasingly in the modern world, acquisition of
means by which information is processed and computer skills is becoming necessary for employment,
communicated using automatic systems, such as educational development, and leisure. Computer studies
computers and other telecommunication and office intend to furnish students with a broad knowledge of the
systems. ICT becomes a vital part of the operation of nature of information processing and how ICT is used
most businesses and industries. The expanded use of today [4]. Tomorrow‟s world is an ICT one where
computers in the society has increased the number of information handling skills will be needed to improve
employment opportunity available to individuals with the standard of learning and living. The world is
some computer-related experience or training. In becoming connected electronically by the Internet, the
addition, the increasing use of information technology in world-wide network through which we can all share
the contemporary society proves that ICT is absolutely information.
necessary for development [1].
According to Sanders [5], to be an educated citizen
The United Nations Educational, Scientific and today, we must be acquainted not only with computers
Cultural Organization (UNESCO) uses the term ICTs, or themselves but also with what computers do, how they
information and communication technologies, to are put to work, and how they affect individuals and
describe: “…the tools and the processes to access, organizations in a society.
retrieve, store, organize, manipulate, produce, present
and exchange information by electronic and other
4
In the report titled “The National ICT for elaborations view ICT as an integral tool in the learning
Development (ICT4D) Five Years Action Plan for process [10]. ICT has the potential to extend student
Ethiopia”, Dzidonu [6] mentioned that Ethiopia, learning capabilities, engaging them in understanding
recognizing the developmental potentials and concepts and processes in areas of learning and
opportunities of the information and technological facilitating change in learning, thinking and teaching.
revolution, has embarked on a process of economic New technologies are introduced continually, and the
transformation through the modernization of the key existing ones become obsolete almost as soon as they
sectors of the economy including those of agriculture, appear [10]. The rapid evolution of the discipline has a
services and industrial sectors through the deployment profound effect on ICT education, affecting both content
and the exploitation of ICTs. Acknowledging the key and pedagogy. The general aim of ICT education is to
role that development, deployment and exploitation of introduce the necessary knowledge, skills, and culture to
ICTs can play in Ethiopia‟s economic and social use and work with information and knowledge for
development, the Government recognizes the need for citizens living in the knowledge based society.
the rapid development of the nation‟s information and
communications infrastructure as well as the ICT education can not cover all the rapid
development of the nation‟s educational, human developments in modern Computer Science. Therefore,
resources and other development-focused resources to the goal is to emphasize on general education by
bring about the necessary change and transformation presenting fundamentals in Computer and Information
required to achieve improvements in the determinants of Science, providing students with the basic knowledge
the nation‟s socio-economic performance. and skills in exploring different fields of sciences, and
preparing them for their future life and role in the
With the above national ICT strategy, the Ethiopian knowledge based society.
ICT syllabi is geared towards students who are in
secondary school and is designed to equip them with The impact of educational level on economic
knowledge of computer skills, the applied use in the development is more pronounced with the recent growth
world of work, and a background for further training. It of ICT and its increasing importance in social and
includes theoretical and practical content in hardware, economic development. This has profound implications
software, communication and data processing. The for education both in how ICTs can be used to
Ethiopian curriculum of ICT has been in implementation strengthen education, and how education can be more
for about fourteen years but its status has not been yet effective in promoting the growth of ICT. However,
explored. Therefore, from this crucial perspective, this education systems have changed very little in response.
research is conducted in Addis Ababa to identify the Without improved efficiencies in their education
major problems of implementation, solutions to the systems, developing nations will not likely be able to
problems, and compare the existing secondary school provide the additional human capital required to achieve
ICT curriculum in relation to other countries‟ experience economic self-sufficiency in the context of a highly
and to suggest the way forward for secondary school competitive global economy that is increasingly based
ICT curriculum of Ethiopia. on the electronic transfer and manipulation of
information [10].
The rest of the paper is organized as follows. Section
II briefly surveys literature and some of the important Lim [15] showed that a socio-cultural approach
works related to the work in this paper. Section III towards the study of ICT in schools rejects the view that
outlines the research approach and the research ICT can be studied in isolation, or as a single variable in
questions. Section IV presents the results of the study. the learning environment holding all other things
Finally, Section V concludes the paper by outlining constant. The research was carried out in two phases.
recommendations and future work. Phase one comprises of a self-reporting questionnaire to
be sent out to all schools in Singapore. One of the main
II. LITERATURE REVIEW objectives of the questionnaire is to assess the level of
ICT integration in schools by identifying the various
We live in a technological world where ICTs are socio-cultural elements that influence the successful
fundamental to most activities [10]. The importance of integration of ICT in schools. The other objectives are to
ICT in society is emphasized in enabling our future serve as a screening phase to identify the case studies for
which identifies ICT literate citizens as being central to a phase two of the study, and to refine and guide the
country‟s economic and social goals, to improving direction of phase two of the study. The questionnaire
productivity and efficiency, and to build innovative explored different aspects of ICT integration in schools
capacity and competitiveness. that include school ICT culture (leadership support,
The ability of individuals to use ICT appropriately to exchange of ideas and experiences, and extent of staff
access, manage, evaluate information, develop new involvement in review of school ICT program), pupil
understandings, and communicate with others is very use of ICT (proficiency of pupils in the use of ICT and
important in order to participate effectively in society. pupils‟ usage for learning), teacher use of ICT (teachers‟
These Statements of Learning and the professional proficiency in the use of ICT and integration of ICT by
5
teachers in the classroom), management of ICT needed to master ICT as a subject. Learning ICT is more
resources (accessibility to ICT resources and monitoring than the ability to operate and use a computer system.
process of ICT resources to optimize usage), and staff Hence, the acquisition of technical skills is only part of
development (opportunities for staff development in the the problems encountered in teaching and learning ICT
area of ICT integration and review of staff development as a subject. Indeed, ICT education includes a
to meet the needs of ICT integration). sophisticated set of higher-order skills and cognitive
The finding showed that a sociocultural approach abilities, such as analyzing, designing, implementing,
towards the study of ICT in schools rejects the view that collecting and retrieving, organizing and managing,
ICT can be studied in isolation; it must be studied within interpreting and representing, evaluating and creating
the learning environment and the broader context in information.
which it is situated. The paper has argued for a more Comparison of ICT curricula among different countries
holistic approach of studying ICT in schools by adopting
a sociocultural perspective. It proposes a theoretical Kargiban and Kaffash [8] described that ICT now
framework based on the activity system as a unit of plays an important role in the curriculum of England,
analysis that is surrounded by different levels of Malaysia, America, Canada, India, and China. It is not
ecological circles. only taught as a discrete subject but is also a useful tool
for other subjects in the curriculum. However, ICT as a
National Policy for ICT in Education subject discipline in Malaysia and China is more or less
Information and communication technology is a different compared to its characteristics in the National
powerful tool for the development of quality teaching Curriculum for England, America, India, and Canada.
and learning; it is a catalyst for a radical change in the ICT curriculum structure of England and America
existing school practices and an absolute vehicle for has a constructivist theory background. But, the
preparing students for the future [11]. Success in the curriculum structure of England in each subject includes
implementation of an ICT policy will be dependent on the key stage, and the programs of study set out what
the recognition of the importance of strategic application pupils should be taught; and the attainment targets set
to education and sustainable implementation. out the expected standards of pupils‟ performance [9]. It
Maximizing ICT potentials will involve quality ICT is up to schools to choose how they organize their school
policy, greater involvement of private and public sectors curriculum to include the programs of study. ICT tools
in the funding of the implementation, and proper are used to find, use, analyze, interpret, evaluate and
implementation and monitoring. present information for a range of purposes. The key
Different countries in the world have national ICT skills in the National Curriculum include: 1)
policy. When we see specifically the strategy “Human information-processing skill; 2) reasoning skill; 3)
Resource Development” in ICT policy, the Ethiopian enquiry skill; and 4) creative thinking skill and
Government is offering ICT education and training in evaluation skill.
secondary and tertiary educational institutions with the The Jordan ICT curriculum [12] is split into three
aim of creating ICT literacy and the basis for the levels: introductory, intermediate and advanced. At each
proliferation of ICT professionals in the country [6]. level there is the specification itself, the objectives
When the ICT development sector gets strengthening, which a student is expected to attain and some
the need for ICT professionals will continue to grow side suggestions as to how these objectives might be met.
by side. On top of this, in order to make the community As stated in [13], the Education Policy of Botswana
benefit from ICT, it will be appropriate to equip it with aims to prepare Botswana for a transition from a
basic knowledge and awareness of computer and related traditional agro-based to an industrial economy.
technology.
A study shows that the implementation of the ICT
Dimension in ICT Learning content is not satisfactory in secondary schools in
According to Paas [7], there are three separate Mongolia [14]. The study shows implementation gap
aspects of ICT in school education: a) Using ICT as a between urban and rural schools.
tool to support teaching and learning processes, for
example using a word processor, spreadsheet or database III. RESEARCH APPROACH/METHODS
in other subject areas such as mathematics or science; b) In this research, first we thoroughly studied related
Learning through ICT where the ICT facility becomes literature of ICT curriculum experience of other
the whole learning environment by providing learning countries. A set of contents were identified and selected
materials, such as LMS (Learning Management System) that are relevant to our ICT education. Moreover, we
or Web-based learning; and c) Learning ICT as a reviewed related works on dimensions of ICT learning at
subject, that is to say learning the knowledge, concepts, secondary schools.
skills, and processes of ICT. Using ICT as a tool and
learning through ICT offer very little opportunity for A survey was conducted to collect both quantitative
students to learn the knowledge, concepts, and skills and qualitative data about the status of ICT education in
6
Ethiopian secondary schools. The instruments of the assessing the need of students, teachers and educators.
study are questionnaires (for university instructors and Accordingly, based on the analysis, the following major
secondary school teaches), a structured Focus Group results were drawn.
Discussion for students, and a laboratory observation  ICT teachers’ background and training to teach the
(with a checklist). The instruments were pilot tested to current ICT curriculum:
check content validity, readability, and estimate the time
needed by the participants. The educational background and training of ICT
teachers is acceptable (98%) in Addis Ababa but in
Since the study was descriptive in nature, other regions, according to the annual educational
convenience sampling was used. The sample consisted abstract of 2011/12 [16], it is very low and qualified
of 4 departments from 3 private and public teachers at secondary schools is less than 60%. This
universities/colleges that have ICT related education in shows that most schools don‟t have adequate and
place and 10 secondary schools which comprise of qualified ICT teachers. Similarly, most of the ICT
grades 9 up to 12. teachers did not obtain any training related to the new
The data collected through questionnaire and contents like control and learning with Logo and
observation checklist were tabulated and analyzed image processing and multimedia systems before
manually. Data obtained through FGD which were in implementing the curriculum at secondary schools.
Amharic were translated to English. After the data This shows that our schools don‟t have qualified ICT
gathered through FGDs were examined, they were teachers in every region.
categorized in line with their similarities which served to ICT teaching learning process lacks teachers‟
put related issues together in such a way that the data are pedagogical science knowledge as well as other
on the basis of the research questions. Rationality of the subject matter knowledge and skills to implement the
data was achieved using triangulation methods by current curriculum contents. The research shows that
comparing students' perspective, instructors‟ perspective most (95%) teachers are doing their work with their
and researcher‟s observation perspective of events. instinctive knowledge of teaching (lecturing) and this
Finally, conclusion was made based on the result of the has a negative implication on the teaching learning
analysis. process and teachers don‟t get in-service training to
Research Questions improve their teaching profession.
This study attempted to address the following basic  Availability of ICT resources at school ICT
questions: laboratories
1. Are the current ICT syllabi relevant and A related challenge is that all public schools have
appropriate to Ethiopian secondary schools? limited resources and low access to the Internet.
Those schools that have Internet connectivity provide
2. To what extent our secondary school ICT e-mail service only, and it is available only to the
curriculum contents are in line with other administration and teachers. Access to ICT resources
developed and developing countries? by teachers is also limited, especially to computers,
3. Which knowledge and skills of ICT education which makes it difficult to assume that teachers can
students gain at secondary school that help them effectively transfer ICT knowledge and skills in their
in the world of work and further study in higher teaching.
education? Observation of ICT laboratories at schools showed
4. What is the status of teachers‟ qualification to that teachers couldn‟t help students to develop
teach ICT at secondary schools? practical skills due to lack of resources. Mostly more
5. What is the status of resource availability than five students sit on one computer just to see not
(computers, computer laboratories, Internet to practice the ICT lessons. ICT Laboratories are not
connection, software, etc.) to teach ICT at sufficient at schools, 50-75 students entered in one
secondary schools? laboratory session in public schools. This indicates
teachers face big challenges to conduct ICT practical
6. To what extent are relevant and appropriate ICT sessions in laboratories.
textbooks, teachers‟ guides and reference
materials available? When we see computer student ratio, 85% of all the
surveyed schools have a ratio in between 1:3-1:6 and
in private schools the ratio is 1:2. In 5% of the public
IV. RESULTS AND DISCUSSION schools, computer student ratio is above 1:6.
The main focus of this research is to explore the
implementation status of the current ICT curriculum, Every lesson of ICT should be supported by practice
availability of all the required inputs, and forward in computer laboratories for better knowledge and
recommendations for the proper implementation of the skills to be acquired and developed by students. This
ICT curriculum in Ethiopian secondary schools by implies that due attention is not given to the field.
7
Therefore, supervision and support is expected from experiences of other countries, it will be advisable to
responsible bodies to organize additional ICT replace these difficult contents by appropriate and
laboratories to reduce big class sizes in one feasible ones in subsequent revisions.
laboratory session in order to bring better change in Table 2: Difficult contents from the current ICT curriculum
the education system. In addition, teachers‟ practical
skills of ICT need to be upgraded with short in-
service training on new and difficult contents. No. of participants claimed
difficult to teach
 Appropriateness of the name ICT to the Subject Content
Sec. school University
As shown in Table 1, most of the participants of the Teachers Instructors
university instructors (80%) and secondary school
ICT teachers (81%) are in favor of the name ICT to Exploring the Internet 84% -
the subject but some considerable number of Basic troubleshooting 15% 44%
university instructors (20%) and secondary school
ICT teachers (19%) disagree with the name ICT for Image processing and
86% 82%
the subject. multimedia system
Table 1: Participants’ opinion regarding the name of the Control and learning with
46% 89%
subject ICT Logo
Do You Agree with the name ICT Yes No Information and
- 6%
computer security
University instructors 80% 20%
Secondary school teachers 81% 19% The reasons given and the percentage of secondary school
teachers are shown in Table 3.
Various reasons were given by the participants,
among which the major ones are the following. Table 3: Reasons given for the contents considered difficult to
- ICT has a broad meaning, so it is better to use a teach
more specific name appropriate to the subject
such as “Introduction to Computers”. Reason Percentage of teachers
- ICT is a broad term that encompasses all sorts of Lack of training 44%
communications and devices such as computers Lack of computers 73%
and computer networks. So, it will be better to use
“Computer and its Applications”. Lack of Internet access 92%
- Since ICT holds information and communication Lack of software 53%
together, it is not an appropriate name for the
 Students’ interest to learn ICT
subject.
 Participants’ perception on the major changes made The research finding showed that 31% of ICT
on versions of ICT curriculum contents teachers responded that most students have no
interest to learn ICT because of different reasons.
The ICT curriculum has been revised three times The major reasons given are: a) students are forced to
since 1994 in every five years. As per the study, 89% give attention to other subjects which appear in
of the participants indicated that the previous first grades 10 and 12 national examinations, b) there are
and second versions were generic and limited only to no sufficient resources in the computer laboratories
application software. But, the current version is good to acquire the expected knowledge and skills, and c)
and includes many ICT fundamental concepts. But students in the private schools know the contents at
participants said that the allotted time to cover the earlier grade levels which teachers are able to teach
content is in question and some contents need proper with certain constraints.
scoping and feasibility.
 Quality of ICT textbooks and availability of reference
 Contents of the current ICT curriculum that are materials
difficult to teach/learn
The role of quality textbooks and reference materials
As shown in Table 2, the research identified difficult is very significant for the students as well as the
contents due to different reasons from the current teachers for teaching ICT. Most of the teachers
ICT curriculum. These are, for secondary school (62%) said that they don‟t get any reference materials
teachers and university instructors, respectively, in their school libraries and 29% of them said that
Exploring the Internet (84%, -), Control and Learning they have access to some reference materials about
with Logo (46%, 89%), and Image Processing and old application software. This evidence indicates that
Multimedia Systems (86%, 82). Based on the views there are no relevant reference materials in school
of the study participants and considering the libraries that support ICT education in secondary
8
schools. Teachers and students don‟t have very poor quality and ICT reference materials are not
opportunity to enrich their ICT knowledge and skills available at schools except books of application software
by using additional reference materials apart from the of old versions. These problems affect the ICT
knowledge they acquire by using the basic teaching/learning process and neither of the concerned
application software like MS Word, MS Excel, MS bodies seem to make any interventions. Therefore, the
Access and MS Publisher. curriculum needs revision and reconsidering those
It was also indicated that ICT textbooks are in very contents considered not appropriate to students‟
poor quality. Study participants said that textbooks background and schools concrete situation. Then,
are full of irrelevant paragraphs, they recommend prioritize, exclude and replace them by appropriate
unavailable software, they do not have details on the contents to keep the curriculum standard.
use of new software, etc. As regards to ICT Knowledge and skills that students
In addition, there is no Internet access to browse for should acquire and develop in secondary schools,
additional materials to consolidate and deepen their university instructors and literature review similarly
knowledge. indicated that the curriculum should include topics like
basic computer manipulation with hardware and
 Knowledge and skills of ICT that students should software issues, computer applications, basic computer
gain and develop at secondary schools troubleshooting, assembly and upgrading, problem
Most of the university instructors (92%) said that solving and some basic programming concepts like basic
students at this level should develop basic computer web page design, basics of networking and
skills to operate computers, acquire some theoretical communication, and exploring the Internet with Internet
knowledge about ICT, and become competent in safety.
using the Internet which are enough for secondary Recommendations
school students. They indicated that in order to go
beyond the mentioned ones and upgrade content so as The study focuses on the status of ICT education
to use other countries‟ experience is a very difficult curriculum in implementation for the past fifteen years.
task since our limited resources don‟t allow to Most of the research findings need consideration and
include other contents. supervision by the concerned government sector and
Some respondents (46%) have different opinion on non-government bodies. Therefore, based on the
what students should learn. They said that students findings of the study, the following recommendations
should learn basic computer knowledge in addition to are given as possible ways to improve the curriculum
some important knowledge and skills like data and the teaching learning process.
handling to help them in the world of work and to  Recruit qualified teachers from the market and in
continue with their study at higher education. addition provide in-service training on pedagogy and
Therefore, contents should go in line with other new contents for non-qualified teachers.
developing and developed countries‟ experience.  Equip school ICT laboratories with required
resources like hardware, software and Internet
V. CONCLUSIONS AND RECOMMENDATIONS connection.
Conclusions  Naming of the subject ICT was controversial by
Computer studies intend to furnish students with a different educators. The research finding indicated
broad knowledge of the nature of information processing that the ICT contents lack communication. Therefore,
and how ICT is used today. In Ethiopia, ICT education the future ICT curriculum should include some basic
has been provided for secondary school students since knowledge of networking and communication media
2000. Considering the findings of this study, it is and devices in more detail.
concluded that the provision of ICT education at  Quality teaching learning materials (textbooks and
secondary schools in Ethiopia is very poor with the teachers guides) and availability of reference
following reasons. The current ICT curriculum didn‟t materials have a great role for the provision of
consider students background knowledge of computers quality education. Also they broaden students‟
and feasibility in our school situation and other learning experiences to meet different learning needs.
countries‟ experience of the world. Most schools don‟t However, ICT textbooks are not developed properly
have qualified ICT teachers; don‟t get training on new and ICT reference materials are not available at
contents and lack pedagogical science knowledge. schools. Therefore, concerned bodies should involve
Moreover, secondary schools do not have sufficient ICT and support to develop quality textbooks and teachers
equipment and facilities and suffer from scarcity of all guides. In addition, schools should be supported with
types of ICT resources. More than five students sit in reference materials and Internet infrastructures for
one computer just to see but not to practice the ICT students and teachers to browse information related
lessons. So, most students have no interest to learn ICT. to ICT to broaden their knowledge.
In addition, the textbooks and teachers guides are of
9
 The finding indicated that students have no interest to [12] Ministry of Education, “A Model Standard ICT Curriculum of
follow the ICT subject at secondary schools because Middle and High Schools General Guidelines for
Comprehensive Secondary Education”, Hashemite Kingdom of
of different reason. To motivate students, concerned Jordan (Amman), 1997.
bodies should fulfill the resources, revise the [13] P. T. Ramatsui, “Botswana Senior Secondary ICT Syllabus:
curriculum, and make the subject to appear at Revised National Policy on Education”, Gabarone, 1994.
national examination. [14] Uyanga Sambuu and Munkhtuyu Lxagvasuren , “Evaluation on
ICT Education Contents Linkages in General and Tertiary
 Based on the opinion of respondents and the Education of Mongolia” National University of Mongolia and
experience of other countries, the study suggested for Mongolian State University of Education [2008 – 2009].
the curriculum to include the following topics at the [15] Cher Ping Lim “A theoretical framework for the study of ICT in
various grade levels with varying complexity: 1) schools: a proposal”, British Journal of Educational Technology,
Computer system, 2) Computer and its applications, Vol. 33, No. 4, 2002.
3) Computer care and safety, 4) Basic computer [16] Ministry of Education, EMIS, “Education Statistics Annual
troubleshooting (assembly and upgrading), 5) Abstract 2004 E.C (2011/12 G.C)”, Addis Ababa, Ethiopia,
2012.
Information system and social media tools, 6)
Networks and communication, 7) Exploring the
Internet and Internet Safety, 8) Basic application of
ICT in real life, and 9) Problem solving and some
basic programming concepts.
Future Work
Due to time and financial constraints, the research has
been done on a sample of ten urban private and public
secondary schools and three public and private
universities/colleges in Addis Ababa. As a future work,
the status of ICT education in Ethiopian secondary
schools will be studied in a wider and deeper scope,
taking into consideration sample urban and rural schools
from all over the regions of the country.

REFERENCES
[1] Fikre Sitot and Belye Tedla, “Fundamentals of Information
Technology”, Mega Publishing Enterprise, 2002.
[2] Said Hadjerrouit, “Didactics of ICT in Secondary Education:
Conceptual Issues and Practical Perspectives”, 2009.
[3] Terence Driscoll and Bob Dolden, “Computer Studies and
Information Technology”, Macmillan, 1997.
[4] ICT in School Education (Primary and Secondary) 2010,
Selected Educational Statistics, Government of India, Ministry
of Human Resource Development, New Delhi, 2006 – 07.
[5] Donald H. Sanders, “Inside Computers Today”, McGraw-Hill
Book Company, United States of America, 1985.
[6] Clement Dzidonu, “The National ICT for Development (ICT4D)
Five Years Action Plan for Ethiopia [2006 – 2010] ”, A United
Nations Development Programme (UNDP) Initiative, 2006.
[7] Leslie Paas, “How Information and Communications
Technologies Can Support Education for Sustainable
Development”, 2008.
[8] Zohreh Abedi Kargiban and Hamid Reza Kaffash,” ICT
Curriculum in Secondary School: A Comparison of Information
and Communication Technology in the Curriculum among
England, America, Canada, China, India, and Malaysia”, 2012.
[9] Patti Swarts, “ICT as Core and as Elective Subject: Issues to
Consider”, 2004.
[10] Curriculum Corporation on behalf of its members, “Statements
of Learning for Information and Communication Technologies
(ICT)”, Australia, 2006.
[11] Mudasiru Olalere Yusuf, “Information and Communication
Technology and Education: Analysing the Nigerian National
Policy for Information Technology”, International Education
Journal, 2005.

10
The Development of Amharic Morphological
Analyzer Using Memory Based Learning

Mesfin Abate Yaregal Assabie


Department of Computer Science Department of Computer Science
Addis Ababa University Addis Ababa University
Addis Ababa, Ethiopia Addis Ababa, Ethiopia
messyimam@gmail.com yaregal.assabie@aau.edu.et

Abstract— Morphological analysis of highly inflected rule-based approach to computational morphology uses
languages like Amharic is non-trivial task because of the
complexity of the morphology. In this paper, we proposed a the two-level formalism. In rule-based approach, the
supervised data-driven experimental approach to develop formulation of rules for languages makes the
Amharic morphological analyzer. We used a memory-based
supervised machine learning method which extrapolates new development of morphological analysis system costly
unseen classes based on previous examples in memory. We and time consuming [4], [11]. Because of a need of
treated morphological analysis as a classification task which
retrieves the grammatical functions and properties of hand crafted rules for the morphology of languages and
morphologically inflected words. As the task was geared towards intensive requirements of linguistic experts in rule
analyzing the vowelled inflected Amharic words with their
grammatical functions of morphemes, the morphological based approaches, there is considerable interest in
structure of words and the way how they are represented in robust machine learning approaches to morphology
memory-based learning was exhaustively investigated. The
performance of the model was evaluated using 10-fold cross which extracts linguistic knowledge automatically from
validation with IB1 and IGtree algorithms resulting in the over an annotated or unannotated corpus. Machine learning
all accuracy of 93.6% and 82.3%, respectively.
approaches have two learning paradigm: unsupervised
Keywords— Amharic Morphology; Memory-Based Learning; and supervised learning. Supervised approach learns by
Morphological Analysis
example where as unsupervised approach is learning by
Introduction patterns. Machine learning approaches that use
Morphological analysis helps to find the minimal units supervised learning paradigm include inductive logic
of a word which holds linguistic information for further programming (ILP), support vector machine (SVM),
processing. Morphological analysis plays a critical role in hidden Markov model (HMM) and memory-based
the development of natural language processing (NLP) learning (MBL). These paradigms have been used to
applications. In most practical language technology implement low-level linguistic analysis such as
applications, morphological analysis is used to perform morphological analysis [2], [3], [7]. Among various
lemmatization in which words can be segmented into alternatives, the choice of the approach depends on the
its minimal meaning [11]. In morphologically complex problem at hand. In our work, we employed MBL to
languages, morphological analysis is also a core develop morphological analyzer for Amharic, partly
component in information retrieval, text motivated by the limitations of previous attempts using
summarization, question answering, machine rule-based [7] and ILP [10] approaches. Memory-based
translation, etc. There are two broad categories of learning has a promising feature in analyzing NLP tasks
approaches in computational morphology: rule-based like POS, text translation, chunking and morpho-
and corpus-based. Currently, the most widely applied phonology due to its capabilities of incremental learning
from examples. Among the MBL algorithms, IB1 and

11
IGtree are known to be popular. Both algorithms rely on In addition to such non-concatenative morphological
the k nearest neighbor classifier which uses some features, Amharic uses different affixes to create
distance metric to measure the distance between each inflectional and derivational morpheme. Affixation can
neighbors of features [4], [5], [9]. be prefix, infix, suffix and circumfix. The morphological
The remaining part of the paper is organized as follows. complexity of the language is better understood by
Section 2 presents the characteristics of Amharic language with looking at the ford formation process through inflection
special emphasis on its morphology. In Section 3, we present and derivation.
the proposed system for morphological analysis. Section 4
presents experimental results, and conclusion and future works
are highlighted in Section 5. References are provided at the Inflection
end.
Amharic nouns are inflected for number, definiteness,
Characteristics of Amharic Language cases (accusative/objective, possessive/genitive) and
gender. Amharic adjectives, in a similar affixation
The Amharic Language
process to that of nouns, can be marked for number,
Amharic is an official working language of Ethiopia
definiteness, cases and gender. The affixation of
and it is widely spoken throughout the country as a first
morphemes to express numbers is similar with nouns
and a second language. It is the Semitic branch of the
except with some plural formation. On the other hand,
supper family of Afro-asiatic languages and thus related
Amharic verbs are inflected for any combinations of
to Hebrew, Arabic and Aramaic. Amharic is the second
person, gender, number, case, tense/aspect and mood.
widely spoken Semitic language, next to Arabic. It uses a
As a result of this, tens of thousands of verbs (in surface
unique script called “fidel” which is conveniently
forms) are generated from a single verbal root. As verbs
written in a tabular format of seven columns. The first
are marked for various grammatical units, a single verb
column represents the basic form and the other orders
can form a complete sentence as shown in the example
are derived from it by more or less regular
yisebreñal (‘he will break me’). This verb (sentence) is
modifications indicating the different vowels. The vowel
analyzed as follows.
in the script is the character “አ” along with its six
derived forms. Amharic has 33 base characters and this verbal root: sbr (‘to break’)
leads to have a total of 231 (=33*7) Amharic characters.
In addition, there are about two scores of labialized verbal stem: sebr (‘will break’)
characters such as ሏ(lwa), ሟ(mwa) ሯ(rwa) ሷ(swa),
subject: yi…al (he)
etc. used by the language for writing.
object: eñ (me)

Derivation
Amharic Morphology
Like other Semitic languages, Amharic is one of the Amharic nouns can be derived from adjectives, verbal
most morphologically complex languages. It exhibits a roots (by inserting vowels between consonants), stems,
root-pattern morphological phenomenon [1]. Root is a stem-like verbs and nouns themselves. Few primary
set of consonants (also called radicals) which has a basic adjectives (which are not derived) exist in the language.
lexical meaning [12]. A pattern consists of a set of However, many adjectives can be derived from nouns,
vowels which are inserted among the consonants of a stems, compound words and verbal roots. Adjectives
root to form a stem. Semitic languages, particularly can also be derived either from roots by intercalation of
Amharic verbal stems consist of a “root + vowels + vocalic elements or attaching a suffix to bound stems.
template” merger. For instance, the root verb 'sbr' + ee Amharic verbs can also be derived from different verbal
stems in many ways.
+ CVCVC leads to form the stem 'seber' (“ሰበር”/‘broke’).

12
The Proposed Amharic morphological and phonological processes. Therefore,
we consider the most significant part of the word stem
Morphological Analyzer which bears meaning next to the roots. In Amharic
grammar, the stem of a word is the main part which
System Architecture
remains unchanged when the ending changes. Thus, we
As memory based learning is a machine learning manually annotate sample words with their patterns
approach, our morphological analyzer contains a training where the data will be used as training data.
phase which consists of morpheme annotation to
manually annotate inflected Amharic words, feature Morpheme Annotation
extraction to create instances in a fixed length of
windows, parameter optimization and algorithm selection Amharic nouns have more than 2 and 7 affixes in the
to tune and select some of the parameters and algorithms. prefix and suffix position, respectively. The affixation is
On the other hand, the morphological analysis not somehow arbitrary, rather they affix in ordered
component contains the feature extraction to de-construct manner. An Amharic noun consists of a lexical part, or
a given text, morpheme identification to classify and stem and one or more grammatical parts. This is easy to
extrapolate, stem and root extraction to label segmented
see with a noun, for example, the Amharic noun
inflected words with their morpheme functions. The
architecture of the proposed Amharic morphological ቤቶቻቸውን (bEtocacewn/their houses). The lexical part
analyzer is depicted in Figure 1. is the stem ቤት (bEt/house); this conveys most of the
important content in the noun. Since the stem cannot
Training Phase
be broken into smaller meaningful units [8], it is a
The training process requires sample patterns of words morpheme (a primitive unit of meaning). The word
showing the changes in the internal structures of words.
contains three grammatical suffixes, each of which
Amharic morphemes may predominantly be expressed by
internal phonological changes in the root. These internal provides information that is more abstract and less
irregular changes of phonemes make the morphological crucial to the understanding of the word than the
analysis cumbersome. It is not a trivial task in finding the information provided by the stem: -oc, -acew, and -n.
roots of Amharic verbs. Hence, we investigated the Each of these suffixes can be seen as providing a value
morphological formation of Amharic language, for a particular grammatical feature (or dimension along
particularly nouns and verbs. Adjectives have similar which Amharic nouns can vary): -oc (plural marker), -
derivation and inflection process to that of nouns. A acew (third person plural neuter), and -n: (accusative).
morphological database is built after identifying the Since each of these suffixes cannot be broken down
common property of all morphological formations of
Amharic nouns (and adjectives) and grammatical features further, they can be considered as a morpheme.
of all the morphemes. As to Amharic verbs, it is too Generally, these grammatical morphemes can have a
difficult to find a single representation or patterns of great role in understanding the semantics of the whole
verbs as they are different in types due to a number of word [7], [12].

13
Memory-Based Learning

Learning Model

No

Yes

Training Phase Morphological Analysis

Fig. 1. Architecture of the proposed Amharic morphological analyzer

Table 1. Example showing annotation of nouns


The following tasks were identified and performed to
prepare annotated datasets used for training:
identifying inflected words; segmenting the word in to
prefix, stem, suffix; putting boundary marker between
each segment; and describing the representation of
each marker. Morphemes that are attached next to the
stem (as suffixes) may have seven purposes:
plurality/possessions, derivation, relativazation,
definiteness, negation, causative and conjection. The
annotation is according to the prefix-stem-suffix ([P]-
Amharic verbs have four slots for prefixes and four
[S]-[S]) structure as shown in Table 1. The bracket ([ ])
slots for suffixes [1], [7], [10]. The positions of the
can be filled with the appropriate grammatical features affixes are shown as follows, where prep|conj is for
for each segmentation where S, M, 1, K, D, and O preposition or conjunction; rel is for relativation; neg is
indicate end of stem, plural, possession, preposition, for negation; subj is for subject; appl-applicative;
derivative and object markers, respectively. Lexicons obj|def is for objective (definiteness) or object; and acc
were prepared manually in such a way to be suitable for is for accusative.
extraction purpose.

14
chosen to be the length of windows size. The input
character in focus, plus the eight preceding and eight
following characters are placed in the windows.
Character based analysis gives concern for each
In addition to analyzing all these affixes, the root character or letter to be considered. From the basic
template pattern of Amharic verbs makes its annotation, instances were automatically extracted, to
morphological analysis complex. It is a challenging task be suitable to memory based learning by sliding a
representing its features into suitable memory based window over the word in the lexicon.
learning approach. Generally, Amharic verb stems are
broken into verb roots and grammatical templates. A For instance, the character based representation of
given root can be combined with more than 40 the word ስለሰበረችው (sleseberecw) is shown in Table
templates [1]. The stem is the lexical part of the verb 3. The ‘=’ sign is used as a filter symbol which shows
and also the source of most of its complexity. To there is no character at that position. The construction
consider all morphologically productive of the verb of instances displays the 11 instances derived from the
types, we need a morphologically annotated word list Amharic word and its associated classes. The class of
with its possible inflection forms. Then, the tokens are the third instance is “K” representing the preposition
manually annotated in similar fashion what we did for morpheme ‘sle’ ending with the prefix 'e'. Therefore,
nouns and adjectives like prefix [ ], stem [ ] and suffix [ ] character based representation of words exhaustively
pattern. The “* +” can be filled with the appropriate transcribes their deep structure of phonological process
grammatical features for each segmentation. The and segments each character one at a time.
sample annotation for verbs is shown in Table 2.
Table 3. Character based feature extraction of the word '
Table 2. Examples of annotated verbs sleseberecw'

Feature Extraction
Once the annotated words are stored in a database,
instances are extracted automatically from the
morphological database based on the concept of Memory-Based Learning
windowing method [2] in a fixed length of left and right
context. Each instance is associated with a class. The Memory based approaches borrow some of the
class represents the morphological category in which advantages of both probabilistic and knowledge-based
the given word posses. An instance usually consists of a methods to successfully implement it in NLP tasks [5]. It
vector of fixed length. The vector is built up of n feature performs classification by analogy. In order to learn any
value pairs depending on the length of the vector. Each NLP classification problem, different algorithms and
example focuses on one letter, and includes a fixed concepts are implemented by reusing data structures.
number of left and right neighbor letters using 8–1 to 8- We used TiMBL as a learning tool for our task [2]. There
1 windows which yields eighteen features. The largest are a number of parameters to be tuned in memory
word length from the manually annotated data base is based learning using TiMBL. Therefore, to get an
optimal accuracy of the model we used the default

15
settings and also tuned some of the parameters. The closely resembles it. For instance, the word በጎች
optimized parameters are the MVDM (modified value (begoc) is segmented and its features are extracted as
difference metric) and chi-square from distance metrics, shown in Figure 2.
IG (information gain) from weighting metrics, ID
(inverse distance) from class voting weights, and k from
the nearest neighbor. These optimized parameters are
used together with the different classifiers. The
classifier engines we used are IGtree and IB1 which
construct databases of instances in memory during the
learning process. The procedure of building an IGtree is
described in [6]. Instances are classified by IGTree or by Fig. 2. Feature extraction for morphological analysis
IBI by matching them to all instances in the instance
base. As a result of this process, we get a memory Morpheme Identification
based learning model which will be used later during When new or unknown inflected words are de-
the morphological analysis phase. constructed as instances and given to the system to be
analyzed, an extrapolation is performed to assign the
most likely neighborhood class with its morphemes
Morphological Analysis based on their boundaries. The extrapolation is based
on the similarity metric applied on the training data. If
The training phase is the backbone of the morphological there is an exact match on the memory, the classifier
analysis module to successfully implement the system. returns (extrapolates) the class of that instance to the
The morphological analysis is implemented by using the new instance. Otherwise, new instance is classified by
memory-based learning model. Therefore, in this phase, analogy in memory with a similar feature vector, and
the feature extraction is used to make the input words extrapolating a decision from their class. For example,
to be suitable for memory based learning classification, the feature of the unknown verb ለነገረችው
the morpheme identification is applied to classify and (lenegerecw) can be segmented as shown in Figure 3.
extrapolate the class of new instances, the stem This instance is compared to each and every instance in
extraction process reconstructs and inserts identified the training set, recorded by the memory-based
morphemes, and finally the root extraction is used to learner. In doing so, the classifier tries to find that
get root forms and stems with their grammatical training instance in memory that most closely
functions. resembles it. Taking the feature of ለነገረችው
(lenegerecw) as shown in Figure 3, this might be
Feature Extraction
instance 10 in Table 3, as they share almost all features
Memory based learning learns new instances by storing (L8, L7, L5, L3- L1, F, R1- R8), except L6 and L4. In this
previous training data into memory. When a new word case, the memory-based learner then extrapolates the 9
is given to be analyzed by the system, it accepts and de- classes of this training instance and predicts it to be the
construct as instances to make similar representation class of the new instance.
with the one stored in memory. Feature extraction in
this section is different from the one described in the
training phase. The word is de-constructed in a fixed-
length of instances with out listing (identifying) the class
labels at the last index. For example, when a new
previously unseen word (which is not found in the
memory) needs to be segmented, the words are Fig. 3. Instances for the unknown token ለነገረችው (lenegerecw)
similarly deconstructed and represented as instances
using the same information. This instance is compared Stem Extraction
to each and every instance in the training set, recorded
by the memory-based learner. In doing so, the classifier After appropriate morphemes are identified, the next
will try to find training instance in memory that most step is the stem extraction process. In stem extraction,

16
reconstruction of individual instances into meaningful Experiment
units (to their original word form) and insertions of
identified morphemes in their segmentation points are The Corpus
performed. After stem extraction, the system searches In order to evaluate the performance of the model and
resembling instances from previously stored patterns in the capability of learn-ability of the dataset we
memory. If there is no similar instance in memory, it conducted the experiment by combining nouns and
uses a distance similarity matrix to find more nearest verbs. To get unbiased estimate of the accuracy of a
neighbor. The modified value difference metric (MVDM) model learned through machine learning, it should be
which looks for the co-occurrence of the values with the tested on unseen data which is not present in the
target classes is used to determine the similarity of the training set. Therefore, we split our data set in to
value of features. For example, the reconstruction of training and testing. The total number of our corpus
the whole instances of the word 'slenegerecw' is shown contains 1022 words, of which 181 are verbs and 841
in Figure 4. In the example, four non-null classes are are nouns (adjectives are considered as nouns as they
predicted in the classification step. In the second step have similar analysis). The number of instances
the letter of the morphemic segments are concatenated extracted from nouns and adjectives are 1356 and from
and morphemes are inserted. Then, root extraction can verbs are 6719 which accounts a total of 8075
be performed in the third step. instances. A total of 26 different class labels occur
within these instances.

Test Results
Simply splitting the corpus into a single training and testing
set may not give the best estimate of the system’s
performance. Thus, we used 10-fold cross-validation
technique to test the performance of the system with IB1 and
IGtree algorithms. This means that the data is split in ten
equal partitions, and each of these is used once as test set,
with the other nine as corresponding training set. This way,
all examples are used at least once as a test item, while
keeping training and test data carefully separated, and the
memory-based classifier is trained each time on 90% of the
available training data. We also used leave-one-out (LOO)
Fig. 4. Reconstruction of the word 'slenegerecw' cross-validation for IB1 algorithm, which uses all available
data except one (n-1) example as training material. It tests
Root Extraction the classifier on the one held-out example by repeating it for
all examples. However, we found it time consuming to use
The smallest unit morpheme for nouns and adjectives leave-one-out cross-validation for IGtree algorithm. Table 4
is the stem. Thus, the root extraction process will not be shows the performance of the system for optimized
applied on nouns and adjectives. Root extraction in parameters.
verbal stems is not complex task in Amharic as roots are Table 4. Test result for Amharic morphological analysis
consonants of verbal stems. In order to extract the root
Evaluation Type of Time taken Space requirement Accuracy
from verbal stems, we simply remove the vowels from method algorithm
verbal stems. However, there are exceptions as vowels (in seconds) (in bytes) (%)
in some verbal stems (e.g. when the verbal stems start
LOO IB1 30.99000 1,327,460 96.40
with vowels) serve as consonants. In addition, vowels
should not be removed from mono- and bi-radical verb IB1 0.82077 1,213,656 93.59
10-fold
types since they have valid meaning when they end
IGtree 0.03711 1,136,582 82.26
with vowels.

17
Conclusion [11] G. Pauw and G. Schryver. Improving the Computational Morphological
Analysis of a Swahili Corpus for Lexicographic Purposes. Lexikos,
18:303, 318, 2008.
Many high-level NLP applications heavily rely on a
[12] B. Yimam. የአማርኛ ሰዋስው (Amharic Grammar), 2nd ed. Eleni Printing
good morphological analyzer. Few attempts have been Press, Addis Ababa, Ethiopia, 2000.
made so far to develop an efficient morphological
analyzer for Amharic. However, due to the complexity
of the inherent characteristics of the language, it was
found to be difficult to develop a full-fledged Amharic
morphological analyzer. This research work is aimed at
developing Amharic morphological analyzer using
memory-based approach. As compared to previous
works, our system performed well and provided
promising results. For example, in the work of Gasser
[7], the system (which is rule-based) does not consider
unseen or unknown words. To overcome this problem,
Mulugeta and Gasser [10] developed Amharic
morphological analyzer using IPL. However, our system
still performs better in terms of accuracy. Thus, our
work adds value in the overall effort to dealing the
complex problem of developing Amharic morphological
analyzer. The performance of our system can be further
enhanced by increasing the training data. Future work is
recommended to be directed at looking into the
morpheme segmentation on individual instances.
Segmentation on the full words and insertions of
grammatical features in each segmented morphemes is
expected to boost the performance of the system.

REFERENCES
[1] S. Amsalu and D. Gibbon. Finite state morphology of Amharic. In Proc.
of Inter. Conf. on Recent Advances in Natural Language Processing.
Borovets, pp 47–51, 2005.
[2] A. Bosch and W. Daelemans. Memory-based morphological analysis. In
Proc. of the 37th Annual Meeting of the Association for Computational
Linguistics. Stroudsburg, 1999.
[3] V. Bosch, et al. An efficient memory-based morpho-syntactic tagger and
parser for Dutch, In Proc. of the 17th Meeting Comp. Ling. in the
Netherlands. Leuven, Belgium, 2007.
[4] A. Clark. Memory-Based Learning of Morphology with Stochastic
Transducers. In Proc. of the 40th Annual Meeting of the Assoc. for
Comp. Ling., Philadelphia, 2002.
[5] W. Daelemans and A. Bosch. Memory-Based Language Processing.
Cambridge University Press, Cambridge, 2009.
[6] W. Daelemans, A. Bosch and Ton Weijters. IGTree: Using Trees for
Compression and Classification in Lazy Learning Algorithms. Artificial
Intell. Review 11:407-423, 1997.
[7] M. Gasser (2011). HornMorpho: a system for morphological processing
of Amharic, Oromo, and Tigrinya, In Proc. of Conf. on Human Lang.
Tech. for Dev., Egypt, 2011.
[8] H. Hammarstrom and L. Borin. Survey Article Unsupervised Learning
of Morphology, Computational Linguistics, 37(2), 309-350, 2011.
[9] E. Marsi, A. Bosch and A. Soudi. Memory-based morphological analysis
generation and part-of-speech tagging of Arabic. In Proc. of the ACL
Workshop on Computational Approaches to Semitic Languages, pp 1–8,
2005.
[10] W. Mulugeta and M. Gasser. Learning Morphological Rules for Amharic
Verbs Using Inductive Logic Programming. In Proc. of
SALTMIL8/AfLaT, 2012.

18
Amharic Anaphora Resolution Using
Knowledge-Poor Approach

Temesgen Dawit Yaregal Assabie


Department of Computer Science Department of Computer Science
Addis Ababa University Addis Ababa University
Addis Ababa, Ethiopia Addis Ababa, Ethiopia
temesgen.dawit.er@gmail.com yaregal.assabie@aau.edu.et

Abstract— Building complete anaphora resolution systems knowledge such as morphology, syntax, semantics,
that incorporate all linguistic information is difficult because of discourse and pragmatics. Furthermore, different
the complexities of languages. In the case of Amharic, it is even
more difficult because of its complex morphology. In addition to disciplines such as logic, philosophy, psychology,
independent anaphors, Amharic has anaphors embedded inside communication theory and other related fields are also
words (hidden anaphors). In this paper, we proposed Amharic involved in the process of anaphora resolution [8]. Due
anaphora resolution system that also treats hidden anaphors, in to such complexities and difficulties, most of the
addition to independent ones. Hidden anaphors are extracted
using Amharic morphological analyzer. After anaphoric terms
researches performed in the area of anaphora
are identified, their relationships with antecedents are built by resolution were aimed at the resolution of pronouns or
making use of the grammatical structure of the language along one type of anaphora. Though the resolution of
with constraint and preference rules. The system is developed anaphora is complex, it needs to be addressed strongly
based on knowledge-poor approach in the sense that we use low
levels of linguistic knowledge like morphology avoiding the need
as it is used as an important component in many natural
of complex knowledge like semantics, world knowledge and language processing (NLP) applications like information
others. The performance of the system is evaluated using 10-fold extraction, question answering, machine translation,
cross validation technique and experimental results are reported. text summarization, opinion mining, dialogue systems
and many others. Depending on the inherent
Keywords— Amharic anaphora resolution; knowledge-poor
algorithm; hidden anaphors characteristics of languages, the process of anaphora
resolution may pass through several steps. Among the
Introduction most important steps are identification of anaphors,
Anaphora resolution is a technique or phenomena of identifying the location of candidates and selection of
pointing back to an entity in a given text that has been an antecedent from the set of candidates [8]. Selection
of an antecedent from the set of candidates is made by
introduced with more descriptive phrase in the text
than the entity or expression which is referring back. making use of constraint and preference rules [9], [14].
The entity referred in the text can be anything like Constraint rules must be satisfied for any antecedent to
object, concept, individual, process, or any other thing be considered as candidate antecedent for the
anaphora resolution. Gender, number and person
[12]. The expression which is referring back is called
anaphor whereas the previous expression being agreement [6], [11], [18] and selectional restrictions [9],
referred is called antecedent [2]. Anaphora can be [18] are common constraint rules applied in most
languages. Preference rules give more preference to
categorized as pronominal anaphora, lexical noun
phrase anaphora, noun anaphora, verb anaphora, antecedents when constraint rules are satisfied. They
adverb anaphora or zero anaphora based on the form are applied on antecedents that passed constraint rules
of the anaphor or syntactic category of the anaphor. and when a certain anaphor has more than one
Based on the location of the antecedents in the text, antecedent as a choice [9], [14]. Some of the preference
anaphors can also be classified into intrasentential or rules are definiteness [7], giveness [13], recency [5],
intersentential [9], [16], [18]. Anaphora resolution is a frequency of mention [9], etc.
complex task which needs various types of linguistic

19
Based on the level of linguistic information required languages family and it is the second most spoken
to resolve anaphora, two broad categories have language among Semitic language families in the world,
emerged: knowledge-rich and knowledge-poor next to Arabic. Though the language is widely spoken or
approaches. Knowledge-rich anaphora resolution used as a working language in the country, it is still
approaches employ linguistic and domain knowledge in considered to be “under-resourced” since linguistic
great detail starting from the low-level morphological resources are not widely available. Amharic is written
knowledge to the very high level knowledge like world using its own script having 34 consonants (base
knowledge. As a result, morphological analyzers, characters), out of which six other characters
syntactic analyzers, semantic analyzers, discourse representing combinations of vowels and consonants
processors and domain knowledge extractors are are derived for each character. The base characters
needed to deal with anaphora resolution using have the vowel ኧ(ä) and other derived characters have
knowledge-rich approaches [15]. On the other hand, vowels in the order of ኡ(u), ኢ(i), ኣ(a), ኤ(e), እ(ĭ), and
knowledge-poor approaches are formulated to avoid ኦ(o). For example, for the base character ረ(rä), the
the complex linguistic knowledge used in anaphora following six characters are derived: ሩ(ru), ሪ(ri), ራ(ra),
resolution process. It is a result of looking for ሬ(re), ር(r[ĭ]), and ሮ(ro). This leads to have a total of
inexpensive solutions to satisfy the need of NLP systems 34x7 (=231) Amharic characters. In addition, there are
in a practical way. The development of low-level NLP about two scores of labialized characters such as
tools like part-of-speech (POS) tagger and ሏ(lwa), ሟ(mwa) ሯ(rwa) ሷ(swa), etc. used by the
morphological analyzer attracted the use of knowledge- language for writing.
poor approach. In general, knowledge-poor anaphora
resolution approach depends on a set of constraint and Amharic Morphology
preference rules [13], [14]. Depending on the Due to its Semitic characteristics, Amharic is one of the
characteristics of languages, several of the most morphologically complex languages. Amharic
aforementioned rules, techniques and approaches have nouns and adjectives are marked for any combination
been used to develop anaphora resolution systems for of number, definiteness, gender and case. Moreover,
various languages across the world such as English [7], they are affixed with prepositions [1], [19]. For
Turkish [9], Norwegian [6], Estonian [15], Czech [17],
example, from the noun ተማሪ (tämari/student), the
etc. However, to our best knowledge, there is no any
attempt to develop anaphora resolution system for following words are generated through inflection and
Amharic. Thus, this work is aimed at developing affixation: ተማሪዎች (tämariwoč/students), ተማሪው
anaphora resolution system for Amharic using (tämariw/ the student {masculine}/his student), ተማሪየ
knowledge-poor approach as the language is under- (tämariyän/my student), ተማሪየን (tämariyän/my
resourced with only limited number of NLP tools student {objective case}), ተማሪሽ (tämariš/your
available. {feminine} student), ለተማሪ (lätämari/for student),
The remaining part of this paper is organized as follows. ከተማሪ (kätämari/ from student), etc. Similarly, we can
Section 2 presents the characteristics of Amharic language with generate the following words from the adjective ፈጣን
special emphasis on its morphology and personal pronouns. In (fäţan/fast): ፈጣኑ (fäţanu/fast, {definite} {masculine}
Section 3, we present the proposed system for anaphora
resolution. Experimental results are presented in Section 4, and {singular}), ፈጣኖች (fäţanoč/fast {plural}), ፈጣኖቹ
conclusion and future works are highlighted in Section 5. (fäţanoču/fast {definite} {plural}), etc. Amharic verb
References are provided at the end. inflections and derivations are even more complex than
those of nouns and adjectives. Several verbs in surface
Linguistic Characteristics of Amharic forms are derived from a single verbal stem, and several
The Amharic Language stems in turn are derived from a single verbal root. For
example, from the verbal root ስብር (sbr/to break), we
Amharic is the working language of Ethiopia having a
population of over 90 million at present. Even though can derive verbal stems such as säbr, säbär, sabr,
many languages are spoken in Ethiopia, Amharic is säbabr, täsäbabr, etc. From each of these verbal stems
dominant and is spoken as a mother tongue by a large we can derive many verbs in their surface forms. If we
segment of the population and it is the most commonly take the stem säbär, the following verbs can be
learned second language throughout the country [10]. derived: ሰበረ (säbärä/he took), ሰበረች (säbäräč/she
The language is categorized under the Semitic broke), ሰበርኩ (säbärku/I broke), ሰበርኩት

20
(alsäbärkutĭm/I broke [it/him]), አልሰበርኩም but are embedded in the word ይሰብረኛል
(alsäbärkum/I didn’t break), አልሰበረችም (yĭsäbräňal/[he/it] will break me). Moreover, both
(alsäbäräčĭm/she didn’t break) , አልሰበረም independent and hidden personal pronouns are subject
(alsäbäräm/he didn’t break), አልሰበረኝም to complex affixation processes in Amharic.
(alsäbäräňĭm/he didn’t break me), ተሰበረ
(täsäbärä/[it/he] was broken), ስለተሰበረ Amharic Anaphora Resolution
(sĭlätäsäbärä/as [it/he] was broken), ከተሰበረ System
(kätäsäbärä/ if [it/he] is broken), እስኪሰበር System Architecture
(ĭskisäbär/until [it/he] is broken), ሲሰበር (sĭsäbär/when
A complete anaphora resolution system is a very
[it/he] is broken), etc. complex process involving multidisciplinary concepts
Amharic verbs are marked for any combination of person, and requires different levels of linguistic analysis as
gender, number, case, tense/aspect, and mood resulting in
thousands of words from a single verbal root [1], [19]. As a
discussed in Section I. Such requirements hinder the
result, a single word may represent a complete sentence development of a complete anaphora resolution system
cosutructed with subject, verb and object. For example, for under-resourced languages like Amharic. Thus, the
ይሰብረኛል (yĭsäbräňal/[he/it] will break me) is a sentence proposed anaphora resolution system focuses on
where the verbal stem ሰብር (säbr/ will break) is marked for resolution of pronominal anaphoric terms, which in many
various grammatical functions as shown in Fig. 1. languages form the core of anaphora resolution systems.
Accordingly, the proposed system has five major
components: low-level knowledge extraction,
ይ ሰ ብ ረ ኛ ል
identification of independent anaphors, identification of
yĭ sä b rä ňa l hidden anaphors, identification of candidate antecedents
and anaphora resolution. In addition to the Amharic text
The verbal stem -ኧኛ- (-äňa-) to be processed, the system requires permanently stored
ሰብር list of Amharic personal pronouns, constraint rules and
Marker for the preference rules as inputs, and produces anaphors along
(säbr/ will break) objective case “me” with antecedent relations as an output. The architecture
ይ....ል/yĭ....l of the system is shown in Fig. 2. In the figure, dashed
lines represent temporary storages created during
Marker for the subject “he/it” runtime.
Fig. 1. Morphology of the word ይሰብረኛል (yĭsäbräňal). Low-Level Knowledge Extraction
Amharic Personal Pronouns Independent anaphors can be easily identified by
Personal pronouns in Amharic language are እኔ (ĭne/ I), searching for personal pronouns in the text whereas
hidden anaphors can be identified by analyzing the
እኛ (ĭňa/ we), አንተ (antä/ you [male, singular]), አንቺ morphology of verbs. On the other hand, nouns and head
(anči/ you [female, singular]), እናንተ (ĭnantä/ you nouns of noun phrases form antecedents. Thus, to
[gender neuter, plural]), እርስዎ (ĭrswo/ you [gender identify independent anaphors, hidden anaphors and
neuter, polite]), እርሱ /እሱ (ĭrsu/ĭsu/ he or it), እርሷ/እሷ candidate antecedents, the Amharic text should be
(ĭrswa/ĭswa/ she), እርሳቸው/እሳቸው (ĭrsačäw/ ĭsačäw/ processed with the aim of labeling the text with POS tags
he or she [polite]), እነርሱ/እነሱ (ĭnärsu/ĭnäsu/ they). In and chunks which are considered to be low-level
linguistic knowledge. POS tagging is a classification task
addition, possessive and objective forms could be with the goal of assigning word classes to the words in a
derived through prefixing and suffixing with የ- (yä) and text [3] and it helps identify antecedents and independent
-ን (n), respectively. For example, የእኔ (yäĭne/ my) and anaphors. Chunking is a task that consists of dividing a
እኔን (ĭnen/ me) are derived from እኔ (ĭne/ I) through text into non-overlapping syntactically correlated words
such affixation process. Amharic personal pronouns can and it helps to get noun phrases from which candidate
exist in sentences in two forms: independent and antecedents are identified. Furthermore, POS tagging and
hidden. Independent pronouns are those which appear chunking are used to identify verbs from which we
extract hidden personal pronouns. Because of the
independently in the text whereas hidden pronouns are
unavailability of a fully functional POS tagger for
embedded in other words. For example, in Fig. 1, the Amharic, we used a manually tagged Amharic text. POS
subjective pronoun እሱ (ĭsu/ he or it) and objective tagged Amharic text is further processed to get chunks in
pronoun እኔን (ĭnen/ me) do not appear independently the text. An Amharic chunker developed by Ibrahim and

21
Assabie [7] was used to chunk POS tagged Amharic
texts.

Amharic text Identification of

Candidate Antecedents Candidate antecedents


Low-Level Identification of Hidden Anaphors Anaphora Resolution

Knowledge Morphological analysis


Identification of verbs
Amharic POS tagging
Extraction
List of verbs Morphological properties
POS tagged text of anaphors and
Morphological analysis
antecedents
Amharic chunking
Application of constraint rules
List of morphemes
Chunked text
Extraction of hidden anaphors anaphora Yes

resolved?
No
Identification of
List of anaphors Application of preference rules
Independent Anaphors

Constraint rules

List of Amharic personal pronouns


Anaphors and antecedents
Preference rules relations

Fig. 2. Amharic anaphora resolution system architecture.

The examples below show the results of low-level ልጅቷ ትልቅ ወንድሟ እርሷ ክፍል ውስጥ
knowledge extraction process.
እንዲያስጠናት እናቷን ጠየቀቻት።
መስተዳድሩ ከፍተኛ ውጤት ላስመዘገቡ ተማሪዎች
ሽልማት ሰጠ። lĭjĭtwa tĭlĭq wendĭmwa ĭrswa kĭfĭl wusţ ĭndiyasţänat
ĭnatwan ţäyäqäčat።
mästädadĭru käfĭtäña wuţet lasmäzägäbu tämariwoč
šĭlĭmat säţä። The girl requested her mother so that her older
brother would tutor her in her room.
The administration gave awards to students who
scored high grade. ('ልጅቷ', 'N'), ('ትልቅ ADJ ወንድሟ N', 'NP'), ('እርሷ
PERPRO ክፍል N ውስጥ PREP', 'PP'), ('እንዲያስጠናት' ,
('መስተዳድሩ' , 'N'), ('ከፍተኛ ADJ ውጤት N', 'V'), ('እናቷን NPREP ጠየቀቻት V' , 'VP')
'NP'), ('ላስመዘገቡ', 'VPREP'), ('ተማሪዎች', 'N'),
('ሽልማት N ሰጠ V', 'VP')
Identification of Independent Anaphors
Personal pronouns in a given Amharic text are considered
to be independent anaphors. Thus, to identify independent
anaphors, we are searching for words tagged as personal

22
pronouns in a POS tagged Amharic text. However, it is to be the second example. Therefore, the use of Amharic
noted that Amharic personal pronouns are subject to various independent anaphors is discouraged in such cases, and the
affixation processes. For example, many words can be derived first example is more common and natural than the second.
through affixation processes from the personal pronoun እርሷ These lingistic phenomena add complexity to the whole
(ĭrswa), some of which are: የእርሷ (yäĭrswa), ከእርሷ (käĭrswa), process of anaphora resolution for Amharic text. Thus, a
በእርሷ (bäĭrswa), እርሷን (ĭrswan), የእርሷን (yäĭrswan), successful identification of hidden anaphors is an important
እርሷንም (ĭrswanĭm), የእርሷንም (yäĭrswanĭm) , እንዯእርሷ step in Amharic anaphora resolution. These hidden anaphors
(ĭndäĭrswa), etc. Thus, identification of independent anaphors is are identified from verbs in the chunked text by making use of
achieved by having a database of Amharic personal pronouns morphological analyzer that indicates grammatical functions
(in their root forms) and matching them against the sub- for which verbs are marked. Identification of hidden anaphors
patterns of personal pronouns in the given POS tagged text. becomes more complex when verbs are marked for more
Identified independent anaphors are stored in a temporary number of grammatical functions. We used HornMorpho
storage which will be used later in the process of anaphora morphological analyzer dveloped by Gasser [4] to extract such
resolution. information from verbs. For example, ይሰብራል (yĭsäbral/[he]
can break) is analyzed as “ይ (yĭ) + ሰብር (säbr) + ኣል (al)”
Identification of Hidden Anaphors where “ይ (yĭ) ... ኣል (al)” would represent “he can” and
The unique complexity of Amharic anaphora “ሰብር (säbr)” would be equivalent to “break”. Accordingly,
resolution lies in the identification of hidden anaphors. the hidden anaphor እሱ (ĭsu/ he) can be identified from the
verb ይሰብራል (yĭsäbral/[he] can break) though morphological
As discussed in Section II.B, verbs are marked for analysis. Identified hidden anaphors are stored in a temporary
combinations of several grammatical functions such as storage along with independent anaphors.
subject and object resulting in references to subjects
and objects embedded in the verbs. Consequently, it is Identification of Candidate Antecedents
very common to construct Amharic sentences without Identification of candidate antecedents is a process
the explicit use of independent anaphors. For instance, whereby a list of words is prepared from which the
consider the following examples: correct antecedents referred by anaphors are selected.
ካሳ ጠንካራ ነው። መቶ ኪሎ ይሸከማል። የብረት ዱላም All nouns and head nouns of noun phrases are used as
በእጁ ይሰብራል። possible candidate antecedents. Thus, we used chunked
texts to identify nouns and head nouns in a given text.
kasa ţänkara näw። mäto kilo yĭšäkämal። yäbĭrät Thus, Amharic chunker is used to find possible
dulam bäĭju yĭsäbral። antecedents which could be best selected as the
antecedents of the anaphors to be resolved. For
Kasa is strong. [He] can carry a hundred kilogram. instance, let us take the chunked sentence:
[He] can also break an iron stick with his hand. ('መስተዳድሩ' , 'N'), ('ከፍተኛ ADJ ውጤት N', 'NP'),
('ላስመዘገቡ', 'VPREP'), ('ተማሪዎች', 'N'), ('ሽልማት N ሰጠ
V', 'VP'). From the sentence, መስተዳድሩ
(mästädadĭru/the administra-tion), ውጤት (wuţet/
grade), ተማሪዎች (tämariwoč/ students) and ሽልማት
(šĭlĭmat/award) are nouns or head nouns of noun
ካሳ ጠንካራ ነው። እሱ መቶ ኪሎ ይሸከማል። እሱ የብረት phrases. As a result, they are candidate antecedents for
ዱላም በእጁ ይሰብራል። the anaphor (“he” or “it”) hidden inside the verb ሰጠ
(säţä/[he/it] gave).
kasa ţänkara näw። ĭsu mäto kilo yĭšäkämal። ĭsu
yäbĭrät dulam bäĭju yĭsäbral። Anaphora Resolution
Now that anaphors and candidate antecedents are identified,
Kasa is strong. [He] He can carry a hundred kilogram. the final step is to establish relevant relations between them.
This relation is built by applying a stored knowledge of
[He] He can also break an iron stick with his hand. constraint and preference rules which are developed by taking
the inherent characteristics of Amharic language into
consideration. The constraint rules we applied in this work are
gender, number and person agreements where we use
The verb ይሰብራል (yĭsäbral/[he] can break) is marked for morphological analyzer to extract the gender, number, and
the subject እሱ (ĭsu/ he) and implicitly shows the presence of person properties of nous and head nouns of noun phrases.
the subject (hidden anaphor). The explicit use of እሱ (ĭsu/he) Constraint rules are applied both on anaphors and candidate
as a subject results in redundancy of the subject as shown in antecedents to select correctly related antecedents that agree

23
with the anaphors. The antecedent-anaphor relationship is built The administration gave awards to students who
and stored in a database if only one plausible antecedent is left scored high grade.
after the constraint rules are applied. If more than one
candidate antecedents vie for a single anaphora, then
preference rules are applied. Preference rules give more
preference to antecedents that satisfy the rule and give nothing In the example, the verb ሰጠ (säţä /[he] gave) has the hidden
or negative value to those that do not satisfy the rule. The personal pronoun እሱ (ĭsu/he). መስተዳድሩ (mästädadĭru/the
criteria to establish preference rules used in this work are administration), ውጤት (wuţet/ grade), ተማሪዎች (tämariwoč/
subject place, definiteness, recency, mention frequency, and students) and ሽልማት (šĭlĭmat/award) are nouns or head nouns
boost pronoun. Nouns at subject position of a sentence are of noun phrases which can be candidate antecedents. Gender,
more preferred than nouns not placed at subject positions. number and person information of the verb ሰጠ (säţä /[he]
Indefiniteness is unmarked in Amharic nouns, but definiteness gave) is male, singular and third person, respectively. On the
is always marked by a suffix -ኡ (-u) or -ው (w) as in በጉ other hand, gender, number and person information of
(bägu/the sheep) and በሬው (bärew/the ox), respectively. candidate antecedents is shown in Table 2.
Definite nouns are more preferred as relevant antecedents than
nouns which are not definite. On the other hand, nouns which Table 2. Gender, number and person information of candidate
are most recent to the anaphor are given more weight than antecedents.
others which are not recent. The antecedent mentioned Candidate antecedent Gender Number Person
repetitively is also given more priority. Boost pronouns are
pronouns used as antecedents for anaphors and the pronouns
themselves can be considered as possible candidates. The መስተዳድሩ [male] singular third
preference of one criterion over the other is determined through
values calculated using the probability distribution they have in ውጤት [male] singular third
Amharic language. The probability distribution is calculated by
statistically analyzing an Amharic text. Table 1 shows the ተማሪዎች neuter plural third
probability distribution of preference rules (with percentage
from 5). ሽልማት [male] singular third

Table 1. Values of preference rules for Amharic text.

Percent from Among the candidates, ተማሪዎች (tämariwoč/ students) will


Preference criteria be discarded due to the constraint on its number which is plural
5 as opposed to that of the hidden anaphor. The remaining three
candidates pass to the next step where preference rules are
Subject place + 2.88 applied. Table 3 shows the respective values (along with their
aggregate scores) that the remaining candidate antecedents
Definiteness + 0.94 receive from each preference criterion. From the three
candidate antecedents, the candidate having largest aggregate
Recency + 0.89 score is መስተዳድሩ (mästädadĭru/the administration). As a
result, it is selected as the antecedent being referred by the
Mention frequency + 0.02 personal pronoun እሱ (ĭsu/he) hidden inside verb ሰጠ (säţä
/[he] gave).
Boost pronoun + 0.12
Table 3. Preference values given to candidate antecedents.
Candidate Subject Definiteness Recency Mention Boost Aggregate
antecedent score
Each candidate antecedent receives a value from each place frequency pronoun
preference criterion. If a candidate antecedent fulfils a
preference criterion, it would receive the corresponding value መስተዳድሩ +2.88 +0.94 0 0 0 +3.82
for that criterion. Otherwise, it would receive zero value.
Finally, a candidate antecedent that has the highest aggregate ውጤት 0 0 0 0 0 0
score is selected as an antecedent for a given anaphor. For
example, let us take the following sentence.
ሽልማት 0 0 +0.89 0 0 +0.89

መስተዳድሩ ከፍተኛ ውጤት ላስመዘገቡ


ተማሪዎች ሽልማት ሰጠ።

mästädadĭru käfĭtäña wuţet lasmäzägäbu


tämariwoč šĭlĭmat säţä።

24
Experiment Conclusion
Amharic is one of the most morphologically complex and
The Corpus less-resourced languages. Despite the efforts being undertaken
To evaluate the performance of the system, two sets of to develop various Amharic NLP applications, only few usable
Amharic text corpus were collected from various sources. The tools are publicly available at present. Researchers and
first set of corpus is used to establish preference rules. A total developers frequently mention that the morphological
of 495 sentences were statistically analyzed to compute the complexity of the language is a feature that hinders the
probability distribution of preference criteria. Another set of development of natural language processing applications for
corpus was used to test the performance of the system. We the language. Amharic anaphora resolution also suffers from
used 311 sentences that were generated by the Amharic this problem since anaphoric terms are subject to affixation
chunker. The text was collected from Walta Information processes and are also embedded in other words. In this work,
Center and an Amharic grammar book [7]. From these we tried to overcome this problem by POS tagging, chunking,
sentences, we identified a total of 315 verbs which contain and analyzing the morphology of words. Due to unavailability
hidden personal pronouns. It was observed that the number of of high-level linguistic tools for the language, we follow
independent anaphors in the sentences was very minimal. Thus, knowledge poor anaphora resolution approach that uses the
we collected an additional text from the Amharic Bible where results of low-level linguistic analyses. With the limitations on
we selected a total of 163 sentences having 110 independent Amharic chunker and morphological analyzer we used as
personal pronouns. components in the system, tests have shown that the proposed
Amharic anaphora resolution system yields promising results.
Test Results The performance of the system can be further enhanced by
We used 10-fold cross-validation technique to test the improving the effectiveness of the chunking and morphological
performance of the system. The original sample is randomly analysis modules. Thus, future work is recommended to be
partitioned into 10 equal size subsamples and the cross- directed at improving the chunking and morphological analysis
validation process is then repeated 10 times. Accordingly, we components of the system. Moreover, the preference rules can
obtain 10 results from the folds which can be averaged to be refined better by statistically analyzing a large amount of
produce a single estimation of the overall performance of the corpus.
system. The evaluation methods we used to test the
References
performance of the system are Success Rate (SR) and Critical
Success Rate (CSR) which are computed as follows. [1] G. Amare. ዘመናዊ የአማርኛ ሰዋስው በቀላል አቀራረብ (Modern
Amharic Grammar in a Simple Approach), Addis Ababa, Ethiopia, 2010.
Number of correctly resolved anaphors  [2] T. Deoskar. Techniques for Anaphora Resolution: A Survey,
SR  Unpublished, Cornell University, USA, 2004.
Number of all anapho rs
[3] S. Fissaha. Part of Speech Tagging for Amharic using Conditional
Random Fields. In Proceedings of the ACL Workshop on Computational
Number of correctly resolved anaphors (2)
CSR  Approaches to Semitic Languages: 47–54,2005.
Number of anaphors with more than one antecedent [4] M. Gasser. HornMorpho: a system for morphological processing of
Amharic, Oromo, and Tigrinya, In Proceedings of Conference on
Human Language Technology for Development, Alexandria, Egypt,
Equation 2 is used when an anaphor has more than one 2011.
antecedent after the application of gender, number and person [5] H. Ho, K. Min and W. Yeap. Pronominal Anaphora Resolution using a
constraint rules. The overall performance of the proposed shallow meaning representation of sentences. In Proceedings of the 8th
system also depends on the performances of Amharic POS Pacific Rim International Conference on AI: 862-871, 2004.
tagger, chunker, and morphological analyzer. The accuracy of [6] G. Holen. Automatic Anaphora Resolution for Norwegian (ARN), PhD
the Amharic chunker is 93.75% [7] and the accuracy of the Thesis, University of Oslo, 2006.
morphological analyzer, using our tests, is 92.7%. With these [7] A. Ibrahim and Y. Assabie. Hierarchical Amharic Base Phrase Chunking
limitations, the overall performance of our proposed anaphora Using HMM with Error Pruning, In Proceedings of the 6th Language &
resolution system is shown in Table 2. Technology Conference, December 7-9, Poznan, Poland, 2013.
[8] M. Kabadjov. A Comprehensive Evaluation of Anaphora Resolution and
Table 2. Test result for Amharic anaphora resolution system. Discourse-new Classification. PhD Thesis, University of Essex, UK,
2007.
Success [9] D. Küçük. A knowledge-poor pronoun resolution system for Turkish.
Type of anaphors Critical success rate
rate PhD Thesis, Middle East Technical University, 2005.
[10] P. Lewis, F. Simons and D. Fennig. Ethnologue: Languages of the
Independent 81.79% 76.07% World, Seventeenth edition. Dallas, Texas: SIL International, 2013.
[11] T. Liang and D. Wu. Automatic Pronominal Anaphora Resolution in
Hidden 70.91% 57.58% English Texts. Computational Linguistics and Chinese Language
Processing (9)(1): 21-40, 2004.

25
Web Based Amharic Question Answering System for Factoid Questions Using
Machine Learning Approach
DesalegnAbebaw (ledessu@gmail.com), MulugetaLibsie (PhD) (mlibsie@yahoo.com)

Abstract Amharic documents from the web in preparing the


search space. The downloaded Amharic documents are
When users need for a certain fact and try requesting
then indexed by the open source tool, Lucene indexer, to
search engines for it, they get back a bunch of addresses
facilitate the document retrieval process. Hence, our
and snippets which are „related‟ to their need and it is up
system has two major parts, the search space preparation
to the users to decide which address to choose expecting
part (crawler and indexer) and the question answering
that the requested fact could be found there. Opening the
part. Besides, our system is designed to be a web based
address could present the user with lots of pages of
system for interacting with the end users on the web.
information and it is the user‟s duty to go through the
information and extract the actual fact. But given a By employing the machine learning algorithm in
collection of documents, a Question Answering system question classification and adopting answer extraction
attempts to retrieve correct answers to questions posed in techniques used in TETEYEQ, we have achieved
natural language. Hence question answering relieves 77%overall system performance which is better than that
users from the task of digging the information from of TETEYEQ‟s.
related pages.
In the absence of basic natural language processing
Prior implementations of language processing tasks (NLP) tools like part of speech (POS) tagger and named
typically involved the direct hand coding of large sets of entity recognizer (NER) for the Amharic language, both
rules. Modern approaches to NLP are grounded in TETEYEQ and our system have achieved a considerable
Machine Learning (ML). This approach uses learning performance which would be boosted up by the addition
algorithms often grounded in statistical inference to of such NLP tools in the future.
automatically learn such rules through the analysis of
Key Words: Amharic Factoid Question Answering,
large corpora of typical real world examples as a
SVM based question classification, Answer Extraction,
training set.
Web based Amharic Question Answering
SeidMuhie‟s[2] TETEYEQ is a pioneer question
answering work designed to answer Amharic factoid
1. Introduction
questions („Person‟, „Place‟, „Time‟, and „Quantity‟ type
When users need for a certain fact and try requesting
questions) using rule based approach. We have designed
search engines for it, they get back a bunch of addresses
a system for answering the four kinds of Amharic
and snippets which are „related‟ to their need and it is up
factoid questionsusing a machine learning approach by
to the users to decide which address to choose expecting
employing the known machine learning based
that the requested fact could be found there. Opening the
classification algorithm, support vector machine (SVM).
address could present the user with lots of pages of
By doing so, we attained an accuracy of 94.2% in
information and it is the user‟s duty to go through the
question classification which outperforms the rule based
information and extract the actual fact. But given a
question classification in TETEYEQ[2].
collection of documents, a Question Answering
We have also integrated a web crawler by customizing systemattempts to retrieve correct answers to questions
the open source JSpider crawler to automatically gather posed in natural language. Hence question answering

26
relieves users from the task of digging the information answering systems having a predetermined
from related pages. knowledgebase.

There are different types of questions like definition, list, But this doesn‟t make sense when thinking about the
acronyms, true/false, and factoid types. Most of the entire web as a knowledgebase for the question
question answering systems have three major answering system. Therefore, in our research, as we are
components, question analysis, document (passage) aiming to use the web as our knowledgebase, we don‟t
retrieval, and answer extraction. need to undertake a pre-processing task.

An effort has been made to design a Question The preprocessed documents were then indexed using
Answering tool in many languages. The popular web the Lucene indexer after being analyzed by the Amharic
based Question Answering System for English Language analyzer (for tokenizing, removing stop words, character
is available online [1].A similar effort has been made to normalization, and stemming purposes)
design an Amharic Question Answering System for User generated questions are received by the system and
factoid questions by SeidMuhie[2]. This work was done a rule based question type identification carried out to
using a rule based approach to analyze natural language determine what type of question the user is asking and
questions. what type of answer is being expected from that
2. Related Work

For languages like English, many question answering


systems are available which are designed in different
approaches. But in the case of Amharic, Seid Muhie‟s[2]
TETEYEQ is a pioneer work designed to answer
Amharic factoid questions. It aims to answer four kinds
of Amharic factoid questions namely the „Person‟,
„Place‟, „Time‟, and „Quantity‟ question types. It was
Figure 1 a snapshot of TETEYEQ’s interface
question.
designed by employing a rule based approach in
question analysis component by manually writing rules But both the question type and expected answer types
to classify questions into one of these four question identification was done using the help of manually
types resulting in an accuracy of classifying 86.9% of crafted regular expressions (rules) which are not easy to
the questionscorrectly. It was also using manually develop and are most of the time rigid to entertain other
collected Amharic documents as a search space and the types of questions if a need arises, in the future, for the
reported overall system performance was 72%. system to deal with other question types other than the
four factoid types („person‟, „time‟, „place‟, and
Figure1shows a snapshot of TETEYEQ‟s interface
„quantity‟).
replying to the user‟s question
“የኢትዮጵያጠቅላይሚኒስትርማንይባላሉ ? ”. In this
research, a large number of Amharic corpuses
(approximately 15600 news articles) were collected from 3. Design
the Web and Ethiopian newspapers. The corpuses need The web based Amharic question answering system, as
to pass through a document pre-processing phase to its Figure
name indicates, is a natural
1 : TETEYEQ's interfacelanguage question
make them have a predetermined format. This pre- answering system expected to respond to end users‟
processing could be done for documents already Amharic factoid questions posed to the system like the
collected and expected to be used for answering all the way end users are querying human beings in natural
natural language questions to be raised by users. This language questions.
could be suitable in designing domain specific question

27
Therefore, in the first place, the system needs to Knowing the key words to retrieve documents from the
understand what the end user is meant to look for. document repository is the next step. But the document
Analyzing the users‟ request by itself is not enough. The repository‟s richness has the major effect in the likelihood of
request (question) needs to be formulated into an finding the documents in the repository. The repository can be
a small (or constant) collection of documents collected from
equivalent query that can be understood by the system in
the web or offline materials. This results in two situations:
retrieving documents related to what is being asked.
First, if a matching document is not found in the repository,
Passage retrieval is an intermediate process in some question given a particular set of query terms, it can always be
answering systems between document retrieval and answer determined that the same response will be generated in the
extraction. Potential passages are selected from the returned future for similar queries. Second, the repository size has a
documents so that the answer extraction effort will be direct relationship in the likelihood of finding related
minimized. But it doesn‟t always guarantee to get a better documents for a given query; the larger the number of
result than extracting answers directly from retrieved documents in the repository, the higher the likelihood of
documents. This is basically dependent on the distribution of finding related documents for a given query term(s)
answer terms in the document. If it is believed that answer
terms are commonly found in a single passage, the passage
retrieval becomes efficient. But for open ended factoid Figure 2 shows the entire architecture of the web based
questions, the answer terms might be dispersed over the entire Amharic question answering system. The detail explanation
document. Hence, dealing with passage retrieval doesn‟t work about each component in the system comes next.
in our case [5].

Question
The Web

Query Question Training the


Web Crawling Generation Classification Classifier

Question Analysis
Documents
Crawled
DocumentsDire
ctory
Classifier
Models Models
Script
Identification Document
Retrieval
Amharic
documents
Directory Candidate
Majority Amharic
Content Documents Documents

Documents
Expected
Lucene Answer
Indexing Type

Query

Index
Lucene Directory
Index File Candidate
Documents
Answer Extraction
28

Answers

Answer
for research purpose and can be smoothly integrated
with our system.

3.1.2 Query Generation


3.1. Question Analysis
This is the component which transforms the natural
This is the first component of the system which directly language question into the one suitable for searching the
interacts with the end user. When the user poses terms in the question from the Lucene index file. The
Amharic factoid question to the system, the question generated query hence is the sole input to the
analysis component takes in the user query and passes it forthcoming document retrieval component.
to its sub components. The two major sub components
are automatic question classification, and query During the generation phase the user question goes
generation. through the language analysis, tokenization, short words
expansion, stop words removal, and stemming phases.
3.1.1 Automatic Question Classification
The generated query is then passed to the document
One of the major differences of our system with that of
retrieval component for searching documents related to
SeidMuhie‟s[2] is the question classification approach.
the query generated and believed to contain candidate
In [2], a serious of hand-crafted rules were used to
answers.
classify user‟s questions into the factoid question types
of „person‟, „time‟, „place‟, and „quantity‟. But, in our
case, the classification is done using a statistical
3.2.Document Retrieval
approach by getting use of the very known and efficient
text classification algorithm called the Support Vector The document retrieval component is responsible for
Machine (SVM). Support Vector Machines were fetching documents which are related to the generated
initially designed for binary classification (classification query in the previous component. The quality of this
into one of the two classes). How to use it in multi-class component has a direct relationship with the overall
classification is still a research issue. Different Question Answering System‟s performance [4].
approaches have been proposed on how to use the binary
classifier for multi-class classification tasks. A “one- Hits returned from the document retrieval component are
against-one” way of combining binary classifiers is a not guaranteed to hold candidate answers for the user
more suitable way for practical use [3]. question. Because, the Lucene‟s index searching only
cares about the match (similarity) between query terms
Hence, in our classification, we have generated a total of and indexed documents. But it is likely that a top ranked
six classification models in the “one-against-one” document, a document containing maximum number of
terminology (i.e. n(n-1)/2 models where n is the number query terms when compared to other documents, might
of classes; n=4 in our case). not contain candidate answers in it.
SVM, as a statistical approach, needs large amount of Therefore, the ranked list of relevant documents returned
training data so as to train the machine so that it would from the Lucene‟s search method are subjected to further
make accurate classifications for the unseen test data. inspection by the subsequent component, answer
extraction component to find the ranked list of candidate
In this case, we have prepared 1200 questions from the
answers which are to be returned to the user.
four classes of factoid questions („person‟, „time‟,
„place‟, and „quantity‟) for training the classifier. 3.3. Answer Extraction
The Windows version of the SVM classifier, This is a fundamental component that a question
SVM_Light, developed by Thorsten Joachims, is used answering system differs from the ordinary search
for our question classification task. It is freely available engines. Its main objective is to extract terms, which are

29
believed to be answers to the question posed by the user, It is not a surprise that after all these sequential
from the hits returned by the Lucene‟s index searcher. interaction of the question answering components one
after the other, the final result may end up in returning
Different factoid question answering systems employ
“no answer” to the user. This happens because of the
different methods of extracting an answer from
absence of relevant documents in the repository,because
candidate documents or passages [6]. The answer
the query terms are found very dispersed in „relevant‟
extraction methods based on pattern matching form the
documents, or absence of candidate answer terms in the
most simple and commonly used group of answer
„relevant‟ documents.
extraction methods. The most complex part of the
methods lies in the way the patterns are created or
generated, whereas the patterns themselves are simple. 1. Performance Evaluation
Answer extraction methods based on pattern matching While evaluating the performance of the question
follow the tradition of information extraction methods. classification, we have followed a 10-fold cross
validation technique. In this technique, ten separate
Answer extraction patterns (AEPs) may consist of
experiments are done by reserving one tenth of the data
several types of units, such as punctuation marks,
(120 questions) for testing and using the rest of the data
capitalization patterns, plain words, part-of-speech

(POS) tags, Figure 3: A snapshot


tags describing of the
syntactic system named
function, while answering at the firstfor
(1080 questions) rank.
training the classifier in each
entities (NEs) and temporal and numeric expressions. experiment. We have used 6 classifier models in a single
experiment, resulting in a total of 60 models in the ten
One of the other methods is the very simple proximity
experiments.
based method. It is based on the proximity between the
query terms and candidate answer terms. These By supplying 30 test questions from each class, we have
candidate answer terms are identified from the recorded the performance of the classifier in each
documents based on some pattern matching technique. experiment and then took the average of the ten
experiments to report the actual performance of the
In our system, we follow the pattern matching approach
question classifier in each of the four question types.
together with the proximity method.
From this, we can say that we have acquired a
Finally, the list of likely answers will be returned back to
classification performance from the SVM classifier
the user as a reply to the question posed at the very first
giving 94.2% accuracy on average.
step.

30
We have also performed the 10-fold cross validation to This work of ours aims to apply a machine learning
check the performance of SeidMuhie‟s [4] rule based approach to the question analysis component hoping to
question classifier by employing the test questions we improve precision of the question classification task in
used to test our classifier. On average, the rule based question analysis, whose accuracy impacts the accuracy
question classifier has classified 86.9% of the of the entire system.
questionscorrectly.
With application of a known machine learning based
Besides the advantages of employing machine learning classification algorithm, support vector machine (SVM),
based classifier over the rule based one, we observed we attained an accuracy of 94.2% in question
that the classification performance of the rule based classification.
classifier is less than that of the SVM based classifier.
Besides, we tried to incorporate a web crawler in our
system to help automatically gather web documents and
2. Conclusion prepare a search space. With the integration of the
The web based Amharic factoid question answering JSpider based web crawler, we are able to download
system has two major parts, the search engine part and Amharic content pages from known Amharic news sites
the question answering part. and keeping the crawling process active while
periodically running the language identification and
As the corpus used is gathered from Amharic web sites, indexing processes would enrich our search space
we used a crawler to download pages from these sites. resulting in the likelihood of answering Amharic factoid
Then, the downloaded pages will go through an questions.
identification phase to filter out non-Amharic content
pages from our search space. This is the language The top 5 relevant documents are used in our system.
identification phase. The Amharic documents are then Therefore, among the top hits returned by the document
indexed in a way to facilitate information retrieval later retrieval component, the top 5 are believed to contain
in the system. When the search space is ready, then the much of the query terms and have potential to contain
question answering part will start playing its role in the correct answer.
serving the users.
Extracting correct answers, given relevant documents
The question answering part has three main components, with the potential to contain correct answers, requires
namely, question analysis, document retrieval and extracting candidate answer terms from documents and
answer extraction. picking up one best answer among the candidates which
is closer to the query terms in the document than other
For natural language processing applications such as candidate answer terms. Our system‟s precision in
question answering, the importance of other fundamental extracting candidate answer terms from relevant
language processing modules like part of speech tagger documents is 62.3% on average for the four question
(POS), named entity recognizer (NER), and stemmer is types. This is mainly due to the absence of part of speech
unquestionable. With the absence of these such modules tagger and named entity recognizer which could have
in Amharic language, SeidMuhie[2] has designed the helped a lot for identifying the part of speech where a
first Amharic factoid question answering system using a term is used in a sentence and named entities found in
rule based approach employed in the question analysis the document. But, the selection of best answer out of
component resulting in 89% of correct question type the candidate answer terms is done by computing the
identification; as is reported in [2]. The system had also character distance between candidate answers and query
a performance of 97% in retrieving related documents. terms in the document. In this regard, our system is able
Using a gazetteer in place of the named entity recognizer to pick out 77% of answers correctly.
and a regular expression, the researcher has reported a
72% overall system performance in correctly answering
user questions.

31
3. References

[1] http://start.csail.mit.edu/ - START (SynTactic Analysis


using Reversible Transformations) – Last accessed on
Sept 29, 2011.

[2]SeidMuhie, 2009, Automatic Amharic Question Answering


for Factoid Question Types, Master‟s Thesis, Addis
Ababa University, Unpublished.

[3]Chih-Wei Hsu and Chih-Jen Lin, A Comparison of


Methods for Multi-Class Support Vector Machines,
Department of Computer Science and Information
Engineering, National Taiwan University.

[4]Kevyn Collins Thompson, Jamie Callan, Egidio Terra, and


Charles L.A. Clarke, The Effect of Document Retrieval
Quality on Factoid Question Answering Performance,
School of Computer Science, Carnegie Mellon
University, and University of Waterloo, Canada, ACM,
July 25–29, 2004.

[5] ChristofMonz, From Document Retrieval to Question


Answering, ILLC Dissertation Series DS-2003-4,
Institute for Logic, Language and
Computation,Universiteit van Amsterdam, 11
December, 2003.

[6] LiliAunimo, Methods for Answer Extraction in Textual


Question Answering, Series of Publications AReport A-2007-
3, Department of Computer Science, University of Helsinki,
Finland, 2007.

32
A Mathematical Algorithm for Robust Color
Retinal Image Analysis - a preliminary study
Daniel Moges, Birhanu Assefa, and Dawit Assefa*
Center of Biomedical Engineering, Division of Bioinformatics
Addis Ababa Institute of Technology, Addis Ababa University, Addis Ababa, Ethiopia
* dawit.assefa@aau.edu.et

Abstract— This study presents a novel abnormalities, the MA, are discrete localized distension
mathematical scheme for analysis of color retinal of the weakened capillary walls and are presented as
images acquired through digital fundus cameras small, circular red dots on the retina. The MA cause intra
from patients treated for diabetic retinopathy. The retinal hemorrhage when it is ruptured. Hemorrhages
proposed scheme uses a holistic representation of the appear either as red dots or flame-shaped. The disease
color images in the three (trinion) space and applies severity is classified as mild non-proliferative DR. HEs
trinion based Fourier transforms to extract useful are caused by proteins and lipids leaking from the blood
imaging features for the purpose of classification and into the retina via damaged blood vessels and they appear
in retinal images as yellow white areas and sometimes in
segmentation of retinal images. A suitable color space
a ring-like structure around leaking capillaries. This state
transformation and a way of extracting robust higher of the retinopathy is called moderate non-proliferative
order features are included in the method. The DR. When the severity of DR advances the blood vessels
scheme has been applied in analyzing images become obstructed causing microinfarcts in the retina
acquired from standard retinal image databases and (the SE) which appear as pale yellow area. When
results were very promising showing its suitability to significant number of SEs, intraretinal hemorrhages or
complement the screening of diabetic retinopathy intraretinal micro vascular abnormalities are encountered
and its potential usefulness to ophthalmologists in the state of the retinopathy is defined as severe non
their daily practices. proliferative DR, which can quickly turn into
proliferative DR when extensive lack of oxygen causes
Index Terms— Diabetic Retinopathy, Optic Disk, the development of new fragile vessels
Hard Exudates, Soft Exudates, Glaucoma, Color (neovascularization) which is a serious eye sight
Image Processing, Quaternion, Trinion. threatening state [2]. Glaucoma is an optic neuropathy,
involving gradual damage to the retinal ganglion cells
I. Introduction and their axons that lead into the optic nerve and further
Diabetic retinopathy (DR), glaucoma, age-related into the visual areas of the brain. Glaucoma causes the
macular degeneration (AMD), and cataract are the optic disk (OD) to have sharply denned borders and a
leading causes of blindness and visual impairment in shallow physiological cup [1]. The OD is pale red with a
human eye. Cataract affects the front part of the eye yellowish tint in the center of the retina and is the spot
while the remaining three affect the retina and/or optic where the optic nerve and blood vessels enter the eye.
nerve head at the back of the eye. Cataract is usually In clinical ophthalmology, examination of the optic
noticed by the patient early enough for adequate fundus enables early detection and diagnosis of eye
treatment while the early forms of DR, glaucoma and diseases due to DR, AMD and glaucoma. Various
AMD are usually not noticed by the patient, causing ophthalmoscope techniques such as digital and stereo
substantial damage unless diagnosed early [1]. fundus photography, hyper spectral imaging, scanning
laser ophthalmoscopy (SLO), fluorescein angiography,
DR is a complication in the retina due to diabetics and optical coherence tomography (OCT) are used for
causing abnormalities in the retina, and in the worst case, diagnosis and management of eye diseases [2]. However,
blindness. Its characteristic features are Microaneurysms because of its simplicity, fundus camera is used most
(MA), Hemorrhages (H), Hard exudates (HE), and Soft often. It allows to view the fundus and the structures
exudates (SE). The DR typically begins as small changes leading to it by 2D representation of the 3D retinal semi-
in the retinal capillaries. The first detectable transparent tissues. It is mostly used to detect and

33
evaluate symptoms of retinal detachment or eye diseases preprocessing, candidate abnormality finding and
due to DR and monitor their change over time which is classification/segmentation as briefly reviewed below.
needed due to the progressive nature of the disease. The
growing incidence of diabetes increases the number of A. Overview of preprocessing methods
images that need to be reviewed by ophthalmologists, Patient movement, poor focus, reflections, inadequate
and the high cost of examination and lack of specialists illumination and the like can cause images to be of such
hinder broad screening and thereby preventing many poor quality as to interfere with analysis and impede
patients from receiving effective treatment. This calls for human grading. Different preprocessing methods have
the development of automatic image processing methods. been suggested in the literature to overcome these
Thus the success of the mass screening depends not only problems. Image normalization techniques have been
on accurate fundus image capture but also on the used to reduce the effect of image variations between
accuracy and reliability of the image processing subjects by using histogram marching against a reference
algorithms in detecting abnormalities. image [4, 10, 11]. However those histogram marching
Automated computer aided detection of retinal lesions techniques may also lead to a loss of important
associated with DR offers many potential benefits. In a information in the original image (might suppress less
screening setting, it allows the examination of large bright HEs for instance). Others used non uniform
number of images in less time and more objectively than background illumination correction using mean/median
traditional observer driven techniques. In a clinical filters [10]. In other studies images were divided into non
setting, it can be an important diagnostic aid by reducing overlapping rectangular regions and mean and standard
the workload of trained graders and other costs. The deviation of intensities over the windows were used to
current study focuses specially on exudates (EXs) estimate the background [9, 11]. Global and local
detection and their differentiation from other bright areas, contrast enhancement techniques such as contrast limited
specifically cotton wools (CWs) and the OD which has adaptive histogram equalization (CLAHE) [12] and ways
similar features in color and brightness with HEs. As of background noise suppression and mask estimation
EXs are among common early clinical signs of DR, their have also been proposed.
detection would be an asset to the mass screening tasks Selection of best color space that have maximum
and could serve as a first step towards a complete interclass separability for abnormality types is also a
monitoring and grading of the disease. major preprocessing task. For example, some methods
use only the green channel [8] and others use the
Generally, most attempts in fundus color image intensity component [4] of the original fundus images to
processing focus on analyzing each separate color carry out their analysis. The LUV color space [8] and a
components serially and combining the outputs from the combination of green (from RGB), lightness (from
different channels. However, such a serial method masks LUV), and magenta (from CMYK) space [3] have also
any existing inter-correlations between the color channels been used.
and the associated computational cost is often high. In
this regard, a more holistic representation and analysis B. Image classification and segmentation
approach should have tremendous advantages. Recent Accurate segmentation of fundus images provides a
application of higher dimensional algebra in color image better source of information for studying the various
analysis uses vectorial representations of the three color ocular morphologies which is helpful to improve patient
components in quaternion and trinion spaces and the diagnosis, prognosis, treatment planning, and monitoring.
respective integral transforms permit the analysis as one However, manual segmentation is often a time-
entity/object keeping the inter correlation information [5, consuming and subjective process. Additional challenge
6]. Based on such a vectorial principle, a holistic, is the fact that the retinal images acquired using fundus
automated, and unsupervised image processing scheme is cameras are prone to various artifacts such as noise, poor
suggested in the current work to extract robust and image contrast, presence of anatomical structures with
relevant textural features for major abnormality markers highly correlated pixels with that of lesion, illumination
in fundus images. The efficacy of image based higher variability, color variation of similar subjects due to
order features in enhancing major abnormality markers difference in background, size and shape variation of
and their effectiveness in attenuating intra and inter lesions, and eye movement during image acquisition.
image variations due to unwanted artifacts is thoroughly Various methods have been suggested in the literature for
investigated. automatic segmentation of pathological and anatomical
structures of color retinal images. Due to retinal
II. image analysis of fundus images pigmentation and the acquisition process, the retinal
Most image analysis techniques for finding abnormalities images vary largely in luminosity and contrast, making it
in fundus images are comprised of three major stages: harder to distinguish retinal features and lesions in some

34
areas thus hindering the automatic segmentation of generative Gaussian mixture model and discriminative
abnormalities. Some researchers attempted to detect EXs SVM classifiers. An active contour has been used based
based on their luminosity by simple tresholding [3]. The on Zhang's method to extract vessel networks from color
wide variability among images makes the method retina images [21]. This analysis facilitates calculation of
dependent on user intervention to select appropriate arterial index which is required to diagnose certain
thresholds. In addition, since the color and brightness retinopathies. Other methods include a hybrid method
attributes of the OD are similar to that of the HE, such a comprised of Gabor filters and entropic tresholding based
method may detect large number of false positives. A on gray level co-occurrence matrix (GLCM) [20], vessel
similar approach was suggested where a combination of enhancement technique comprised of Wavelet transform,
global and local thresholds were applied to segment EXs curvature estimation, matched filtering, and Fast
[3]. Another scheme creates a model of the retina to Marching (FM) [19].
automatically identify the presence of lesions in the Unlike most of the above mentioned approaches, which
macular region [8]. Edge and brightness information rely on extraction of abnormalities in fundus images
were also combined to identify EXs and detect the based on features observed on a single color channel, a
severity of DR [9]. An automated statistical classifier has scheme that combines three representative channels to
also been used together with Fisher's linear discriminant form novel features has also been suggested [3]. Based
analysis (LDA) to segment EXs [10]. The limitation with on the multichannel features, the scheme employs
the approach was that it uses the isolated EXs color to boosted soft segmentation (BSS) algorithm to obtain a
characterize yellowish regions, which cannot represent confidence map and then applies a background
the color of all the EXs found in the images due to their subtraction method to get the coarse segmentation result.
diversity in brightness. Other methods include mixture The selected three channels were green (from RGB),
models with edge detection based post processing [4], luminosity (from LUV), and magenta (from CMYK).
mathematical morphology operations [7], radial basis The method was applied only for detection of HEs and
function network [13], support vector machine (SVM) the challenge of selecting the effective channels that best
[14], Neural networks (NN) [15], Bayesian classifiers, a suit the BSS algorithm are some of the limitations. To
combined Fuzzy C-means clustering with SVM [12], and circumvent this problem, a more holistic mathematical
atlas based approaches [6]. approach making use of vectorial representations of the
An efficient detection of OD in color retinal images is retinal color images and a robust feature extraction
also a significant task in an automated retinal image scheme based on local Fourier transforms computed in
analysis [17]. The position of OD can be used as a higher dimensional spaces is proposed in the current
reference length for measuring distances in retinal study.
images, especially for the location of the macula. Using
features of the fovea and their geometric relations with III. materials and methods
other retinal structures, a method for the fovea A. Data sets
localization was proposed [4]. Grading of HEs was Five publicly available standard color fundus image data
performed using a polar coordinate system centered at sets have been used in the current work: DIARETDB0
the fovea. (130 images, 10 normal and 120 contain mild
Automated segmentation of blood vessels is proved nonproliferative DR), DIARETDB1 (89 images, 5 normal
useful in various clinical applications such as in vessel and 84 contain non-proliferative DR), High-Resolution
diameter measurement to diagnose hypertension and Fundus (HRF) Image Database (15 normal, 15 patients
cardiovascular diseases and computer-assisted laser with DR, and 15 glaucomatous patients), STARE (400
surgery [18, 19, 20]. However, blood vessel segmentation images), DRIVE (40 images). Images were 24 bit RGB
involves a huge challenge as images have various and were captured with varying settings (flash intensity,
artifacts which affect retinal background texture and the shutter speed, aperture, gain). The images contain a
vessels structure. Furthermore the blood vessels varying amount of imaging noise, but the optical
particular features make them complex structures to aberrations (dispersion, transverse and lateral chromatic,
detect as the color of vascular structures is not constant spherical, field curvature, coma, astigmatism, distortion)
even along the same vessel. Their complex tree-like and photometric accuracy (color or intensity) were the
geometry includes bifurcations and overlaps that may same. Therefore, the system induced photometric
mix up the detection system. variance over the visual appearance of the different
Various methods have been developed in the literature retinopathy findings can be considered small. The data
for automatic blood vessel segmentation. For example, correspond to a good practical situation where the images
spatially localized feature vectors were used utilizing are comparable and can be used to evaluate the general
properties of scale and orientation selective Gabor filters performance of diagnostic methods. All image sets came
[18]. The extracted features were then classified using

35
accompanied by annotations carried out by experts and successful approach uses quaternions for efficient
those are used as gold standards in our study. representation of color images and quaternion based
integral transforms to effectively analyze color images as
B. Color space selection a whole by treating the color voxels as vectors [5, 6]. A
In the current work, a set of color spaces including RGB, quaternion is defined with one real and three imaginary
HSV, Lab, luv, CMYK, and LcrYcrR were tested and components. Quaternions have been used to represent
compared for their robustness to extract meaningful three component color images by setting one of the
features. Based on previous research results [3], two components to zero. However, most color images,
color spaces that have enhanced contrast and uniformity including our fundus camera images, have three
for EXs and other retinal structures were selected for components and the extra fourth dimension in
further analysis. These were HSV (hue, saturation, value) quaternions creates redundancy in representing such
and GLM' (G component from RGB, L component from images with additional computational cost. Color images
Lab, and Inverse Magenta component from CMYK color can be best represented using the three component
spaces). The HSV color space was selected since it has trinions [5, 6].
similarity with human color perception and it is less A trinion t is defined as:
affected by inter and intra image variations due to
artifacts. The GLM' color space was chosen based on the t  a  ib  jc
scatter matrix result of maximum interclass separability where a , b , and c are real numbers, and i , j are
[3]. A holistic image processing is applied that doesn't operators satisfying the following rules:
require separation of color bands in both cases.
i 2  j , ij  ji  1 , j 2  i .
C. Holistic image processing in the trinion space
Any trinion t  a  ib  jc can be written in the form
We are interested in robust texture feature based analysis
of color retinal images for use in effective classification
and segmentation. Visual analysis of textures is often a t  t (cos( )   sin( )) where t  a 2  b 2  c 2 ,
difficult task especially in medical images where, in   V (t ) V (t ) , and   tan 1 ( V (t ) S (t )) ,
general, the texture patterns may be very subtle and their
assessment very subjective. Quantitative texture analysis 0     are the amplitude, eigen axis, and eigen angle
attempts to provide objective metrics and determines (phase) respectively, where S and V are the real and
neighborhoods around pixels (texture elements) for vector parts of t . When t  1 , t is a unit trinion, and
which the voxel intensities and spatial relationship
between them is used to extract certain quantitative when a  0 it is a pure trinion. More interesting
features that characterize the texture of this properties of trinions are discussed in the literature [6].
neighborhood. Statistical approaches are widely Two working definitions for the trinion Fourier transform
employed in texture recognition and description of (TFT) have been suggested [6]. The TFT of type I and its
images [22]. Such methods include, for example, inverse (ITFT) are given by:
measurement of local image variations to help in
T (u, v) 
contrasting image elements, calculation of auto
 
correlation functions based on spatial frequencies,
calculation of correlations to measure image linearity and   h( x, y)cos(2 (ux  vy))  
  
1 sin(2 (ux  vy)) dxdy
the like.
The traditional way of analyzing color images through
separating the monochromatic components misses out the h ( x, y ) 
inter-correlation information that is embedded among the  
monochromes. The task can be accomplished more
effectively and efficiently by employing more holistic
  T (u, v)cos(2 (ux  vy))  
  
2 sin(2 (ux  vy))dudv
techniques. In that regard, methods that rely on vectorial where h( x, y) is a trinion valued image function, 1 is a
representation of color images have recently been
proposed and have shown very great promises. One such

36
Figure 2: OD localization results: original color fundus images (1st & 2nd) and feature
maps (3rd & 4th) showing compact signatures for the OD. The first image was a normal
case while the second one was a Glaucoma case.

unit, pure trinion, and  2 is a trinion such that the image analysis [6]. Spatially localized analysis in the
selected color space was then performed by computing
1  2  1. The choice of 1 and  2 is arbitrary. In our the TFT over a translating window of size 3x3. Note that
study we used 1  (i  j ) 2 and after discretization, the TFT was computed using the fast
Fourier transform.
 2  (1  i  j ) 2 . There exists a type II TFT in the
literature [6]. However, many operations of interest Higher order 'Haralick' texture features [22] were
including convolutions and correlations are shown to be extracted from TFT transformed image matrix computed
easier using the type I TFT than type II TFT [6], and over the translating window. Before computing these
hence the former has been adopted in the current study. features Principal Component Analysis (PCA) was
As a next step, the converted RGB color retinal image is applied for feature dimension reduction and each value of
mapped to a trinion as h( x, y)  H  iS  jV (in the the resulting 3x3 matrix was normalized between 0 and 1
and these are the probability density functions used to
HSV color space) or as h( x, y)  G  iL  jM ' (in the compute the texture features. The computed feature is
GLM' color space). It has already been shown previously assigned to the central voxel value in that window.
that the order of the mapping to a trinion doesn't affect

37
Figure 3:
HE detection results: original images with moderate nonproliferative DR (1st &
2nd) and the respective signature maps generated using the proposed scheme (3rd &
4th). White lines are the ground truth that denote HEs.

Several Haralick second order features were accordingly through qualitative comparison of the
computed including Sum-mean, Variance, Energy, signature maps against the available ground truths.
Homogeneity, Contrast, Entropy, Cluster shade, and
Cluster prominence. These features were used to I. Results and discussion
generate the final signature maps, which allow to Signature maps computed from cluster prominence on
the GLM' color space showed improved results over
differentiate between areas corresponding to HE, H, the other features. In Fig. (1), results are presented to
OD, and other regions that have clinical significance demonstrate the efficiency of the proposed scheme in
related to DR. The performance of signature maps OD localization. In both the normal as well as the
Glaucoma cases, very compact signatures maps were
generated using different features was compared generated for the ODs (cyan color) well distinct from
the background.

38
Figure (2) presents computed feature maps for References
representative fundus images acquired from two
patients treated for moderate nonproliferative DR. In [1] Mir, M. A., Atlas of clinical diagnosis, 2nd Ed., 2003.
both cases HEs (white bluish) were correctly identified [2] Kade, K. M., "Fundus Image Acquisition Techniques with Data base in
by the proposed scheme. The ODs (cyan color) were Diabetic Retinopathy," IJERA, 3(2), pp. 1350-1362, 2013.
also correctly detected as well. [3] H.-C. Lu & G.-L. Fang,, "An effective framework for automatic
Effective segmentation of HE and OD can be done by segmentation of hard exudates in fudus images," J Circuit Syst Comp.,
using iterative tresholding applied on the color 22(1), 2013.
signature maps. Our experiments showed that, for [4] C. I. Snchez, M. Garca, A. Mayo, M. I. Lpez, & R. Homero, "Retinal
images taken under normal illumination conditions, image analysis based on mixture models to detect hard exudates," Med
segmentation of HE can easily be done by applying Image Anal., 13(4), pp. 650-658, 2009.
thresholding on the third channel of the signature map. [5] D. Assefa, L. Mansinha, K. F. Tiampo, H. Rasmussen, & K. Abdella,
Similarly OD segmentation can be done by using "Local quaternion Fourier transform and color image texture analysis," Sig
tresholding on the second channel. Since blood vessels Proc., 90(6), pp. 1825-1835, 2010.
appear as dark red color with sharp edges on the [6] D. Assefa, L. Mansinha, K. F. Tiampo, H. Rasmussen, & K. Abdella,
signature maps, it can be segmented by applying edge "The trinion Fourier transform of color images," Sig Proc., 91(8), pp.
operators on the first channel. However, instead of 1887-1900, 2011.
applying such a simple tresholding in each color
[7] M. G. F. Eadgahi & H. Pourreza, "Localization of hard exudates in retinal
channel, a holistic method that applies a vectorial
fundus image by mathematical morphology operations," in Int Conf
scheme can be sought for a more effective
Comput Knowledge Eng. (ICCKE), 2012.
segmentation.
Generally, there is a good agreement between the [8] G. B. Kande, P. V. Subbaiah, & T. S. Savithri, "Feature extraction in digital
signature maps generated and the available ground fundus images," J Med Bio Eng., 29(3), pp. 122-130, 2009.
truth, while quantitative performance evaluation of the [9] C. I. Snchez, R. Homero, M. I. Lpez, & J. Poza, "Retinal image analysis to
proposed scheme is still a work in progress. However, detect and quantify lesions associated with diabetic retinopathy," in Proc
there are still rooms for improvement particularly Conf IEEE Eng Med Bio Soc., 2004.
regarding the application of the proposed scheme on [10] C. I. Snchez, R. Hornero, M. Lopez, M. Aboy, J. Poza, & D. Abasolo, "A
images captured under very poor illumination and novel automatic image processing algorithm for detection of hard exudates
higher noise conditions. One solution can be excluding based on retinal image analysis," Med Eng Phy., 30(3), pp. 350-357, 2008.
the fundus areas with poor luminance accompanied by [11] I. Jamal, M. U. Akram, & A. Tariq, "Retinal image preprocessing:
better noise suppression mechanisms before computing background and noise segmentation," TELKOMNIKA Indo J Elect Eng.,
the features. 10(3), 2012.
[12] S. Akara & U. Bunyarit, "Automatic exudates detection from diabetic
retinopathy retinal image using fuzzy c-means and morphological
II. conclusion and future work methods," in Proc IASTED Int Conf Adv Compute Sci Techno., Phuket,
Thailand, Apr. 2-4, 2007.
In the current study automated signature maps
generated based on robust texture features computed in [13] R. Vijayamadheswaran, M. Arthanari, & M. Sivakumar, "Detection of
higher order Fourier spaces were tested for their diabetic retinopathy using radial basis function," Int J Innov Techno Creat
Eng., 1(1), 2011.
potential in OD localization and detection of HEs and
other abnormalities in fundus retinal images. [14] W. Kittipol, H. Nualsawat, & P. Ekkarat, "Automatic detection of retinal
Preliminary results showed that the texture signature exudates using a support vector machine," Appl Med Inform., 32(1), 2013.
maps computed from cluster prominence feature in [15] M. Garca, C. I. Snchez, M. I. Lpez, D. Absolo, & R. Hornero,, "Neural
particular were able to correctly classify the structures. network based detection of Hard exudates in retinal images," Comput
The same feature showed good results in detecting the Meth Prog Biomed., 93(1), pp. 9-19, 2009.
optic cup and the major vertically emerging blood [16] S. Ali, K. M. Adal, D. Sidibe, T. P. Karnowski, E. Chaum, & F.
vessels, which are essential for glaucoma detection. Meriaudeau, "Exudates segmentation on retinal atlas space," in Int Symp
Quantitative analysis of the results is still ongoing. on Image &Sig Proc &Anal (ISPA),, Sep. 4-6, 2013.
Such an unsupervised image analysis technique is [17] D. A. Godse & D. S. Bormane, "Automated localization of optic disc in
helpful for mass screening and monitoring of diseases retinal images," Int J Adv Comput Sci &Appl., 4(2), 2013.
due to DR and glaucoma. It awaits further investigation [18] A. Osareh & B. Shadgar, "Automatic blood vessel segmentation in color
to discover more clinical implications of the outcomes images of retina," Iran J Sci &Techno Trans B Eng., 33(B2), pp. 191-206,
from this study. 2009.
[19] C. Liu, H. Lu, & J. Zhang, "Using fast marching in automatic
segmentation of retinal blood vessels," in 7th Asian-Pacific Conf Med Bio

39
Eng., 2008.
[20] P. Siddalingaswamy & K. Prabhu, "Automatic detection of multi- ple
oriented blood vessels in retinal images," J Biomed Sci Eng., 3(1), pp. 101-
107, 2010.
[21] S. Ehsan & Z. Somayeh, "Automatic segmentation of retina vessels by
using Zhang method," World Acad Sci Eng Techno., 6(12), 2012.
[22] D. Assefa, H. Keller, C. Menard, N. Laperriere, R. J. Ferrari, & I. Yeung,
"Robust texture features for response monitoring of glioblastoma
multiforme on T1-weighted and T2-FLAIR MR images: A preliminary
investigation in terms of identification and segmentation," Med Phys, pp.
1722-1736, 2010.

40
Automatic Recognition of Ethiopian License Plates
Seble Nigussie Yaregal Assabie
Department of Computer Science Department of Computer Science
Addis Ababa University Addis Ababa University
Addis Ababa, Ethiopia Addis Ababa, Ethiopia
seble.nigussie@aau.edu.et yaregal.assabie@aau.edu.et

surveillance system that captures images of vehicles


Abstract—This paper presents an automatic recognition and recognizes their license numbers [3]. The main
system for Ethiopian license plates. The proposed system has
three major components: plate detection, character
objective of LPR
segmentation and character recognition. We used Gabor filters systems is to detect and recognize license plates of
for plate detection, connected component analysis for vehicles from their images, and present the identified
character segmentation, and correlation based template information about the plate to machines in a way that
matching method for character recognition. In addition to the further applications can understand it. Some of these
correlation value, the recognition process is supported by color
analysis techniques and location information of characters. To
applications, where LPR systems are gaining
test the performance of the system, a dataset of 350 car images popularity include: access control in restricted areas,
of RGB color format were collected from moving cars under law enforcement, stolen car detection, automatic toll
different angle, distance, motion and illumination conditions. collection, car parking automation and border crossing
Test results on the dataset showed plate detection rate of control.
88.9%, character segmentation accuracy of 83.9% among
detected plates, and character recognition rate of 84.7% The process of LPR begins with acquiring vehicle
among correctly segmented characters. We have also achieved images that are going to be fed to the LPR system. The
an overall plate recognition rate of 63.1%, and the whole input images could be static pictures of cars or they
recognition processes takes 2-5 seconds depending on whether could be extracted from video sources. Both of the
post processing operations are needed or not.
techniques have their own pros and cons on the whole
Keywords— Intelligent Transport System; License Plate recognition process. The quality of images from video
Recognition; Ethiopian License Plate; Character Segmentation sequences is usually not clear as static images [3].
Accordingly, LPR systems that detect plate regions
Introduction from a video source passes through additional
preprocessing tasks than other LPR systems that work
on still images [1], [2]. Unlike related applications like
With the ever-increasing auto population, the document character recognition, LPR systems face a lot
transportation system has been struggling to serve of challenges, including noisy and low quality images,
users efficiently. In order to solve this problem, many various illumination conditions and a high degree of
countries in the world still depend on building more plate‟s orientation. LPR systems are usually dependant
and more road infrastructures to reduce traffic stress. on country‟s plate style, format, character and signs
But, for different reasons this solution is still not included in the plates of vehicles. Currently, there are
capable enough to solve the existing problem. As a many countries with their own LPR systems. Although,
result, various Intelligent Transportation Systems there are LPR systems that are designed with an
(ITS) emerge as viable solution to the problem of ambition to work on various scenes [3], [4], [5], they
managing vehicles [1], [2]. ITSs apply technologies of are still heavily affected by country-specific plate
information processing and communication on features [2]. In Ethiopia, the development of intelligent
transport infrastructures to improve the transportation transport system is in its early stage and, to our best
outcome. In these days, numerous intelligent knowledge, no software product is available to
transportation systems are being used by users in automatically recognize Ethiopian license plates.
various applications. One of these systems is License The remaining part of this paper is organized as
Plate Recognition (LPR). LPR, also known as Car follows. Section 2 presents Ethiopian license plate
Plate Recognition (CPR), is one of the most important types and structures. The design of the proposed
and popular elements of ITS. LPR is a mass system recognizing Ethiopian license plates is

41
discussed in Section 3. Experimental results are 3 Trade Green
presented in Section 4, and conclusion and future
works are highlighted in Section 5. 4 Governmental Black

Ethiopian License Plates 5 Non-governmental Orange


Ethiopian license plates contain numeric and
CD ኮዲ Diplomatic Black
character codes printed in rectangular plates. Character
codes are wrtitten in Amharic and (in most cases)
AU አህ Africa Union Light Green
English characters. Moreover, Ethiopian plates differ
in basic features, such as dimension and plate format. AO ዕድ Aid Organization Orange
The plates exist in five primary dimensions measured
in millimeters: 460x110, 280x200, 305x155, 340x200 UN የተመ United Nation Light Blue
and 520x110. And also, the plates have two basic
formats: single and double row plates. Figure 1 shows ፖሊስ Police Black
sample Ethiopian license plates in various formats. In
broad terms, the features in Ethiopian plates reflect የዕለት Temporary Red
three categories of information: the regions issuing
plates, the intended service of the car, and ተላላፊ Transient Light Blue
alphanumeric code.
ልዩ ተ Special vehicle Red

Regional Codes
License plates can be issued by the federal government or
state governments. License plate issuer provides code of the
region (see Table 2) except for Diplomatic, Africa Union,
Aid Organization and United Nation plate classes where the
federal government issues plates but code of the federal
goverment is not written on the plate.
Table 5. Regional codes used in plates

Issuing English Amharic

Region Code Code


Fig. 4. Some of the various Ethiopian plate styles
Ethiopia ET ኢት

Service Code Addis Ababa AA አአ


Ethiopian license plates are classified into 13 types
based on the vehicle‟s intended service (see Table 1). Amhara AM አማ
Each class can be identified by its code, which is either
numeric or character. Besides, most of the classes vary Afar AF አፋ
in foreground color, even though they have common
white background color except Police vehicles which Benshangul Gumuz BG ቤጉ
have yellow background.
Dire Dawa DR ድሬ
Table 4. Ethiopian license plate classes based on services
Gambella GM ጋም
o
N English Amharic Service Foreground
Harar HR ሐረ
Code Code Code Type Color
Oromiya OR ኦሮ
1 Taxi Red
Somali SM ሶማ
2 Private Blue

42
Southern Nations, Nationalities & People SP ዯሕ features of plates (i.e., texture), to locate potential plate
regions. Plate regions have rich texture information as
Tigray TG ትግ a result of high intensity variations that exist between
plate characters and the plate background. Accordingly,
we used Gabor filter to highlight regions with high
texture information [6]. The parameters for the Gabor
Unique Plate Codes filter are selected empirically by analyzing the filter
Each region issuing plates is reposnsible to provide responses for different parameters. Figure 3 shows the
aunique alphanumeric code under each service type. By Gabor filter responses.
doing so (along with region and service codes) the
information contained in the plate becomes unique in
the country. In the case of diplomatic plates, the digits
to the left of the Amharic code “ኮዲ” represent the
code of the diplomatic mission.

The Proposed License Plate


Recognition System
The proposed system has three major modules:
Plate Detection, Character Segmentation and Character
Recognition. The system takes car image as an input (a) (b)
and then it applies a series of various methods and Fig. 6. (a) Grayscale image, (b) Gabor filter response of a
techniques of plate detection, character segmentation
and recognition algorithms on the input image. Finally, In the Gabor filter response, high values indicate
it stores the license identification number, regions that have high texture differences, one of which
region/category and code (if exists) in a flat file in a is the plate region. Thus, binarizing the filter response
way that any further application can read and separates out higher intensity regions like the plate,
understand it. Figure 2 shows the block diagram of the from other regions that have low intensity value. We
proposed system. apply morphological closing operation on the binarized
image so as to merge individual character objects of the
Car Recognized plate and form a solid rectangular plate region.
Plate Morphological closing operation is followed by small
Image objects removal the purpose of which is to eliminate a
number of small connected objects that do not need to
Plate Character Character be detected as plate regions. Figure 4 shows the result
Segmentatio Recognition of the binarization, closing and unwanted small object
Detection n removal operations applied on the Gabor filter
Fig. 5. Architecture of the proposed system
response.
Plate Detection We use connected component labeling to detect
This module of LPR involves detection and connected components/objects (one of which is a plate
extraction of plate region from a given car image. This region) found in the resultant binary image. For this
process is a critical phase of the whole system, as the case, 8-neighborhood pixels connectivity is selected as
execution of the other modules of the LPR (i.e., a criterion to label objects. Once objects are labeled,
character segmentation and character recognition) acceptable aspect-ratio and density values of plate
depend on its outcome. Thus, a lot of effort is taken regions are used to minimize set of candidate plate
into account in selecting plate detection techniques and regions. Finally, the region with the acceptable size
algorithms that compromise the tradeoff between corresponding to the expected number of characters is
accuracy and computation time. In the proposed selected as the genuine plate object (see Fig. 5).
system, a series of vital techniques are used to detect
plate regions: Gabor Filtering, Morphological Closing
Operation and Connected Component Analysis (CCL).
The RGB image of cars is converted into grayscale
image where we take advantage of one of the popular

43
1  slope
(3)
m  0 1 
0 0 
(a) (b)
Fig. 7. (a) Binarization, (b) morphological closing followed by When spatial transformation is performed, there are
small objects removal often pixels in the output image that are not part of the
original input image, which are called a fill value,
usually 0. So, to complete the orientation adjustment
process the final step is extracting the actual plate from
the spatially transformed image. Thus, to achieve this
task, we perform morphological closing operation on
the spatially transformed binary image, followed by
connected component labeling (CCL). Figure 6 shows
resultant images of individual steps in the orientation
adjustment process. Size normalization is finally
performed to minimize variations of the dimensions of
plates occured as a result of capturing images from
different orientation and distance. We have set 160-by-
100 pixel dimension which is found to be sufficient for
(a) (b) character segmentation.
Fig. 8. (a) Candidate plate regions, (b) detected plate region

Segmentation
Given a plate region, segmentation is dedicated to
detecting and segmenting each alpha-numeric
characters found on the extracted plate image. (a) (b) (c)
Although plates have rectangular shapes, detected plate (d)
regions are quite often rotated and skewed while
images are captured. Thus, the first procedure in this
phase is orientation and size normalization.
Segmentation is then performed on the normalized [
(e) (f) (g)
image where we also have post segmentation process to
Fig. 9. Orientation adjustment of plates: (a) detected plate, (b)
separate connected characters and numbers with a prior
edges of a, (c) vertical shear-transform applied on b, (d) vertical
knowledge of plate structure information. shear-transform applied on a, (e) morphological closing followed
Orientation Adjustment and Size Normalization by CCL on c, (f) mapping detected plate region of e onto d, (g)
Orientation adjustment is performed by finding out extracting plate region from f
the slope of the plate and correcting the plate
orientation according to the identified slope The Segmentation Process
information. First, edges of the plate region are Segmentation is achieved by making use of
highlighted using Canny edge detection algorithm connected component analysis (CCA) which works
resulting in a binary image. Then, linear Hough efficiently irrespective of the orientation and location
transform is applied to detect lines. Plate borders of the plates and characters. The CCA method used in
usually form the longest lines from which we compute this process is similar to the method we used in the
the slope of the plate. A spatial transformation that uses plate detection process, except that the pixels
the slope of the longest detected line as a shear factor is connectivity used here is 4-neighborhood, which is
used to correct orientation of plates. We perform adequate to form a character. The CCL operation
vertical shear transform based spatial transformation detects every group of connected pixels found on the
that uses the slope of the longest detected line as shear input image that might be character or not. However,
factor along the y-axis resulting in horizontally existence of plate‟s frame has a huge impact on the
orientation corrected plate region. The shear matrix m character segmentation process, especially, if plate
used in the shear transform is presented as: characters are connected with the plate‟s frame. Thus, a
morphological reconstruction method is used to remove
plate borders resulting in a plate with character objects.
However, there could be noise (such as rivets) in the

44
plate where we used plate characters‟ features such as to improve the recognition accuracy of characters.
acceptable width, height and area to filter the genuine Finally, the unique foreground color of plates for
character objects. Figure 7 shows the results of the circle-coded plates combined with the code numbers is
segmentation process. used to eaily recognize the service type of the car.

Experiment
(a) (b) Dataset Collection
The dataset is built from moving and stationary cars
Fig. 10. (a) Binarized plate region, (b) segmented characters of a where the plate region is visible for human beings and
Post Segmentation contains a total of 350 images taken using a 12-
The layout of Ethiopian license plates can be single megapixel digital camera from different views,
or double row. In addition, numeric codes representing orientations, and distance representing the real-world
service types are circled and located at the left-top application of the system. Images contain cars along
corner of the plate. This knowledge of plate structure with the surrounding environment. A summary of the
information is used to separate physically connected description of the images is given in Table 3.
characters and numbers in the post segmentation phase. Table 6. Dataset description
Connected characters and numbers are easily detected
Sample Illumination Mobility Camera Camera Clarity of plate No of
since they have larger width or height than the average
set condition of cars angle distance background samples
width or height, respectively of characters and
numbers. After identification, the connected object is Set 1 Day Moving 15o-30o 3-5m Average, High 153
segmented into two or more separate parts by Moving,
Set 2 Day 0o-15o 1-3m Average 124
identifying whether they are connected side to side or Stationary
top to bottom using the plate structure information. Moving,
Set 3 Day 0o-15o <1m Low 35
Stationary
Figure 8 shows the result of separating connected
objects. Set 4 Night Moving 0o-15o 1-3m Average 16

Set 5 Shadow Moving 0o-15o <3m Low, Average 14

Set 6 Day, Night Moving 0o-15o <3m Low 8

(a) (b)

Fig. 11. (a) Binarized plate with connected objects, (b) separated
characters and numbers of a
Test Results
The system is developed using Matlab
Recognition programming tool tested on Intel Core i3 2.4GHz
processing unit. Some of the images in the dataset
Characters and numbers are treated differently in contain plates where human observers were unable to
the recognition phase. To avoid numbers mistakely detect and recognize. Excuding those images, we
recognized for characters or vice versa, segmented achieved an overall plate recognition rate of 71.06%,
objects are identified either as numbers or characters and the plate recognition process takes 2-5 seconds
based on their location in the plate. For the recognition depending on whether post processing operations are
purpose we use correlation-based template matching needed or not. The success rate of sub-systems is
method supported by correlation of characters in the shown in Table 4.
plate. Template image sets that constitute plate
character and number appearances on real environment Table 7. Test results
are collected from real world data. The template o
images are binarized and normalized to 24x24 and Sample N of Success rate (%)
24x34 pixel sizes for characters and numbers set samples
Plate Character Character
templates, respectively. During recognition, segmneted
character and number objects are normalized to detection segmentation recognition
respective pixel sizes. In the case of character
recognition in cirle-coded plates, we take advantage of Set 1 153 95.4 84.9 82.3
the fact that plates have both Amharic and equivalent
Set 2 124 92.7 88.7 87.3
English characters. This correlation information is used

45
Set 3 35 88.5 93.5 86.2 module), Using local binarization techniques than
global threshold values, and applying vertical angle
Set 4 16 56.3 33.3 66.6 adjustment on the detected plate region before the
character segmentation process.
Set 5 14 42.8 50.0 100.0

Set 6 8 50.0 - -

Average 88.9 83.9 84.7

REFERENCES
From the testing process, we observed that the
character recognition module misclassifies numbers
like 0 for 8, characters ኢ for አ, and አ for ኦ. In [1] T. Monahan. "War Rooms" of the Street: Surveillance
Practices in Transportation Control Centers, The
addition, characters and numbers affected by the rivets Communication Review, 10 (4): 367-389, 2007.
of plates have got higher percentage of [2] S. Du, M. Ibrahim, M. Shehata and W. Badawy.
misclassification. Automatic License Plate Recognition (ALPR): A State-
of-the-Art Review, IEEE Transactions on Circuits and
Systems for Video Technology, 23(2): 311-325, 2013.
Conclusion and Future Works [3] S. Kranthi, K. Pranathi and A. Srisaila. Automatic
We presented a system that recognizes Ethiopian Number Plate Recognition, International Journal of
Advancements in Technology, 2(3), 2011.
license plates in real-world scenario. The system has [4] M. Ashoori-Lalimi, and S. Ghofrani, An Efficient
many application areas such as law enforcement, Method for Vehicle License Plate Detection in Complex
parking lot, security etc. The fact that we have tested Scenes, Scientific Research Circuits and Systems,
the system in real scenario (by taking images while 2:320-325, 2011.
cars are moving or standing) shows that the system [5] J. Matas, K. Zimmermann. Unconstrained Licence Plate
Detection, In: Proc. 8th Int. IEEE Conf. on Intelligent
developed can be applied in real time. Better Transportation Systems, pp 572-577, Wien, Austria,
recognition accuracy even at further distances can be 2005.
achieved by capturing images using a high resolution [6] M. Sivalingamaiah and B. Reddy. Texture
Segmentation Using Multichannel Gabor Filtering,
camera with optical and digital zooming capabilities. Journal of Electronics and Communication
Future work includes applying contrast enhancement Engineering, 2(6):22-26, 2012.
techniques (as a preprocess in the plate detection

46
Designing Amharic Definitive Question Answering
Wondwossen Teshome Tassew Martha Yifiru
Ministry of Communication and Information Technology School of Information science
Addis Ababa, Ethiopia Addis Ababa University
wondwossen.teshome@mcit.gov.et Addis Ababa, Ethiopia
internet usersmartha.yifiru@aau.edu.et
using search engines and search services
Abstract - The amount of available information written in attempt to find specific information [1]. The survey
Amharic is increasing on the Web proliferation. The problem
faced by the user is not only the lack of documents or
also indicated that users are not satisfied with the
information but also the lack of time to find a short and precise performance of the current search services since
answer among the variety of available documents. Search majority of search engines are able to return ranked
engines offer a lot of links toward web pages, but are not able
to provide an exact answer; instead return ranked documents lists of related documents only. That means they don‟t
based on relevance measure with the posed query from users. deliver exact answers, so users have to extract the
The main goal of Question Answering system is to obtain a
brief and concise answer. Though there are studies towards demanded answer from the large volumes of
developing question – answering system for Amharic factoid information [1]. The virtue of question answering
questions, there is no research conducted to develop a (QA) systems is that they allow the user to state his or
definition question answering system for Amharic. In this
study, an attempt is made to design Amharic question her information need in a more specific natural
answering for definitive questions. The proposed Question language question, and that they do not return full
Answering deals with Amharic definition question by applying
surface text pattern method. This method considers two main documents which have to be skimmed by the user to
steps. First, it applies a pattern to discover a set of definition- determine whether they contain an answer. Instead
related text patterns from the Amharic legal corpus. Then, question answering systems return short text extract,
using these patterns, it extracts a collection of concept-
description pairs from a target document file, and applies the or phrases or even precise answers.
definition extraction to return answer to a given question. The
research achieved nugget precision of 85.6 %, nugget recall of II. QUESTION ANSWERING
73% and F-measure of 78.8%. Usage of surface patterns is
effective to answer Amharic definition questions. Definiendum There are an infinite number of questions users
extraction from users question and extracting concept-
descriptions from corpora are the major challenges of this may ask. The questions can be categorized as Factoid,
study. Further sequence mining algorithm can be List, Definition, Reason, Purpose, and Procedure
experimented to extract concept-description relationship from questions. We give emphasis to definition question
the corpus.
answering since this research deals with such kind of
questions.
Index Terms – Amharic Question Answering,
Question Analysis, Definition Extraction Factoid questions require a single fact as answer [2].
These include question such as “Who is the President
of Ethiopia?/የኢትዮጵያ ፕሬዝዯንት ማን ነው?”
I. INTRODUCTION and “What is the capital city of Ethiopia?/የኢትዮጵያ
Document retrieval systems have become part ዋና ከተማ ማን ነው?”
of our daily lives, mostly in using Internet search
List questions were originally a simple extension
engines and tools. Although document retrieval
of the factoid questions [3]. Instead of requiring a
systems do a great job in finding relevant documents,
single exact answer, list questions simply required a
given a set of keywords, there are circumstances where
fixed sized, unordered set of answers, for example
we want more specific information. The conventional
“Name 5 schools of Addis Ababa University”.
Information Retrieval systems do not really perform
the task of information retrieval; they in fact aim at
only document retrieval. A survey showed 85% of

47
Definition Question Answering performance. The Web is apparently an ideal source of
answers to a large variety of questions due to the
Certain types of question cannot be answered by a tremendous amount of information available online for
single exact answer. For example questions such as technologically favored languages like English.
“What is data?/ዳታ ምንድነው?” and “Who is
Abebech Gobena?/አበበች ጎበና ማን ናቸው?” do not Attempts are already done to develop QA
have a single short answer. This type of question is system for different languages worldwide including
usually referred to as a definition question. While English, Chinese, Indian and Amharic.
definition questions can be natural language questions
The research of Chinese QA systems using natural
there are only a few ways in which they can be
language was developed later than that of western
phrased and most are of the form “Who/What is/was
countries and Japan.
X?”. In these questions X is the definiendum often
referred to as the „target‟ of the question and is the Techniques of statistical retrieval and shallow
thing, (be it person, organization, object or event), language analysis are chiefly used in QA systems. On
which the user is asking the system to define. the web-based QA, [5] attempted to represent Chinese
text by its characteristics, and tried to convert the
The currently accepted evaluation methodology Chinese text into ERE (E: entity, R: relation) relation
for definitional questions focuses on the inclusion of a data lists and then to answer the question through ERE
given nuggets of information in the definition. A relation model.
nugget is a single atomic piece of information about
the current target [3]. For example, when asked to An Arabic definitional Question Answering
define “ኃይሌ ገብረስላሴ” (Haile Gebresilasie) system was developed by [6]. This QA system is based
systems should include in a definition the following on a pattern approach to identify exact and accurate
facts: የሩጫ ጀግና (A hero athlet)፣ የረጅም ርቀት definitions about organization using Web resources.
አትሌት (long distance runner)፣ እና ባለ ሃብት The system is experimented using 2000 snippets
(investor). Notice that this does not include a lot of returned by Google search engine and Arabic version
facts that is expected in a full definition of a person. of Wikipedia as well as a set of 50 organization
There is no mention of when or where he was born; definition questions. The system does not use any
given that he is a hero to whom he runs for; and on sophisticated syntactic or semantic techniques but
which type of investment he is engaged. Whilst it may provide effective and exact answers to definition
be preferable to update the answer keys so as to more questions. The system identifies candidate definitions
accurately reflect the information usually contained in by using a set of lexical patterns, filters these
encyclopedia or biographical dictionary entries. candidate definitions by using heuristic rules and ranks
them by using a statistical approach.
While definitional question answering systems
tend to adopt a similar structure to systems designed to An innovative research on Amharic Question
answer factoid questions, there is often no equivalence Answer for factoid questions was conducted in [7] and
to the question analysis component due to the little tried to solve problem in finding precise answers for
information contained in the question [3]. A process of Amharic factoid questions. Novel techniques are used
extraction of important part of the definiendum is to determine the question types, possible question
called Nugget Extraction which is not found in any focuses, and expected answer types as well as to
other types of QA system [4]. generate proper Information Retrieval query, based on
Amharic language specific issue investigations. As to
III. RELATED WORK the performance evaluation of the system, the rule
based question classification module classifies about
Many QA systems have been researched
89% of the question correctly. The document retrieval
extensively and obtained significant improvement in

48
component shows greater coverage of relevant the corpus using patterns. The definition searching
document retrieval (97%) while the sentence based component comprises of question analysis and
retrieval has the least (93%). The gazetteer based retrieving the appropriate definition.
answer selection using a paragraph answer selection
technique answers 72% of the questions correctly. The A) Indexing and Definition Identification
file based answer selection technique exhibits a recall The Amharic legal corpus is thoroughly studied
of 90.9%. The pattern based answer selection and advantage is taken on the stylistic nature of
technique has better accuracy for person names using authors to introduce new concept. There are some
paragraph based answer selection technique while the words that usually exist right after the definiendum
sentence based answer selection technique has words (post definiendum words): ማለት, የሚባለው,
outperformed in numeric and date question types. [7] ትርጉም, ትርጉሙ, ሲተሮገም, ሲባል, and ፡-.
Recommended extending the research on other
question types. The first step in the implementation of
DefAmharicQA is document pre-processing. After
document processing each file is read from the
Amharic legal corpus, tokenized and all the tokens are
IV. THE AMHARIC LANGUAGE
saved in a file as list of tokens. Each token are read
Amharic (አማርኛ) is a
dominant language spoken as a
mother tongue by a considerable part
of the population and it is the most
commonly learned second language in
Ethiopia. Amharic is a Semitic
language spoken. Amharic is written
using a writing system called fidel
(ፊዯል) or abugida (አቡጊዳ), adapted
from the one used for the Ge'ez
language [8].

Most Amharic definitive questions


start by the definiendum. The
following are examples of Amharic
definitive question:

 የህግ ትርጉም ምንድነው?


 ኃይሌ ገብረስላሴ ማን ነው? Figure 1: Architecture of Amharic QA
 ሪኮንስትራክቲቭ ህግ ማለት ምን ማለት ነው?
and whenever there is a match between the read token
V. DESIGN OF DEFINITIVE AMHARIC QA and words in post definiendum the word/ phrase
before the post definiendum are considered as a
The Amharic definitive question answering system concept/ definiendum.
DefAmharicQA has two main components: Indexing
and definition searching. The indexing component Then the potential definiendum has been pre-
consists of two subcomponents: pre-processing of the processing and indexed with its description - the
corpus and extraction of all potential definitions from sentence, separated by dollar ($) sign (i.e.

49
<concept>$<description>). This index is kept in a file orders of አ, all orders of ፀ (with the sound „tse‟) are
called definition catalogue. Therefore, two separate changed to their corresponding equivalent orders of ጸ.
indices are created; the first is index consisting of Stemming algorithm of [10], Punctuation mark
tokens and the second one consisting of candidate removal has also been conducted on the corpus.
definitions/sentences based on the already identified
signal words. To be relevant, a definition instance 2) Definition Identification
must contain the concept and its description in one
single sentence. The definition extraction module identifies and
extracts definition sentences from the relevant
1) Document Pre-processing document set. Since it is all-important to provide
enough contexts to ensure the readability of the final
Document pre-processing is a procedure which answer, definition QA systems prefer sentences to
may include many text operations: Lexical analysis of nuggets. Definition type questions require word
the text with the objective of treating digits, hyphens, meaning, term definition, and description of term and
and punctuation marks; stemming of the remaining so on. So as to give ample information for the user we
words with the objective of removing affixes (i.e., extract a full sentence as answer for users natural
prefixes and suffixes) and allowing the retrieval of language query.
documents containing syntactic variations of query
Extracting definitions is a very important process
terms.
in definitional QA systems and there are a number of
Cognizant of the above, the Amharic legal definition extraction approaches but hard matching
corpuses require pre-processing which includes alignment between a snippet and a lexical pattern is
normalization and stemming to make it ready for done slot-by-slot.
DefAmharicQA system and the different pre-
While pre-processing is conducted, from the
processing tasks that are performed on the text before
Amharic legal corpus the maximum possible concept-
it is indexed.
dentition is extracted and saved in a file (definition
In this study normalization refers to character and catalogue). All definitions in the documents are
document level normalizations searched based on concept - description pairs by
applying definition patterns prepared manually by
Character Normalization: the presence of analysing the stylistic usage of authors to introduce
several Amharic alphabets (fidels) with the same new concept.
sound in the Amharic writing system creates a
problem in text retrieval. The same word written using The following patterns are prepared;
different symbols (letters) with the same sound are
‹definiendum› (ማለት|የሚባለው)
considered as different words by the retrieval system.
In this research, choosing one letter from the group of ‹definiendum› (ወይም) * (በመባል የሚታወቀው
|የሚባለው)
letters with the same sound and replace the remaining
‹definiendum›
ones is taken as a solution to the problem. Therefore, if
(ማለት|የሚባለው|በተለምዶ|በልምድ)* (በመባል
a character is any of ኀ, ሐ, ሃ, ኃ, and ሓ (all with the የሚታወቀው|ይባላል) * (ነው|ናቸው)
sound „h‟) is normalized to ሀ. Also the different orders ‹definiendum› (የሚገለፀው|ተብሎ የሚገለፀው) *
of ሐ and ኃ are changed to their corresponding (ነው)
equivalent orders of ሀ. Similarly, all orders of ሠ ‹definiendum› (ትርጉሙ|ትርጉማቸው) *
(with the sound „s‟) are changed to their corresponding (ነው|ናቸው)
equivalent orders of ሰ, all order of ዐ (with the sound የ‹definiendum› (ትርጉም) * (ነው)
„a‟) are changed to their corresponding equivalent

50
‹definiendum› (ማን|እነማን|ምንድን) * DefAmharicQA. Nearly all of DefAmharicQA’s
(ነው|ናቸው) components depend in some way on the information
In addition definition signal words at the end of produced by question analysis.
the sentence like ነው, ናቸው and ይተሮገማል are used
to identify the concept- description pairs. Even though structurally simpler than the more
general class of fact-seeking questions, definition
B) Searching Definition questions have a few properties which complicate their
processing as there are only few available keyword(s)
The searching definition module starts by in the question.
accepting a natural language question from the user.
Then any punctuation mark is removed from the a) Question particle identification: Despite dealing
question, stemming is applied, and character with a wide variety of natural language inputs,
normalization is performed. There are two components there are some distinctive structures utilized by
in the searching definition module: identifying the users for indicating definition queries. The
question particles are used for different types of
definiendum (query analysis) and answering (return
question formulation. The following table shows
definition). list of a few of the most common definition query
logs.
Definiendum identification is the vital component TABLE 1: DEFINITION QUESTION PARTICLES
of the whole system as it is a decisive factor to answer
the question correctly. Though most Amharic Question particles Question type
definition questions positioned the definiendum at the <definiendum> ማለት ምን ማለት Term/expression
beginning, we formulated algorithm to extract the ነው?
definiendum from the question and advantage of the What does <definiendum> mean?
stylistic presentation of users‟ question is taken. The
<definiendum> ማን ነው?/ Person name,
identified definiendum is captured by a variable. The
<definiendum> ማን ናቸው? place,
last module returns the definition. organization
Who is/are <definiendum>?
1) Question Analysis <definiendum> ምንድነው?/ Term/expression
<definiendum> ምንድን ናቸው?
Given a natural language question as input, the What is/are <definiendum>?
overall function of the question processing module is የ <definiendum> ትርጉም Term/expression
to process and analyse the question, and to create some ምንድነው?
representation of the definiendum. The query for What is the meaning of
search engines and QA systems are different in such a <definiendum>?
way that queries in QA should be more specific to ተርጉም <definiendum> Term/expression
locate relevant documents. Define <definiendum>

The question-answering process in DefAmharicQA


Questions with none of the question particles are
like that of most other question-answering systems
classified as can‟t be answered and no answer is
begins with a question analysis phase that attempts to
returned. Otherwise, it first check the question if it has
determine what the question is asking for and how to
the specific question particles that has clear indication
approach answering it. Question analysis receives
on the question type such as ማለት, የሚባለው,
unstructured natural language question as input and
ምንድነው, ምንድናቸው ትርጉም, ተርጉም which
identifies syntactic and semantic elements of the
always indicate looking for a certain description. If the
question, which are encoded as structured information
that is later used by the other components of definition question contains ማን, ማናቸው, and ማን

51
ናቸው indicates the question focus on such as person, Users‟ natural language questions are not submitted to
place and organization type of question types. the indexed file as it is. It requires some modifications
that enhance the possibility of matching relevant
b) Query Generation/ Definiendum Extraction: The concept from the dictionary catalogue. Query
goal of query generation is to construct one or generation sub-module is very critical. Incorrect
more queries in the query syntax of the document queries might result in returning irrelevant description
retrieval engine. Most if not all, question analysis that lead to extract incorrect definition answer. This in
components create an IR query by starting with a
turn reduces recall significantly as there could be
bag-of- words approach; all the words in the
question ignoring case and order are used to unmatched documents with the query. The query
construct a query [7]. generation sub-module includes removal of
The major problem we face when dealing with punctuation marks, stemming, character normalization,
definition questions is that little or no information is number normalization for Ge‟ez numerals and
provided other than the target itself which makes it stopwords removal and synonym expansion.
difficult to construct rich IR queries that locates
relevant documents. For instance submitting the target Amharic documents used both types of numbering
ሃይሌ ገብረስላሴ as two terms to an IR engine not only so that the different variations of numbers are included
retrieve relevant documents but also documents about to the query (query expansion).
the famous film producer ሃይሌ ገሪማ and other people
with the name ሃይሌ could be retrieved by such a The query term may contain regular expressions
query. such as “([definiendum] |ማለት | ምንድነው)”,
“(የ[definiendum] ትርጉም)”, “(ተርጉም+
As question answer systems attempt to define [definiendum])” etc. Therefore documents are matched
a given target, it is likely that relevant documents also based on the regular expressions besides the
contain the definiendum in the text and so one possible normal query terms. We have identified a number of
approach to improving the retrieval and hence the words that provide signal for the position of
quality of the resulting passages, is to use the target as definiendum from construction of natural anguage
a phrase when searching a text collection (i.e. in a query and we classify them based on their position in
relevant document both terms must appear next to the query as shown below.
each other, ignoring stopwords, and in the same order
as in the query). For a simple example such as ሃይሌ 1: ማለት የሚባለው ትርጉም
ገብረስላሴ this approach has the desired effect
2: ተርጉም ፍታ አብራራ ዐብራራ ዓብራራ ኣብራራ ግለፅ
dramatically reducing the number of documents ግለጽ ማለት የሚባለው በተለምዶ በልምድ ምን ምንድን
retrieved from the collection while retaining those ማን እነማን ትርጉሙ ትርጉማቸው የሚገልፀው
which are actually about the target.
Our algorithm works based on the number of tokens in user
Equation 1: Official definition of F-measure question and based on the lists we constructed.

Let: 2) Retrieving definition


r # of vital nuggets returned in a response
a # of non-vital nuggets returned in a response To return the answer for the definition question,
R total # of vital nuggets in the assessors’ list the indexed file which contains concept-description
l # of non-whitespace characters in the entire pair is opened and search for a match between the
answer string then, concept and definiendum. If the definiendum is found
Nugget Recall (R)= r/ R
in the dictionary catalogue it displays the answer;
Allowance ( ) =
otherwise it displays a message “No answer”.

Finally,
52

In this work, the assumption is that both nugget


precision and nugget recalls are equally important
(i.e β = 1).
VI. EXPERIMENT TABLE 2: SUMMARY OF NORMALIZATION VS PERFORMANCE

To develop our prototype Python programming Before normalization After normalization


language is used as it is an ideal language to process
text that supports Unicode. Amharic corpora most of Precision Recall F-measure Precision Recall F-measure

them are related to law are collected from


0.78 0.44 0.49 0.83 0.52 0.56
http://www.abyssinialaw.com/ and the concerned
organ and definition questions are prepared to evaluate
the prototype.
From the above table it is inferred that document
Evaluation of DefAmharicQA is mainly for normalization has a significant effect on precision and
accuracy and recall of answers. Precision for definition recall of DefAmharicQA. In some questions, though
QA is calculated as the number of correct concepts the question processing module identifies the
retrieved to the total number of concepts retrieved. definiendum properly from the natural language
questions couldn‟t match with the appropriate target
The assessment takes into consideration the number of word from the document because of the orthography
non-whitespace characters in the entire answer string, differences. For instance the question “ህግ ማለት ምን
number of vital nuggets retrieved, and number of non-
ማለት ነው?” didn‟t get the appropriate definition even
vital nuggets retrieved. The length-based measure used
if the question processing module identify “ሕግ” as
a system an allowance of 100 (non-white-space)
definiendum but the definition extraction module
characters for each correct concept it retrieves.
couldn‟t extract definition and the system responded
No answer. As a result recall is penalized and
precision reduced as compared to the normalized one.
1) Evaluation of Definiendum Identification

3) Effect of Stemming on Effectiveness


DefAmharicQA‟s question processing module has
been evaluated for identification of the definiendum
The performance of DefAmharicQA has been
correctly as improper identification of the definiendum
evaluated before and after stemming. This evaluation
leads to incorrect result that affect the performance of
shows us the impact of stemming on performance.
the system. A hard pattern rule based technique is used
to identify the definiendum. Other than the twenty TABLE 3: EFFECT OF STEMMING ON PERFORMANCE
performance testing questions, seventeen questions are
Before stemming After stemming
used to evaluate the query processing module, of
which the module identifies fifteen (88.24%) Precision Recall F- Precision Recall F-measure
definiendums properly but it wrongly identifies two measure

(11.76%) as definiendums. 0.83 0.52 0.56 0.856 0.730 0.788

2) Normalization and Performance


Stemming increases the performance of
The performance of DefAmharicQA has been evaluated DefAmharicQA.
before and after punctuation marks removal and stemming.
This evaluation shows us the significant effect of
punctuation marks removal for performance. Out of the 20 4) Evaluation of DefAmharicQA
questions the system doesn‟t retrieve a single vital nugget
for seven questions. The final system performance experiment is
conducted under the assumption that both nugget
precision and nugget recalls are equally important
(𝑖. 𝑒 𝛽 = 1).

53
The human assessor prepared all vital and non- assessor list. The system achieved nugget recall of
vital nuggets of answers for each question manually. 73.0%, nugget precision of 85.6% and F-measure of
The system performance is tested based on the 78.8%. We found extracting concept-description
retrieved and not-retrieved number of nuggets for each relationship from the corpora is the major language
question and cross checked with the assessor list of the dependent component of the system in addition to
ground truths. query analysis.

The calculated average evaluation metrics are: Application of surface text pattern makes the
nugget precision = 85.6%%, nugget recall = 73 % and system effective. From the research we inferred
F-measure = 78.8% normalization and stemming increases the
performance of Amharic question answering for
The normalized and stemmed based answering definitive questions.
Amharic definition question technique outperforms the
However, the system's effectiveness is highly
others (i.e. without normalization and stemming).
affected by identification of the definiendum word/
TABLE 4: SUMMARY OF PERFORMANCE OF DefAmharicQA phrase(s) and extraction of definitions from the corpus.
The major reasons are definition and answer patterns
Correct Wrong No Precision Recall F- are prepared manually, inappropriate descriptions are
answer answer measure extracted for a concept from the corpus and the
Answer question analysis
(𝛽 = 1)
Although, this research attempted to develop a
(85%) (15%) (0%) 0.856 0.730 0.788 closed domain definition question answer system with
a promising F-measure value. The following issues
must be dealt with to increase the effectiveness and
VII. CONCLUSION AND FUTURE WORK
efficiency of the system.
In this research we proposed a definitional
Question Answering system for Amharic language. -Sequence mining algorithm: An automatic sequence
This system answers legal definition questions mining algorithms can be used to extract concept-
expressed in Amharic language for a closed legal description relationship from large corpora.
corpus. It is based on an approach which employs a
-Amharic spell checker: integrating Amharic spelling
little linguistic analysis and language understanding
checker helps to improve performance as users usually
capability. DefAmharicQA identifies candidate
make mistake while writing the query.
definitions by using a set of lexical patterns. For this
study hard pattern matching is used to extract concept -Standard corpus: having standard corpus to measure
– description relations from the Amharic corpus and to the performance of definition question answer system
identify definiendum from user‟s natural language is essential.
question.
Develop a definition QA system for other Ethiopic
DefAmharicQA system is evaluated using languages
Amharic documents and 20 definitive questions. The
evaluation shows that obtained results are very Integrating with Amharic search engine
encouraging. DefAmharicQA succeeded in generating
-Amharic WordNet: the development of Amharic
pertinent definitions for seventeen out of 20 definition
WordNet is very essential that includes word
questions. The system provided answer sets for 17 of
synonym, hyponym, and antonym so as to increase the
the 20 questions (85%). For these 20 questions, 10
recall ability of the system.
(50%) questions contained all the vital nuggets of the

54
-Named Entity Extractor: target extraction is a key, [3] Andrew, G. M. (2005). Open-Domain Question Answering. PhD
Thesis, University of Sheffield, Sheffield.
non-trivial capability critical to the success of a [4] Figueroa, A. G. (2010). Finding Answers to Definition Questions on
the Web. A dissertation submitted to the Philosophy Faculty of
definition question answer system. Similarly, database Saarland University in partial fulfilment of the requirements for the
lookup works only if the relevant definiendum are degree of Doctor of Philosophy, University of Saarland, Department of
Computational Linguistics and Phonetics.
identified and indexed while preprocessing the corpus. [5] Huang, G. T. (2004). Chinese question-answering system. Journal of
Both of these issues point to the need for a more robust Computer Science and Technology , 19.
[6] Omar Trigui, L. H. (2008). DefArabicQA: Arabic Definition Question
named-entity extractor, capable of handling Answering System. University of Sfax.
specialized names (e.g., “እሳቱ ተሰማ”, “የሸዋወርቅ”, [7] Seid, Yimam. (2009). Amharic Question Answering System for factoid
Questions. MSc Thesis, Addis Ababa University, Department of
“አትጠገብ“,…). Computer Science, Addis Ababa.
[8] Tessema Mindaye, M. S. (2010). The Need for Amharic WordNet.
[9] Greenwood, A. (2005). Open-Domain Question Answering. PhD
-NLP tools: The improvement of the NLP tools that Thesis, University of Sheffield, Sheffield.
[10] Nega Alemayehu, Willet P. (2002) Stemming of Amharic Words for
are used in this research is another area that needs Information Retrieval. In Literary and Linguistic Computing. 17 No.1,
much work: to improve the syntactical parser and, pp. 1-17. Oxford: Oxford University press.
specially, the semantic analyzer (which is a quite open
problem in the NLP community).

-Stemmer: To enhance the Amharic definitive question


answering system performance, further studies should
be conducted to design an effective stemmer that can
stem infixes for Amharic words since it is rich in
morphology.

-Query expansion: Future researches can also improve


the performance of the system by exploring and
integrating query expansion techniques.

-Domain experts: to reduce the effect of judgment on


categorization of vital and non-vital nuggets team of
experts has to be participated in the evaluation. In this
study only one legal expert is participated.

-Open-domain: This research is conducted on closed


domain corpora hence the same research can be
extended to open domain Amharic definition
questions.

Amharic question answering can be regarded as at


its infant stage, therefore researchers can develop on
other types (how, why, list etc) of QA to assist users so
as to get precise and concise answer for their query.

REFERENCES

[1] Liu, J. (2007). An Intelligent FAQ Answering System Using a


Combination of Statistic and Semantic IR Techniques. Department of
Computer Science and Electronics at Mälardalen University.
[2] Saggion, M. A. (2004). A Pattern Based Approach to Answering
Factoid, List and Definition Questions. University of Shefield,
Department of Computer Science, Shefield.

55
56
martha.yifiru@aau.edu.et

You might also like