Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Rules for Ontology Population from Text of

Malaysia Medicinal Herbs Domain

Zaharudin Ibrahim1,2 , Shahrul Azman Noah2 , and Mahanem Mat Noor3


1
Department of Information System Management,
Faculty of Information Management, Universiti Teknologi MARA,
40450 Shah Alam, Selangor, Malaysia
2
Department of Information Science,
Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia,
43600 Selangor, Malaysia
3
School of Biosciences and Biotechnology,
Faculty Science and Technology, Universiti Kebangsaan Malaysia,
43600 Selangor, Malaysia
zahar347@salam.uitm.edu.my, samn@ftsm.ukm.my, mahanem@ukm.my

Abstract. The primary goal of ontology development is to share and


reuse domain knowledge among people or machines. This study focuses
on the approach of extracting semantic relationships from unstructured
textual documents related to medicinal herb from websites and proposes
a lexical pattern technique to acquire semantic relationships such as syn-
onym, hyponym, and part-of relationships. The results show of nine ob-
ject properties (or relations) and 105 lexico-syntactic patterns have been
identified manually, including one from the Hearst hyponym rules. The
lexical patterns have linked 7252 terms that have the potential as on-
tological terms. Based on this study, it is believed that determining the
lexical pattern at an early stage is helpful in selecting relevant term
from a wide collection of terms in the corpus. However, the relations and
lexico-syntactic patterns or rules have to be verified by domain expert
before employing the rules to the wider collection in an attempt to find
more possible rules.

Keywords: Knowledge management and extraction, medicinal herb, se-


mantic web, Natural Language Processing, knowledge engineering.

1 Introduction

Ontology is a kind of knowledge which was historically introduced by Aristotle


and has recently become a topic of interest in computer science. Ontology pro-
vides a shared understanding of the domain of interest to support communication
among human and computer agents. It is typically represented in a machine pro-
cessable representation language [1] and is also an explicit formal specification
of terms, which represents the intended meaning of concepts, in the domain and
relations among them, and considered as a crucial factor for the success of many

J. Yu et al. (Eds.): RSKT 2010, LNAI 6401, pp. 386–394, 2010.


c Springer-Verlag Berlin Heidelberg 2010
Rules for Ontology Population from Text 387

knowledge-based applications [2]. With the overwhelming increase of biomed-


ical literature in digital forms there is a need to extract knowledge from the
literature [3]. Ontology may also helpful in fulfilling the need to uncover infor-
mation present in large and unstructured bodies of text that commonly referred
to non-interactive literatures [4].
Ontology is considered as the backbone of many current applications, such
as knowledge-based systems, knowledge management systems and semantic web
applications. One of the important tasks in the development of such systems
is knowledge acquisition. Conventional approaches to knowledge acquisition are
mainly from interviewing domain experts and subsequently modeling and trans-
forming the acquired knowledge into some form of knowledge representation
technique. However, a huge amount of knowledge is currently embedded in vari-
ous academic literatures and has the potential of being exploited for knowledge
construction. The main inherent issue is that such knowledge is highly unstruc-
tured and difficult to transform into meaningful model. Although a number of
automated approach in acquiring such knowledge has been proposed by [5] and
[6] their success have yet to be seen. Such approaches have only been tested on
general domain whereas scientific domains such as the medicinal herbs domain
have yet to be explored. While automated approach seems to offer promising
solutions, human still play an important role in validating the correctness of the
acquired knowledge, particularly in scientific domain. Human experts are still
required to construct the TBox of ontology while the ABox can be supported
by some form of automated approaches. The creation of the ABox can be con-
sidered as an ontology population whereby the TBox is populated with relevant
instances.
This study, therefore, proposed a set of rules for populating medicinal herbs
domain ontology from unstructured text. The TBox ontology for this domain was
constructed from a series of interviews with domain experts as well as analysis
of available reputable literatures. Our proposed approach is based on pattern
matching and Named Entity Recognition (NER), whereby semantic relations are
identified by analyzing given sentences and identified entities are subsequently
asserted as instance of concepts of the TBox ontology.

2 Related Research

Due to limited space, we briefly reviewed representative approaches for build-


ing herbs ontology domain. There are several researches on ontology building
from unstructured text. The Hearst’s technique [7] has been employed to ex-
tract concept terms from the literature and to discover new patterns through
corpus exploration. The technique acquires hyponym relations automatically by
identifying a set of frequently used unambiguous lexico-syntactic patterns in the
form of regular expressions. Moreover [8] used these techniques to extract Hy-
ponym, Meronym and Synonym relations from agricultural domain corpus. Two
processes involved in this corpus-based ontology extraction: (1) the process for
finding lexico-syntactic patterns and (2) the process for extract corpus-based

You might also like