Zaharudin Ibrahim1,2 , Shahrul Azman Noah2 , and Mahanem Mat Noor3
1 Department of Information System Management, Faculty of Information Management, Universiti Teknologi MARA, 40450 Shah Alam, Selangor, Malaysia 2 Department of Information Science, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, 43600 Selangor, Malaysia 3 School of Biosciences and Biotechnology, Faculty Science and Technology, Universiti Kebangsaan Malaysia, 43600 Selangor, Malaysia zahar347@salam.uitm.edu.my, samn@ftsm.ukm.my, mahanem@ukm.my
Abstract. The primary goal of ontology development is to share and
reuse domain knowledge among people or machines. This study focuses on the approach of extracting semantic relationships from unstructured textual documents related to medicinal herb from websites and proposes a lexical pattern technique to acquire semantic relationships such as syn- onym, hyponym, and part-of relationships. The results show of nine ob- ject properties (or relations) and 105 lexico-syntactic patterns have been identified manually, including one from the Hearst hyponym rules. The lexical patterns have linked 7252 terms that have the potential as on- tological terms. Based on this study, it is believed that determining the lexical pattern at an early stage is helpful in selecting relevant term from a wide collection of terms in the corpus. However, the relations and lexico-syntactic patterns or rules have to be verified by domain expert before employing the rules to the wider collection in an attempt to find more possible rules.
Keywords: Knowledge management and extraction, medicinal herb, se-
mantic web, Natural Language Processing, knowledge engineering.
1 Introduction
Ontology is a kind of knowledge which was historically introduced by Aristotle
and has recently become a topic of interest in computer science. Ontology pro- vides a shared understanding of the domain of interest to support communication among human and computer agents. It is typically represented in a machine pro- cessable representation language [1] and is also an explicit formal specification of terms, which represents the intended meaning of concepts, in the domain and relations among them, and considered as a crucial factor for the success of many
J. Yu et al. (Eds.): RSKT 2010, LNAI 6401, pp. 386–394, 2010.
c Springer-Verlag Berlin Heidelberg 2010 Rules for Ontology Population from Text 387
knowledge-based applications [2]. With the overwhelming increase of biomed-
ical literature in digital forms there is a need to extract knowledge from the literature [3]. Ontology may also helpful in fulfilling the need to uncover infor- mation present in large and unstructured bodies of text that commonly referred to non-interactive literatures [4]. Ontology is considered as the backbone of many current applications, such as knowledge-based systems, knowledge management systems and semantic web applications. One of the important tasks in the development of such systems is knowledge acquisition. Conventional approaches to knowledge acquisition are mainly from interviewing domain experts and subsequently modeling and trans- forming the acquired knowledge into some form of knowledge representation technique. However, a huge amount of knowledge is currently embedded in vari- ous academic literatures and has the potential of being exploited for knowledge construction. The main inherent issue is that such knowledge is highly unstruc- tured and difficult to transform into meaningful model. Although a number of automated approach in acquiring such knowledge has been proposed by [5] and [6] their success have yet to be seen. Such approaches have only been tested on general domain whereas scientific domains such as the medicinal herbs domain have yet to be explored. While automated approach seems to offer promising solutions, human still play an important role in validating the correctness of the acquired knowledge, particularly in scientific domain. Human experts are still required to construct the TBox of ontology while the ABox can be supported by some form of automated approaches. The creation of the ABox can be con- sidered as an ontology population whereby the TBox is populated with relevant instances. This study, therefore, proposed a set of rules for populating medicinal herbs domain ontology from unstructured text. The TBox ontology for this domain was constructed from a series of interviews with domain experts as well as analysis of available reputable literatures. Our proposed approach is based on pattern matching and Named Entity Recognition (NER), whereby semantic relations are identified by analyzing given sentences and identified entities are subsequently asserted as instance of concepts of the TBox ontology.
2 Related Research
Due to limited space, we briefly reviewed representative approaches for build-
ing herbs ontology domain. There are several researches on ontology building from unstructured text. The Hearst’s technique [7] has been employed to ex- tract concept terms from the literature and to discover new patterns through corpus exploration. The technique acquires hyponym relations automatically by identifying a set of frequently used unambiguous lexico-syntactic patterns in the form of regular expressions. Moreover [8] used these techniques to extract Hy- ponym, Meronym and Synonym relations from agricultural domain corpus. Two processes involved in this corpus-based ontology extraction: (1) the process for finding lexico-syntactic patterns and (2) the process for extract corpus-based