An Automatic Approach For Ontology Based

Engineering Applications of Artificial Intelligence 26 (2013) 1092–1106
Contents lists available at SciVerse ScienceDirect
Engineering Applications of Artificial Intelligence

journal homepage: www.elsevier.com/locate/engappai
An automatic approach for ontology-based feature extraction from

heterogeneous textual resources
Carlos Vicient n, David Sánchez, Antonio Moreno
Intelligent Technologies for Advanced Knowledge Acquisition (ITAKA), Departament d’Enginyeria Informa tica i Matematiques,
Universitat Rovira i Virgili. Av. Paı̈sos Catalans,
26, 43007 Tarragona, Catalonia, Spain
a r t i c l e i n f o a b s t r a c t
Article history: Data mining algorithms such as data classification or clustering methods exploit features of entities to
Received 24 February 2012 characterise, group or classify them according to their resemblance. In the past, many feature extraction
Received in revised form methods focused on the analysis of numerical or categorical properties. In recent years, motivated by the
24 July 2012
success of the Information Society and the WWW, which has made available enormous amounts of textual
Accepted 10 August 2012
electronic resources, researchers have proposed semantic data classification and clustering methods that
Available online 13 September 2012
exploit textual data at a conceptual level. To do so, these methods rely on pre-annotated inputs in which text
Keywords: has been mapped to their formal semantics according to one or several knowledge structures (e.g. ontologies,
Information extraction taxonomies). Hence, they are hampered by the bottleneck introduced by the manual semantic mapping
Feature extraction
process. To tackle this problem, this paper presents a domain-independent, automatic and unsupervised
Ontologies
method to detect relevant features from heterogeneous textual resources, associating them to concepts
Wikipedia
modelled in a background ontology. The method has been applied to raw text resources and also to semi-
structured ones (Wikipedia articles). It has been tested in the Tourism domain, showing promising results.
& 2012 Elsevier Ltd. All rights reserved.
1. Introduction analyse and catalog Web contents. These tasks require a set of
structures to model knowledge, and a linkage between the
Since the creation of the World Wide Web, its structure and knowledge and Web contents. The Semantic Web relies on two
architecture have been in constant growth and development. basic components: ontologies and semantic annotations.
Nowadays the Web has turned into what we know as the Social On one hand, ontologies are formal, explicit specifications of a
Web, where all users are able to add and modify its content. This shared conceptualization (Guarino, 1998). Ontologies are useful
fact has resulted in an exponential growth of the available to model knowledge in a formal abstract way which can be
information. However, its lack of structure has brought some interpreted and exploited by computers. On the other hand,
problems: the access and retrieval of information is difficult and annotations are linkages between textual contents of a document
inaccurate, and the unstructured and mainly textual information and their formal semantics, modelled in a knowledge structure
can hardly be interpreted by IT applications (Fensel et al., 2002). such an ontology, enabling the development of knowledge-based
In order to solve these problems, the Semantic Web (Berners-Lee systems (Setchi et al., 2011).
et al., 2001) has been proposed. The availability of such amount of electronic resources has
The Semantic Web is an evolving extension of the World Wide brought a growing interest in the research community in the
Web in which the semantics of information and services are exploitation of Web contents. Data mining algorithms, and particu-
explicitly defined, making it possible for the Web to understand larly data classification and clustering methods, aim to analyse,
and satisfy the requests of users and computers. One of the basic characterise, classify or group data. Since classical data analysis
pillars of the Semantic Web concept is the idea of having explicit methods were designed to deal with numerical data, in recent years
semantic information on the Web pages that can be used by researchers have proposed novel semantically-grounded methods
intelligent agents in order to solve complex problems such as that, by exploiting structured knowledge bases like ontologies,
Information Retrieval and Question Answering (Brill, 2003). The taxonomies or folksonomies, are able to semantically interpret
final objective of the Semantic Web is to be able to semantically textual data such as Web contents (Hotho et al., 2002). In He et al.
(2008) authors propose a data classification method that generalises
set-valued textual data (e.g. query logs, clinical records, etc.) relying
n on domain taxonomies. In Batet et al. (2011,2010) authors adapt
Corresponding author: Tel.: þ 34 977256563; fax: þ34 977559710.
E-mail addresses: carlos.vicient@urv.cat (C. Vicient), well-known clustering methods so that they can semantically
david.sanchez@urv.cat (D. Sánchez), antonio.moreno@urv.cat (A. Moreno). interpret (i.e. compare, group and aggregate) multi-featured textual
0952-1976/$ - see front matter & 2012 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.engappai.2012.08.002
C. Vicient et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1092–1106 1093
data (e.g. polls, electronic health-care records, etc.), exploiting systems) in order to overcome the bottleneck of manual solutions.
general purpose and domain-specific ontologies. A similar process Melita (Ciravegna et al., 2002) is based on user-defined rules and
is carried out in Martı́nez et al. (2012), but focusing on semantic pre-defined annotations, which are employed to propose new ones.
micro-aggregation of textual values for privacy-preserving purposes. Supervised approaches, however, are difficult to apply, due to the
Even though these methods deal with textual features, they suppose effort required to compile a large and representative training set.
structured inputs rather than raw textual documents, in which Cerno is another system for the semi-automated annotation of Web
words describing an entity have been extracted and associated to resources (Kiyavitskaya et al., 2005,2009). The authors propose the
concepts whose semantics are modelled in a knowledge base (Hotho use of patterns/annotation schemas (addressed to extract objects
et al., 2002). such as email addresses, phone numbers, dates and prices) to obtain
Unfortunately, semantically-tagged features, which can be the candidates to annotate, and then this set is annotated by means
useful for semantic data analysis, are scarce nowadays. Due to of a domain conceptual model. That model represents the informa-
the semantic annotation bottleneck that characterises manual tion of a particular domain through concepts, relationships and
approaches, most of the existing textual resources are plain text attributes (in an entity-relation based syntax). Supervised systems
documents or ad-hoc and partially annotated ones (such as also use extraction rules obtained from a set of pre-tagged data
Wikipedia articles). Hence, the applicability of semantic-based (Califf and Mooney, 2003; Roberts et al., 2007). WebKB (Cafarella
data mining methods is hampered by the availability of ade- et al., 2005) and Armadillo (Alfonseca and Manandhar, 2002) use
quately processed input information. supervised techniques to extract information from computer science
To tackle this problem, this paper proposes a method that, websites. Likewise, S-CREAM (Cunningham et al., 2002) uses
given an input document describing an entity (e.g. a Web Machine Learning techniques to annotate a particular document
resource, a Wikipedia article or a plain text document) and an with respect to an ontology, given a set of annotated examples.
ontology stating which features should be extracted (e.g. tourism Other supervised systems present a more ad-hoc design and focus
points of interest), it is able to detect, extract and semantically on a specific domain of knowledge. In Maedche et al. (2003) a system
annotate relevant textual features describing that entity. that works on the Tourism domain is proposed. They combine lexical
The method has been designed in a general way so that it can knowledge, extraction rules and ontologies in order to extract
be applied to a range of textual documents going from raw plain information in the form of instantiated concepts and attributes that
text to semi-structured resources (e.g. tagged Wikipedia articles). are stored in an ontology-like fashion (hotel names, number of rooms,
In this last case, the method is able to exploit the pre-processed prices, etc.). The most interesting characteristic is the fact that the
input to complement its own learning algorithms. The key point pre-defined knowledge structures are extended as a result of the
of the work is to leverage the syntactical parsing and several Information Extraction process, improving and completing them.
natural language processing techniques with the knowledge They use several Ontology Learning techniques already developed
contained in the input ontology, and the use of the Web as a for the OntoEdit system (Staab and Maedche, 2000). A human expert
corpus to assist the learning process to be able to (1) identify has to validate each extension before continuing. Another system
relevant features describing a particular entity in the input focused on the Tourism domain is presented in Feilmayr et al. (2009).
document, and (2) associate, if applicable, the extracted features A domain-dependent crawler collects Web pages corresponding to
to concepts contained in the input ontology. In this manner, the accommodation websites. This corpus is passed to an extraction
output of the system consists on tagged features which can be module based on the GATE framework (Cunningham et al., 2002)
directly exploited by semantically grounded data analysis meth- which provides a number of text engineering components. It per-
ods such as clustering (Batet et al., 2011). forms an annotation of the Web pages in the corpus, supported by a
The rest of the paper is organised as follows: Section 2 domain-dependent ontology and rule set. The extracted tokens are
discusses related works proposing annotation systems for textual ranked as a function of their frequency and relevancy for the domain.
data. Section 3 provides and discusses the theoretical background, Semantic-ART (Sapkota et al., 2011) is another system that aims to
techniques and working environment for the proposed method. annotate texts associated to regulatory information to convert them
Section 4 details and formalises the proposed method, detailing into semantic models. The system assumes that a regulation typically
also how it can be applied to different kinds of inputs (unstruc- comprises at least one statement (a sentence in the regulation), which
tured and semi-structured text). Section 5 evaluates the proposed must have a subject, an obligation and an action. The presence of an
method when dealing with touristic destinations, testing the obligation (i.e., sentences containing obligatory verbs) in a paragraph
behaviour and influence of learning parameters and input types helps to distinguish the regulation from the ordinary text. The
and, finally, it is shown the global performance of the method subjects and objects of these sentences are directly searched in an
when analysing different domains. The final section discusses and input ontology. WordNet is used to look for synonyms of these words
summarises the main contributions of the work. if a direct matching is not found.
Another domain-dependent system is SOBA (Buitelaar et al.,
2008), a sub-component of the SmartWeb (a multi-modal dialog
2. Related work system that derives answers from unstructured resources such as
the Web), which automatically populates a knowledge base with
In the context of the Semantic Web, several methods have information extracted from soccer match reports found on the Web.
been proposed to detect and annotate relevant entities from The extracted information is defined with respect to an underlying
electronic resources. First, manual approaches have been devel- ontology. The SOBA system consists of a Web crawler, linguistic
oped to assist the user in the annotation process such as Annotea annotation components and a module for the transformation
(Koivunen, 2005), CREAM (Handschuh et al., 2003), NOMOS of linguistic annotation into an ontology-based representation.
(Niekrasz and Gruenstein, 2006) or Vannotea (Schroeter et al., The first component enables the automatic creation of a soccer
2003). Those systems rely on the skills and will of a community of corpus, which is kept up-to-date on a daily basis. Linguistic annota-
users to detect and tag entities within Web content. Considering tion is based in finite-state techniques and unification-based algo-
that there are 1 trillion pages on the Web, it is easy to envisage rithms. It implements basic grammars for the annotation of people,
the unfeasibility of manual annotation of Web resources. locations, numerals and date and time expressions. On the top, rules
On the contrary, some authors have tried to automate some for extracting soccer-specific entities, such as soccer players, teams
of the stages of the annotation process (proposing supervised and tournaments are implemented. Finally, data are transformed
1094 C. Vicient et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1092–1106
into ontological facts, by means of tabular processing (applying However, the final association between text entities and a domain
wrapper-like techniques) and text matching (by means of F-logic ontology is not addressed.
structures specified in a declarative form).
Other systems like KnowItAll (Etzioni et al., 2005) rely on the
redundancy of the Web to perform a bootstrapped information 3. Background
extraction process iteratively. The confirmation of the correctness
of the obtained information is requested to the user to re-execute This section discusses the basic characteristics of the working
the process. KnowItAll is self-supervised; instead of using hand- environment and presents the main existing techniques (lexical
tagged training data, the system selects and labels its own patterns and Web-scale statistics) in which the proposed method
training examples and iteratively bootstraps its learning process. relies.
For a given relation, the set of generic patterns is used to
automatically instantiate relation-specific extraction rules, which 3.1. The Web as a tool to assist the learning process
are then used to learn domain-specific extraction rules. The rules
are applied to Web pages identified via search-engine queries, When aiming to develop an unsupervised system dealing with
and the resulting extractions are assigned a probability using textual data, one cannot rely on experts opinions to validate the
information-theoretic measures derived from search engine hit results (e.g. to evaluate which extracted terms should be mapped
counts. Next, KnowItAll uses frequency statistics computed by to the input ontology or which is the best mapping for a given
querying search engines to identify which instantiations are most term). However, one can profit from predefined knowledge to
likely to be members of the class. KnowItAll is relation-specific in assess the adequacy of the extracted conclusions. To be domain-
the sense that it requires a laborious bootstrapping process for independent, however, no domain-specific knowledge structures
each relation of interest, and the set of relations has to be named or corpora can be used. Hence, to support our proposed method
by the human user in advance. This is a significant obstacle to we rely on the Web as a domain-independent corpus which can
open-ended extraction because unanticipated concepts and rela- aid the process of learning entity features.
tions are often encountered while processing text. The Web has many interesting characteristics in comparison
Completely automatic and unsupervised annotation systems are with other information sources. As stated in Brill (2003), many
scarce. SemTag (Dill et al., 2003) performs automated semantic classical techniques for information extraction and knowledge
tagging from large corpora based on the Seeker platform for text acquisition present performance limitations because they typically
analysis and tags a large number of pages with the terms included in rely on a limited corpus. According to Brill (2003), the Web is the
a domain ontology named TAP. This ontology contains lexical and biggest repository of information available. In fact, the amount and
taxonomic information about music, movies, sports, health, and heterogeneity of information in the Web is so high that it can be
other issues, and SemTag detects the occurrence of these entities in assumed to approximate the real distribution of information in the
Web pages. It disambiguates the sense of the words using neighbour world (Cilibrasi and Vitányi, 2006). Moreover, the high redundancy
tokens and corpus statistics, picking the best label for a token. KIM of information, derived from the huge amount of available resources,
(Kiryakov et al., 2004) is another example of unsupervised domain- can represent a measure of its relevance, i.e., the more times a piece
independent system. It scans documents looking for entities corre- of information appears on the Web, the greater its relevance (Brill,
sponding to instances in its input ontology. 2003; Ciravegna et al., 2003; Etzioni et al., 2004; Rosso et al., 2005).
The system proposed in Zavitsanos et al. (2010) follows a In other words, we can give more confidence to a fact that is
sequence of annotation steps. First, all the ontology concepts (and enounced by a considerable amount of independent sources. Thanks
their stems) are directly searched on the text. In a second stage, to the aforementioned characteristics, the Web has demonstrated its
WordNet is used to obtain synonyms of the ontology concepts, and validity as a corpus to assist learning processes (Etzioni et al., 2004;
the system looks for them in the analysed text. Later, the authors Sánchez et al., 2009; Velásquez et al., 2011).
propose the use of a semantic similarity measure (path length in In our work the Web is used as a general tool in which the
WordNet) which is computed between all the words in the text and feature extraction process relies to assess the relevancy of
all the ontology concepts. Finally, another similarity measure based potential extractions found in an input document (Section 4.3),
on Wikipedia is also considered. Since only words contained in the and to improve the recall of the feature annotation towards
ontology or those corresponding to their synonyms or semantically concepts included in the input ontology (Section 4.4). To this
similar ones are annotated, this work is limited to the coverage end, Web-scale information distribution is exploited to assist the
offered by the input knowledge repositories. Named entities or learning process and to extract robust conclusions.
recently minted terms which cannot be found in Wikipedia, for
example, are unlikely to result in any annotation. 3.2. Learning techniques
Another interesting annotation application is presented in
Michelson and Knoblock (2007). In this case, authors use a reference In the following, the main techniques in which our work relies to
set of elements (e.g., on-line collections containing structured data extract potential features and assess their relevancy are presented.
about cars, comics or general facts) to annotate ungrammatical They are focused on linguistic patterns and Web-scale statistics.
sources like texts contained in posts. First of all, the elements of
those posts are evaluated using the TF-IDF metric. Then, the most 3.2.1. Linguistic patterns
promising tokens are matched with the reference set. In both cases, The problem of semantically annotating/classifying/matching
limitations may be introduced by the availability and coverage of features extracted from text to their formal semantics can be
the background knowledge (i.e., ontology or reference set). simplified to the problem of discovering the ontological concept
From the applicability point-of-view, Pankow (Cimiano et al., of which the discovered feature is an instance (e.g., Barcelona is a
2004) is the most promising system. It uses a range of well-studied city). In essence, instance–concept relationships define a taxono-
syntactic patterns to mark-up candidate phrases in Web pages mical link which should be discovered.
without having to manually produce an initial set of marked-up There exist many approaches for detecting this kind of rela-
Web pages, and without depending on previous knowledge. Its tions. As stated in Cimiano et al. (2004), three different learning
context driven version, C-Pankow (Cimiano et al., 2005), improves paradigms can be exploited. First, some approaches rely on the
the first by reducing the number of queries to the search engine. document-based notion of term subsumption (Sanderson and
Table 1 two terms (a, b) as the conditional probability of a and b co-

Hearst patterns. occurring within the text
Pattern Example rðabÞ
PMIða,bÞ ¼ log2 ð1Þ
rðaÞrðbÞ
Such NP as {NP,}n {and9or} NP Such countries as Poland
NP {,} such as {NP,}n {and9or} NP Cities such as Barcelona here r(ab) is the probability that a and b co-occur. If a and b are
NP {,} including {NP,}n {and9or} NP Capital cities including London statistically independent, then the probability that they co-occur is
NP {,} specially {NP,}n {and9or} NP Science fiction films, specially Matrix
given by the product r(a)r(b). If they are not independent, and they
NP {,} (and9or) other NP The Sagrada Familia and other churches
have a tendency to co-occur, then r(ab) will be greater than
r(a)r(b). Therefore the ratio between r(ab) and r(a)r(b) is a
Croft, 1999). Secondly, some researchers claim that words or measure of the degree of statistical dependence between a and b.
terms are semantically similar to the extent to which they share To compute robust PMI values from the Web, term probabil-
similar syntactic contexts (Bisson et al., 2000; Caraballo, 1999). ities can be approximated to the chance of retrieving each term
Finally, several researchers have attempted to find taxonomic from the Web, when asking a search engine. Concretely, Eq. (2)
relations expressed in texts by matching certain patterns asso- approximates the PMI score by assessing term occurrence and co-
ciated to the language in which documents are presented (Ahmad occurrence probabilities as the hit count provided by a search
et al., 2003; Berland and Charniak, 1999). engine (Turney, 2001)
Pattern-based approaches are heuristic methods using regular hitsða&bÞ=#total Webs
expressions that have been successfully applied in information PMIIR ða,bÞ ¼ log2 : ð2Þ
ðhitsðaÞ=#total WebsÞ=ðhitsðbÞ=#total WebsÞ
extraction (Sánchez, 2010; Sánchez and Isern, 2011). The text is
scanned for instances of distinguished lexical-syntactic patterns
that indicate a relation of interest. This is especially useful for
detecting specialisations of concepts that can represent is-a 4. A novel ontology-based feature extraction method
(taxonomic) relations (Hearst, 1992; Sánchez and Moreno,
2008b) or individual facts (Etzioni et al., 2005). Table 1 shows This section provides a thorough description and formalization
patterns proposed by M. Hearst to detect taxonomical relation- of the proposed methodology, whose aim is to discover those
ships (NP stands for a noun phrase). features modelled in an input ontology that can be found in a
However, the quality of pattern-based extractions can be textual document describing an entity. The general algorithm is
compromised by the problems of de-contextualisations and introduced in Section 4.1, and the following three sections detail
ellipsis. For example, de-contextualisations can easily be found its basic components. Sections 4.5 and 4.6 explain its adaptation
in sentences like ‘‘There are several newspapers sited in big cities to the analysis of plain text documents and semi-structured
such as El Pais and El Mundo’’; without a more exhaustive resources (Wikipedia articles). Section 4.7 analyses the temporal
linguistic analysis we might erroneously extract El Pais and El cost of the method according to the type of input.
Mundo as instances of ‘‘city’’ instead of ‘‘newspapers’’. Hence, a
way to validate the correctness of the detected relationships is 4.1. General algorithm
needed in order to avoid depending solely on individual observa-
tions (which can be affected by natural language ambiguity). Algorithm 1 provides a high-level pseudo-code description of
Another limitation of pattern-based approaches is the fact that the proposed algorithm. Table 2 contains the explanation of the
they usually present a relatively high precision but typically
suffer from low recall due to the fact that the patterns are rare Table 2
in reduced corpora (Cimiano et al., 2004). Fortunately, as it stated Main definitions used in the feature extraction algorithm.
in Section 4.5.2, this data sparseness problem can be tackled by
Element Definition
exploiting the Web (Buitelaar et al., 2004; Velardi et al., 2005) as a
corpus. ae (analysed entity) Name of the real entity which is being
analysed (e.g. Barcelona)
3.2.2. Web-scale statistics wd (web document) Web document which contains the information
As stated above, pattern-based analysis is hampered by the about the ae.
do (domain ontology) Domain ontology used in order to specify which kind
fact that conclusions are extracted from individual observations.
of concepts will be taken into account during the
To base the conclusions on a wider evidence, the use of statistical feature extraction process
measures about information distribution (term occurrence and pd (parsed document) Web document parsed in a readable format for the
co-occurrence, in our case) in a large enough corpus have aided in system
PNE (Potential Named List of potential Named Entities (entities extracted
the past (Lin, 1998). To avoid data sparseness problems that may
Entities) from the text that have not been checked and
be derived from statistics computed from reduced corpora, some filtered yet)
authors like Brill (2003) have demonstrated the convenience NE (Named Entities) List of selected PNEs, which are those with a relevancy
of using a corpus as big as the Web to improve the quality of score higher than the NE_THRESHOLD
classical statistical methods. ð3Þ
NE ¼ x A PNE=NE_SCOREðx,aeÞ4 NE_THRESHOLD
However, the analysis of such an enormous repository for SC (Subsumer Abstractions of collections of real entities which share
computing statistics is, in most cases, impracticable. To tackle this Concepts) common characteristics. Subsumer concepts and their
problem, some researchers (Cilibrasi and Vitányi, 2006; Cimiano abstractions are taxonomically related. For example,
the subsumer concept of the real entities The Sagrada
et al., 2004; Etzioni et al., 2005) have used Web search engines to
Familia and St. Peter’s Basilica could be basilica.
obtain good quality and relevant statistics. OC (Ontological List of all the classes (i.e., concepts) modelled
To assess the degree of relatedness between a pair of terms (in Classes) in the input ontology
our case, for example, between an extracted feature and its potential SOC (Subsumer Subset of the ontological concepts that correspond to
conceptualization), term collocation measures such as the well- Ontological Classes) the subsumer concepts of a NE
ac (annotated class) Subsumer ontological concept which represents the
known Pointwise Mutual Information (PMI, Eq. (1)) (Church et al., final annotation of the analysed NE
1991) can be used. PMI statistically assesses the relation between
basic elements of the algorithm. Several of the proposed functions describe, in a less ambiguous way than general words, the
are abstract and can be overwritten in order to adapt the analysis relevant features of a particular entity (Abril et al., 2011).
to different kinds of inputs (plain text documents or semi- To select which of the detected NEs are the most related to the
structured ones). Moreover, since the feature extraction process analysed entity, a relevance-based analysis relying on Web co-
is guided by the input ontology, by using different domain
occurrence statistics is performed. Afterwards, the selected NEs
ontologies (e.g. in Medicine (Spackman, 2004) or Chemical engi-
are matched to the ontological concepts of which they can be
neering (Morbach et al., 2007)), the system is able to adapt the
feature extraction process to the domain of interest. As inputs, the considered instances. In this manner the extracted features are
system receives the document to be analysed (e.g. a text describ- presented in an annotated fashion, easing the posterior applica-
ing a city) and the ontology detailing the features to be extracted tion of semantically-grounded data analyses. In the following
(e.g. tourism activities or points of interest). subsections, the main algorithm steps are described in detail.
4.2. Document parsing

Algorithm 1. Ontology-based feature extraction method.
The first step is to parse the input Web document (line 3) which
1 OntologyBasedExtraction(WebDocument wd, String ae, is supposed to describe a particular real world entity, from now on
DomainOntology do){ ae. The Parse_document function depends on the kind of document
2 /n Document Parsing n/ that is being analysed. If it is a HTML document, then it is necessary
3 pd: ¼parse_document(wd) to extract raw text from it by means of HTML parsers which are able
4 to remove headers, templates, HTML tags, etc. Otherwise, if the
5 /n Extraction and selection of Named Entities from document comes from a semi-structured source such as Wikipedia,
Document n/ then ad-hoc tools are used to filter and extract the main text.
6 PNE:¼extract_potential_NEs(pd)
7 8 pneiA PNE { 4.3. Named entity detection
8 if NE_Score(pnei, ae) 4NE_THRESHOLD{
9 NE:¼NE [ pnei This step consists in extracting relevant named entities from the
10 } analysed document. The function extract_potential_NEs (line 6)
11 } returns a set of Named Entities (PNE). The implementation of this
12 /n Retrieval of potential subsumer concepts for each NEn/ function depends on the type of input, as will be described in
13 8 neiANE{ Sections 4.5 and 4.6. However, as stated above, only a subset of the
14 SC: ¼extract_subsumer_concepts(nei ) elements of PNE really describe the main features of ae; the rest of
15 nei: ¼add_subsumer_concepts_list(SC) the elements of PNE could introduce noise because they may be
16 } unrelated to the analysed entity (they just happen to appear in the
17 Web page describing the entity but are not part of its basic
18 /n Annotation of NEs with ontological classes n/ distinguishing characteristics). Thus, it is necessary to have a way
19 OC: ¼extract_ontological_classes(do) of separating the relevant NEs from the irrelevant ones (NE filtering,
20 8 neiANE { line 8). To do that, a Web-based co-occurrence measure that tries to
21 /n Retrieval of Subsumer Ontological Classes (i.e. assess the degree of relationship between ae and each NE is used.
potential Concretely, a version of the Pointwise Mutual Information (PMI,
22 annotations) for each Subsumer Concept of each NEn/ introduced in 3.2.2) relatedness measure adapted to the Web is
23 SC: ¼ get_subsumer_concepts_list(nei) computed (Church et al., 1991)
24 /n Application of direct matching n/ hitsðpnei &aeÞ
25 SOC: ¼extract_direct_matching(OC, SC) NE_SCOREðpnei ,aeÞ ¼ ð3Þ
hitsðpnei Þ
26 /n If direct matching fails, semantic matching is applied
n In the NE_SCORE (Eq. (3)), concept probabilities are approximated
/
by Web hit counts provided by a Web search engine as proposed by
27 if 9SOC9¼ ¼0 {
Turney (2001). Since the intention of the algorithm is to rank a set of
28 SOC: ¼extract_semantic_matching(OC, SC, nei, ae)
choices –PNEs- for a given domain (i.e. the ae) the hits(ae) term can be
29 }
dropped because it has the same value for all choices. NEs that have a
30 /n if a similar ontological class is found, the best
score exceeding an empirically determined threshold (NE_THRES-
31 annotation is chosen and the annotation is performed n/
HOLD, line 8) are considered as relevant, whereas the rest are
32 if 9SOC940 {
removed. The value of the threshold determines a compromise
33 SOC: ¼SOC_Score(SOC, nei, ae)
between the precision and the recall of the system, as will be shown
34 ac: ¼select_SOC_max_score(SOC, AC_TRESHOLD)
in the evaluation results presented in Section 5.1.
35 nei:¼ add_annotation(ac)
36 }
37 } 4.4. Semantic annotation
38 return NE
39 } In this work, the aim of the semantic annotation step is to
match the extractions found in a text describing an entity with
the appropriate classes contained in the input ontology, which
In order to discover the relevant features of an entity, we focus
models the features of interest.
on the extraction and selection of Named Entities (NEs) found in Some approaches have been proposed in this area. One way to
the text. A NE is a Noun Phrase which refers to a real world entity assess the relationship between two terms (which, in our case,
and can be considered as an instance of a conceptual abstraction would be a NE and an ontology class) is to use a general thesaurus
(e.g. Barcelona is a NE and an instance of the conceptual abstrac- like WordNet to compute a similarity measure based on the number
tion city). Due to their concrete nature, it is assumed that NEs of semantic links among them. However, those measures are
hampered by WordNet’s limited coverage of NEs and, in conse- 4.4.2.2. Semantic matching. This step faces the non-trivial matching
quence, it is usually not possible to compute the similarity between between semantically similar subsumers and ontological classes
a NE and an ontological class in this way. There are approaches that that, because they were represented with different textual labels,
try to discover automatically taxonomic relationships (Bisson et al., could not be discovered in the previous stage. Details on how this
2000; Sanderson and Croft, 1999), but they require a considerable task is performed are given in Algorithm 2, which specifies what it is
amount of background documents and linguistic parsing. Finally, done in line 28 of Algorithm 1.
another possibility is to compute the co-occurrence between each
Algorithm 2. Semantic matching method (line 28, algorithm 1).
NE and each ontological class using Web-scale statistics (Turney,
2001), but this solution is not scalable because of the huge amount
of required queries (Cimiano et al., 2005). 1 extract_semantic_matching(OC, SC, ne, ae) {
The method proposed in this paper employs this last technique, 2 8 sci A SC {
but introducing a previous step that reduces the number of queries 3 SYNSETS: ¼get_wordnet_synsets (sci )
to be performed. To do so, the semantic annotation step is divided in 4 if 9SYNSETS9¼ ¼0 {return}
two parts: the discovery of potential subsumer concepts (line 14) 5 else if 9SYNSETS9¼ ¼1 {
and their matching with the ontology classes (lines 19–38). 6 RELATED_TERMS ’ get_related_terms(synset0)
7 }
4.4.1. Discovering potential subsumer concepts 8 else {/n when the sc has more than 1 synset,
This first stage is proposed in order to minimise the number of disambiguation is needed n/
queries (NE, ontology class) to be performed by the final statis- 9 CONTEXT: ¼get_context(ne)
tical assessor. It may be noticed that the problem of semantic 10 CONTEXT: ¼CONTEXTþget_web_snippets(ne, ae)
annotation is to find a bridge between the instance level (i.e., a 11 8 contextj A CONTEXT {
NE) and the conceptual level (i.e., an ontology concept for which 12 /n the similarity between the context
the NE is an instance). As stated in Section 3.2.1 NEs and concepts 13 and each synset is calculated n/
are semantically related by means of taxonomic relationships. 14 8 synsetk A SYNSET {
Thus, the way to go from the instance level to the conceptual level 15 similarity:¼get_cosine_distance(contextj, synsetk)
is by discovering taxonomic abstractions, which are represented 16 update_average_synset(synsetk, similarity)
by subsumer concepts. So, in this stage, the aim is to automati- 17 }
cally discover possible subsumer concepts for each NE. 18 }
This is done by means of the function extract_subsumer_ 19 disambiguated_synset:¼get_synset_max_average(SYNSETS)
concepts, which depends on the kind of input document (see 20 RELATED_TERMS: ¼get_related_terms
Sections 4.5 and 4.6 for specific details). As a result, we obtain a (disambiguated_synset)
set of subsumer concepts for each NE (e.g. the subsumer concepts 21 }
of Porsche could be car, automobile, auto, motorcar and machine). 22 }
Notice that those concepts are abstractions of the NE and they not 23 return extract_direct_matching(OC, RELATED_TERMS)
depend on any ontology. This means that subsumer concepts do 24 }
not necessarily match with ontological classes. However, at this
point the algorithm is working at a conceptual level (rather than
at the instance level) and thus it is possible to match NEs with The semantic matching step is performed when the direct
ontological classes more easily and efficiently in the following matching has not produced any result. Its main goal is to increase
step, as detailed in the next section. the number of elements in SOC, so that the direct matching can be
re-applied with a wider set of terms. The new potential subsumers
are concepts semantically related to any of the initial subsumers
4.4.2. Matching subsumers to ontological classes
(synonyms, hypernyms and hyponyms). As the algorithm works at a
This stage tries to discover matches between the subsumer
conceptual level, WordNet (Fellbaum, 1998) has been used to obtain
concepts of a NE and the ontological classes. To do so, two different
cases are considered: Direct Matching and Semantic Matching. these related terms and to increase the SOC set.
The main problem of semantic matching is that a word may be
4.4.2.1. Direct matching. First, the system tries to find a direct match polysemous and, before extracting the related concepts from
between the subsumers of a NE and the ontology classes. This phase WordNet, it is necessary to discover to which sense it corresponds
begins with the extraction of all the classes contained in the domain (i.e., a semantic disambiguation step must be performed to
ontology (line 19). Then, for each Named Entity nei, all its potential choose the correct synset). To do so, one possible solution is to
subsumer concepts (sciASC) are compared against each ontology class use the context (i.e., the sentence from which nei was extracted,
in order to discover lexically similar ontological classes (sociASOC, line 9 of Algorithm 2) but, usually, that is not enough to
lines 23–25), i.e., classes whose name matches the subsumer itself or disambiguate the meaning. To minimise this problem, the Web
a subset of it (e.g., if one of the potential subsumers is Gothic cathedral, is used to assist the disambiguation.
it would match an ontology class called Cathedral). We propose a Web-based approach that combines the contexts
A stemming algorithm (Porter, 1997) is applied to both the sci of Named Entities, WordNet definitions and the cosine distance. The
and the ontology classes in order to discover terms that have the Web is used in this case to find new evidences of the relationship
same root (e.g., city and cities). If one (or several) ontology between the ne and the ae in order to increase the contexts (where
classes match with the potential subsumers, they are included those terms occurs together) that enable a better disambiguation of
in SOC as candidates for the final annotation of nei. This direct the meaning of polysemous words. First, the algorithm retrieves the
matching step is quite easy and computationally efficient; contexts in which the NE appears (line 10 of Algorithm 2), framed in
however, its main problem is that, in many cases, subsumers do the scope of the original ae (by querying a search engine for
not appear as ontology classes with exactly the same name. As a common appearances of ae and ne). Then, the set of Web snippets
result, potentially good candidates for annotation are not returned by the search engine is analysed. Web snippets (see Fig. 1)
discovered. are used instead of complete documents because they precisely
Fig. 1. Snippet of a website obtained by Google for the Tarragona NE.
Table 3 Table 4
Semantic disambiguation example (part I). Semantic disambiguation example (part II).
Data Value Snippet/context Synset Synset

1 2
ae: Barcelona
nei: Sagrada Familia His best known work is the immense but still unfinished 0.16 0.11
sci: Cathedral church of the Sagrada Famı́lia, which has been under
Query: ‘‘Barcelona’’ þ‘‘Sagrada Familia’’ construction since 1882, and is still financed by private
Synset 1: [cathedral] any large and important church donations.
Synset 2: [cathedral, duomo] the principal Christian church Review of Barcelona’s greatest building the Sagrada Familia 0.0 0.12
building of a bishop’s diocese by Antonio Gaudi, Photos, and links
The Sagrada Familia is the most famous church in 0.26 0.18
Barcelona... As a church, the Sagrada Familia should not
only be seen in the artistic point of view
The Sagrada Familia (Holy Family) is a church in Barcelona, 0.12 0.08
contain, in a concise manner, the context of the occurrence of the Spain.... The architect who designed the Sagrada Familia is
web query. Several snippets (e.g. 10) can be obtained with a single Antoni Gaudı́, the designer of more other...
Web query, so their analysis is much more efficient than the one of Virtual Tour of Barcelonas’s sightseeings.... commonly 0.28 0.10
complete Web resources. known as the Sagrada Famı́lia, is a large Roman Catholic
church in Barcelona, Catalonia, Spain,...
Then, the system calculates the cosine distance between each
[...] [...] [...]
snippet and every WordNet synset of the element of SC (lines 11
Final average 0.14 0.12
to 18 of Algorithm 2). The aim is to assess which of the senses of
each subsumer concept is the most similar to the context in
which the NE appears. The synset with the highest average value
is finally selected as the appropriate conceptualization of the elements of SOC can also be polysemous, and can be referring to
subsumer concept (line 19 of Algorithm 2) and the direct match- different concepts depending on the context (line 33 of Algorithm 1).
ing is repeated with all the new SCs. So, in Eq. (4), the analysed entity ae has been introduced to
Let us consider an example to illustrate this method. Table 3 contextualise the relationship of each element of SOC with nei
depicts the input data of the problem. The first three rows are the hitsðae&nei &soci Þ
analysed entity, the named entity and a subsumer concept. The last SOC_Scoreðsoci ,nei ,aeÞ ¼ ð4Þ
hitsðae&soci Þ
three rows show the query performed in order to retrieve Web
snippets and the two WordNet synsets of the subsumer concept. The score (Eq. (4)) computes the probability of the co-occurrence
Table 4 shows a subset of the retrieved snippets, which represent of the named entity nei and each ontology class proposed for
contexts of the NE, and the cosine distance between the context and annotation soci from the Web hit count provided by a search engine
each of the subsumer’s synsets. Synset 1 obtains the highest score, when querying these two terms (contextualised with ae). Finally,
so it will be selected as the most appropriate conceptualization of the annotation with the highest score which exceeds the AC_THRES-
the SC and, hence, its synonyms, hyponyms and hypernyms will be HOLD (lines 34–35 of Algorithm 1) is annotated. If no elements in
retrieved in the semantic matching step. In this example, minster, SOC exceed the threshold, the NE remains unannotated. This will
church and church building are used as terms to expand the direct indicate that, even though the NE and its corresponding SC may be
matching between the SC cathedral and the ontology classes. correct, there are not enough evidences (statistically assessed from
the Web) to support the annotation in the context of the domain
defined by the input ontology.
4.4.3. Class selection
After applying the matching process between subsumers and 4.5. Extraction of features from raw texts
ontological classes, a final assessor decides if the selected match-
ing is adequate enough (if only one matching was found) or which So far, the generic feature extraction algorithm has been
of the matchings is the most appropriate, if several ones were presented. This section discusses how the functions extract_po-
obtained (line 32 of Algorithm 1). In this case, a Web-based tential_NEs (line 6 of Algorithm 1) and extract_subsumer_concepts
statistical measure is applied again in order to choose the most (line 14 of Algorithm 1) have been implemented in order to apply
representative one (Class Selection, lines 33–34 of Algorithm 1). it to raw text.
Notice that, at this stage, the number of matches between NEs
and ontological classes is low compared with all the possible 4.5.1. Named entities detection
combinations between NEs and ontological classes (as discussed The main problem related with the detection of NEs from raw
in Section 4.4.1). As a result, a much lower amount of queries is text is the fact that they are unstructured and unlimited by nature
needed to select the final annotation. This shows the benefits of (Sánchez et al., 2010). This implies that, in most cases, NEs are not
exploiting automatically discovered subsumer concepts as a contained in classical repositories like WordNet due to its
bridge to enable more direct semantic annotations. potential size and its dynamic character. Different approaches
The selection is based on the degree of relatedness between have been proposed in the field of NE detection. Roughly, they can
the Named Entity and each element of the matched ontological be divided into supervised and unsupervised methods.
classes in SOC, assessed again with the Web-based version of PMI Supervised approaches rely on a specific set of extraction rules
introduced in Section 3.2.2. However, it must be noted that the learned from pre-tagged examples (Fleischman and Hovy, 2002;
Stevenson and Gaizauskas, 2000), or predefined knowledge bases Table 5

such as lexicons and gazetteers (Mikheev and Finch, 1997). Set of extracted NE from Tarragona Wikipedia introduction.
However, the effort required to assemble large tagged sets or
Detected sentences Extracted PNE Correct?
lexicons binds the NE recognition to either a limited domain (e.g.,
medical imaging), or a small set of predefined, broad categories of [NP Tarragona/EX] Tarragona Yes
interest (e.g., persons, countries, organisations, products), ham- [NP Catalonia/NN] Catalonia Yes
pering the recall (Pasca, 2004). [NP Spain/NNP] Spain Yes
[NP Sea/NNP] Sea No
In unsupervised approaches like (Lamparter et al., 2004), it has [NP Tarragone s/VBZ] Tarragone s Yes
been proposed to use a thesaurus as background knowledge (i.e., [NP the/VBZ Vegueria/NNPS] The Vegueria No
if a word does not appear in a dictionary, it is considered as a NE).
Despite the fact that this approach is not limited by the coverage
of the thesaurus, misspelled words are wrongly considered as NEs focuses on the exploitation of internal links and category links. The
whereas correct NEs composed by a set of common words are first ones represent connections between terms that appear in a
rejected, providing inaccurate results. Wikipedia article and other articles that describe them. This is an
Other approaches take into consideration the way in which indication of the fact that the linked term represents a distin-
NEs are presented in a specific language. Concretely, languages guished entity by itself, easing the detection of NEs. On the other
such as English distinguish proper names from other nouns hand, category links group different articles (corresponding to
through capitalisation. The main problem is that basing the entities) in areas that are related in some way and give articles a
detection of NEs on individual observations may produce inaccu- kind of categorisation. Wikipedia’s category system can be
rate results if no additional analyses are applied. For example, a thought of as consisting of overlapping trees. Hence, Wikipedia
noun phrase may be arbitrary capitalised to stress its importance categories can be considered as rough taxonomical abstractions of
or due to its placement within the text. However, this simple idea, entities and can be exploited to aid the annotation process.
combined with linguistic pattern analysis, as it has been applied This section details how the functions extract_potential_NEs
by several authors (Cimiano et al., 2004; Downey et al., 2007; (line 6) and extract_subsumer_concepts (line 14) have been
Pasca, 2004), provides good results without depending on manu- adapted to take advantage of these semi-structured data.
ally annotated examples or specific categories.
We have defined the following unsupervised, domain-
4.6.1. Named entities detection
independent method to detect NEs when dealing with raw text
In order to take profit of the manually created Wikipedia link
(i.e., to implement extract_potential_NEs, line 6). First, a natural
structure, those words tagged with internal links have been
language processing parser is used. Concretely, the four modules
considered as potential named entities (PNE, Table 7). In this
of the OPENNLP1 parser (Sentence Detector, Tokenizer, Tagging
manner, the analysis is reduced to only those terms that are
and Chunking) are applied in order to analyse syntactically the
likely to correspond to distinguished entities (that have its own
input text of the Web document. Then, all the detected Noun
Wikipedia article).
Phrases (NP) which contain one or more words beginning with a
The problem of the PNEs extracted from the internal links are
capital letter are considered as a Potential Named Entities (PNE).
that, on one hand, not all of them are directly related with the
Table 5 shows an example of the extracted PNEs from the first
analysed entity (ae) and, on the other hand, only a subset of PNE
fragment of text of the Wikipedia article about Tarragona.2
are real NEs.
In order to illustrate these problems, the following fragment of
4.5.2. Discovering potential subsumer concepts text extracted from Wikipedia corresponding to the Barcelona
To extract subsumer concepts of NEs from raw text, it is article is examined. ‘‘Barcelona is the capital and the most
necessary to look for linguistic evidences stating their implicit taxo- populous city of Catalonia and the second largest city in Spain,
nomical relationship. As discussed in Section 3.2.1, we use Hearst’s after Madrid, with a population of 1,621,537 within its adminis-
taxonomic linguistic patterns, which have proved their effectiveness trative limits on a land area of 101.4 km2’’. In this text, there are
to retrieve hyponym/hypernym relationships (Hearst, 1992). The four terms internally linked with other Wikipedia articles. Three
Web is used to assist this learning process, looking for pattern of them are NEs (Catalonia, Spain and Madrid) and they represent
matches that provide the required linguistic evidences. instances of cities/regions/countries, whereas the other one is a
To do so, the system constructs a Web query for each NE and common noun which represents a concept (capital). Moreover,
each pattern. Each query is sent to a Web search engine, which Madrid is bringing information of general purpose that is not
returns as a result a set of Web snippets. Finally, all these snippets directly related with Barcelona and, in consequence, it is not a
are analysed in order to extract a list of potential subsumer relevant feature for describing the entity Barcelona.
concepts (i.e., expressions that denote concepts of which the NE Due to these potential problems, the set of extracted PNEs
may be considered an instance). (Table 7) has to be filtered by means of the NE score presented in
Table 6 summarises the linguistic patterns that have been used the general algorithm (see Section 4.3), as it is done with the PNEs
(CONCEPT represents the retrieved potential subsumer concept obtained from the analysis of raw text. In this case, however, the
and NE the Named Entity that is being studied). algorithm starts with a pre-processed set of NEs whose reliability
is supported by the community of Wikipedia contributors.
4.6. Extracting features from semi-structured documents
4.6.2. Discovering potential subsumer concepts
In this work, Wikipedia has been used to show the applic- In order to extract potential subsumer concepts for each
ability of the proposed method when applied to semi-structured named entity, Wikipedia category links have been used. The idea
resources. Wikipedia has some particularities which can ease the is to consider Wikipedia categories of articles describing a NE as
information extraction when compared with raw text. This work potential subsumers of the NE. In this manner it is possible to
obtain, in a very direct and efficient manner, the set of subsumers
1
http://incubator.apache.org/opennlp/ Last access: July 24th, 2012. that enable the matching of the NE with ontological classes
2
http://en.wikipedia.org/wiki/Tarragona. Last access: November 9th, 2011. (Table 8).
Table 6
Patterns used to retrieve potential subsumer concepts.
Pattern structure Query Example
CONCEPT such as NE ‘‘Such as Barcelona’’ Cities such as Barcelona

Such CONCEPT as NE ‘‘Suchn as Spain’’ Such countries as Spain
NE and other CONCEPT ‘‘Ebre and other’’ Ebre and other rivers
NE or other CONCEPT ‘‘The Sagrada Familia or other’’ The Sagrada Familia or other monuments
CONCEPT especially NE ‘‘Especially Tarragona’’ World Heritage Sites especially Tarragona
CONCEPT including NE ‘‘Including London’’ Capital cities including London
Table 7 Both plain text and semi-structured text analyses have the
Subset of extracted NEs from Barcelona Wikipedia same cost for NE filtering, semantic disambiguation and class
article.
selection. To rank NEs in the relevance filtering step, two queries
Wikilinks Correct? are required by the NE_Score function (4) to evaluate each NE.
Hence, a total of 2n queries are performed where n represents the
Acre No number of NEs. Class selection requires computing as many
Antoni Gaudı́ Yes
SOC_Scores (5) as candidates. Hence, being c the total number
Arc de Triomf Yes
Archeology Museum of Catalonia Yes
of class candidates, this results in 2c queries. In the semantic
Barcelona Cathedral Yes disambiguation stage, only one query is needed for each candi-
Barcelona Museum of Contemporary Yes date to obtain the snippets relating each candidate with the
art analysed entity (i.e. c queries are performed in this step).
Barcelona Pavilion Yes
Hence, the difference in computational cost between plain text
Casa Batlló Yes
analyses and semi-structured ones is in the steps of NE detection
and extraction of subsumer concepts. Although NE detection
A problem of these categories, however, is the fact that they is different when analysing plain texts and semi-structured
are, in many cases, too complex or concrete to be modelled in an resources, neither of them needs to perform queries and its cost
ontology. For example, The Sagrada Familia article is categorised is considered constant for both approaches. Referring to the
as Antoni Gaudı́ buildings, Buildings and structures under construc- extraction of subsumer concepts, in the first approach six queries
tion, Churches in Barcelona, Visitor attractions in Barcelona, are performed to discover subsumer concepts by means of Hearst
World Heritage Sites in Spain, Basilica churches in Spain, etc. These Patterns (6n), whereas in the second approach no queries are
categories are too complex because they involve several concepts needed because SCs are directly extracted from the categories of
or even mix concepts with other NEs. To tackle this problem, we the tagged entities.
syntactically analyse each category to detect the basic concepts to In summary, considering the above expressions, the number of
which it refers. For example in ‘‘Churches in Barcelona’’ the key queries needed to analyse plain text is 2nþ 6nþ 2cþ1c ¼8n þ3c,
concept is ‘‘Churches’’ and in ‘‘Buildings and structures under whereas only 2nþ2c þ1c ¼2n þ3c are needed when dealing with
construction’’ there are two important concepts: ‘‘Buildings’’ and Wikipedia articles. In both cases the proposed method scales
‘‘Structures’’. To extract the main concepts of each sentence a linearly with respect to the number of NEs, but the much lower
natural language parser has been used, and all the Noun Phrases coefficient for the Wikipedia articles shows how the exploitation
have been extracted. of their structure aids to improve the performance of the method.
Another limitation of the Wikipedia categories is the fact that
they do not always contain enough concepts to perform the
matching among them and ontological classes, or those are too
concrete to be included in an ontology (and, hence, to enable a
direct matching). Fortunately, as mentioned before, Wikipedia 5. Evaluation
categories are included in higher categories. So, we consider two
upper-level categories to improve the recall of the ontology This section presents the evaluation of the proposed method,
matching. As the Wikipedia categorisation does not define a strict which has been conducted in two directions. In the first part,
taxonomy, it is not advisable to climb recursively in the category given a specific evaluation scenario, it is presented a detailed
graph because the initial meaning could be lost. picture of the influence of the different parameters involved in
the learning process: thresholds, input ontology and document
4.7. Computational cost types. In the second part, in order to give a general view of the
expected results, a battery of tests for different scenarios (i.e.
The runtime of the proposed method mainly depends on the different input documents and ontological domains) is presented.
number of queries, since the Web search engine response time is In all cases, the results have been evaluated according to their
several orders of magnitude higher (around 500 milliseconds) precision and recall against ideal results provided by a human
than any other offline analysis performed by our method (e.g. expert. To do so, the expert has been requested, for each input
natural language processing, pattern matching, ontological pro- document representing an analysed entity (ae) and ontology, to
cessing, etc., taking a few milliseconds). There are five different manually select which features found in the document directly or
steps in which queries are performed: NE detection, NE filtering, indirectly refer to concepts modelled in the ontology. As a result,
subsumer concepts extraction, semantic disambiguation and class a list of ideal features corresponding to concepts found in the
selection. Since the response time of the Web search engine does ontology for each ae is obtained.
not depend on the type and complexity of the query (but on the Then, recall is calculated by dividing the number of correct
delay introduced by the Internet connection), the runtime for all features discovered by our system (i.e. those also found in the
query types is expected to be equivalent. ideal set of features) by the amount of ideal features stated by the
human expert Table 8

Subset of extracted potential subsumer concepts for Barcelona NEs.
#Correct_f eatures
RECALL ¼ ð6Þ
#Ideal_f eatures Articles Potential subsumer concepts
Precision is computed as the number of correct features (as Antoni Gaudı́ 1852 births, 1926 deaths, architects, roman
above) divided by the total number of retrieved features by our catholic churches, art nouveau architects, catalan
system architects, spanish ecclesiastical architects,
modernisme architects, 19th century architects,
#Correct_f eatures 20th century architects, organic architecture,
PRECISION ¼ : ð7Þ
#Retrieved_f eatures people, reus...
Arc de Triomf Triumphal arches, gates, moorish revival
architecture, 1888 architecture, public art stubs,
1888 works, architecture, architecture, public art,
5.1. Influence of input parameters public art, art stubsy
Archeology Museum of Museums, archaeology museums, Sants-Montjuı̈c
Catalonia
In this first battery of tests, we focused on documents Barcelona Cathedral Cathedrals, churches, visitor attractions, basilica
describing cities (Barcelona and Canterbury), and ontologies churchesy
related to touristic and geo-spatial features. First, the influence Barcelona Museum of Museums, art museums, galleries, modern art
Contemporary art museums, modernist architecture, spots, richard
of the algorithm thresholds in the results has been studied. They
meier buildings, el raval, modern arty
enable to configure the system behaviour so that a high precision, Casa Batlló Visionary environments, antoni gaudı́ buildings,
a high recall or a balance between both of them can be achieved. 1907 architecture, world heritage sites, spain,
Then, different types of input documents have been considered. visitor attractions, eixample, passeig, gra cia,
Specifically, our method has been applied to Wikipedia articles, outsider art, 1907 works, 1900s Architecture,
edwardian architecture...
but analysing them as plain text and as Wiki-tagged documents.
Finally, different input ontologies covering the same domain of
knowledge have been used to evaluate the influence of the
ontology design and coverage in the results. Two ontologies have
Table 9
been considered: a manually built Tourism ontology3 , which Description of used ontologies.
models concepts related to different kinds of touristic points of
interest typically found in Wikipedia articles, and a general Ontology Taxonomical #classes Root classes
ontology modelling geo-spatial concepts (Space4 ) retrieved from depth
the SWOOGLE5 search engine. A summary of their structure is
Tourism 5 levels 315 Administrative divisions, buildings,
shown in Table 9. festivals, landmarks, museums, sports
Space 6 levels 188 Geographical features, geopolitical
5.1.1. Learning thresholds entities, places
In this section, the influence of the thresholds used to filter

named entities (NE_THRESHOLD) and to select annotations However, NE_THRESHOLD is useful to drop some NE candidates
(AC_THRESHOLD) is studied. The Wikipedia article of Barcelona, to be analysed in latter stages and, hence, to lower the number of
taken as plain text and as a Wiki-tagged document, together with required Web queries associated to the evaluation of their
the general Space ontology have been used as input. Fig. 2 shows subsumer concept candidates.
precision and recall figures when setting one of the two thresh- From a general perspective we observe an inverse behaviour
olds to 0 and varying the other one from 0 to 1 for the different for precision and recall and for both thresholds. Since the final
types of input documents. goal of our method is to enable the application of data analysis
Results show that the AC_THREHOLD has a larger influence in methods (such as clustering) a high precision may be desirable,
precision/recall. Notice that NE_THRESHOLD is related to the even at the cost of a reduced recall. In this case, more restrictive
NE_Score, which is calculated by measuring the level of related- thresholds can be used. This aspect shows an important difference
ness between the potential named entity and the analysed entity, with related works in which no statistical assessor is used to filter
whereas AC_THRESHOLD goes further by measuring the related- extractions (Cimiano et al., 2005). Since no parameter tuning is
ness between the analysed entity, the potential named entity and possible, the precision of the final results may be limited by the
the subsumer candidate to be annotated (SOC_Score). This fact fixed criterion for NE selection. Notice also that, analysing the
implies that the second threshold is more restrictive because the document as plain text, a total of 202 named entities were
relatedness involves three elements instead of two. Moreover, detected from which only 146 were already annotated by Wiki-
since the AC_THRESHOLD is considered after the NE_THRESHOLD, pedia (i.e. 72%). So, in comparison with methods like the one
its purpose is twofold: (1) it measures the relatedness between proposed by Zavitsanos et al. (2010), which limit the extraction to
the named entity and its subsumer candidate, facilitating the final what it is already annotated in input knowledge bases (i.e.
annotation, and (2) it contextualises the ontology annotation in WordNet, Wikipedia, etc.), our method is potentially able to
the domain of the analysed entity, so that it may filter unwanted improve the extraction and annotation recall.
named entities. This later aspect is similar to the purpose of the
NE_THRESHOLD, which is to select only those NE that are related
to the analysed entity. Since all NEs selected according to the 5.1.2. Plain text vs. Wiki-tagged document
NE_THRESHOLD have to pass a second more restrictive threshold, In this second test, we picked up as case studies the Wikipedia
this enables the system to drop irrelevant entities and those that articles that describe the cities of Barcelona and Canterbury,
are not related with concepts modelled in the input ontology. analysing them as plain text and as wiki-tagged documents. In
all cases, the Space ontology was used. Again, different threshold
3
http://deim.urv.cat/~itaka/CMS2/images/ontologies/tourismowl.owl.
values were set as shown in Fig. 3.
4
http://deim.urv.cat/~itaka/CMS2/images/ontologies/space.owl. Figures show that precision tends to remain stable or to
5
http://swoogle.umbc.edu/ Last Access: July 24th, 2012. improve when taking into account the additional information
Fig. 2. Influence of NE_THRESHOLD and AC_THRESHOLD.
provided by Wiki-tagged text, that is, the tagged entities and their precision. The downside is the lowered recall obtained with the
associated Wiki-categories. On the contrary, recall is, in many Tourism ontology, since solely those named entities that are
situations, higher when analysing the input document as plain already tagged in the text (and that, hence, correspond to
text. This is because when using plain text, the whole textual concepts modelled in the Tourism ontology) are likely to be
content is analysed. Hence, there are more possibilities to extract annotated. Differences between the two ontologies when using
named entities that could correspond to representative features Wiki-tagged text are much reduced. In this case, the fact that NEs
than when relying solely on Wiki-tagged entities. On the contrary, correspond solely to those already tagged in the Wikipedia article
since Wiki entities are tagged by humans, the precision of their produces more similar annotation accuracies. Recall is, however,
resulting annotations is expected to be higher than when per- significantly higher for the Tourism ontology, since NE subsumers,
forming the automatic NE detection proposed by our method. which correspond to Wikipedia categories in this case, are more
Moreover, for the Wiki-tagged text, humanly edited categories are likely to match directly (as described in Section 4.4.2.1) to
used to assist the annotation, which may also contribute to concepts of an ontology that is specially designed to cover topics
provide more accurate results. In any case, results are close and entities usually referred in Wikipedia documents. This avoids
enough to consider the plain text analysis (i.e. both the NE the inherent ambiguity of the semantic matching process
detection and the subsumer candidate learning) almost as reliable (described in Section 4.4.2.2) that is carried out when a direct
as human annotations, at least for feature discovery purposes. It is matching is not found, which may hamper the results.
important to note, however, that the analysis of Wiki-tagged text,
as discussed in Section 4.7, is considerably faster than its plain
5.2. General evaluation
text counterpart, since the number of analyses and queries
required to extract and annotate entities is greatly reduced.
In this second part of the evaluation, general tests have been
performed for several entities belonging to two well differenced
domains: touristic destinations and movies. For each domain,
5.1.3. Input ontologies
eight different entities/documents have been analysed, using
The third test compares the results obtained using different
their corresponding Wikipedia articles. For touristic destinations,
ontologies to assist the feature extraction process of a given
the Tourism ontology introduced above has been used, whereas
entity. In this case, the plain-text and wiki-tagged versions of
for movies, a film6 ontology retrieved from SWOOGLE has been
the Barcelona article have been analysed using the input ontolo-
used. This latter one models concepts related with the whole
gies Space and Tourism. Again, results for different threshold
production process of a film, including directors, producers, main
values are shown in Fig. 4.
actors/actresses starring in the film, etc. The structure of both
Results obtained for the plain-text version of the Barcelona
ontologies is summarised in Table 10.
article show a noticeably higher precision when using the Tourism
Table 11 depicts the precision and recall obtained for the
ontology instead of the Space one. Since the former ontology was
different cities analysed in the tourism domain, whereas Table 12
constructed according to concepts typically appearing in Wikipe-
shows the results of the evaluation of films.
dia articles describing cities, it helps to provide more appropriate
Comparing both domains, we observe slightly better results for
annotations since, even though extracted named entities are
the film domain, with averaged precision and recall figures of
equal for both ontologies, the Tourism one is more suited to
76.9% and 63.2%, respectively. This is related to the lower
annotate those that are already tagged in the Wikipedia article.
ambiguity inherent to film-related entities, in comparison with
Moreover, since the Tourism ontology contains more concrete
touristic ones. For example, in the film domain, most NEs refer to
concepts than the more general-purpose Space one, language
ambiguity that may affect the subsumer extraction stage and
the posterior annotation is minimised, resulting in better 6
http://deim.urv.cat/~itaka/CMS2/images/ontologies/film.owl.
Fig. 3. Plain text vs. Wikipedia documents.
person names (directors, producers, stars), which are hardly suppose a considerable reduction of the human effort required
ambiguous due to their concreteness. Hence, annotation to to annotated textual resources.
ontological concepts is usually correct. On the contrary, a NE
such as ‘‘Barcelona’’ may refer either to a city or a football team,
being both annotations almost equally representative; if both 6. Conclusions
concepts are found in the ontology, the annotation is more likely
to fail, in this case. One of the basic pillars of the Semantic Web concept is the
Analysing individual entities, we observe better precisions and idea of having explicit semantic information that can be used by
recalls of English/American entities (e.g, American films like First intelligent agents in order to solve complex problems of Informa-
Blood or British cities like London), which usually result in a tion Retrieval and Question Answering and to semantically
larger amount of available Web resources for the corresponding analyse and catalogue the electronic contents. This fact has
entities and, hence, on more robust statistics. motivated the creation of new data mining techniques like
As a general conclusion, considering recall figures, we observe semantic clustering (Batet et al., 2011), which are able to exploit
that our method was able to extract, in both cases, more than half semantics of data. However, these methods suppose that input
of the features manually marked by the domain expert. Moreover, contents were annotated in advance so that relevant features can
precisions were around a 70–75% in average, which produce be detected and interpreted.
usable results for further data analysis. It is important to note This work aimed to extract and pre-process data from different
that these results, which have been automatically obtained, kinds of inputs (i.e., plain textual documents, Web resources or
Fig. 4. Influence of domain ontologies.
Table 10 Table 11
Description of used ontologies. Evaluation results from Wikipedia descriptions of different cities using Tourism
ontology.
Ontology Taxonomic # Classes Main classes
depth Articles Precision (%) Recall (%)
Tourism 5 levels 315 Administrative divisions, Cracow 80 36.3

buildings, festivals, landmarks, London 75.8 66.6
museums , sports New York City 69.2 75
Film 4 levels 20 Person, product, company Poznan 66.6 44.4
Saint Petersburg 66.6 57.1
Stockholm 73.6 56
Vancouver 45.4 50
semi-structured documents like Wikipedia articles) in order to Warsaw 79.1 70.3
generate the required semantically tagged data for the aforemen-
Average 69.5 57.0
tioned semantic data analysis techniques. In order to reach our
goals, several well-known techniques and tools have been used:
natural language processing parsers have been useful to analyse
texts and detect named entities, Hearst Patterns have been used related works, it the proposal of a novel way to go from the
to discover potential subsumer concepts of named entities, and named entity level to the conceptual level by using the Web as a
Web-scale statistics complemented with co-occurrence measures general learning source, so that the recall of the annotation can be
have been calculated to score and filter potential named entities improved regardless of the coverage limitations of named entities
and to verify if the final semantic annotation of subsumer presented by the input ontology or WordNet. This enables a final
concepts is applicable. The main contribution in comparison with annotation that minimises the number of queries performed to
Table 12 Batet, M., Valls, A., Gibert, K., 2011. Semantic clustering based on ontologies: an
Evaluation results from Wikipedia descriptions of different films using Film application to the study of visitors in a natural reserve. In: Proceedings of the
ontology. 3rd International Conference on Agents and Artificial Intelligence ICAART
2011, pp. 283–289.
Articles Precision (%) Recall (%) Batet, M., Valls, A., Gibert, K., Sánchez, D., 2010. Semantic clustering using multiple
ontologies. In: Proceedings of the 13th International Conference on the Catalan
52 Association for Artificial Intelligence, pp. 207–216.
Batman Begins 75
77.7 Berland, M., Charniak, E., 1999. Finding parts in very large corpora. In: Proceedings
First Blood (Rambo) 87.5
of the 37th annual meeting of the Association for Computational Linguistics on
Inception 81.2 72.2
Computational Linguistics. Association for Computational Linguistics, College
Matrix 87.5 57
Park, Maryland, pp. 57–64.
Oldboy 57.1 66.6
Berners-Lee, T., Hendler, J., Lassila, O., 2001. The Semantic Web—a new form of
Pulp Fiction 78 61.1 Web content that is meaningful to computers will unleash a revolution of new
Requiem for a dream 60 50 possibilities. Sci. Am. 284, 34–43.
The Fountain 88.8 69.2 Bisson, G., Nédellec, C., Cañamero, D., 2000. Designing clustering methods for
63.2 ontology building: The Mo’K workbench. In: Staab, S., Maedche, A., Nédellec,
Average 76.9
C., Wiemer-Hastings, P. (Eds.), Proceedings of the First Workshop on Ontology
Learning OL’2000, Berlin, Germany, August 25, 2000.
Brill, E., 2003. Processing natural language without natural language processing.
In: Gelbukh, A.F. (Ed.), Proceedings of the 4th International Conference on
the Web, since the final matching is performed at a conceptual
Computational Linguistics and Intelligent Text Processing and Computational
level (Cimiano et al., 2004). Moreover, statistical assessors have Linguistics, CICLing 2003. Springer Berlin/Heidelberg, Mexico City, Mexico,
been tuned to better assess the suitability of extraction and pp. 360–369.
annotation at each stage, minimising the inherent language Buitelaar, P., Cimiano, P., Frank, A., Hartung, M., Racioppa, S., 2008. Ontology-based
information extraction and integration from heterogeneous data sources. Int. J.
ambiguity of Web queries. Our method has been also designed Hum.—Comput. Stud. 66, 759–788.
in a general way so that it can be applied to different kinds of Buitelaar, P., Olejnik, D., Sintek, M., 2004. A Protégé plug-in for ontology extraction
inputs and exploit semi-structured information (like Wikipedia from text based on linguistic analysis. In: Proceedings of the 1st European
Semantic Web Symposium (ESWS), pp. 31–44.
annotation) to improve the performance. Finally, being unsuper-
Cafarella, M., Downey, D., Soderland, S., Etzioni, O., 2005. KnowItNow: fast,
vised and domain-independent, the implemented methodology can scalable information extraction from the web. In: Proceedings of the Human
be applied in different domains and without human supervision. Language Technology Conference/Conference on Empirical Methods in Natural
The evaluation performed for different entities and domains Language Processing, HLT/EMNLP 2005. Association for Computational Lin-
guistics, Vancouver, Canada, pp. 563–570.
produced usable results that, by carefully tuning the algorithm Califf, M.E., Mooney, R.J., 2003. Bottom-up relational learning of pattern matching
thresholds, reach the high precisions needed for applying them to rules for information extraction. J. Mach. Learn. Res. 4, 177–210.
data mining algorithms. The evaluation has also shown the Caraballo, S.A., 1999. Automatic construction of a hypernym-labeled noun hier-
archy from text. In: Proceedings of the 37th Annual Meeting of the Association
benefits of using semi-structured inputs (Wikipedia articles), for Computational Linguistics on Computational Linguistics. Association for
producing quality results with a lower computational cost. Computational Linguistics, College Park, Maryland, pp. 120–126.
As further work, it is a priority to study the quality of the Cilibrasi, R.L., Vitányi, P.M.B., 2006. The Google similarity distance. IEEE Trans.
extracted features when using them in semantic data mining Knowl. Data Eng. 19, 370–383.
Cimiano, P., Handschuh, S., Staab, S., 2004. Towards the self-annotating web. In:
algorithms (Batet et al., 2011; Martı́nez et al., 2012). Other Proceedings of the 13th International Conference on World Wide Web, WWW
important line of work is to study how to reduce the number of 2004. ACM, New York, USA, pp. 462–471.
queries (e.g., using only a subset of Hearst Patterns) to Web Cimiano, P., Ladwig, G., Staab, S., 2005. Gimme’ the context: context-driven
automatic semantic annotation with C-PANKOW. In: Ellis, A., Hagino, T.
search engines, as they are the slowest part of the algorithm and (Eds.), Proceedings of the 14th International Conference on World Wide
introduce a dependency on external resources. Finally, the anno- Web. ACM, Chiba, Japan, pp. 462–471.
tation process could be extended in order to take profit of Ciravegna, F., Dingli, A., Guthrie, D., Wilks, Y., 2003. Integrating information to
bootstrap information extraction from Web Sites. In: Proceedings of the
ontology properties (e.g. data or object properties) instead of
Workshop on Intelligent Information Integration IJCAI’03.
solely focusing on taxonomic knowledge. In this manner, entity Ciravegna, F., Dingli, A., Petrelli, D., Wilks, Y., 2002. User-system cooperation in
property-value tuples (e.g. the population or area of a city) could document annotation based on information extraction. In: Gómez-Pérez, A.,
be also extracted from text. Thanks to the generic design of our Benjamins, R. (Eds.), Proceedings of the 13th International Conference on
Knowledge Engineering and Knowledge Management. Ontologies and the
method, extraction patterns other than taxonomic ones can be Semantic Web. Springer Berlin/Heidelberg, Sigüenza, Spain, pp. 122–137.
used for that purpose, detecting non-taxonomic relations Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., 2002. GATE: a framework
(Sánchez and Moreno, 2008a; Sánchez et al., 2012) such as part- and graphical development environment for robust NLP tools and applica-
tions. In: Proceedings of the 40th Anniversary Meeting of the Association for
of (Sánchez, 2010). Statistical assessors and Web queries should Computational Linguistics, ACL 2002, Philadelphia, US.
be also modified accordingly. Church, K., Gale, W., Hanks, P., Hindle, D., 1991. Using statistics in lexical analysis.
In: Zernik, U. (Ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a
Lexicon. Lawrence Erlbaum Associates, Hillsdale, New Jersey, USA,
Acknowledgements pp. 115–164.
Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Mccurley,
K.S., Rajagopalan, S., Tomkins, A., 2003. A case for automated large-scale
This work was partially supported by the Universitat Rovira i semantic annotation. Web Semantics: Sci. Serv. Agents World Wide Web 1,
Virgili (predoctoral grant of C. Vicient, 2010BRDI-06-06) and the 115–132.
Downey, D., Broadhead, M., Etzioni, O., 2007. Locating complex named entities in
Spanish Government through the project DAMASK-Data Mining Web text. In: Proceedings of the 20th International Joint Conference on
Algorithms with Semantic Knowledge (TIN2009-11005). Artificial Intelligence, IJCAI 2007. AAAI Hyderabad, India, pp. 2733–2739.
Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland,
S., Weld, D.S., Yates, A., 2004. Web-scale information extraction in knowitall:
References (preliminary results). In: Proceedings of the 13th international conference on
World Wide Web. ACM, New York, NY, USA, pp. 100–110.
Abril, D., Navarro-Arribas, G., Torra, V., 2011. In: Torra, V., Narakawa, Y., Yin, J., Etzioni, O., Cafarella, M., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S.,
Long, J. (Eds.), On the Declassification of Confidential Documents Modeling Weld, S, Yates, A., D., 2005. Unsupervised named-entity extraction from the
Decision for Artificial Intelligence. Springer, Berlin/Heidelberg, pp. 235–246. web: an experimental study. Artif. Intell. 165, 91–134.
Ahmad, K., Tariq, M., Vrusias, B., Handy, C., 2003. Corpus-based thesaurus Feilmayr, C., Parzer, S., Pröll, B., 2009. Ontology-based information extraction from
construction for image retrieval in specialist domains. In: Sebastiani, F. (Ed.), tourism websites. J. Inf. Technol. 11, 183–196.
Advances in Information Retrieval. Springer, Berlin/Heidelberg, pp. 76. Fellbaum, C., 1998. WordNet: An Electronic Lexical Database. MIT Press,
Alfonseca, E., Manandhar, S., 2002. Improving an ontology refinement method Cambridge, Massachusetts.
with hyponymy patterns. In: Proceedings of the 3rd International Conference Fensel, D., Bussler, C., Ding, Y., Kartseva, V., Klein, M., Korotkiy, M., Omelayenko, B.,
on Language Resources and Evaluation, LREC 2002, Las Palmas, Spain. Siebes, R., 2002. Semantic web application areas. In: Proceedings of the 7th
International Workshop on Applications of Natural Language to Information Porter, M.F., 1997. An Algorithm for Suffix Stripping. In:Karen Sparck Jones, Peter
Systems (NLDB). Willett (Eds.) Readings in Information Retrieval. Morgan Kaufmann Publishers
Fleischman, M., Hovy, E., 2002. Fine grained classification of named entities. In: Inc, pp. 313–316.
Proceedings of the 19th International Conference on Computational Roberts, A., Gaizauskas, R., Hepple, M., Davis, N., Demetriou, G., Guo, Y., Kola, J.,
linguistics—vol. 1. Association for Computational Linguistics, Taipei, Taiwan, Roberts, I., Setzer, A., Tapuria, A., Wheeldin, B., 2007. The CLEF corpus:
pp. 1–7. semantic annotation of clinical text. In: Proceedings of the Annual Symposium
Guarino, N., 1998. Formal Ontology in Information Systems. In: Guarino, N. (Ed.), AMIA 2007. American Medical Informatics Association, Chicago, USA, pp. 625–
Proceedings of the 1st International Conference on Formal Ontology in 629.
Information Systems, FOIS 1998. IOS Press, Trento, Italy, pp. 3–15. Rosso, P., Montes, M., Buscaldi, D., Pancardo, A., Villaseñor, A., 2005. Two Web-
Handschuh, S., Staab, S., Studer, R., 2003. Leveraging metadata creation for the based approaches for noun sense disambiguation. In: International Conference
Semantic Web with CREAM. In: Günter, A., Kruse, R., Neumann, B. (Eds.), on Computers Linguistics and Intelligent Text Processing, CICLing-2005,
Proceedings of the 26th Annual German Conference on AI, KI 2003. Springer Springer Verlag, LNCS (3406), Mexico D.F, pp. 261–273.
Berlin/Heidelberg, Hamburg, Germany, pp. 19–33. Sánchez, D., 2010. A methodology to learn ontological attributes from the Web.
He, Z., Xu, X., Deng, S., 2008. k-ANMI: a mutual information based clustering Data Knowl. Eng. 69, 573–597.
algorithm for categorical data. Inf. Fusion 9, 223–233. Sánchez, D., Batet, M., Valls, A., Gibert, K., 2009. Ontology-driven web-based
Hearst, M.A., 1992. Automatic acquisition of hyponyms from large text corpora. In: semantic similarity. J. Intell. Inf. Syst. 35, 383–413.
Kay, M. (Ed.), In: Proceedings of the 14th Conference on Computational Sánchez, D., Isern, D., 2011. Automatic extraction of acronym definitions from the
linguistics–vol. 2, COLING 92. Morgan Kaufmann Publishers, Nantes, France, Web. Appl. Intell. 34, 311–327.
pp. 539–545. Sánchez, D., Isern, D., Millán, M., 2010. Content annotation for the Semantic Web:
Hotho, A., Maedche, A., Staab, S., 2002. Ontology-based text document clustering. an automatic Web-based approach. Knowl. Inf. Syst. 27, 393–418.
Künstliche Intelligenz, 48–54. Sánchez, D., Moreno, A., 2008a. Learning non-taxonomic relationships from web
Kiryakov, A., Popov, B., Terziev, I., Manov, D., Ognyanoff, D., 2004. Semantic documents for domain ontology construction. Data Knowl. Eng. 63, 600–623.
annotation, indexing, and retrieval. J. Web Semantics 2, 49–79. Sánchez, D., Moreno, A., 2008b. Pattern-based automatic taxonomy learning from
Kiyavitskaya, N., Zeni, N., Cordy, J.R., Mich, L., Mylopoulos, J., 2005. Semi-automatic the Web. AI Commun. 21, 27–48.
semantic annotations for Web documents. In: Bouquet, P., Tummarello, G. Sánchez, D., Moreno, A., Del Vasto-Terrientes, L., 2012. Learning relation axioms
(Eds.), Proceedings of the 2nd Italian Semantic Web Workshop on Semantic from text: an automatic Web-based approach. Expert Syst. Appl. 39,
Web Applications and Perspectives, SWAP 2005. CEUR-WS, Trento, Italy, 5792–5805.
pp. 210–225. Sanderson, M., Croft, B., 1999. Deriving concept hierarchies from text. In:
Kiyavitskaya, N., Zeni, N., Cordy, J.R., Mich, L., Mylopoulos, J., 2009. Cerno: Light- Proceedings of the 22nd Annual International ACM SIGIR Conference on
weight tool support for semantic annotation of textual documents. Data Research and Development in Information Retrieval. ACM, Berkeley, California,
Knowl. Eng. 68, 1470–1492. United States, pp. 206–213.
Koivunen, M.-R., 2005. Annotea and semantic Web supported collaboration Sapkota, K., Aldea, A., Duce, D.A., Younas, M., Bañares-Alcántara, R., 2011.
(invited talk). In: Dzbor, M., Takeda, H., Vargas-Vera, M. (Eds.), Proceedings Semantic-ART: a framework for semantic annotation of regulatory text. In:
of the Workshop on End User Aspects of the Semantic Web at 2nd Annual Proceedings of the Fourth Workshop on Exploiting Semantic Annotations in
European Semantic Web Conference, UserSWeb 05 CEUR Workshop Proceed- Information Retrieval. ACM, Glasgow, Scotland, UK, pp. 23–24.
ings, Heraklion, Crete, pp. 5–17. Schroeter, R., Hunterd, J., Kosovic, D., 2003. Vannotea—a collaborative video
Lamparter, S., Ehrig, M., Tempich, C., 2004. Knowledge extraction from classifica- indexing, annotation and discussion system for broadband networks. In:
tion schemas. In: Proceedings of the International Conference on Ontologies, Handschuh, S., Koivunen, M.-R., Dieng-Kuntz, R., Staab, S. (Eds.), Proceedings
Databases and Applications of SEmantics (ODBASE). Springer, pp. 618–636. of the Knowledge Markup and Semantic Annotation Workshop, K-CAP 03.
Lin, D., 1998. Automatic retreival and clustering of similar words. In: Proceedings ACM, Sanibel, Florida, pp. 9–26.
of the 36th Annual Meeting of the Association for Computational Linguistics Setchi, R., Silva, J.D.V., Rı́os, S.A., Cao, C., 2011. Special issue on semantic
and 17th International Conference on Computational Linguistics, COLING-ACL information and engineering systems. Eng. Appl. Artif. Intell. 24, 1303–1304.
1998. ACL/ Morgan Kaufmann, Montreal, Quebec, Canada, pp. 768–774. Spackman, K.A., 2004. SNOMED CT Milestones: Endorsements are Added to
Maedche, A., Neumann, G., Staab, S., 2003. Bootstrapping an ontology-based Already-impressive Standards Credentials, 21. Healthcare Informatics: The
information extraction system. In: Szczepaniak, P.S., Segovia, J., Kacprzyk, J., Business Magazine for Information and Communication Systems. (pp. 54–56).
Zadeh, L.A. (Eds.), Intelligent exploration of the web. Physica-Verlag, pp. 345– Staab, S., Maedche, A., 2000. Ontology engineering beyond the modeling of
359. concepts and relations. In: Benjamins, R., Gómez-Pérez, A., Guarino, N. (Eds.),
Martı́nez, S., Sánchez, D., Valls, A., 2012. Semantic adaptive microaggregation of Proceedings of the ECAI-2000 Workshop on Ontologies and Problem-Solving
categorical microdata. Comput. Secur. 31, 653–672. Methods, Berlin, Germany.
Michelson, M., Knoblock, C.A., 2007. An automatic approach to semantic annota- Stevenson, M., Gaizauskas, R., 2000. Using corpus-derived name lists for named
tion of unstructured, ungrammatical sources: a first look. In: Knoblock, C.A., entity recognition. In: Proceedings of the Sixth Conference on Applied Natural
Lopresti, D., Roy, S., Subramaniam, L.V. (Eds.), Proceedings of the Workshop on Language Processing. Association for Computational Linguistics, Seattle,
Analytics for Noisy Unstructured Text Data, IJCAI-2007, Hyderabad, India, Washington, pp. 290–295.
pp. 123–130. Turney, P.D., 2001. Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In: De
Mikheev, A., Finch, S., 1997. A workbench for finding structure in texts. In: Raedt, L., Flach, P. (Eds.), Proceedings of the 12th European Conference on
Proceedings of the Fifth Conference on Applied Natural Language Processing. Machine Learning, ECML 2001. Springer-Verlag, Freiburg, Germany, pp. 491–502.
Association for Computational Linguistics, Washington, DC, pp. 372–379. Velardi, P., Navigli, R., Cucchiarelli, A., Neri, F., 2005. Evaluation of OntoLearn, a
Morbach, J., Yang, A., Marquardt, W., 2007. OntoCAPE—a large-scale ontology for methodology for automatic learning of domain ontologies. In: Buitelaar, P.,
chemical process engineering. Eng. Appl. Artif. Intell. 20, 147–161. Cimiano, O., Magnini, B. (Eds.), Ontology Learning from Text: Methods,
Niekrasz, J., Gruenstein, A., 2006. NOMOS: a SemanticWeb Software framework for Evaluation and Applications. IOS Press, pp. 92–105.
annotation of multimodal corpora. In: Proceedings of the 5th International Velásquez, J.D., Dujovne, L.E., L’Huillier, G., 2011. Extracting significant Website
Conference on Language Resources and Evaluation, LREC 06, Genoa, Italy, Key Objects: A Semantic Web mining approach. Eng. Appl. Artif. Intell. 24,
pp. 21–27. 1532–1541.
Pasca, M., 2004. Acquisition of categorized named entities for web search. In: Zavitsanos, E., Tsatsaronis, G., Varlams, I., Palioras, G., 2010. Scalable semantic
Proceedings of the Thirteenth ACM International Conference on Information annotation of text using lexical and Web resources. In: Procedings of the 6th
and Knowledge Management. ACM, Washington, D.C., USA, pp. 137–145. Hellenic Conference on Artificial Intelligence, pp. 287–296.

An Automatic Approach For Ontology Based

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Automatic Approach For Ontology Based

Uploaded by

Copyright:

Available Formats

Engineering Applications of Artiﬁcial Intelligence 26 (2013) 1092–1106

Contents lists available at SciVerse ScienceDirect

Engineering Applications of Artiﬁcial Intelligence

An automatic approach for ontology-based feature extraction from

Table 1 two terms (a, b) as the conditional probability of a and b co-

4.2. Document parsing

Fig. 1. Snippet of a website obtained by Google for the Tarragona NE.

Data Value Snippet/context Synset Synset

Stevenson and Gaizauskas, 2000), or predeﬁned knowledge bases Table 5

Pattern structure Query Example

CONCEPT such as NE ‘‘Such as Barcelona’’ Cities such as Barcelona

human expert Table 8

In this section, the inﬂuence of the thresholds used to ﬁlter

Fig. 2. Inﬂuence of NE_THRESHOLD and AC_THRESHOLD.

Fig. 3. Plain text vs. Wikipedia documents.

Fig. 4. Inﬂuence of domain ontologies.

Tourism 5 levels 315 Administrative divisions, Cracow 80 36.3

You might also like