Professional Documents
Culture Documents
An Automatic Approach For Ontology Based
An Automatic Approach For Ontology Based
a r t i c l e i n f o a b s t r a c t
Article history: Data mining algorithms such as data classification or clustering methods exploit features of entities to
Received 24 February 2012 characterise, group or classify them according to their resemblance. In the past, many feature extraction
Received in revised form methods focused on the analysis of numerical or categorical properties. In recent years, motivated by the
24 July 2012
success of the Information Society and the WWW, which has made available enormous amounts of textual
Accepted 10 August 2012
electronic resources, researchers have proposed semantic data classification and clustering methods that
Available online 13 September 2012
exploit textual data at a conceptual level. To do so, these methods rely on pre-annotated inputs in which text
Keywords: has been mapped to their formal semantics according to one or several knowledge structures (e.g. ontologies,
Information extraction taxonomies). Hence, they are hampered by the bottleneck introduced by the manual semantic mapping
Feature extraction
process. To tackle this problem, this paper presents a domain-independent, automatic and unsupervised
Ontologies
method to detect relevant features from heterogeneous textual resources, associating them to concepts
Wikipedia
modelled in a background ontology. The method has been applied to raw text resources and also to semi-
structured ones (Wikipedia articles). It has been tested in the Tourism domain, showing promising results.
& 2012 Elsevier Ltd. All rights reserved.
1. Introduction analyse and catalog Web contents. These tasks require a set of
structures to model knowledge, and a linkage between the
Since the creation of the World Wide Web, its structure and knowledge and Web contents. The Semantic Web relies on two
architecture have been in constant growth and development. basic components: ontologies and semantic annotations.
Nowadays the Web has turned into what we know as the Social On one hand, ontologies are formal, explicit specifications of a
Web, where all users are able to add and modify its content. This shared conceptualization (Guarino, 1998). Ontologies are useful
fact has resulted in an exponential growth of the available to model knowledge in a formal abstract way which can be
information. However, its lack of structure has brought some interpreted and exploited by computers. On the other hand,
problems: the access and retrieval of information is difficult and annotations are linkages between textual contents of a document
inaccurate, and the unstructured and mainly textual information and their formal semantics, modelled in a knowledge structure
can hardly be interpreted by IT applications (Fensel et al., 2002). such an ontology, enabling the development of knowledge-based
In order to solve these problems, the Semantic Web (Berners-Lee systems (Setchi et al., 2011).
et al., 2001) has been proposed. The availability of such amount of electronic resources has
The Semantic Web is an evolving extension of the World Wide brought a growing interest in the research community in the
Web in which the semantics of information and services are exploitation of Web contents. Data mining algorithms, and particu-
explicitly defined, making it possible for the Web to understand larly data classification and clustering methods, aim to analyse,
and satisfy the requests of users and computers. One of the basic characterise, classify or group data. Since classical data analysis
pillars of the Semantic Web concept is the idea of having explicit methods were designed to deal with numerical data, in recent years
semantic information on the Web pages that can be used by researchers have proposed novel semantically-grounded methods
intelligent agents in order to solve complex problems such as that, by exploiting structured knowledge bases like ontologies,
Information Retrieval and Question Answering (Brill, 2003). The taxonomies or folksonomies, are able to semantically interpret
final objective of the Semantic Web is to be able to semantically textual data such as Web contents (Hotho et al., 2002). In He et al.
(2008) authors propose a data classification method that generalises
set-valued textual data (e.g. query logs, clinical records, etc.) relying
n on domain taxonomies. In Batet et al. (2011,2010) authors adapt
Corresponding author: Tel.: þ 34 977256563; fax: þ34 977559710.
E-mail addresses: carlos.vicient@urv.cat (C. Vicient), well-known clustering methods so that they can semantically
david.sanchez@urv.cat (D. Sánchez), antonio.moreno@urv.cat (A. Moreno). interpret (i.e. compare, group and aggregate) multi-featured textual
0952-1976/$ - see front matter & 2012 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.engappai.2012.08.002
C. Vicient et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1092–1106 1093
data (e.g. polls, electronic health-care records, etc.), exploiting systems) in order to overcome the bottleneck of manual solutions.
general purpose and domain-specific ontologies. A similar process Melita (Ciravegna et al., 2002) is based on user-defined rules and
is carried out in Martı́nez et al. (2012), but focusing on semantic pre-defined annotations, which are employed to propose new ones.
micro-aggregation of textual values for privacy-preserving purposes. Supervised approaches, however, are difficult to apply, due to the
Even though these methods deal with textual features, they suppose effort required to compile a large and representative training set.
structured inputs rather than raw textual documents, in which Cerno is another system for the semi-automated annotation of Web
words describing an entity have been extracted and associated to resources (Kiyavitskaya et al., 2005,2009). The authors propose the
concepts whose semantics are modelled in a knowledge base (Hotho use of patterns/annotation schemas (addressed to extract objects
et al., 2002). such as email addresses, phone numbers, dates and prices) to obtain
Unfortunately, semantically-tagged features, which can be the candidates to annotate, and then this set is annotated by means
useful for semantic data analysis, are scarce nowadays. Due to of a domain conceptual model. That model represents the informa-
the semantic annotation bottleneck that characterises manual tion of a particular domain through concepts, relationships and
approaches, most of the existing textual resources are plain text attributes (in an entity-relation based syntax). Supervised systems
documents or ad-hoc and partially annotated ones (such as also use extraction rules obtained from a set of pre-tagged data
Wikipedia articles). Hence, the applicability of semantic-based (Califf and Mooney, 2003; Roberts et al., 2007). WebKB (Cafarella
data mining methods is hampered by the availability of ade- et al., 2005) and Armadillo (Alfonseca and Manandhar, 2002) use
quately processed input information. supervised techniques to extract information from computer science
To tackle this problem, this paper proposes a method that, websites. Likewise, S-CREAM (Cunningham et al., 2002) uses
given an input document describing an entity (e.g. a Web Machine Learning techniques to annotate a particular document
resource, a Wikipedia article or a plain text document) and an with respect to an ontology, given a set of annotated examples.
ontology stating which features should be extracted (e.g. tourism Other supervised systems present a more ad-hoc design and focus
points of interest), it is able to detect, extract and semantically on a specific domain of knowledge. In Maedche et al. (2003) a system
annotate relevant textual features describing that entity. that works on the Tourism domain is proposed. They combine lexical
The method has been designed in a general way so that it can knowledge, extraction rules and ontologies in order to extract
be applied to a range of textual documents going from raw plain information in the form of instantiated concepts and attributes that
text to semi-structured resources (e.g. tagged Wikipedia articles). are stored in an ontology-like fashion (hotel names, number of rooms,
In this last case, the method is able to exploit the pre-processed prices, etc.). The most interesting characteristic is the fact that the
input to complement its own learning algorithms. The key point pre-defined knowledge structures are extended as a result of the
of the work is to leverage the syntactical parsing and several Information Extraction process, improving and completing them.
natural language processing techniques with the knowledge They use several Ontology Learning techniques already developed
contained in the input ontology, and the use of the Web as a for the OntoEdit system (Staab and Maedche, 2000). A human expert
corpus to assist the learning process to be able to (1) identify has to validate each extension before continuing. Another system
relevant features describing a particular entity in the input focused on the Tourism domain is presented in Feilmayr et al. (2009).
document, and (2) associate, if applicable, the extracted features A domain-dependent crawler collects Web pages corresponding to
to concepts contained in the input ontology. In this manner, the accommodation websites. This corpus is passed to an extraction
output of the system consists on tagged features which can be module based on the GATE framework (Cunningham et al., 2002)
directly exploited by semantically grounded data analysis meth- which provides a number of text engineering components. It per-
ods such as clustering (Batet et al., 2011). forms an annotation of the Web pages in the corpus, supported by a
The rest of the paper is organised as follows: Section 2 domain-dependent ontology and rule set. The extracted tokens are
discusses related works proposing annotation systems for textual ranked as a function of their frequency and relevancy for the domain.
data. Section 3 provides and discusses the theoretical background, Semantic-ART (Sapkota et al., 2011) is another system that aims to
techniques and working environment for the proposed method. annotate texts associated to regulatory information to convert them
Section 4 details and formalises the proposed method, detailing into semantic models. The system assumes that a regulation typically
also how it can be applied to different kinds of inputs (unstruc- comprises at least one statement (a sentence in the regulation), which
tured and semi-structured text). Section 5 evaluates the proposed must have a subject, an obligation and an action. The presence of an
method when dealing with touristic destinations, testing the obligation (i.e., sentences containing obligatory verbs) in a paragraph
behaviour and influence of learning parameters and input types helps to distinguish the regulation from the ordinary text. The
and, finally, it is shown the global performance of the method subjects and objects of these sentences are directly searched in an
when analysing different domains. The final section discusses and input ontology. WordNet is used to look for synonyms of these words
summarises the main contributions of the work. if a direct matching is not found.
Another domain-dependent system is SOBA (Buitelaar et al.,
2008), a sub-component of the SmartWeb (a multi-modal dialog
2. Related work system that derives answers from unstructured resources such as
the Web), which automatically populates a knowledge base with
In the context of the Semantic Web, several methods have information extracted from soccer match reports found on the Web.
been proposed to detect and annotate relevant entities from The extracted information is defined with respect to an underlying
electronic resources. First, manual approaches have been devel- ontology. The SOBA system consists of a Web crawler, linguistic
oped to assist the user in the annotation process such as Annotea annotation components and a module for the transformation
(Koivunen, 2005), CREAM (Handschuh et al., 2003), NOMOS of linguistic annotation into an ontology-based representation.
(Niekrasz and Gruenstein, 2006) or Vannotea (Schroeter et al., The first component enables the automatic creation of a soccer
2003). Those systems rely on the skills and will of a community of corpus, which is kept up-to-date on a daily basis. Linguistic annota-
users to detect and tag entities within Web content. Considering tion is based in finite-state techniques and unification-based algo-
that there are 1 trillion pages on the Web, it is easy to envisage rithms. It implements basic grammars for the annotation of people,
the unfeasibility of manual annotation of Web resources. locations, numerals and date and time expressions. On the top, rules
On the contrary, some authors have tried to automate some for extracting soccer-specific entities, such as soccer players, teams
of the stages of the annotation process (proposing supervised and tournaments are implemented. Finally, data are transformed
1094 C. Vicient et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1092–1106
into ontological facts, by means of tabular processing (applying However, the final association between text entities and a domain
wrapper-like techniques) and text matching (by means of F-logic ontology is not addressed.
structures specified in a declarative form).
Other systems like KnowItAll (Etzioni et al., 2005) rely on the
redundancy of the Web to perform a bootstrapped information 3. Background
extraction process iteratively. The confirmation of the correctness
of the obtained information is requested to the user to re-execute This section discusses the basic characteristics of the working
the process. KnowItAll is self-supervised; instead of using hand- environment and presents the main existing techniques (lexical
tagged training data, the system selects and labels its own patterns and Web-scale statistics) in which the proposed method
training examples and iteratively bootstraps its learning process. relies.
For a given relation, the set of generic patterns is used to
automatically instantiate relation-specific extraction rules, which 3.1. The Web as a tool to assist the learning process
are then used to learn domain-specific extraction rules. The rules
are applied to Web pages identified via search-engine queries, When aiming to develop an unsupervised system dealing with
and the resulting extractions are assigned a probability using textual data, one cannot rely on experts opinions to validate the
information-theoretic measures derived from search engine hit results (e.g. to evaluate which extracted terms should be mapped
counts. Next, KnowItAll uses frequency statistics computed by to the input ontology or which is the best mapping for a given
querying search engines to identify which instantiations are most term). However, one can profit from predefined knowledge to
likely to be members of the class. KnowItAll is relation-specific in assess the adequacy of the extracted conclusions. To be domain-
the sense that it requires a laborious bootstrapping process for independent, however, no domain-specific knowledge structures
each relation of interest, and the set of relations has to be named or corpora can be used. Hence, to support our proposed method
by the human user in advance. This is a significant obstacle to we rely on the Web as a domain-independent corpus which can
open-ended extraction because unanticipated concepts and rela- aid the process of learning entity features.
tions are often encountered while processing text. The Web has many interesting characteristics in comparison
Completely automatic and unsupervised annotation systems are with other information sources. As stated in Brill (2003), many
scarce. SemTag (Dill et al., 2003) performs automated semantic classical techniques for information extraction and knowledge
tagging from large corpora based on the Seeker platform for text acquisition present performance limitations because they typically
analysis and tags a large number of pages with the terms included in rely on a limited corpus. According to Brill (2003), the Web is the
a domain ontology named TAP. This ontology contains lexical and biggest repository of information available. In fact, the amount and
taxonomic information about music, movies, sports, health, and heterogeneity of information in the Web is so high that it can be
other issues, and SemTag detects the occurrence of these entities in assumed to approximate the real distribution of information in the
Web pages. It disambiguates the sense of the words using neighbour world (Cilibrasi and Vitányi, 2006). Moreover, the high redundancy
tokens and corpus statistics, picking the best label for a token. KIM of information, derived from the huge amount of available resources,
(Kiryakov et al., 2004) is another example of unsupervised domain- can represent a measure of its relevance, i.e., the more times a piece
independent system. It scans documents looking for entities corre- of information appears on the Web, the greater its relevance (Brill,
sponding to instances in its input ontology. 2003; Ciravegna et al., 2003; Etzioni et al., 2004; Rosso et al., 2005).
The system proposed in Zavitsanos et al. (2010) follows a In other words, we can give more confidence to a fact that is
sequence of annotation steps. First, all the ontology concepts (and enounced by a considerable amount of independent sources. Thanks
their stems) are directly searched on the text. In a second stage, to the aforementioned characteristics, the Web has demonstrated its
WordNet is used to obtain synonyms of the ontology concepts, and validity as a corpus to assist learning processes (Etzioni et al., 2004;
the system looks for them in the analysed text. Later, the authors Sánchez et al., 2009; Velásquez et al., 2011).
propose the use of a semantic similarity measure (path length in In our work the Web is used as a general tool in which the
WordNet) which is computed between all the words in the text and feature extraction process relies to assess the relevancy of
all the ontology concepts. Finally, another similarity measure based potential extractions found in an input document (Section 4.3),
on Wikipedia is also considered. Since only words contained in the and to improve the recall of the feature annotation towards
ontology or those corresponding to their synonyms or semantically concepts included in the input ontology (Section 4.4). To this
similar ones are annotated, this work is limited to the coverage end, Web-scale information distribution is exploited to assist the
offered by the input knowledge repositories. Named entities or learning process and to extract robust conclusions.
recently minted terms which cannot be found in Wikipedia, for
example, are unlikely to result in any annotation. 3.2. Learning techniques
Another interesting annotation application is presented in
Michelson and Knoblock (2007). In this case, authors use a reference In the following, the main techniques in which our work relies to
set of elements (e.g., on-line collections containing structured data extract potential features and assess their relevancy are presented.
about cars, comics or general facts) to annotate ungrammatical They are focused on linguistic patterns and Web-scale statistics.
sources like texts contained in posts. First of all, the elements of
those posts are evaluated using the TF-IDF metric. Then, the most 3.2.1. Linguistic patterns
promising tokens are matched with the reference set. In both cases, The problem of semantically annotating/classifying/matching
limitations may be introduced by the availability and coverage of features extracted from text to their formal semantics can be
the background knowledge (i.e., ontology or reference set). simplified to the problem of discovering the ontological concept
From the applicability point-of-view, Pankow (Cimiano et al., of which the discovered feature is an instance (e.g., Barcelona is a
2004) is the most promising system. It uses a range of well-studied city). In essence, instance–concept relationships define a taxono-
syntactic patterns to mark-up candidate phrases in Web pages mical link which should be discovered.
without having to manually produce an initial set of marked-up There exist many approaches for detecting this kind of rela-
Web pages, and without depending on previous knowledge. Its tions. As stated in Cimiano et al. (2004), three different learning
context driven version, C-Pankow (Cimiano et al., 2005), improves paradigms can be exploited. First, some approaches rely on the
the first by reducing the number of queries to the search engine. document-based notion of term subsumption (Sanderson and
C. Vicient et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1092–1106 1095
basic elements of the algorithm. Several of the proposed functions describe, in a less ambiguous way than general words, the
are abstract and can be overwritten in order to adapt the analysis relevant features of a particular entity (Abril et al., 2011).
to different kinds of inputs (plain text documents or semi- To select which of the detected NEs are the most related to the
structured ones). Moreover, since the feature extraction process analysed entity, a relevance-based analysis relying on Web co-
is guided by the input ontology, by using different domain
occurrence statistics is performed. Afterwards, the selected NEs
ontologies (e.g. in Medicine (Spackman, 2004) or Chemical engi-
are matched to the ontological concepts of which they can be
neering (Morbach et al., 2007)), the system is able to adapt the
feature extraction process to the domain of interest. As inputs, the considered instances. In this manner the extracted features are
system receives the document to be analysed (e.g. a text describ- presented in an annotated fashion, easing the posterior applica-
ing a city) and the ontology detailing the features to be extracted tion of semantically-grounded data analyses. In the following
(e.g. tourism activities or points of interest). subsections, the main algorithm steps are described in detail.
hampered by WordNet’s limited coverage of NEs and, in conse- 4.4.2.2. Semantic matching. This step faces the non-trivial matching
quence, it is usually not possible to compute the similarity between between semantically similar subsumers and ontological classes
a NE and an ontological class in this way. There are approaches that that, because they were represented with different textual labels,
try to discover automatically taxonomic relationships (Bisson et al., could not be discovered in the previous stage. Details on how this
2000; Sanderson and Croft, 1999), but they require a considerable task is performed are given in Algorithm 2, which specifies what it is
amount of background documents and linguistic parsing. Finally, done in line 28 of Algorithm 1.
another possibility is to compute the co-occurrence between each
Algorithm 2. Semantic matching method (line 28, algorithm 1).
NE and each ontological class using Web-scale statistics (Turney,
2001), but this solution is not scalable because of the huge amount
of required queries (Cimiano et al., 2005). 1 extract_semantic_matching(OC, SC, ne, ae) {
The method proposed in this paper employs this last technique, 2 8 sci A SC {
but introducing a previous step that reduces the number of queries 3 SYNSETS: ¼get_wordnet_synsets (sci )
to be performed. To do so, the semantic annotation step is divided in 4 if 9SYNSETS9¼ ¼0 {return}
two parts: the discovery of potential subsumer concepts (line 14) 5 else if 9SYNSETS9¼ ¼1 {
and their matching with the ontology classes (lines 19–38). 6 RELATED_TERMS ’ get_related_terms(synset0)
7 }
4.4.1. Discovering potential subsumer concepts 8 else {/n when the sc has more than 1 synset,
This first stage is proposed in order to minimise the number of disambiguation is needed n/
queries (NE, ontology class) to be performed by the final statis- 9 CONTEXT: ¼get_context(ne)
tical assessor. It may be noticed that the problem of semantic 10 CONTEXT: ¼CONTEXTþget_web_snippets(ne, ae)
annotation is to find a bridge between the instance level (i.e., a 11 8 contextj A CONTEXT {
NE) and the conceptual level (i.e., an ontology concept for which 12 /n the similarity between the context
the NE is an instance). As stated in Section 3.2.1 NEs and concepts 13 and each synset is calculated n/
are semantically related by means of taxonomic relationships. 14 8 synsetk A SYNSET {
Thus, the way to go from the instance level to the conceptual level 15 similarity:¼get_cosine_distance(contextj, synsetk)
is by discovering taxonomic abstractions, which are represented 16 update_average_synset(synsetk, similarity)
by subsumer concepts. So, in this stage, the aim is to automati- 17 }
cally discover possible subsumer concepts for each NE. 18 }
This is done by means of the function extract_subsumer_ 19 disambiguated_synset:¼get_synset_max_average(SYNSETS)
concepts, which depends on the kind of input document (see 20 RELATED_TERMS: ¼get_related_terms
Sections 4.5 and 4.6 for specific details). As a result, we obtain a (disambiguated_synset)
set of subsumer concepts for each NE (e.g. the subsumer concepts 21 }
of Porsche could be car, automobile, auto, motorcar and machine). 22 }
Notice that those concepts are abstractions of the NE and they not 23 return extract_direct_matching(OC, RELATED_TERMS)
depend on any ontology. This means that subsumer concepts do 24 }
not necessarily match with ontological classes. However, at this
point the algorithm is working at a conceptual level (rather than
at the instance level) and thus it is possible to match NEs with The semantic matching step is performed when the direct
ontological classes more easily and efficiently in the following matching has not produced any result. Its main goal is to increase
step, as detailed in the next section. the number of elements in SOC, so that the direct matching can be
re-applied with a wider set of terms. The new potential subsumers
are concepts semantically related to any of the initial subsumers
4.4.2. Matching subsumers to ontological classes
(synonyms, hypernyms and hyponyms). As the algorithm works at a
This stage tries to discover matches between the subsumer
conceptual level, WordNet (Fellbaum, 1998) has been used to obtain
concepts of a NE and the ontological classes. To do so, two different
cases are considered: Direct Matching and Semantic Matching. these related terms and to increase the SOC set.
The main problem of semantic matching is that a word may be
4.4.2.1. Direct matching. First, the system tries to find a direct match polysemous and, before extracting the related concepts from
between the subsumers of a NE and the ontology classes. This phase WordNet, it is necessary to discover to which sense it corresponds
begins with the extraction of all the classes contained in the domain (i.e., a semantic disambiguation step must be performed to
ontology (line 19). Then, for each Named Entity nei, all its potential choose the correct synset). To do so, one possible solution is to
subsumer concepts (sciASC) are compared against each ontology class use the context (i.e., the sentence from which nei was extracted,
in order to discover lexically similar ontological classes (sociASOC, line 9 of Algorithm 2) but, usually, that is not enough to
lines 23–25), i.e., classes whose name matches the subsumer itself or disambiguate the meaning. To minimise this problem, the Web
a subset of it (e.g., if one of the potential subsumers is Gothic cathedral, is used to assist the disambiguation.
it would match an ontology class called Cathedral). We propose a Web-based approach that combines the contexts
A stemming algorithm (Porter, 1997) is applied to both the sci of Named Entities, WordNet definitions and the cosine distance. The
and the ontology classes in order to discover terms that have the Web is used in this case to find new evidences of the relationship
same root (e.g., city and cities). If one (or several) ontology between the ne and the ae in order to increase the contexts (where
classes match with the potential subsumers, they are included those terms occurs together) that enable a better disambiguation of
in SOC as candidates for the final annotation of nei. This direct the meaning of polysemous words. First, the algorithm retrieves the
matching step is quite easy and computationally efficient; contexts in which the NE appears (line 10 of Algorithm 2), framed in
however, its main problem is that, in many cases, subsumers do the scope of the original ae (by querying a search engine for
not appear as ontology classes with exactly the same name. As a common appearances of ae and ne). Then, the set of Web snippets
result, potentially good candidates for annotation are not returned by the search engine is analysed. Web snippets (see Fig. 1)
discovered. are used instead of complete documents because they precisely
1098 C. Vicient et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1092–1106
Table 3 Table 4
Semantic disambiguation example (part I). Semantic disambiguation example (part II).
Table 6
Patterns used to retrieve potential subsumer concepts.
Table 7 Both plain text and semi-structured text analyses have the
Subset of extracted NEs from Barcelona Wikipedia same cost for NE filtering, semantic disambiguation and class
article.
selection. To rank NEs in the relevance filtering step, two queries
Wikilinks Correct? are required by the NE_Score function (4) to evaluate each NE.
Hence, a total of 2n queries are performed where n represents the
Acre No number of NEs. Class selection requires computing as many
Antoni Gaudı́ Yes
SOC_Scores (5) as candidates. Hence, being c the total number
Arc de Triomf Yes
Archeology Museum of Catalonia Yes
of class candidates, this results in 2c queries. In the semantic
Barcelona Cathedral Yes disambiguation stage, only one query is needed for each candi-
Barcelona Museum of Contemporary Yes date to obtain the snippets relating each candidate with the
art analysed entity (i.e. c queries are performed in this step).
Barcelona Pavilion Yes
Hence, the difference in computational cost between plain text
Casa Batlló Yes
analyses and semi-structured ones is in the steps of NE detection
and extraction of subsumer concepts. Although NE detection
A problem of these categories, however, is the fact that they is different when analysing plain texts and semi-structured
are, in many cases, too complex or concrete to be modelled in an resources, neither of them needs to perform queries and its cost
ontology. For example, The Sagrada Familia article is categorised is considered constant for both approaches. Referring to the
as Antoni Gaudı́ buildings, Buildings and structures under construc- extraction of subsumer concepts, in the first approach six queries
tion, Churches in Barcelona, Visitor attractions in Barcelona, are performed to discover subsumer concepts by means of Hearst
World Heritage Sites in Spain, Basilica churches in Spain, etc. These Patterns (6n), whereas in the second approach no queries are
categories are too complex because they involve several concepts needed because SCs are directly extracted from the categories of
or even mix concepts with other NEs. To tackle this problem, we the tagged entities.
syntactically analyse each category to detect the basic concepts to In summary, considering the above expressions, the number of
which it refers. For example in ‘‘Churches in Barcelona’’ the key queries needed to analyse plain text is 2nþ 6nþ 2cþ1c ¼8n þ3c,
concept is ‘‘Churches’’ and in ‘‘Buildings and structures under whereas only 2nþ2c þ1c ¼2n þ3c are needed when dealing with
construction’’ there are two important concepts: ‘‘Buildings’’ and Wikipedia articles. In both cases the proposed method scales
‘‘Structures’’. To extract the main concepts of each sentence a linearly with respect to the number of NEs, but the much lower
natural language parser has been used, and all the Noun Phrases coefficient for the Wikipedia articles shows how the exploitation
have been extracted. of their structure aids to improve the performance of the method.
Another limitation of the Wikipedia categories is the fact that
they do not always contain enough concepts to perform the
matching among them and ontological classes, or those are too
concrete to be included in an ontology (and, hence, to enable a
direct matching). Fortunately, as mentioned before, Wikipedia 5. Evaluation
categories are included in higher categories. So, we consider two
upper-level categories to improve the recall of the ontology This section presents the evaluation of the proposed method,
matching. As the Wikipedia categorisation does not define a strict which has been conducted in two directions. In the first part,
taxonomy, it is not advisable to climb recursively in the category given a specific evaluation scenario, it is presented a detailed
graph because the initial meaning could be lost. picture of the influence of the different parameters involved in
the learning process: thresholds, input ontology and document
4.7. Computational cost types. In the second part, in order to give a general view of the
expected results, a battery of tests for different scenarios (i.e.
The runtime of the proposed method mainly depends on the different input documents and ontological domains) is presented.
number of queries, since the Web search engine response time is In all cases, the results have been evaluated according to their
several orders of magnitude higher (around 500 milliseconds) precision and recall against ideal results provided by a human
than any other offline analysis performed by our method (e.g. expert. To do so, the expert has been requested, for each input
natural language processing, pattern matching, ontological pro- document representing an analysed entity (ae) and ontology, to
cessing, etc., taking a few milliseconds). There are five different manually select which features found in the document directly or
steps in which queries are performed: NE detection, NE filtering, indirectly refer to concepts modelled in the ontology. As a result,
subsumer concepts extraction, semantic disambiguation and class a list of ideal features corresponding to concepts found in the
selection. Since the response time of the Web search engine does ontology for each ae is obtained.
not depend on the type and complexity of the query (but on the Then, recall is calculated by dividing the number of correct
delay introduced by the Internet connection), the runtime for all features discovered by our system (i.e. those also found in the
query types is expected to be equivalent. ideal set of features) by the amount of ideal features stated by the
C. Vicient et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1092–1106 1101
Precision is computed as the number of correct features (as Antoni Gaudı́ 1852 births, 1926 deaths, architects, roman
above) divided by the total number of retrieved features by our catholic churches, art nouveau architects, catalan
system architects, spanish ecclesiastical architects,
modernisme architects, 19th century architects,
#Correct_f eatures 20th century architects, organic architecture,
PRECISION ¼ : ð7Þ
#Retrieved_f eatures people, reus...
Arc de Triomf Triumphal arches, gates, moorish revival
architecture, 1888 architecture, public art stubs,
1888 works, architecture, architecture, public art,
5.1. Influence of input parameters public art, art stubsy
Archeology Museum of Museums, archaeology museums, Sants-Montjuı̈c
Catalonia
In this first battery of tests, we focused on documents Barcelona Cathedral Cathedrals, churches, visitor attractions, basilica
describing cities (Barcelona and Canterbury), and ontologies churchesy
related to touristic and geo-spatial features. First, the influence Barcelona Museum of Museums, art museums, galleries, modern art
Contemporary art museums, modernist architecture, spots, richard
of the algorithm thresholds in the results has been studied. They
meier buildings, el raval, modern arty
enable to configure the system behaviour so that a high precision, Casa Batlló Visionary environments, antoni gaudı́ buildings,
a high recall or a balance between both of them can be achieved. 1907 architecture, world heritage sites, spain,
Then, different types of input documents have been considered. visitor attractions, eixample, passeig, gra cia,
Specifically, our method has been applied to Wikipedia articles, outsider art, 1907 works, 1900s Architecture,
edwardian architecture...
but analysing them as plain text and as Wiki-tagged documents.
Finally, different input ontologies covering the same domain of
knowledge have been used to evaluate the influence of the
ontology design and coverage in the results. Two ontologies have
Table 9
been considered: a manually built Tourism ontology3 , which Description of used ontologies.
models concepts related to different kinds of touristic points of
interest typically found in Wikipedia articles, and a general Ontology Taxonomical #classes Root classes
ontology modelling geo-spatial concepts (Space4 ) retrieved from depth
the SWOOGLE5 search engine. A summary of their structure is
Tourism 5 levels 315 Administrative divisions, buildings,
shown in Table 9. festivals, landmarks, museums, sports
Space 6 levels 188 Geographical features, geopolitical
5.1.1. Learning thresholds entities, places
provided by Wiki-tagged text, that is, the tagged entities and their precision. The downside is the lowered recall obtained with the
associated Wiki-categories. On the contrary, recall is, in many Tourism ontology, since solely those named entities that are
situations, higher when analysing the input document as plain already tagged in the text (and that, hence, correspond to
text. This is because when using plain text, the whole textual concepts modelled in the Tourism ontology) are likely to be
content is analysed. Hence, there are more possibilities to extract annotated. Differences between the two ontologies when using
named entities that could correspond to representative features Wiki-tagged text are much reduced. In this case, the fact that NEs
than when relying solely on Wiki-tagged entities. On the contrary, correspond solely to those already tagged in the Wikipedia article
since Wiki entities are tagged by humans, the precision of their produces more similar annotation accuracies. Recall is, however,
resulting annotations is expected to be higher than when per- significantly higher for the Tourism ontology, since NE subsumers,
forming the automatic NE detection proposed by our method. which correspond to Wikipedia categories in this case, are more
Moreover, for the Wiki-tagged text, humanly edited categories are likely to match directly (as described in Section 4.4.2.1) to
used to assist the annotation, which may also contribute to concepts of an ontology that is specially designed to cover topics
provide more accurate results. In any case, results are close and entities usually referred in Wikipedia documents. This avoids
enough to consider the plain text analysis (i.e. both the NE the inherent ambiguity of the semantic matching process
detection and the subsumer candidate learning) almost as reliable (described in Section 4.4.2.2) that is carried out when a direct
as human annotations, at least for feature discovery purposes. It is matching is not found, which may hamper the results.
important to note, however, that the analysis of Wiki-tagged text,
as discussed in Section 4.7, is considerably faster than its plain
5.2. General evaluation
text counterpart, since the number of analyses and queries
required to extract and annotate entities is greatly reduced.
In this second part of the evaluation, general tests have been
performed for several entities belonging to two well differenced
domains: touristic destinations and movies. For each domain,
5.1.3. Input ontologies
eight different entities/documents have been analysed, using
The third test compares the results obtained using different
their corresponding Wikipedia articles. For touristic destinations,
ontologies to assist the feature extraction process of a given
the Tourism ontology introduced above has been used, whereas
entity. In this case, the plain-text and wiki-tagged versions of
for movies, a film6 ontology retrieved from SWOOGLE has been
the Barcelona article have been analysed using the input ontolo-
used. This latter one models concepts related with the whole
gies Space and Tourism. Again, results for different threshold
production process of a film, including directors, producers, main
values are shown in Fig. 4.
actors/actresses starring in the film, etc. The structure of both
Results obtained for the plain-text version of the Barcelona
ontologies is summarised in Table 10.
article show a noticeably higher precision when using the Tourism
Table 11 depicts the precision and recall obtained for the
ontology instead of the Space one. Since the former ontology was
different cities analysed in the tourism domain, whereas Table 12
constructed according to concepts typically appearing in Wikipe-
shows the results of the evaluation of films.
dia articles describing cities, it helps to provide more appropriate
Comparing both domains, we observe slightly better results for
annotations since, even though extracted named entities are
the film domain, with averaged precision and recall figures of
equal for both ontologies, the Tourism one is more suited to
76.9% and 63.2%, respectively. This is related to the lower
annotate those that are already tagged in the Wikipedia article.
ambiguity inherent to film-related entities, in comparison with
Moreover, since the Tourism ontology contains more concrete
touristic ones. For example, in the film domain, most NEs refer to
concepts than the more general-purpose Space one, language
ambiguity that may affect the subsumer extraction stage and
the posterior annotation is minimised, resulting in better 6
http://deim.urv.cat/~itaka/CMS2/images/ontologies/film.owl.
C. Vicient et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1092–1106 1103
person names (directors, producers, stars), which are hardly suppose a considerable reduction of the human effort required
ambiguous due to their concreteness. Hence, annotation to to annotated textual resources.
ontological concepts is usually correct. On the contrary, a NE
such as ‘‘Barcelona’’ may refer either to a city or a football team,
being both annotations almost equally representative; if both 6. Conclusions
concepts are found in the ontology, the annotation is more likely
to fail, in this case. One of the basic pillars of the Semantic Web concept is the
Analysing individual entities, we observe better precisions and idea of having explicit semantic information that can be used by
recalls of English/American entities (e.g, American films like First intelligent agents in order to solve complex problems of Informa-
Blood or British cities like London), which usually result in a tion Retrieval and Question Answering and to semantically
larger amount of available Web resources for the corresponding analyse and catalogue the electronic contents. This fact has
entities and, hence, on more robust statistics. motivated the creation of new data mining techniques like
As a general conclusion, considering recall figures, we observe semantic clustering (Batet et al., 2011), which are able to exploit
that our method was able to extract, in both cases, more than half semantics of data. However, these methods suppose that input
of the features manually marked by the domain expert. Moreover, contents were annotated in advance so that relevant features can
precisions were around a 70–75% in average, which produce be detected and interpreted.
usable results for further data analysis. It is important to note This work aimed to extract and pre-process data from different
that these results, which have been automatically obtained, kinds of inputs (i.e., plain textual documents, Web resources or
1104 C. Vicient et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1092–1106
Table 10 Table 11
Description of used ontologies. Evaluation results from Wikipedia descriptions of different cities using Tourism
ontology.
Ontology Taxonomic # Classes Main classes
depth Articles Precision (%) Recall (%)
Table 12 Batet, M., Valls, A., Gibert, K., 2011. Semantic clustering based on ontologies: an
Evaluation results from Wikipedia descriptions of different films using Film application to the study of visitors in a natural reserve. In: Proceedings of the
ontology. 3rd International Conference on Agents and Artificial Intelligence ICAART
2011, pp. 283–289.
Articles Precision (%) Recall (%) Batet, M., Valls, A., Gibert, K., Sánchez, D., 2010. Semantic clustering using multiple
ontologies. In: Proceedings of the 13th International Conference on the Catalan
52 Association for Artificial Intelligence, pp. 207–216.
Batman Begins 75
77.7 Berland, M., Charniak, E., 1999. Finding parts in very large corpora. In: Proceedings
First Blood (Rambo) 87.5
of the 37th annual meeting of the Association for Computational Linguistics on
Inception 81.2 72.2
Computational Linguistics. Association for Computational Linguistics, College
Matrix 87.5 57
Park, Maryland, pp. 57–64.
Oldboy 57.1 66.6
Berners-Lee, T., Hendler, J., Lassila, O., 2001. The Semantic Web—a new form of
Pulp Fiction 78 61.1 Web content that is meaningful to computers will unleash a revolution of new
Requiem for a dream 60 50 possibilities. Sci. Am. 284, 34–43.
The Fountain 88.8 69.2 Bisson, G., Nédellec, C., Cañamero, D., 2000. Designing clustering methods for
63.2 ontology building: The Mo’K workbench. In: Staab, S., Maedche, A., Nédellec,
Average 76.9
C., Wiemer-Hastings, P. (Eds.), Proceedings of the First Workshop on Ontology
Learning OL’2000, Berlin, Germany, August 25, 2000.
Brill, E., 2003. Processing natural language without natural language processing.
In: Gelbukh, A.F. (Ed.), Proceedings of the 4th International Conference on
the Web, since the final matching is performed at a conceptual
Computational Linguistics and Intelligent Text Processing and Computational
level (Cimiano et al., 2004). Moreover, statistical assessors have Linguistics, CICLing 2003. Springer Berlin/Heidelberg, Mexico City, Mexico,
been tuned to better assess the suitability of extraction and pp. 360–369.
annotation at each stage, minimising the inherent language Buitelaar, P., Cimiano, P., Frank, A., Hartung, M., Racioppa, S., 2008. Ontology-based
information extraction and integration from heterogeneous data sources. Int. J.
ambiguity of Web queries. Our method has been also designed Hum.—Comput. Stud. 66, 759–788.
in a general way so that it can be applied to different kinds of Buitelaar, P., Olejnik, D., Sintek, M., 2004. A Protégé plug-in for ontology extraction
inputs and exploit semi-structured information (like Wikipedia from text based on linguistic analysis. In: Proceedings of the 1st European
Semantic Web Symposium (ESWS), pp. 31–44.
annotation) to improve the performance. Finally, being unsuper-
Cafarella, M., Downey, D., Soderland, S., Etzioni, O., 2005. KnowItNow: fast,
vised and domain-independent, the implemented methodology can scalable information extraction from the web. In: Proceedings of the Human
be applied in different domains and without human supervision. Language Technology Conference/Conference on Empirical Methods in Natural
The evaluation performed for different entities and domains Language Processing, HLT/EMNLP 2005. Association for Computational Lin-
guistics, Vancouver, Canada, pp. 563–570.
produced usable results that, by carefully tuning the algorithm Califf, M.E., Mooney, R.J., 2003. Bottom-up relational learning of pattern matching
thresholds, reach the high precisions needed for applying them to rules for information extraction. J. Mach. Learn. Res. 4, 177–210.
data mining algorithms. The evaluation has also shown the Caraballo, S.A., 1999. Automatic construction of a hypernym-labeled noun hier-
archy from text. In: Proceedings of the 37th Annual Meeting of the Association
benefits of using semi-structured inputs (Wikipedia articles), for Computational Linguistics on Computational Linguistics. Association for
producing quality results with a lower computational cost. Computational Linguistics, College Park, Maryland, pp. 120–126.
As further work, it is a priority to study the quality of the Cilibrasi, R.L., Vitányi, P.M.B., 2006. The Google similarity distance. IEEE Trans.
extracted features when using them in semantic data mining Knowl. Data Eng. 19, 370–383.
Cimiano, P., Handschuh, S., Staab, S., 2004. Towards the self-annotating web. In:
algorithms (Batet et al., 2011; Martı́nez et al., 2012). Other Proceedings of the 13th International Conference on World Wide Web, WWW
important line of work is to study how to reduce the number of 2004. ACM, New York, USA, pp. 462–471.
queries (e.g., using only a subset of Hearst Patterns) to Web Cimiano, P., Ladwig, G., Staab, S., 2005. Gimme’ the context: context-driven
automatic semantic annotation with C-PANKOW. In: Ellis, A., Hagino, T.
search engines, as they are the slowest part of the algorithm and (Eds.), Proceedings of the 14th International Conference on World Wide
introduce a dependency on external resources. Finally, the anno- Web. ACM, Chiba, Japan, pp. 462–471.
tation process could be extended in order to take profit of Ciravegna, F., Dingli, A., Guthrie, D., Wilks, Y., 2003. Integrating information to
bootstrap information extraction from Web Sites. In: Proceedings of the
ontology properties (e.g. data or object properties) instead of
Workshop on Intelligent Information Integration IJCAI’03.
solely focusing on taxonomic knowledge. In this manner, entity Ciravegna, F., Dingli, A., Petrelli, D., Wilks, Y., 2002. User-system cooperation in
property-value tuples (e.g. the population or area of a city) could document annotation based on information extraction. In: Gómez-Pérez, A.,
be also extracted from text. Thanks to the generic design of our Benjamins, R. (Eds.), Proceedings of the 13th International Conference on
Knowledge Engineering and Knowledge Management. Ontologies and the
method, extraction patterns other than taxonomic ones can be Semantic Web. Springer Berlin/Heidelberg, Sigüenza, Spain, pp. 122–137.
used for that purpose, detecting non-taxonomic relations Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., 2002. GATE: a framework
(Sánchez and Moreno, 2008a; Sánchez et al., 2012) such as part- and graphical development environment for robust NLP tools and applica-
tions. In: Proceedings of the 40th Anniversary Meeting of the Association for
of (Sánchez, 2010). Statistical assessors and Web queries should Computational Linguistics, ACL 2002, Philadelphia, US.
be also modified accordingly. Church, K., Gale, W., Hanks, P., Hindle, D., 1991. Using statistics in lexical analysis.
In: Zernik, U. (Ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a
Lexicon. Lawrence Erlbaum Associates, Hillsdale, New Jersey, USA,
Acknowledgements pp. 115–164.
Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Mccurley,
K.S., Rajagopalan, S., Tomkins, A., 2003. A case for automated large-scale
This work was partially supported by the Universitat Rovira i semantic annotation. Web Semantics: Sci. Serv. Agents World Wide Web 1,
Virgili (predoctoral grant of C. Vicient, 2010BRDI-06-06) and the 115–132.
Downey, D., Broadhead, M., Etzioni, O., 2007. Locating complex named entities in
Spanish Government through the project DAMASK-Data Mining Web text. In: Proceedings of the 20th International Joint Conference on
Algorithms with Semantic Knowledge (TIN2009-11005). Artificial Intelligence, IJCAI 2007. AAAI Hyderabad, India, pp. 2733–2739.
Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland,
S., Weld, D.S., Yates, A., 2004. Web-scale information extraction in knowitall:
References (preliminary results). In: Proceedings of the 13th international conference on
World Wide Web. ACM, New York, NY, USA, pp. 100–110.
Abril, D., Navarro-Arribas, G., Torra, V., 2011. In: Torra, V., Narakawa, Y., Yin, J., Etzioni, O., Cafarella, M., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S.,
Long, J. (Eds.), On the Declassification of Confidential Documents Modeling Weld, S, Yates, A., D., 2005. Unsupervised named-entity extraction from the
Decision for Artificial Intelligence. Springer, Berlin/Heidelberg, pp. 235–246. web: an experimental study. Artif. Intell. 165, 91–134.
Ahmad, K., Tariq, M., Vrusias, B., Handy, C., 2003. Corpus-based thesaurus Feilmayr, C., Parzer, S., Pröll, B., 2009. Ontology-based information extraction from
construction for image retrieval in specialist domains. In: Sebastiani, F. (Ed.), tourism websites. J. Inf. Technol. 11, 183–196.
Advances in Information Retrieval. Springer, Berlin/Heidelberg, pp. 76. Fellbaum, C., 1998. WordNet: An Electronic Lexical Database. MIT Press,
Alfonseca, E., Manandhar, S., 2002. Improving an ontology refinement method Cambridge, Massachusetts.
with hyponymy patterns. In: Proceedings of the 3rd International Conference Fensel, D., Bussler, C., Ding, Y., Kartseva, V., Klein, M., Korotkiy, M., Omelayenko, B.,
on Language Resources and Evaluation, LREC 2002, Las Palmas, Spain. Siebes, R., 2002. Semantic web application areas. In: Proceedings of the 7th
1106 C. Vicient et al. / Engineering Applications of Artificial Intelligence 26 (2013) 1092–1106
International Workshop on Applications of Natural Language to Information Porter, M.F., 1997. An Algorithm for Suffix Stripping. In:Karen Sparck Jones, Peter
Systems (NLDB). Willett (Eds.) Readings in Information Retrieval. Morgan Kaufmann Publishers
Fleischman, M., Hovy, E., 2002. Fine grained classification of named entities. In: Inc, pp. 313–316.
Proceedings of the 19th International Conference on Computational Roberts, A., Gaizauskas, R., Hepple, M., Davis, N., Demetriou, G., Guo, Y., Kola, J.,
linguistics—vol. 1. Association for Computational Linguistics, Taipei, Taiwan, Roberts, I., Setzer, A., Tapuria, A., Wheeldin, B., 2007. The CLEF corpus:
pp. 1–7. semantic annotation of clinical text. In: Proceedings of the Annual Symposium
Guarino, N., 1998. Formal Ontology in Information Systems. In: Guarino, N. (Ed.), AMIA 2007. American Medical Informatics Association, Chicago, USA, pp. 625–
Proceedings of the 1st International Conference on Formal Ontology in 629.
Information Systems, FOIS 1998. IOS Press, Trento, Italy, pp. 3–15. Rosso, P., Montes, M., Buscaldi, D., Pancardo, A., Villaseñor, A., 2005. Two Web-
Handschuh, S., Staab, S., Studer, R., 2003. Leveraging metadata creation for the based approaches for noun sense disambiguation. In: International Conference
Semantic Web with CREAM. In: Günter, A., Kruse, R., Neumann, B. (Eds.), on Computers Linguistics and Intelligent Text Processing, CICLing-2005,
Proceedings of the 26th Annual German Conference on AI, KI 2003. Springer Springer Verlag, LNCS (3406), Mexico D.F, pp. 261–273.
Berlin/Heidelberg, Hamburg, Germany, pp. 19–33. Sánchez, D., 2010. A methodology to learn ontological attributes from the Web.
He, Z., Xu, X., Deng, S., 2008. k-ANMI: a mutual information based clustering Data Knowl. Eng. 69, 573–597.
algorithm for categorical data. Inf. Fusion 9, 223–233. Sánchez, D., Batet, M., Valls, A., Gibert, K., 2009. Ontology-driven web-based
Hearst, M.A., 1992. Automatic acquisition of hyponyms from large text corpora. In: semantic similarity. J. Intell. Inf. Syst. 35, 383–413.
Kay, M. (Ed.), In: Proceedings of the 14th Conference on Computational Sánchez, D., Isern, D., 2011. Automatic extraction of acronym definitions from the
linguistics–vol. 2, COLING 92. Morgan Kaufmann Publishers, Nantes, France, Web. Appl. Intell. 34, 311–327.
pp. 539–545. Sánchez, D., Isern, D., Millán, M., 2010. Content annotation for the Semantic Web:
Hotho, A., Maedche, A., Staab, S., 2002. Ontology-based text document clustering. an automatic Web-based approach. Knowl. Inf. Syst. 27, 393–418.
Künstliche Intelligenz, 48–54. Sánchez, D., Moreno, A., 2008a. Learning non-taxonomic relationships from web
Kiryakov, A., Popov, B., Terziev, I., Manov, D., Ognyanoff, D., 2004. Semantic documents for domain ontology construction. Data Knowl. Eng. 63, 600–623.
annotation, indexing, and retrieval. J. Web Semantics 2, 49–79. Sánchez, D., Moreno, A., 2008b. Pattern-based automatic taxonomy learning from
Kiyavitskaya, N., Zeni, N., Cordy, J.R., Mich, L., Mylopoulos, J., 2005. Semi-automatic the Web. AI Commun. 21, 27–48.
semantic annotations for Web documents. In: Bouquet, P., Tummarello, G. Sánchez, D., Moreno, A., Del Vasto-Terrientes, L., 2012. Learning relation axioms
(Eds.), Proceedings of the 2nd Italian Semantic Web Workshop on Semantic from text: an automatic Web-based approach. Expert Syst. Appl. 39,
Web Applications and Perspectives, SWAP 2005. CEUR-WS, Trento, Italy, 5792–5805.
pp. 210–225. Sanderson, M., Croft, B., 1999. Deriving concept hierarchies from text. In:
Kiyavitskaya, N., Zeni, N., Cordy, J.R., Mich, L., Mylopoulos, J., 2009. Cerno: Light- Proceedings of the 22nd Annual International ACM SIGIR Conference on
weight tool support for semantic annotation of textual documents. Data Research and Development in Information Retrieval. ACM, Berkeley, California,
Knowl. Eng. 68, 1470–1492. United States, pp. 206–213.
Koivunen, M.-R., 2005. Annotea and semantic Web supported collaboration Sapkota, K., Aldea, A., Duce, D.A., Younas, M., Bañares-Alcántara, R., 2011.
(invited talk). In: Dzbor, M., Takeda, H., Vargas-Vera, M. (Eds.), Proceedings Semantic-ART: a framework for semantic annotation of regulatory text. In:
of the Workshop on End User Aspects of the Semantic Web at 2nd Annual Proceedings of the Fourth Workshop on Exploiting Semantic Annotations in
European Semantic Web Conference, UserSWeb 05 CEUR Workshop Proceed- Information Retrieval. ACM, Glasgow, Scotland, UK, pp. 23–24.
ings, Heraklion, Crete, pp. 5–17. Schroeter, R., Hunterd, J., Kosovic, D., 2003. Vannotea—a collaborative video
Lamparter, S., Ehrig, M., Tempich, C., 2004. Knowledge extraction from classifica- indexing, annotation and discussion system for broadband networks. In:
tion schemas. In: Proceedings of the International Conference on Ontologies, Handschuh, S., Koivunen, M.-R., Dieng-Kuntz, R., Staab, S. (Eds.), Proceedings
Databases and Applications of SEmantics (ODBASE). Springer, pp. 618–636. of the Knowledge Markup and Semantic Annotation Workshop, K-CAP 03.
Lin, D., 1998. Automatic retreival and clustering of similar words. In: Proceedings ACM, Sanibel, Florida, pp. 9–26.
of the 36th Annual Meeting of the Association for Computational Linguistics Setchi, R., Silva, J.D.V., Rı́os, S.A., Cao, C., 2011. Special issue on semantic
and 17th International Conference on Computational Linguistics, COLING-ACL information and engineering systems. Eng. Appl. Artif. Intell. 24, 1303–1304.
1998. ACL/ Morgan Kaufmann, Montreal, Quebec, Canada, pp. 768–774. Spackman, K.A., 2004. SNOMED CT Milestones: Endorsements are Added to
Maedche, A., Neumann, G., Staab, S., 2003. Bootstrapping an ontology-based Already-impressive Standards Credentials, 21. Healthcare Informatics: The
information extraction system. In: Szczepaniak, P.S., Segovia, J., Kacprzyk, J., Business Magazine for Information and Communication Systems. (pp. 54–56).
Zadeh, L.A. (Eds.), Intelligent exploration of the web. Physica-Verlag, pp. 345– Staab, S., Maedche, A., 2000. Ontology engineering beyond the modeling of
359. concepts and relations. In: Benjamins, R., Gómez-Pérez, A., Guarino, N. (Eds.),
Martı́nez, S., Sánchez, D., Valls, A., 2012. Semantic adaptive microaggregation of Proceedings of the ECAI-2000 Workshop on Ontologies and Problem-Solving
categorical microdata. Comput. Secur. 31, 653–672. Methods, Berlin, Germany.
Michelson, M., Knoblock, C.A., 2007. An automatic approach to semantic annota- Stevenson, M., Gaizauskas, R., 2000. Using corpus-derived name lists for named
tion of unstructured, ungrammatical sources: a first look. In: Knoblock, C.A., entity recognition. In: Proceedings of the Sixth Conference on Applied Natural
Lopresti, D., Roy, S., Subramaniam, L.V. (Eds.), Proceedings of the Workshop on Language Processing. Association for Computational Linguistics, Seattle,
Analytics for Noisy Unstructured Text Data, IJCAI-2007, Hyderabad, India, Washington, pp. 290–295.
pp. 123–130. Turney, P.D., 2001. Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In: De
Mikheev, A., Finch, S., 1997. A workbench for finding structure in texts. In: Raedt, L., Flach, P. (Eds.), Proceedings of the 12th European Conference on
Proceedings of the Fifth Conference on Applied Natural Language Processing. Machine Learning, ECML 2001. Springer-Verlag, Freiburg, Germany, pp. 491–502.
Association for Computational Linguistics, Washington, DC, pp. 372–379. Velardi, P., Navigli, R., Cucchiarelli, A., Neri, F., 2005. Evaluation of OntoLearn, a
Morbach, J., Yang, A., Marquardt, W., 2007. OntoCAPE—a large-scale ontology for methodology for automatic learning of domain ontologies. In: Buitelaar, P.,
chemical process engineering. Eng. Appl. Artif. Intell. 20, 147–161. Cimiano, O., Magnini, B. (Eds.), Ontology Learning from Text: Methods,
Niekrasz, J., Gruenstein, A., 2006. NOMOS: a SemanticWeb Software framework for Evaluation and Applications. IOS Press, pp. 92–105.
annotation of multimodal corpora. In: Proceedings of the 5th International Velásquez, J.D., Dujovne, L.E., L’Huillier, G., 2011. Extracting significant Website
Conference on Language Resources and Evaluation, LREC 06, Genoa, Italy, Key Objects: A Semantic Web mining approach. Eng. Appl. Artif. Intell. 24,
pp. 21–27. 1532–1541.
Pasca, M., 2004. Acquisition of categorized named entities for web search. In: Zavitsanos, E., Tsatsaronis, G., Varlams, I., Palioras, G., 2010. Scalable semantic
Proceedings of the Thirteenth ACM International Conference on Information annotation of text using lexical and Web resources. In: Procedings of the 6th
and Knowledge Management. ACM, Washington, D.C., USA, pp. 137–145. Hellenic Conference on Artificial Intelligence, pp. 287–296.