Professional Documents
Culture Documents
Effective Semantic Search Using Thematic Similarity: Journal of King Saud University - Computer and Information Sciences
Effective Semantic Search Using Thematic Similarity: Journal of King Saud University - Computer and Information Sciences
National University of Sciences and Technology (NUST), School of Electrical, Engineering & Computer Science H-12,
Islamabad, Pakistan
KEYWORDS Abstract Most existing semantic search systems expand search keywords using domain ontology
Semantic search; to deal with semantic heterogeneity. They focus on matching the semantic similarity of individual
Thematic similarity; keywords in a multiple-keywords query; however, they ignore the semantic relationships that exist
Semantic heterogeneity; among the keywords of the query themselves. The systems return less relevant answers for these
RDF triples; types of queries. More relevant documents for a multiple-keywords query can be retrieved if the sys-
Information retrieval tems know the relationships that exist among multiple keywords in the query. The proposed search
methodology matches patterns of keywords for capturing the context of keywords, and then the rel-
evant documents are ranked according to their pattern relevance score. A prototype system has
been implemented to validate the proposed search methodology. The system has been compared
with existing systems for evaluation. The results demonstrate improvement in precision and recall
of search.
ª 2013 King Saud University. Production and hosting by Elsevier B.V. All rights reserved.
1319-1578 ª 2013 King Saud University. Production and hosting by Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.jksuci.2013.10.006
162 S. Khan, J. Mustafa
axioms are also defined to define new concepts that can be 2002) for evaluation, and the results demonstrate improvement
introduced in ontology and to apply logical inference (Ding in precision and recall of semantic searching.
et al., 2004). Semantic similarity refers to semantic closeness, The remainder of this paper is structured as follows: Sec-
proximity, or nearness. It indicates similarity between different tion 2 reviews the current approaches to semantic search tech-
concepts and their relationships. There are three types of niques and their proposed systems. Section 3 explains our
semantic similarity: (a) surface, (b) structure, and (c) thematic proposed searching methodology in detail. Section 4 illustrates
similarity (Poole et al., 1995; Zhong et al., 2002; Zhu et al., a walk-through example for demonstrating the proposed
2002; Montes-Y-Gomez et al., 2000). Surface and structure methodology. Section 5 discusses the evaluation of the proto-
similarity focus individually on concepts and relationships, type system, and Section 6 concludes the paper.
respectively, whereas thematic similarity considers the pattern
(i.e., combination) of concepts and the relationship that exists 2. Related work
among them. The term ‘‘keyword’’ stands for either a concept
or relationship of domain ontology alternatively in this paper. Several methods for determining semantic similarity between
Existing typical semantic search systems (Bonino et al., keywords, i.e., either concepts or relationships, have been pro-
2004; Fang et al., 2005; Varelas et al., 2005) expand individual posed in the literature. These methods are classified into three
keywords through domain ontology to deal with different main categories (Varelas et al., 2005). We discuss first the
semantic heterogeneity challenges such as synonymy. For methods in this section and then describe existing systems that
example, a search for the concept writer can be expanded have applied the methods.
through domain ontology to the keywords writer and author.
The search, looking only for a keyword writer may have fewer 2.1. Semantic similarity methods
results than the search looking for writer and author. The exist-
ing systems focus on matching the semantic similarity of indi-
2.1.1. Edge counting methods
vidual keywords (i.e., they apply either surface or structure
similarity) and apply Boolean operators if multiple keywords These methods measure semantic similarity between two key-
are given in a query. They ignore the semantic relationships words as a function of length of the path (i.e., distance) linking
that exist among the multiple keywords themselves. keywords and their position in their respective hierarchy (Rodri-
If a user inputs a multiple keywords query, for example, guez and Egenhofer, 2003; Varelas et al., 2005). This similarity
‘‘pipe in computer science domain,’’ conventional IR systems calculation simply relies on counting the number of edges sepa-
retrieve thousands of documents where pipe might be used rating two keywords by an ‘Is-A’ relation in ontology (Rada
as (a) a tube of any kind, (b) a device for smoking, (c) a musi- et al., 1989). This technique assumes that the semantic difference
cal instrument or (d) a portion of memory that can be used by between upper-level keywords in a hierarchy is greater than the
one process to pass information to another process in com- semantic difference between lower-level keywords. In other
puter. Sometimes none of search results may be relevant to a words, general concepts are less similar than two specialized
user requirement. The systems return less relevant answers concepts. Because the specialized concepts may appear more
for multiple keywords queries although they expand individual similar than general ones, depth is taken into account by calcu-
keywords in a query with different semantic relationships. lating either the maximum depth in the hierarchy (Leacock
More relevant documents for a multiple keywords query et al., 1998) or the depth of the most specific concept, while sub-
can be retrieved if systems know the meanings and relation- suming the two compared concepts/relationships (Hirst et al.,
ships that exist among the multiple keywords themselves in 1998; Wu et al., 1994). Semantic similarity between concepts
the query. By keywords pattern, we mean a combination of is calculated with reference to its closest common parent (ccp).
at least two concepts and their relationship that exists in the
domain ontology. A pattern can represent the context/theme, 2.1.2. Information content methods
that is, circumstances in which something happens or should These methods measure the difference in information of two
be considered. Therefore, the existing systems (Bonino et al., concepts as a function of their probability of occurrence in a
2004; Fang et al., 2005; Varelas et al., 2005; Rinaldi, 2009; corpus. They are also known as term frequency (tf)/inverse
Alipanah et al., 2010; Yang et al., 2011) cannot resolve the document frequency (idf). In these methods, two concepts
semantic heterogeneity issue of polysemy because it requires are similar to an extent to which they share information in
identification of the context of keywords to comprehend their common. Therefore, the information content value for each
actual semantics. Moreover, the existing systems also ignore concept in the hierarchy is calculated using its frequency in
other important relationships, such as semantic neighborhoods the corpus (Resnik, 1999).
(Rodriguez and Egenhofer, 2003), that can also contribute to
useful search results. 2.1.3. Feature-based methods
To overcome the limitations of existing semantic searching These methods measure similarity between two concepts either as
systems, we need to represent the context of keywords through a function of their properties or characteristics. These methods
keyword patterns for effective searching using thematic similarity assume two concepts are similar if they have more common char-
(Khan et al., 2006; Poole et al., 1995). The proposed system con- acteristics than non-common characteristics (Tversky, 1977).
centrates on searching keyword patterns and not on the individ-
ual keywords. We employed Resource Description Framework
(RDF) triples to describe the keyword patterns of document 2.2. Existing systems
metadata and search queries. We have developed a prototype sys-
tem for the validation of the proposed solution. The system was DOSE (Bonino et al., 2004) uses tf/idf based on a Vector Space
compared with existing systems (Fang et al., 2005; Shah et al., Model (VSM) for keywords. This system extended the tradi-
Effective semantic search using thematic similarity 163
tional VSM by including taxonomic relationships (i.e., special- words are processed in the context of the information in which
ization and generalization) of keywords for query expansion. they are retrieved. Their query is a list of terms to retrieve and
They expand a query vector through relationships and com- a domain of interest. Then, they apply a ranking to the re-
pute semantic similarity values between a document vector trieved results.
and the expanded query vector using the cosine of the angle be- In Alipanah et al. (2010), the authors proposed a weighting
tween them. mechanism to find the expansion of concepts from ontologies.
In Fang et al. (2005), the authors employ tf/idf based on a They determined expanded terms/concepts in each ontology
traditional Vector Space Model (VSM) for keywords. They ex- (i.e., document) on the basis of semantic similarity, density
tend the model by considering semantic relationships (i.e., di- and betweenness to a user query. Then, they use the idea of
rect, strong, normal, weak and irrelevant) of keywords for co-occurrence across ontologies. Similar concepts are deter-
query expansion. They define weights for these relationships mined by their name and structural similarity in each ontology.
(i.e., the weights of direct, strong, normal, weak and irrelevant At the end of expansion, the system generates a set of terms
are 1.0, 0.7, 0.4, 0.2 and 0.0, respectively). A user query is ex- along with weights and ranks them according to weights. In
panded through these relationships. The tf/idf values of query Yang et al. (2011), the authors retrieve textual information
keywords in a document are adjusted by multiplying weights via word meanings rather than lexical forms. The authors ap-
of the semantic relationships that exist between the query ply WorldNet for word sense disambiguation, and then, this
and document keywords. Then, documents are ranked accord- semantic information is annotated in the documents in a
ing to the relevance score. RDF used for semantic searching. To the best of our
In Varelas et al. (2005), the authors use a Semantic Similar- knowledge, none of these techniques measure semantic
ity Retrieval Model (SSRM) for keywords. They extend the relationships among keywords patterns (i.e., RDF triples)
model by including semantic relationships (i.e., synonyms, and then produce a ranked result to facilitate meeting the
hyponyms and hypernyms) of keywords for query expansion user’s query request.
and assign weights to relationships depending on the position
of keywords in taxonomy. They expand a user query to a spe- 3. Proposed searching methodology
cific threshold weight. The tf/idf values of query keywords in a
document are adjusted by multiplying the weights of the In this section, we discuss our searching methodology, which
semantic relationships that exist between the query and docu- performs context-driven semantic search by matching a user
ment keywords. They rank documents according to the gained query, given in RDF triples (i.e., user-given patterns), with
weights. RDF triples of a digital document.
The system presented in Shah et al. (2002) computes the fre-
quency of the RDF triples, instead of keywords, in a docu- 3.1. Query expansion
ment. The system calculates the similarity on the basis of the
RDF triples’ frequency in the document and employs only
inference rules for semantic searching. The system does not ex- In the proposed system, we take into account the following
pand user queries through semantic relationships. Therefore, semantic relationships: (a) synonyms, (b) semantic neighbor-
this system cannot resolve the semantic heterogeneity issue hood and (c) hyponyms (i.e., Is-A relation-ship). Hyponyms
of polysemy where the context of keywords is needed to com- are handled in the RDF by the built-in property of subclass
prehend the accurate semantics of keywords required. Some (i.e., rdfs:subClassOf). We have designed two additional prop-
systems (Khan et al., 2006; Zhong et al., 2002; Zhu et al., erties: synOf and neighborOf in the RDF for handling the
2002) use the edge counting method (e.g., distance-based meth- remaining two relationships. A user query can be expanded
od) in conceptual graph (CG) for semantic search. The basic by deducing inferences through rules from existing RDF tri-
intuition in conceptual graph (CG) matching is to calculate ples. In the following subsections, we describe the properties
semantic matching by comparing arcs. The comparison of and rules designed in this research.
CG arcs concentrates on the thematic behavior of concepts
and relationships (i.e., keywords), which is a representative 3.1.1. SynOf property
of a given context. We have borrowed their notion of semantic The synOf property states that different individuals can be
matching for RDF triples and extended their searching tech- same (i.e., equivalence relationship). This property may be
niques by computing tf/idf of these triples, i.e., ranked-result. used to create a number of different names that refer to the
In Khan et al. (2004), the authors developed a concept- same individual. It can also refer to acronyms and lexical vari-
based model that uses domain-dependent ontologies for ants. Fig. 1 shows the RDF graph of the synOf property. The
responding to user requests. They apply an automatic query following statement represents the RDF of synOf property in
expansion technique on user queries that are expressed in nat- N3 notation:
ural language. This automatic expansion technique selects only
relevant and controlled expansion. The AKTiveRank (Alani uri : synOfardfs : Property
and Brewster, 2005) system ranks ontologies using graph anal-
ysis measures. The authors also apply Swoogle1 to measure the 3.1.2. neighborOf property
semantic relatedness before applying their proposed technique. The neighborOf property is used to explore the semantic neigh-
Both techniques cannot handle polysemous heterogeneity. For borhood of a concept or relationship. The semantic neighbor-
the Dynamic Semantic Engine (DySE) (Rinaldi, 2009), the hood n of a concept c is the set Ci of concepts whose distance d
authors designed a context-driven approach in which key- to the concept c is an integer number r greater than zero, which
is called the radius of the semantic neighborhood (Rodriguez
1
http://swoogle.umbc.edu/. and Egenhofer, 2003).
164 S. Khan, J. Mustafa
3.2.2. Relationship similarity Let freqij be the frequency of the triple ti in the document dj.
Likewise, the similarity between two relationships is defined as Then, the normalized frequency tfij of the triple ti in dj is the
follows: ratio of the term frequency of ti to the maximum term fre-
quency of any triple in the document dj.idfi is the inverse doc-
simr ðr1 ; r2 Þ ¼ 1 dr ðr1 ; r2 Þ ð6Þ ument frequency for ti given by:
The distance between two relations is also calculated by their
N
respective positions in the relation hierarchy. The only differ- idfi ¼ log ð9Þ
ni
ence is that a relation hierarchy was constructed manually by
us. There are some exceptions that if relations r1 and r2 are The final tf:idf weight of ith triple to jth document is calculated
synonyms or acronyms of each other, then the distance will as follows:
be set to zero and consequently, the similarity between these Wij ¼ tfij idfi ð10Þ
two relations will be one.
The ranking algorithm combines two factors: (i) the RDF tri-
3.2.3. RDF triples similarity ple score using Eq. (7) and (ii) its relevance to a document indi-
cated by Wij using Eq. (10). The documents relevance, R(d),
A user query and data source RDF triples are matched to find
can be calculated as follows:
their similarity. The final triple similarity matching formula
based on combining Eq. (5) (for concepts similarity) and Eq. X
n
RðdÞ ¼ simðqi ; si Þ ðWij þ kÞ ð11Þ
(6) (for relations similarity) is as follows:
i¼0
simsub ðqisub ; sjsub Þ where d represents a document, n is the total number of triples
Y
n Y
m
simðq; sÞ ¼ simr ðqir ; sjr Þ ð7Þ in a document and k is used to normalize the effect of partial
i¼0 j¼0
simobj ðqiobj ; sjobj Þ and imprecise RDF triples. We used 1 (one) as the default va-
lue for k. The documents are ranked according to their rele-
where qsub, qobj and ssub, sobj are matched concepts, whereas qr vance score and returned to a user.
and sr are matched relations of the RDF triple query q, and the
RDF triple source s, respectively. sim (q, s) is the overall sim- 4. Walk-through example
ilarity between the query q and source s RDF triples. Here, i
and j represent ith and jth subject or object or relation of the
query and source RDF triples, respectively. To demonstrate our proposed methodology, we consider an
example. We employed RDF to represent keyword patterns
of data sources and queries and represented their RDF triples
3.3. Documents ranking (R(d))
with graph notations in this paper. Figs. 3–5 show the metada-
ta triples of documents 1, 2 and 3 (i.e., s1, s2 and s3), respec-
Identified relevant documents are ranked according to their tively. Table 2 shows the frequency of triples in the given
relevance to a user query. The relevance of a document is com- documents. Fig. 6 illustrates the concepts hierarchy and their
puted by extending a tf.idf weighting scheme (Zhong et al., milestone value with respect to their position in the taxonomy.
2002) for triples instead of keywords. Let N be the total num- This modified segment is adopted from the WordNet2
ber of documents and ni the number of documents in which the ontology.
triple ti appears.
freqij
tfij ¼ ð8Þ
maxðfreqij Þ 2
i http://wordnet.princeton.edu/.
166 S. Khan, J. Mustafa
Figure 9 F-measure of RDF based VSM, IR framework and the proposed framework.
capture the context of keywords. The thematic similarity ap- ACM CIKM International Conference on Information and
proach has been used for information retrieval to capture the Knowledge Management. Washington, DC, USA, pp. 652–659.
context of concepts. A user submits an RDF triples query. Fang, W.D., Zhang, L., Wang, Y.X., Dong, S.B., 2005. Toward a
The query is expanded through synonyms and a semantic semantic search engine based on ontologies. In: Proceedings of
2005 International Conference on Machine Learning and Cyber-
neighborhood using distance based-approaches with the help
netics, vol. 3. IEEE, pp. 1913–1918.
of domain ontology. The relevance between the RDF triples Hirst, G., St-Onge, D., 1998. WordNet: an electronic lexical database.
of a document and a user query is measured, and the relevant In: Lexical chains as representations of context for the detection
documents are identified. The documents are ranked according and correction of malapropisms. The MIT Press, Cambridge, MA,
to their importance. The contribution of this research work is pp. 305–332.
to combine existing measures and design a novel semantic Khan, L., McLeod, D., Hovy, E., 2004. Retrieval effectiveness of an
search methodology for thematic similarity to handle semantic ontology-based model for information selection. The VLDB
heterogeneity, particularly polysemy. Journal 13 (1), 71–85.
The results of the experiments performed on the system Khan, S., Marvon, F., 2006. Identifying relevant sources in query
indicate improvements in precision and recall and encourage reformulation. In: Proceedings of the 8th International Conference
on Information Integration and Web-based Applications &
new efforts in this direction. The proposed search methodol-
Services (iiWAS). Yogyakarta, Indonesia, pp. 99–130.
ogy can be easily extended using recent measures for semantic Kobayashi, M., Takeda, K., 2000. Information retrieval on the web.
relatedness and ranking methods. Moreover, we intend to ACM Computing Surveys (CSUR) 32 (2), 144–173.
automate the process of generating RDF triples from docu- Leacock, C., Miller, G.A., Chodorow, M., 1998. Using corpus
ments, as we generated them manually in this research to eval- statistics and wordnet relations for sense identification. Computa-
uate the prototype system. Lastly, we urge augmentation of the tional Linguistics 24 (1), 147–165.
system for other heterogeneities i.e., incomplete and incompat- Lee, C.Y., Soo, V.W., 2005. Ontology-based information retrieval and
ible RDF triples. In the current system, we do not consider extraction. In: Proceedings of 3rd International Conference on
partially (i.e., incomplete) matched RDF triples that may con- Information Technology: Research and Education (ITRE). IEEE,
tain important information. pp. 265–269.
Montes-Y-Gomez, M., Lopez-Lopez, A., Gelbukh, A.F., 2000. Infor-
mation retrieval with conceptual graph matching. In: Proceedings
References of 11th International Conference on Database and Expert Systems
Applications (DEXA). London, UK, pp. 312–321.
Alani, H., Brewster, C., 2005. Ontology ranking based on the analysis Poole, J., Campbell, J.A., 1995. A novel algorithm for matching
of concept structures. In: Proceedings of the 3rd International conceptual and related graphs. In: Proceedings of the 3rd Interna-
Conference on Knowledge Capture. ACM, pp. 51–58. tional Conference on Conceptual Structures: Applications, Imple-
Alipanah, N., Parveen, P., Menezes, S., Khan, L., Seida, S.B., mentation and Theory, vol. 954, Lecture Notes in Computer
Thuraisingham, B., 2010. Ontology-driven query expansion meth- Science. Springer-Verlag, pp. 293–307.
ods to facilitate federated queries. In: Proceedings of 2010 IEEE Rada, R., Mili, H., Bicknell, E., Blettner, M., 1989. Development and
International Conference on Service-Oriented Computing and application of a metric on semantic nets. IEEE Transactions on
Applications (SOCA). IEEE, pp. 1–8. Systems, Man and Cybernetics 19 (1), 17–30.
Baeza-Yates, R., Ribeiro-Neto, B., et al, 1999. In: Modern Informa- Resnik, P., 1999. Semantic similarity in a taxonomy: an information-
tion Retrieval, vol. 82. Addison-Wesley, New York. based measure and its application to problems of ambiguity in
Blasio, J.D., Kawamura, T., Hasegawa, T., 2004. Catalog search natural language. Journal of Artificial Intelligence Research 11, 95–
engine: Semantics applied to products search. In: Proceedings of 130.
4th International Workshop on Knowledge Markup and Semantic Rinaldi, A.M., 2009. An ontology-driven approach for semantic
Annotation (SemAnnot 2004), vol. 184, pp. 11–20. Available from: information retrieval on the web. ACM Transactions on Internet
<http://ceur-ws.org/Vol-184/>. Technology (TOIT) 9 (3), 10.
Bonino, D., Corno, F., Farinetti, L., Bosca, A., 2004. Ontology driven Rodriguez, M.A., Egenhofer, M.J., 2003. Determining semantic
semantic search. WSEAS Transaction on Information Science and similarity among entity classes from different ontologies. IEEE
Application 1 (6), 1597–1605. Transactions on Knowledge and Data Engineering 15 (2), 442–456.
Ding, L., Finin, T.W., Joshi, A., et al., 2004. Swoogle: a search and Shah, U., Finin, T.W., Joshi, A., 2002. Information retrieval on the
metadata engine for the semantic web. In: Proceedings of the 2004 semantic web. In: Proceedings of the ACM International Confer-
Effective semantic search using thematic similarity 169
ence on Information and Knowledge Management (CIKM). Yang, Che-Yu, Wu, Shih-Jung, 2011. A wordnet based information
McLean, VA, USA, pp. 461–468, November. retrieval on the semantic web. In: Networked Computing and
Tversky, A., 1977. Features of similarity. Psychological Review 84 (4), Advanced Information Management (NCM), 7th International
327–352. Conference. IEEE, pp. 324–328.
Uschold, M., Gruninger, M., 2004. Ontologies and semantics for Zhong, J., Zhu, H., Li, J., Yu, Y., 2002. Conceptual graph matching
seamless connectivity. ACM SIGMod Record 33 (4), 58–64. for se-mantic search. In: Proceedings of 10th International Con-
Varelas, G., Voutsakis, E., Raftopoulou, P., et al, 2005. Semantic ference on Conceptual Structures: Integration and Interfaces
similarity methods in wordnet and their application to information (ICCS). Borovets, Bulgaria, pp. 92–196, July.
retrieval on the web. In: Proceedings of the 7th Annual ACM Zhu, H., Zhong, J., Li, J., Yu, Y., 2002. An approach for semantic
International Workshop on Web Information and Data Manage- search by matching RDF graphs. In: Proceedings of the 15th
ment. ACM, pp. 10–16. International Florida Artificial Intelligence Research Society Con-
Wu, Z., Palmer, M., 1994. Verbs semantics and lexical selection. In: ference. AAAI Press, pp. 450–454.
Proceedings of the 32nd Annual Meeting on Association for
Computational Linguistics. ACL, pp. 133–138.