Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)

Semantic Heuristic Search in Collaborative Networks:


Measures and Contexts

Valentina Franzoni(1), Marco Mencacci, Paolo Mengoni, Alfredo Milani(1)(2)


(1) Department of Mathematics and Computer Science, University of Perugia, Italy
(2) Department of Computer Science, Hong Kong Baptist University, Hong Kong
{valentina.franzoni, milani}@dmi.unipg.it

Abstract—Relating, connecting and navigating between based semantic networks, such as Wikipedia: the graph size can
concepts represent a major challenge for machine intelligence. On be extremely large, oversizing the main memory available to any
the other hand, collaborative repositories provide a large base of algorithm, and ruling out memory intensive algorithms, or
knowledge already filtered, structured, linked and meaningful algorithms that require to load the whole graph in the working
from a human semantic point of view. Although these repositories memory. The connectivity of the nodes is extremely high,
are machine accessible, they have no formal explicit semantic leading to a high branching factor, which rules out algorithms
tagging to help for automatic navigation in them. In this paper we based on maintaining a large fringe of unexpanded nodes. The
present a randomized approach, based on Heuristic Semantic cost for node expansion is also very high, since the online
Walk (HSW) for searching a collaborative network in order to exploration of the graph requires to query and exchange data on
extract meaningful semantic chains between concepts. The method
the collaborative network, which is virtually distributed over the
is based on the use of heuristics defined on semantic proximity
measures, which can be easily computed from general search
Web. The graph can dynamically change, thus making difficult
engines statistics. Information from multiple random chains can to implement forms of pre-computation, proxies, or techniques
be used to compute semantic distances between the concepts, as of dynamic programming.
well as to determine the underlying semantic context. The On the other hand, the Wikipedia collaborative network
proposed method solves major issues posed by collaborative represents a network of explanations that embeds semantic
networks, such as large dimensions, high connectivity degree and information in the topology of links. Wikipedia provides the
dynamical evolution of online networks, which make classical explanation of a concept by an article in natural language,
search methods inefficient and unfeasible. In this study the HSW
containing links to other articles concerning sub-concepts, the
model has been experimented on Wikipedia. Tests held with the
well known WordSym353 benchmark for human evaluation show
most relevant to its understanding. The Wikipedia network is
that the proposed model is comparable to best state-of-the-art then a collaborative semantic network, built by a community of
results, while being the only web-based approach. Other potential users, who add explanations and links for the mutual general
applications range from query expansion, argumentation mining, benefit of other human users. Similar explicit collaborative
and simulation of user navigation. actions take place nearly in all social network and repositories:
referencing a resource, labelling/tagging objects in order to
Keywords—heuristic search; semantic networks; semantic make them clearer to other users, adding friendships,
similarity measures; random walk; data mining; argumentation commenting and ranking positively/negatively other users
mining messages are all examples of actions that implicitly build
semantic networks among resources, objects and users.
I. INTRODUCTION
Although many different semantic relationships can exists
The ability of relating sets of concepts, navigating among
between two concepts s1 and s2 corresponding to two linked
them, and understanding the underlying context is a major issue
articles, (e.g., class/subclass, part_of, opposite), the explanatory
for machine intelligence applications and man-machine
nature of the Wikipedia semantic network simply suggests that
interfaces aiming at interacting by managing, classifying, and
the existence of a link implies the existence of a relevant
understanding human knowledge. A great opportunity is
represented by the increasing availability of web-based semantic relationship (s1,s2) between them. In other words,
collaborative repositories, built by the users for their mutual this means that, given s1, then s2 is also relevant. Starting from
this idea, the intermediate concept nodes in a path A between
benefit and knowledge sharing. These online repositories e.g.,
pictures and media sharing, fora and discussion groups websites, two given concepts over the semantic network i.e., semantic
chain, can represent hidden concepts or underlying concepts,
provide a large source of information already filtered, structured,
which are implicit in the context of the initially given concepts.
linked and meaningful from a human semantic point of view.
Although these repositories are machine accessible, they have The problem of the semantic chain search can then be
no formal semantic structures to be exploited for automatic reduced to a problem of search in the semantic graph. To
navigation and they pose several important and challenging establish which concepts are implied by a pair of terms, the path
issues that make traditional techniques for network search between the terms can be considered.
unfeasible. Typical issues arise in the case of collaborative web-

978-1-4799-4143-8/14 $31.00 © 2014 IEEE 141


DOI 10.1109/WI-IAT.2014.27
The basic search scheme used in this work is the Heuristic Collaborative semantic networks, and in particular
Semantic Walk (HSW) with random tournaments, an informed Wikipedia, seem to be the best available knowledge base for
randomized search strategy that can be instantiated with these purposes, because they are:
different heuristics based on various proximity measures, in
order to find paths between concepts in a semantic network. The  Knowledge sharing and explanation oriented.
main idea is to use the information collected on the most Information is added and filtered by a huge number
traversed paths in a set of HSW runs, in order to determine the of users, with the explicit purpose of knowledge
semantic context underlying the source and target concepts. sharing. Articles and links in Wikipedia can
Statistics on traversed nodes and paths are combined in a ranking provide explanations of concepts and relationships.
measure to evaluate the relevance of any intermediate  Dynamically updated. Collaborative concept
node/concept. The relevance ranking is then used to extract a networks are continuously refined by a multitude of
subnetwork from the collaborative network, which can represent users, whose aim is to improve the quality of
the context underlying the initial source and the target concept. information, and to share new information as soon
Although the idea of defining a context as the elements of an as the interest on that information arises.
argumentation chain i.e., a path, is not new to artificial  Open and collaborative. Artificially crafted
intelligence, from the earliest work on theorem proving, [21][22] semantic networks e.g., ontologies such as
the collaborative semantic network domain poses special new WordNet, tend to reflect the biases of their authors
challenges for its direct application, such as multiple paths and to be more static, since they have a small and
existing among concepts, computationally intractable big specialised number of authors. On the other hand,
networks, no established algorithms for path finding. collaborative networks such as Wikipedia are
In this paper we propose an innovative solution both for the issue updated and edited continuously by a huge number
of computability and the problem of selecting among multiple of users, reflecting the knowledge of the whole web
explanation paths. The randomized Heuristic Semantic Walk community. A number of recent studies, in fact,
search approach is used in order to avoid complex network affirm that the results of efforts of big communities
analysis, while a context subnetwork, instead of a simple single of non-specialized users are better than those of
path, is extracted from multiple random runs. small groups of specialists, because both more
precise1 and more often updated, thus better
The notion of a context based on an explanation network is reflecting reality.
especially relevant to different areas such as:
This paper is organised as follows. In section II and III the
Automatic Explanation/Argumentation of Concepts. Let Heuristic Semantic Walk scheme and its random tournament
consider for instance two Wikipedia entries like Mars and variations are presented. Section IV analyses the different
Scientist and let the path Mars->Planet->Science- proximity measures used as heuristics for the HSW After
>Scientist. The intermediate concepts in the chain can be discussing semantic paths and contexts finding in section V,
used to generate an explanation of the relationship between experiments and results are showed and compared in sections
Mars and Scientist. VI.
Natural Language Understanding. The intermediate II. SEMANTIC WALK S TRATEGIES
concepts in the chain can be used to provide a context in order
to disambiguate the meaning of some ambiguous terms or Classical graph search algorithms cannot be applied to path
sentences used in natural dialogues. By focusing on meanings search in large collaborative networks because of their size and
that are consistent with the underlying context, it is possible to their structural features, such as huge memory requirements,
focus on Mars as a planet in the Solar System rather than to the high branching factors, unknown dimension and high cost for
ancient Greek god Mars. identifying and accessing nodes, high connectivity and loops.
The HSW search try to overcome these limitations by
Query expansion. The goal is to add a set of suitable
combining a blind random search with an informed heuristic
keywords to the search terms of a query, [1][4][12] in order to
retrieve objects that have been explicitly indexed under the search.
original terms. For instance, a query consisting of the two terms A. Non-informed Walks
Mars and Scientist could likely be expanded using the Since the branching factor on Wikipedia can be very high
underlying words: Planet, Science, returning documents or
(e.g., the page “Rome” includes more than 500 links), Best-First
images indexed under those additional keywords. Search (BFS) [5] approaches are deeply penalized by the
Other fields of application of such a computation of semantic exponential growth of memory requirements. On the other hand,
similarity may vary from the evolution of search engines to Depth-First (DFS) strategies can fail to find the goal wandering
information retrieval, from automated translation to automated among the huge network, or they can eventually find very
classification of texts, from social network analysis to suboptimal i.e., non meaningful, paths. [9][13] Preliminary
recommender systems, and probabilistic models. [24][25][26] experiments have shown, as expected, that DFS behaves like a

1 “Given enough eyeballs, all bugs are shallow”, cit. Linus


Law for open source, by Eric S. Raymond

142
single run random walk, where the high connectivity of nodes s, gV, consists in finding if a sequence of vertices (v0,
Wikipedia can eventually lead to a solution path, but of high v1,..., vn) exists, such that v0=s, vn=g and for each i[0, n-1] the
length and small relevance. [14] The Iterative Deepening Search edge (vi, vi+1)E. Similarly, the Shortest Path search problem
(IDS) approach, which can guarantee completeness, is highly SPath(s,g) can be defined straightforwardly.
inefficient with high branching factors and further penalized by
the high cost of online access to candidate nodes. We conclude A general scheme can be defined in order to characterize the
that non-informed walks strategies are not suitable for efficient class of HSW solutions to the semantic path problem SPath.
path search in the online collaborative semantic network Definition. A Heuristic Semantic Walk instance, or HSW, is
scenario. an algorithm for SPath problems characterized by a semantic
Heuristic Walks collaborative network graph  =(V,E), a search engine S, which
Heuristic Walks are informed search strategies, which make can return statistics about the occurrence of elements in the set
use of a heuristic measure to estimate a score for each candidate of all subsets of V named (V), a proximity measure m defined
node, and to sort them for the expansion. The evaluation is in terms of those statistics, and a search strategy or walk strategy
performed in terms of closeness/relatedness to the goal, A using heuristics derived from m. Note that (V) denotes the
measuring the semantic proximity of the current node to the goal set of all subsets of V; in other words, the engine/repository S
node. should be able to return statistics for co/occurrences of objects.
Most engines/repositories are in fact able to return statistics for
Among heuristic algorithms, we certainly have to consider queries like Q= (a bc). Since the actual returned objects are
A* [23] and its many extensions and variations, since its not relevant, for our purposes the engine S can be formally
optimality properties could make it an ideal candidate for defined as follows:
semantic heuristic walk strategies. The core of the A* class of
algorithms is the selection function f(n)=c(n)+h(n) where c(n) Definition: a search engine/repository S is a function S:
represents the cost to reach n from the source node (the path (V)N, where S(Q) returns the number of occurrences of Q
distance, in our case), while h(n) is the heuristic estimated on the according to the engine/repository [1][4] i.e., the statistics will
distance of n from the goal. In A*, a node ne in the fringe F of an reflect the frequency contextual use of terms in Q, according to
unexpanded node is selected for expansion if ne=min(nF, f(n)) the community related to the domain that feeds objects to S.
i.e., if the value of the evaluation function is minimal for ne. On Given the previous general scheme, we will use a notation in
the other side, when the target is distant from the source, the A* which HSW(S,m,A) on will denote an instance of HSW for a
fringe F will grow too large with real online semantic networks,
semantic network  where the search strategy A uses a heuristic
making the approach unpractical.
hm computed from source S, and HSW(A) denotes a non-
A Greedy Best-First (GBF) approach avoids maintaining informed algorithm with a walk strategy A. For example,
different nodes’ path history. In simple GBF, the evaluation HSW(Bing, NGD, A*) for Wikipedia denotes an A* search
f(n)=h(n) does not consider previous costs, then only the nodes algorithm operating on Wikipedia, which uses a heuristic
of the fringe that are the closest to the goal are chosen. The derived from the Normalized Google Distance (NGD) measure,
problem of a growing large fringe have led to a particular computed on the search engine Bing.
implementation, Local GBF (LGBF), where no fringe memory
A. Heuristic Weighted Random Walk
is maintained except for the current path, and only the current
node descendants are considered in the fringe. LGBF Experiments with HSW(S, m, LGBF) on Wikipedia have
implements a local search with loop avoidance control. shown, despite of using different proximity measures as
heuristics, that biases and redundancies in outgoing links can
III. THE H EURISTIC S EMANTIC WALK MODEL lead to spiral-like behaviors where the walk is repeatedly
The Heuristic Semantic Walk (HSW) model [15][17] moving around the target without converging. After preliminary
describes a framework for browsing an online semantic network experiments on HSW(S,m,A*) and HSW(S,m,LGBF), which
in order to find meaningful connections between pairs of suggested that the high connectivity of collaborative semantic
concepts. The goal of the HSW is to return paths between a pair networks allows to find near optimal solution paths in a
of terms (s, g) following the semantic links (e.g., the anchor links straightforward way, we have decided to add a degree of
from the text of a starting Wikipedia source article s to a goal randomization to the original LGBF, in order to eliminate non
article g), the main feature of the model is the use of a semantic convergence problems.
proximity measure to drive the search towards the best successor The idea of the Heuristic Weighted Random Walk (HWRW)
node (ni) on a set of candidates, which is closest to the goal (g). or Random Torunament Heuristic Semantic Walk (rt-HSW)is to
Definition: a semantic network graph  = (V, E), is defined choose among a set of successors for the current node, making
by a pair where V is a set of vertices/concepts (e.g., the entries a random tournament among the candidate successors, using the
probability distribution induced on the candidates by the values
in Wikipedia), and E(VV) is a set of oriented edges,
of the proximity measure between the candidate concepts and
representing the links between concepts in the network (e.g., the
the target concept. No memory of unexpanded nodes is
anchor links in the text of a Wikipedia article toward a
maintained, except for the current solution path. In HWRW, the
referenced article).
proximity measure values need to be normalized to [0,1] in
Definition: the semantic path finding problem or order to build the probability distribution. The probability of the
Path(s,g), given a semantic network 
V, E  and two candidate ni in the random tournament is proportional to its

143
estimated proximity to the goal and normalized over the node n to browse, with the aim of reaching the goal node.2 Since
candidates and it is defined as we normalize all the measures, distance and proximity measures
( ,)
can be compared as complementary.
( )=( ) = ∑ ( ,)
∀ ∈ () (1)
The proximity measures considered and experimented are:
where g is the target node, nj are the successors of the current  Confidence (CF) [4] is an asymmetric statistical measure,
node, and h(nj,g) is the heuristic derived by the proximity used in rule mining for measuring trust in a rule X→Y i.e.,
measure. The hypothesis underlying this approach is that given the number of transactions that contain X, then the
Wikipedia pages of close concepts are close i.e., have a short Confidence in the rule indicates the percentage of
path linking them, therefore the use of a suitable proximity transactions that contain also Y.
measure that reflects the human judgement of
closeness/relatedness is important in order to guarantee the
adequacy of the heuristic. (2)

From a probabilistic point of view, confidence approximates


the conditional probability, such that
  ()
 ( → ) = (|) = (3)
()
   !y x!(") !yx
where = = is the a priori likelihood
() !(#) !(#)
function for statistical inference.
 Pointwise Mutual Information (PMI) [3][16] is a point-to-
point measure of association used in statistics and
information theory. Mutual information between two
particular events w1 and w2, in this case two words w1 and w2
in webpages indexed by a search engine, is defined as:

Fig. 1: Example of decision stage of HSW(S,h,WRW)


 
 
Fig. 1 shows an example of the random tournament held in
the decision stage of HSW(S,h,HWRW) from a starting concept
start to a target concept end. The computation of values h(nj,end) PMI(w1 ,w2) is an approximate measure of how much a word
requires to query a given search engine in order to obtain the gives information on the other word of the pair, in particular
appropriate data. In the given example, the candidate node n2 the quantity of information provided by the occurrence of the
event w2 about the occurrence of the event w1. A high value
will be most probably chosen, carrying the higher similarity to
of PMI represents a decrease of uncertainty. PMI has been
the goal.
successfully used in [10] to recognize synonyms, using only
IV. SEARCH ENGINE-B ASED S EMANTIC P ROXIMITY word counts, despite that on low frequency data PMI does
MEASURES USED AS HEURISTICS not provide reliable results. PMI in fact is a good measure of
independence since values near zero indicate frequency, but
Since search engines are based on documents that are at the same time is a bad measure of dependence, since the
dynamically updated by a great number of users, using dependency score is related to the frequency of individual
information on indexed terms provided by a search engine can words3.
be a valid approach to evaluate the semantic proximity of pairs
of terms, or groups of terms over the Web. The general idea is  χ2 (Chi-square, CHI) measures the significance of a relation
to use a search engine or an object repository search tool S as a between two categorical variables, checking if the values,
black box, to which submit queries and from which to extract observed by measuring frequency, differ significantly from
useful statistics about the occurrence of a term or a set of terms the frequencies obtained by the theoretical distribution. The
in the indexed objects set, as simple as counting the number of intuition behind Chi-square is that two events are associated
results for terms/objects queries. A proximity measure m is then only when they are more related than by pure chance. χ2
used as heuristic h(n)=m(n,g) to evaluate the most promising

2 Let denote by f(x)=S(x) and f(x,y)=S(x y) the cardinalities of the '


3 In case of perfect dependence: $%(&' , &* ) = +-* (.
results of a query to S, with search term x and x y respectively, and we /)

denote by N the number of documents/objects which are indexed by the In case of perfect independence:
(&' )(&* )
search engine S. The probability estimate can be directly computed $%(&' , &* ) = +-* = +-* 1 = 0
from the frequency, as P(x)=f(x)/N, summarizing the frequency-based (&' )(&* )
approach to probability, whenever the total N is known or can be
realistically approximated with a value that will be greater than f(x) for
each possible x in the considered domain or context.

144
coefficient has also been used in algorithms for community a context the set of all the possible paths linking s and g is
discovering. [8] neither practical nor useful.
In order to characterize the semantic context implicitly defined
(5) by s and g, the randomized search strategy is repeated for a
number of runs, and the subnetwork thus generated (see Fig.2)
Where a,b,c,d represent the number of document in which is analyzed in order to extract the actual semantic context. For
w1 and w2 occur according to a= w1 w2, b= w1 w2, c= w2 instance, it is interesting to estimate which nodes are traversed
w1, d=w1 w2 and n=N=a+b+c+d. by more paths i.e., appear in more explanations, as well as to
 Normalized Google Distance (NGD) has been introduced in estimate their path proximity to the goal.
[2] as a measure of semantic relation, based on the
assumption that similar concepts occur together in a large
number of documents in the Web i.e., the frequency of
documents returned by a query on a search engine S
approximates the distance between related semantic concepts.
The NGD between two terms x and y is formally defined as
follows:


 

where f(x), f(y) and f(x,y) are the cardinalities of results


returned by S for the query on x, y, x y respectively, and M Fig. 2: semantic chains and context
is the number of pages indexed by S, or a value reasonably
greater than f(x) for each possible x. Relevant parameters which have been considered in order to
include a candidate node in the context are:
 PMING Distance (PMING) [4][13] consists of NGD and 1. Number of semantic chains between start and goal
PMI locally normalized, with a correction factor of weight ρ, nodes, in which the node appears.
which depends on the differential of NGD and PMI. 2. Minimum distance from start to goal nodes over all
More formally, the PMING distance of two terms x e y in a the chains.
context W is defined, for f(x)f(y), as a function 3. Average position in the chains, which represents
PMING:WW[0,1] the traversing velocity toward the goal node.
These parameters, computed trough the randomized execution,
(7) estimate centrality/connectivity degree, path distance and
correlation between the candidate node and the start/goal pair.
where μ1 and μ2 are constant values which depend on the Moreover in order to introduce a ranking on the candidate nodes
context of evaluation, defined as: to be included in the context, a node rank value is assigned as a
 2' = max $%(, ) ratio between 1. and 2.:

 2* = max 345(, )  67 = #ℎ6:/<<<5:6 (8)


with x, y ∈W.
Finally different thresholds, respectively 1,2,3 and 4 , have
V. SEMANTIC P ATHS AND CONTEXTS been introduced in order to filter out outliers candidate nodes.
As discussed in the introduction, this work is based on the In the following, for visibility and readability issues, tag clouds
notion, widely accepted in artificial intelligence, that two terms and dispersion graphs have been used:
s and g are correlated if the can be linked by a path in the  Tag clouds (see Fig. 3) basically show parameter 1.
considered semantic network S (in our case Wikipedia) while where bigger terms correspond to the most traversed
the path between s and g represents an argumentation of nodes.
why/how they are correlated. On the other hand, we have  Dispersion graphs (see Fig. 4) represent the above
observed that because of the high connectivity degree and small parameters, 1 to 3, were x and y coordinates
network diameter in online collaborative semantic networks respectively show the average distance and the
like Wikipedia, a shortest path may not be meaningful while minimum distance from a term to g and the bubble
multiple paths exist between pairs of nodes. In the HSW the role diameter represents the number of chains in which the
of the heuristic is to drive the search toward the goal i.e., along term appears.
meaningful paths i.e., semantic chains, which better correlate s
and g, while the role randomized strategy HWRW guarantees
the exploration of multiple paths. It appears that considering as

145
10. The data set has been used as a reference for comparison
purposes by several studies on semantic similarity, using corpus-
based, knowledge-based and hybrid algorithms that can induce
a ranking on a series of word pairs, comparable by the
Spearman’s ρ and Pearson’s r metrics of correlation. [19]
B. Filtering Phases
The whole Wikipedia network is too large to be explored
with classical search algorithms, having 4 10,345,655 nodes and
300,481,663 links, for a total dimension of more than 264 Gb
and a diameter of ≈6, close to a random network.
Although Wikipedia links are purposely introduced by editors
for knowledge sharing and explanation, several of them are not
related to semantic meanings. For instance, in addition to trivial
filtering such as not considering navigator menus or fixed parts
Fig. 3: Tag cloud for the pair (Mars, Scientist) of the website, the so called infobox (the information box, which
appears in the right-hand side of the Wikipedia content pages)
often includes digits as geographic values (e.g., number of
resident or width of a region) and coordinates. Despite that a
scattering effect would be in part corrected by the heuristic
model, following such links can lead into loops or extremely
divergent results, with a final non-convergence of the heuristic
search. The page itself can as well include links to
disambiguation pages or very general pages, which act as hubs
in the network.
On the other side, because of the model of the Wikipedia
standard page and since it is written by humans for explanation
purposes, the first section of the Wikipedia content page is more
related to the essential definition of the concept [6][11] and thus
more significant, in order to derive the semantic meaning, than
the subsequent sections describing particulars, history, or
Fig. 4: Dispersion graph for the pair (Telephone, Communication) bibliography et cetera.
For this kind of reasons, a preprocessing and an online
VI. EXPERIMENTS AND COMPARISON OF RESULTS pruning phases were required, in order both to filter out useless
The non-informed pure Random Walk and 12 combinations links and to prune hubs or too general pages, avoiding to search
of search strategies and heuristics have been experimented. For on trivial/meaningless paths in the network, with the aim of
each heuristic (CF, NGD, PMI and χ2) both Random obtaining results with a higher semantic quality with respect to
Tournament-based (rt-HSW) and Best First-Based Semantic natural language and human evaluation. In particular:
Walks (β-HSW) were tested (see Table 1).
 Only the anchor links in the main content of the
Table 1: Search strategies and heuristics of the tests article are parsed.
rt-HSW β-HSW  A maximum number of candidate links [7] is
defined: in our experiments, a threshold of 15 links
CF rt-HSW-CF β-HSW-CF on the content of the page is considered relevant.
NGD rt-HSW- NGD β-HSW- NGD  Namespace, disambiguation invoices and links to
PMI rt-HSW- PMI β-HSW- PMI external resources and blank pages (i.e., dead ends)
are filtered out.
χ2 rt-HSW- χ2 β-HSW- χ2
 Pages about dates, numbers, international
abbreviations, geographic and etymological
A. Data Set references are dynamically pruned, since they don’t
The experiments of the present study were made on the well carry a semantic value.
known Word Similarity 353 data set, [18] which consists on a
test collection of English word pairs along with human-assigned C. Evaluation Criteria
semantic similarity evaluations. In particular, the collection An evaluation of experimental results has been held on the
consists in 2 sets, the first including 153, the second 200 word base of numerical parameters and human evaluation. The results
pairs. Each pair is linked to an average vote ranging from 1 to have been discussed and compared to previous result in the

4 At the time of the experiments of our study.

146
literature. The HSW approach was tested on Wikipedia, Regarding human evaluation on tag clouds, Confidence has
comparing rt-HSW to β-HSW and to the pure Random Walk the best performance, with a value of 4.4 over 5 and an
(RW) with respect to numerical parameters: improvement of about 50% with respect to PMI, the latter
ranking second.
 Convergence/divergence target found/not found
after a threshold of 1000 steps (i.e., links). Finally, the Spearman’s ρ and Pearson’s r metrics show that
all Rt-HSWs score is among the first best 12 algorithms for the
 Length of the chain. benchmark on word similarity WordSym353 data set [19], with
 Minimal solution: completeness (i.e., found/not the Rt-HSW-Confidence ranking at the 9 th place (Table 2 shows
found) and velocity (i.e., earliest run in which the the first 21 algorithm on World Similarity 353 including the
solution is found). HSW). In comparison to the Corpus-based and Knowledge-
based algorithms present in [19], however, the web-based
 Ranking, with respect to literature on the Word approach of the HSW is in general less computation-costly than
Similarity 353 data base, [19] evaluated using the the algorithms which gains the first places, and relies on a
Spearman’s ρ and Pearson’s r metrics in order to dynamic collaborative source (i.e, the Web).
produce a ranking comparable to the human
evaluation rates provided by the data set.
On the other side, an evaluation was also made in order to
establish the quality of chains and contexts, both from an
objective point of view and on a subjective (i.e., human)
evaluation. Regarding the human evaluation, some tag clouds
(i.e., contexts) were chosen on specific fields (e.g., medical) and
submitted to a group of specialist of the considered field, who
assigned a value ranging from 0 to 5 to the clouds generated
from each heuristic. Since the clouds are too numerous to be all
evaluated by humans (over 48,500 in the total of the
experiments), in order to provide an evaluation of all the results
and an objective one, the percentage of semantic chains that Fig. 5: Chains average length
could be considered relevant for explanation of concepts was
counted. In particular, some convergence limits were evaluated,
considering thresholds of 5, 15, 50, 100 and 200. The main idea
is that for explanation/argumentation purposes a long chain
cannot be considered relevant, because the human mind cannot
jump over more than 5 steps of abstraction without loosing
focus. Furthermore, Wikipedia has a diameter of ≈6 [20] which
can be a good point of reference for comparison aims. Anyway,
over 5 steps, the longer the chain goes, the less the chain is
relevant to be an explanation, hardly going over 15.
D. Discussion of Results
Rt-HSW improves results on all the heuristics, with respect
to the average length of the semantic chain (see Fig.5, 6 and 7).
The best performance was detected for Rt-HSW_Conf, the
heuristic based on Confidence (38.8% improvement i.e., shorter Fig. 6: Evaluation on different convergence thresholds
chains).
Non-informed strategies generally produce a shorter average
chain, but have a high divergence rate. Very bad results were
accounted in particular for local Best-first search (β-HSW_nla);
Best-first with loop avoidance (β-HSW) has been then
implemented to better analyze Best-first by memorizing
explored nodes and avoiding to repeat the same expansion.
Rt-HSW produces chains with an average length comparable
to the β-HSW, but with much better convergence rate.
Confidence produces in general the best results over Rt-HSW
(NGD is the best on β-HSW), also on the convergence limit of
5. Also considering the optimality minimum path i.e., the
velocity of finding the minimum path, heuristic-driven
randomized strategies win over Best-first.
Fig. 7: Relevant chains on different heuristics

147
VII. CONCLUSIONS [12] Clement H. C. Leung, Alice W. S. Chan, Alfredo Milani, Jiming Liu,
Yuanxi Li: Intelligent Social Media Indexing and Sharing Using an
A Weighted Randomized Heuristic Semantic Walk scheme Adaptive Indexing Search Engine. ACM TIST 3(3): 47 (2012),
for path finding in collaborative semantic networks as been doi:10.1145/2168752.2168761
presented. The experiments, held on the collective knowledge [13] V. Franzoni, C.H.C. Leung, Y.X. Li, A. Milani, Collective Evolutionary
base Wikipedia on a range of different proximity measures used Concept Distance Based Query Expansion for Effective Web Document
Retrieval. ICCSA, 657-672, Springer (2013)
in literature, show that the proposed approach leads to better
results than classical non-informed search methods in terms of [14] M. Gori, A. Pucci, A random-walk based scoring algorithm with
application to recommender systems for large-scale e-commerce. 12th
convergence rates, optimality and suitability compared to human ACM SIGKDD International Conference on Knowledge Discovery and
evaluation. In particular, HSW with heuristic randomized search Data Mining (2006)
returns the path that connects two concepts in way faster times [15] V. Franzoni, A. Milani, Heuristic Semantic Walk. In Computational
than an uninformed blind search, and provides more relevant Science and Its Applications, 643-656, Springer (2013)
paths using Confidence as a heuristic. The proposed scheme [16] Lew, M.S., Sebe, N., Djeraba, C., Jain, R. Content-based multimedia
with Rt-HSW-Conf provides excellent performance, ranking information retrieval: State of the art and challenges. ACM Trans.
9th, with respect to state-of-the-art results in the literature (see Multimedia Comp. Com. App.( 2006)
Table 2) on the standard benchmark WordSysm353, while is the [17] Franzoni V., Milani A., Heuristic semantic walk for concept chaining in
collaborative networks, International Journal of Web Information
first one based on a web based collaborative dynamic source. Systems, Vol. 10 Iss: 1 (2014)
REFERENCES [18] Finkelstein, L., Gabrilovich E., et al, and Eytan Ruppin"Placing Search in
Context: The Concept Revisited" ACM Transactions on Information
[1] Bollegala D., Matsuo Y., Ishizukain M., A Web Search Engine-Based Systems, 20(1):116-131, January 2002
Approach to Measure Semantic Similarity between Words. IEEE [19] State-of-the-Art ranking of algorithms on the World Sim 353 data set:
Transactions on Knowledge and Data Engineering (2011) http://aclweb.org/aclwiki/index.php?title=WordSimilarity-
[2] Cilibrasi R, Vitanyi P., The Google Similarity Distance. ArXiv.org (2004) 353_Test_Collection_(State_of_the_art)
[3] Church K. W., Hanks P., Word association norms, mutual informatio n [20] Newman, M., Barabási A.-L., and Duncan J. Watts. 2006. The Structure
and lexicography. In ACL 27 (1989) and Dynamics of Networks. Princeton, NJ: Princeton University Press.
[4] Franzoni V., Milani A., PMING Distance: A Collaborative Semantic [21] A. Newell, J. C. Shaw, and H. A. Simon. 1957. Empirical explorations of
Proximity Measure, WI-IAT, vol. 2, pp.442-449, 2012 IEEE/WIC/ACM the logic theory machine: a case study in heuristic. In Proc. of IRE-AIEE-
International Conferences on Web Intelligence and Intelligent Agent ACM '57. 218-230, ACM, New York, USA, 1957
Technology (2012). [22] Kowalski, R., "A Proof Procedure Using Connection Graphs", in Journal
[5] Kurant M., Markopoulou A., Thiran P., On the bias of BSF. ITC (2010) of the ACM Vol. 22, No. 4, 1975, pp. 572–595.
[6] Milne D., Witten I.H., An effective, low-cost measure of semantic [23] Hart, P. E.; Nilsson, N. J.; Raphael, B. (1968). "A Formal Basis for the
relatedness obtained from Wikipedia links. WIKIAI (2008) Heuristic Determination of Minimum Cost Paths". IEEE Transactions on
[7] Yeh E., Ramage D., Manning C.D., Agirre E., Soroa A., WikiWalk: Systems Science and Cybernetics SSC4 (2): 100–107
Random walks on Wikipedia for Semantic Relatedness; in Proc. Graph- [24] Cheng V.C., Leung C.H.C., Liu J, Milani A, "Probabilistic Aspect Mining
based Methods for Natural Language Processing (2009) Model for Drug Reviews," IEEE Transactions on Knowledge and Data
[8] Newman M.E.J., Fast algorithm for detecting community structure in Engineering, vol.99 p.1 (2014)
networks. University of Michigan, MI (2003) [25] Milani A,Santucci V.,2012. Community of scientist optimization: An
[9] Cao G., Gao J., et al, Extending query translation to cross- language query autonomy oriented approach to distributed optimization. AI Comm.. 25,2
expansion with markov chain models. CIKM, ATM (2007) [26] Baioletti M., Milani A et al: Experimental evaluation of pheromone
[10] Turney P., Mining the web for synonyms: PMI versus LSA on TEOFL. models in ACOPlan. Ann. Math. Artif. Intell. 62(3-4): 187-217 (2011)
In Proc. ECML (2001)
[11] Xu Z., Luo X., Yu J., Xu W., Measuring semantic similarity between
words by removing noise and redundancy in web snippets. Concurrency
Computat: PE, 23 (2011)
Table 2 Ranking on WordSym353

Algorithm Reference for algorithm Type Spearman's rho Pearson's r


1 CLEAR Halawi et al. (2012) Corpus-based 0.81 N/A
2 Y&Q Yih and Qazvinian (2012) Hybrid 0.81 N/A
3 TSA Radinsky et al. (2011) Hybrid 0.8 N/A
4 ESA Gabrilovich and Markovitch (2007) Corpus-based 0.748 0.503
5 Multi- lingual SSA Hassan et al. (2011) Corpus-based 0.713 0.674
6 Multi-prototype Huang et al. (2012) Corpus-based 0.71 N/A
7 HSMN+csmRNN Luong et al. (2013) Corpus-based 0.65 N/A
8 SSA Hassan and Mihalcea (2011) Knowledge-based 0.622 0.629
9 rt-HSW-CONF UNIPG-HKBU Web-based 0.586 0.532
10 LSA Landauer et al. (1997) Corpus-based 0.581 0.563
11 LSA Landauer et al. (1997) Corpus-based 0.581 0.492
12 C&W Collobert and Weston (2008) Corpus-based 0.5 N/A
13 rt-HSW-NGD UNIPG-HKBU Web-based 0.496 0.503
14 rt-HSW-PMI UNIPG-HKBU Web-based 0.493 0.495
15 rw-HSW UNIPG-HKBU Web-based 0.478 0.491
16 rt-HSW-CHI UNIPG-HKBU Web-based 0.472 0.472
17 ROGET Jarmasz (2003) Knowledge-based 0.415 0.536
18 Resnik Resnik (1995) Knowledge-based 0.353 0.365

148

You might also like