Professional Documents
Culture Documents
Articulo IA
Articulo IA
Abstract—Relating, connecting and navigating between based semantic networks, such as Wikipedia: the graph size can
concepts represent a major challenge for machine intelligence. On be extremely large, oversizing the main memory available to any
the other hand, collaborative repositories provide a large base of algorithm, and ruling out memory intensive algorithms, or
knowledge already filtered, structured, linked and meaningful algorithms that require to load the whole graph in the working
from a human semantic point of view. Although these repositories memory. The connectivity of the nodes is extremely high,
are machine accessible, they have no formal explicit semantic leading to a high branching factor, which rules out algorithms
tagging to help for automatic navigation in them. In this paper we based on maintaining a large fringe of unexpanded nodes. The
present a randomized approach, based on Heuristic Semantic cost for node expansion is also very high, since the online
Walk (HSW) for searching a collaborative network in order to exploration of the graph requires to query and exchange data on
extract meaningful semantic chains between concepts. The method
the collaborative network, which is virtually distributed over the
is based on the use of heuristics defined on semantic proximity
measures, which can be easily computed from general search
Web. The graph can dynamically change, thus making difficult
engines statistics. Information from multiple random chains can to implement forms of pre-computation, proxies, or techniques
be used to compute semantic distances between the concepts, as of dynamic programming.
well as to determine the underlying semantic context. The On the other hand, the Wikipedia collaborative network
proposed method solves major issues posed by collaborative represents a network of explanations that embeds semantic
networks, such as large dimensions, high connectivity degree and information in the topology of links. Wikipedia provides the
dynamical evolution of online networks, which make classical explanation of a concept by an article in natural language,
search methods inefficient and unfeasible. In this study the HSW
containing links to other articles concerning sub-concepts, the
model has been experimented on Wikipedia. Tests held with the
well known WordSym353 benchmark for human evaluation show
most relevant to its understanding. The Wikipedia network is
that the proposed model is comparable to best state-of-the-art then a collaborative semantic network, built by a community of
results, while being the only web-based approach. Other potential users, who add explanations and links for the mutual general
applications range from query expansion, argumentation mining, benefit of other human users. Similar explicit collaborative
and simulation of user navigation. actions take place nearly in all social network and repositories:
referencing a resource, labelling/tagging objects in order to
Keywords—heuristic search; semantic networks; semantic make them clearer to other users, adding friendships,
similarity measures; random walk; data mining; argumentation commenting and ranking positively/negatively other users
mining messages are all examples of actions that implicitly build
semantic networks among resources, objects and users.
I. INTRODUCTION
Although many different semantic relationships can exists
The ability of relating sets of concepts, navigating among
between two concepts s1 and s2 corresponding to two linked
them, and understanding the underlying context is a major issue
articles, (e.g., class/subclass, part_of, opposite), the explanatory
for machine intelligence applications and man-machine
nature of the Wikipedia semantic network simply suggests that
interfaces aiming at interacting by managing, classifying, and
the existence of a link implies the existence of a relevant
understanding human knowledge. A great opportunity is
represented by the increasing availability of web-based semantic relationship (s1,s2) between them. In other words,
collaborative repositories, built by the users for their mutual this means that, given s1, then s2 is also relevant. Starting from
this idea, the intermediate concept nodes in a path A between
benefit and knowledge sharing. These online repositories e.g.,
pictures and media sharing, fora and discussion groups websites, two given concepts over the semantic network i.e., semantic
chain, can represent hidden concepts or underlying concepts,
provide a large source of information already filtered, structured,
which are implicit in the context of the initially given concepts.
linked and meaningful from a human semantic point of view.
Although these repositories are machine accessible, they have The problem of the semantic chain search can then be
no formal semantic structures to be exploited for automatic reduced to a problem of search in the semantic graph. To
navigation and they pose several important and challenging establish which concepts are implied by a pair of terms, the path
issues that make traditional techniques for network search between the terms can be considered.
unfeasible. Typical issues arise in the case of collaborative web-
142
single run random walk, where the high connectivity of nodes s, gV, consists in finding if a sequence of vertices (v0,
Wikipedia can eventually lead to a solution path, but of high v1,..., vn) exists, such that v0=s, vn=g and for each i[0, n-1] the
length and small relevance. [14] The Iterative Deepening Search edge (vi, vi+1)E. Similarly, the Shortest Path search problem
(IDS) approach, which can guarantee completeness, is highly SPath(s,g) can be defined straightforwardly.
inefficient with high branching factors and further penalized by
the high cost of online access to candidate nodes. We conclude A general scheme can be defined in order to characterize the
that non-informed walks strategies are not suitable for efficient class of HSW solutions to the semantic path problem SPath.
path search in the online collaborative semantic network Definition. A Heuristic Semantic Walk instance, or HSW, is
scenario. an algorithm for SPath problems characterized by a semantic
Heuristic Walks collaborative network graph =(V,E), a search engine S, which
Heuristic Walks are informed search strategies, which make can return statistics about the occurrence of elements in the set
use of a heuristic measure to estimate a score for each candidate of all subsets of V named (V), a proximity measure m defined
node, and to sort them for the expansion. The evaluation is in terms of those statistics, and a search strategy or walk strategy
performed in terms of closeness/relatedness to the goal, A using heuristics derived from m. Note that (V) denotes the
measuring the semantic proximity of the current node to the goal set of all subsets of V; in other words, the engine/repository S
node. should be able to return statistics for co/occurrences of objects.
Most engines/repositories are in fact able to return statistics for
Among heuristic algorithms, we certainly have to consider queries like Q= (a
bc). Since the actual returned objects are
A* [23] and its many extensions and variations, since its not relevant, for our purposes the engine S can be formally
optimality properties could make it an ideal candidate for defined as follows:
semantic heuristic walk strategies. The core of the A* class of
algorithms is the selection function f(n)=c(n)+h(n) where c(n) Definition: a search engine/repository S is a function S:
represents the cost to reach n from the source node (the path (V)N, where S(Q) returns the number of occurrences of Q
distance, in our case), while h(n) is the heuristic estimated on the according to the engine/repository [1][4] i.e., the statistics will
distance of n from the goal. In A*, a node ne in the fringe F of an reflect the frequency contextual use of terms in Q, according to
unexpanded node is selected for expansion if ne=min(nF, f(n)) the community related to the domain that feeds objects to S.
i.e., if the value of the evaluation function is minimal for ne. On Given the previous general scheme, we will use a notation in
the other side, when the target is distant from the source, the A* which HSW(S,m,A) on will denote an instance of HSW for a
fringe F will grow too large with real online semantic networks,
semantic network where the search strategy A uses a heuristic
making the approach unpractical.
hm computed from source S, and HSW(A) denotes a non-
A Greedy Best-First (GBF) approach avoids maintaining informed algorithm with a walk strategy A. For example,
different nodes’ path history. In simple GBF, the evaluation HSW(Bing, NGD, A*) for Wikipedia denotes an A* search
f(n)=h(n) does not consider previous costs, then only the nodes algorithm operating on Wikipedia, which uses a heuristic
of the fringe that are the closest to the goal are chosen. The derived from the Normalized Google Distance (NGD) measure,
problem of a growing large fringe have led to a particular computed on the search engine Bing.
implementation, Local GBF (LGBF), where no fringe memory
A. Heuristic Weighted Random Walk
is maintained except for the current path, and only the current
node descendants are considered in the fringe. LGBF Experiments with HSW(S, m, LGBF) on Wikipedia have
implements a local search with loop avoidance control. shown, despite of using different proximity measures as
heuristics, that biases and redundancies in outgoing links can
III. THE H EURISTIC S EMANTIC WALK MODEL lead to spiral-like behaviors where the walk is repeatedly
The Heuristic Semantic Walk (HSW) model [15][17] moving around the target without converging. After preliminary
describes a framework for browsing an online semantic network experiments on HSW(S,m,A*) and HSW(S,m,LGBF), which
in order to find meaningful connections between pairs of suggested that the high connectivity of collaborative semantic
concepts. The goal of the HSW is to return paths between a pair networks allows to find near optimal solution paths in a
of terms (s, g) following the semantic links (e.g., the anchor links straightforward way, we have decided to add a degree of
from the text of a starting Wikipedia source article s to a goal randomization to the original LGBF, in order to eliminate non
article g), the main feature of the model is the use of a semantic convergence problems.
proximity measure to drive the search towards the best successor The idea of the Heuristic Weighted Random Walk (HWRW)
node (ni) on a set of candidates, which is closest to the goal (g). or Random Torunament Heuristic Semantic Walk (rt-HSW)is to
Definition: a semantic network graph = (V, E), is defined choose among a set of successors for the current node, making
by a pair where V is a set of vertices/concepts (e.g., the entries a random tournament among the candidate successors, using the
probability distribution induced on the candidates by the values
in Wikipedia), and E(VV) is a set of oriented edges,
of the proximity measure between the candidate concepts and
representing the links between concepts in the network (e.g., the
the target concept. No memory of unexpanded nodes is
anchor links in the text of a Wikipedia article toward a
maintained, except for the current solution path. In HWRW, the
referenced article).
proximity measure values need to be normalized to [0,1] in
Definition: the semantic path finding problem or order to build the probability distribution. The probability of the
Path(s,g), given a semantic network
V, E and two candidate ni in the random tournament is proportional to its
143
estimated proximity to the goal and normalized over the node n to browse, with the aim of reaching the goal node.2 Since
candidates and it is defined as we normalize all the measures, distance and proximity measures
( ,)
can be compared as complementary.
( )=( ) = ∑ ( ,)
∀ ∈
() (1)
The proximity measures considered and experimented are:
where g is the target node, nj are the successors of the current Confidence (CF) [4] is an asymmetric statistical measure,
node, and h(nj,g) is the heuristic derived by the proximity used in rule mining for measuring trust in a rule X→Y i.e.,
measure. The hypothesis underlying this approach is that given the number of transactions that contain X, then the
Wikipedia pages of close concepts are close i.e., have a short Confidence in the rule indicates the percentage of
path linking them, therefore the use of a suitable proximity transactions that contain also Y.
measure that reflects the human judgement of
closeness/relatedness is important in order to guarantee the
adequacy of the heuristic. (2)
denote by N the number of documents/objects which are indexed by the In case of perfect independence:
(&' )(&* )
search engine S. The probability estimate can be directly computed $%(&' , &* ) = +-* = +-* 1 = 0
from the frequency, as P(x)=f(x)/N, summarizing the frequency-based (&' )(&* )
approach to probability, whenever the total N is known or can be
realistically approximated with a value that will be greater than f(x) for
each possible x in the considered domain or context.
144
coefficient has also been used in algorithms for community a context the set of all the possible paths linking s and g is
discovering. [8] neither practical nor useful.
In order to characterize the semantic context implicitly defined
(5) by s and g, the randomized search strategy is repeated for a
number of runs, and the subnetwork thus generated (see Fig.2)
Where a,b,c,d represent the number of document in which is analyzed in order to extract the actual semantic context. For
w1 and w2 occur according to a= w1
w2, b= w1
w2, c= w2 instance, it is interesting to estimate which nodes are traversed
w1, d=w1
w2 and n=N=a+b+c+d. by more paths i.e., appear in more explanations, as well as to
Normalized Google Distance (NGD) has been introduced in estimate their path proximity to the goal.
[2] as a measure of semantic relation, based on the
assumption that similar concepts occur together in a large
number of documents in the Web i.e., the frequency of
documents returned by a query on a search engine S
approximates the distance between related semantic concepts.
The NGD between two terms x and y is formally defined as
follows:
145
10. The data set has been used as a reference for comparison
purposes by several studies on semantic similarity, using corpus-
based, knowledge-based and hybrid algorithms that can induce
a ranking on a series of word pairs, comparable by the
Spearman’s ρ and Pearson’s r metrics of correlation. [19]
B. Filtering Phases
The whole Wikipedia network is too large to be explored
with classical search algorithms, having 4 10,345,655 nodes and
300,481,663 links, for a total dimension of more than 264 Gb
and a diameter of ≈6, close to a random network.
Although Wikipedia links are purposely introduced by editors
for knowledge sharing and explanation, several of them are not
related to semantic meanings. For instance, in addition to trivial
filtering such as not considering navigator menus or fixed parts
Fig. 3: Tag cloud for the pair (Mars, Scientist) of the website, the so called infobox (the information box, which
appears in the right-hand side of the Wikipedia content pages)
often includes digits as geographic values (e.g., number of
resident or width of a region) and coordinates. Despite that a
scattering effect would be in part corrected by the heuristic
model, following such links can lead into loops or extremely
divergent results, with a final non-convergence of the heuristic
search. The page itself can as well include links to
disambiguation pages or very general pages, which act as hubs
in the network.
On the other side, because of the model of the Wikipedia
standard page and since it is written by humans for explanation
purposes, the first section of the Wikipedia content page is more
related to the essential definition of the concept [6][11] and thus
more significant, in order to derive the semantic meaning, than
the subsequent sections describing particulars, history, or
Fig. 4: Dispersion graph for the pair (Telephone, Communication) bibliography et cetera.
For this kind of reasons, a preprocessing and an online
VI. EXPERIMENTS AND COMPARISON OF RESULTS pruning phases were required, in order both to filter out useless
The non-informed pure Random Walk and 12 combinations links and to prune hubs or too general pages, avoiding to search
of search strategies and heuristics have been experimented. For on trivial/meaningless paths in the network, with the aim of
each heuristic (CF, NGD, PMI and χ2) both Random obtaining results with a higher semantic quality with respect to
Tournament-based (rt-HSW) and Best First-Based Semantic natural language and human evaluation. In particular:
Walks (β-HSW) were tested (see Table 1).
Only the anchor links in the main content of the
Table 1: Search strategies and heuristics of the tests article are parsed.
rt-HSW β-HSW A maximum number of candidate links [7] is
defined: in our experiments, a threshold of 15 links
CF rt-HSW-CF β-HSW-CF on the content of the page is considered relevant.
NGD rt-HSW- NGD β-HSW- NGD Namespace, disambiguation invoices and links to
PMI rt-HSW- PMI β-HSW- PMI external resources and blank pages (i.e., dead ends)
are filtered out.
χ2 rt-HSW- χ2 β-HSW- χ2
Pages about dates, numbers, international
abbreviations, geographic and etymological
A. Data Set references are dynamically pruned, since they don’t
The experiments of the present study were made on the well carry a semantic value.
known Word Similarity 353 data set, [18] which consists on a
test collection of English word pairs along with human-assigned C. Evaluation Criteria
semantic similarity evaluations. In particular, the collection An evaluation of experimental results has been held on the
consists in 2 sets, the first including 153, the second 200 word base of numerical parameters and human evaluation. The results
pairs. Each pair is linked to an average vote ranging from 1 to have been discussed and compared to previous result in the
146
literature. The HSW approach was tested on Wikipedia, Regarding human evaluation on tag clouds, Confidence has
comparing rt-HSW to β-HSW and to the pure Random Walk the best performance, with a value of 4.4 over 5 and an
(RW) with respect to numerical parameters: improvement of about 50% with respect to PMI, the latter
ranking second.
Convergence/divergence target found/not found
after a threshold of 1000 steps (i.e., links). Finally, the Spearman’s ρ and Pearson’s r metrics show that
all Rt-HSWs score is among the first best 12 algorithms for the
Length of the chain. benchmark on word similarity WordSym353 data set [19], with
Minimal solution: completeness (i.e., found/not the Rt-HSW-Confidence ranking at the 9 th place (Table 2 shows
found) and velocity (i.e., earliest run in which the the first 21 algorithm on World Similarity 353 including the
solution is found). HSW). In comparison to the Corpus-based and Knowledge-
based algorithms present in [19], however, the web-based
Ranking, with respect to literature on the Word approach of the HSW is in general less computation-costly than
Similarity 353 data base, [19] evaluated using the the algorithms which gains the first places, and relies on a
Spearman’s ρ and Pearson’s r metrics in order to dynamic collaborative source (i.e, the Web).
produce a ranking comparable to the human
evaluation rates provided by the data set.
On the other side, an evaluation was also made in order to
establish the quality of chains and contexts, both from an
objective point of view and on a subjective (i.e., human)
evaluation. Regarding the human evaluation, some tag clouds
(i.e., contexts) were chosen on specific fields (e.g., medical) and
submitted to a group of specialist of the considered field, who
assigned a value ranging from 0 to 5 to the clouds generated
from each heuristic. Since the clouds are too numerous to be all
evaluated by humans (over 48,500 in the total of the
experiments), in order to provide an evaluation of all the results
and an objective one, the percentage of semantic chains that Fig. 5: Chains average length
could be considered relevant for explanation of concepts was
counted. In particular, some convergence limits were evaluated,
considering thresholds of 5, 15, 50, 100 and 200. The main idea
is that for explanation/argumentation purposes a long chain
cannot be considered relevant, because the human mind cannot
jump over more than 5 steps of abstraction without loosing
focus. Furthermore, Wikipedia has a diameter of ≈6 [20] which
can be a good point of reference for comparison aims. Anyway,
over 5 steps, the longer the chain goes, the less the chain is
relevant to be an explanation, hardly going over 15.
D. Discussion of Results
Rt-HSW improves results on all the heuristics, with respect
to the average length of the semantic chain (see Fig.5, 6 and 7).
The best performance was detected for Rt-HSW_Conf, the
heuristic based on Confidence (38.8% improvement i.e., shorter Fig. 6: Evaluation on different convergence thresholds
chains).
Non-informed strategies generally produce a shorter average
chain, but have a high divergence rate. Very bad results were
accounted in particular for local Best-first search (β-HSW_nla);
Best-first with loop avoidance (β-HSW) has been then
implemented to better analyze Best-first by memorizing
explored nodes and avoiding to repeat the same expansion.
Rt-HSW produces chains with an average length comparable
to the β-HSW, but with much better convergence rate.
Confidence produces in general the best results over Rt-HSW
(NGD is the best on β-HSW), also on the convergence limit of
5. Also considering the optimality minimum path i.e., the
velocity of finding the minimum path, heuristic-driven
randomized strategies win over Best-first.
Fig. 7: Relevant chains on different heuristics
147
VII. CONCLUSIONS [12] Clement H. C. Leung, Alice W. S. Chan, Alfredo Milani, Jiming Liu,
Yuanxi Li: Intelligent Social Media Indexing and Sharing Using an
A Weighted Randomized Heuristic Semantic Walk scheme Adaptive Indexing Search Engine. ACM TIST 3(3): 47 (2012),
for path finding in collaborative semantic networks as been doi:10.1145/2168752.2168761
presented. The experiments, held on the collective knowledge [13] V. Franzoni, C.H.C. Leung, Y.X. Li, A. Milani, Collective Evolutionary
base Wikipedia on a range of different proximity measures used Concept Distance Based Query Expansion for Effective Web Document
Retrieval. ICCSA, 657-672, Springer (2013)
in literature, show that the proposed approach leads to better
results than classical non-informed search methods in terms of [14] M. Gori, A. Pucci, A random-walk based scoring algorithm with
application to recommender systems for large-scale e-commerce. 12th
convergence rates, optimality and suitability compared to human ACM SIGKDD International Conference on Knowledge Discovery and
evaluation. In particular, HSW with heuristic randomized search Data Mining (2006)
returns the path that connects two concepts in way faster times [15] V. Franzoni, A. Milani, Heuristic Semantic Walk. In Computational
than an uninformed blind search, and provides more relevant Science and Its Applications, 643-656, Springer (2013)
paths using Confidence as a heuristic. The proposed scheme [16] Lew, M.S., Sebe, N., Djeraba, C., Jain, R. Content-based multimedia
with Rt-HSW-Conf provides excellent performance, ranking information retrieval: State of the art and challenges. ACM Trans.
9th, with respect to state-of-the-art results in the literature (see Multimedia Comp. Com. App.( 2006)
Table 2) on the standard benchmark WordSysm353, while is the [17] Franzoni V., Milani A., Heuristic semantic walk for concept chaining in
collaborative networks, International Journal of Web Information
first one based on a web based collaborative dynamic source. Systems, Vol. 10 Iss: 1 (2014)
REFERENCES [18] Finkelstein, L., Gabrilovich E., et al, and Eytan Ruppin"Placing Search in
Context: The Concept Revisited" ACM Transactions on Information
[1] Bollegala D., Matsuo Y., Ishizukain M., A Web Search Engine-Based Systems, 20(1):116-131, January 2002
Approach to Measure Semantic Similarity between Words. IEEE [19] State-of-the-Art ranking of algorithms on the World Sim 353 data set:
Transactions on Knowledge and Data Engineering (2011) http://aclweb.org/aclwiki/index.php?title=WordSimilarity-
[2] Cilibrasi R, Vitanyi P., The Google Similarity Distance. ArXiv.org (2004) 353_Test_Collection_(State_of_the_art)
[3] Church K. W., Hanks P., Word association norms, mutual informatio n [20] Newman, M., Barabási A.-L., and Duncan J. Watts. 2006. The Structure
and lexicography. In ACL 27 (1989) and Dynamics of Networks. Princeton, NJ: Princeton University Press.
[4] Franzoni V., Milani A., PMING Distance: A Collaborative Semantic [21] A. Newell, J. C. Shaw, and H. A. Simon. 1957. Empirical explorations of
Proximity Measure, WI-IAT, vol. 2, pp.442-449, 2012 IEEE/WIC/ACM the logic theory machine: a case study in heuristic. In Proc. of IRE-AIEE-
International Conferences on Web Intelligence and Intelligent Agent ACM '57. 218-230, ACM, New York, USA, 1957
Technology (2012). [22] Kowalski, R., "A Proof Procedure Using Connection Graphs", in Journal
[5] Kurant M., Markopoulou A., Thiran P., On the bias of BSF. ITC (2010) of the ACM Vol. 22, No. 4, 1975, pp. 572–595.
[6] Milne D., Witten I.H., An effective, low-cost measure of semantic [23] Hart, P. E.; Nilsson, N. J.; Raphael, B. (1968). "A Formal Basis for the
relatedness obtained from Wikipedia links. WIKIAI (2008) Heuristic Determination of Minimum Cost Paths". IEEE Transactions on
[7] Yeh E., Ramage D., Manning C.D., Agirre E., Soroa A., WikiWalk: Systems Science and Cybernetics SSC4 (2): 100–107
Random walks on Wikipedia for Semantic Relatedness; in Proc. Graph- [24] Cheng V.C., Leung C.H.C., Liu J, Milani A, "Probabilistic Aspect Mining
based Methods for Natural Language Processing (2009) Model for Drug Reviews," IEEE Transactions on Knowledge and Data
[8] Newman M.E.J., Fast algorithm for detecting community structure in Engineering, vol.99 p.1 (2014)
networks. University of Michigan, MI (2003) [25] Milani A,Santucci V.,2012. Community of scientist optimization: An
[9] Cao G., Gao J., et al, Extending query translation to cross- language query autonomy oriented approach to distributed optimization. AI Comm.. 25,2
expansion with markov chain models. CIKM, ATM (2007) [26] Baioletti M., Milani A et al: Experimental evaluation of pheromone
[10] Turney P., Mining the web for synonyms: PMI versus LSA on TEOFL. models in ACOPlan. Ann. Math. Artif. Intell. 62(3-4): 187-217 (2011)
In Proc. ECML (2001)
[11] Xu Z., Luo X., Yu J., Xu W., Measuring semantic similarity between
words by removing noise and redundancy in web snippets. Concurrency
Computat: PE, 23 (2011)
Table 2 Ranking on WordSym353
148