Professional Documents
Culture Documents
Output
Output
Silvia Giannini
SisInfLab & DEI, Politecnico di Bari, Via Orabona 4, 70125 Bari, Italy
s.giannini@deemail.poliba.it
1 Introduction
The Web, seen as a global information space, is evolving from a Web of Doc-
uments to a Web of Data. This effort is actively pursued by the Linking Open
Data 1 (LOD) project, that outlines best practices for publishing, sharing, and
connecting structured data on the Web. These are summarized by four basic
guidelines [1]: (i) use of global Uniform Resource Identifiers (URIs) for naming
things (e.g., documents, objects or even abstract concepts); (ii) use of a stan-
dardized access mechanism (HTTP URIs) to look up unique URIs and enable
dereferencing of associated information and descriptions on the Web; (iii) adop-
tion of the Resource Description Framework 2 (RDF) as the machine-readable
and open standardized data format to represent semantics and informative con-
tent of resources; (iv) connection of data URIs among different data sources
through the adoption of RDF links, in order to enrich the knowledge about a re-
source. RDF links represent the glue of the Web of Data, capturing a fair amount
of knowledge and creating a complex network of interrelated resources [2]. An
RDF link simply states that one piece of data has some kind of connection (e.g.,
a relationship, an identity or a vocabulary link) to another piece of data. RDF
links enable retrieval mechanisms, that is, the navigation among data sources
exploited by Linked Data browsers, and data crawling, useful for searching the
Linked Data space by using queries more expressive than the simple keyword
1
http://linkeddata.org
2
http://www.w3.org/standards/techs/rdf#w3c_all
W. Abramowicz (Ed.): BIS 2013 Workshop, LNBIP 160, pp. 220–231, 2013.
c Springer-Verlag Berlin Heidelberg 2013
RDF Data Clustering 221
search [3]. The most recent statistic on the LOD Cloud (September 2011) re-
ports that over 31 billions facts are currently stored as RDF triples in 295 data
sources, with more than 500 millions links among them3 .
Against this critical mass of RDF data, mechanisms for efficiently storing,
indexing and querying data have to be put in place [4,5]. Moreover, in order to
build knowledge-intensive applications with LOD, machine learning techniques
for mining the Web of Data need to be developed. Aimed at answering these
compelling requirements, RDF clustering may be adopted for:
described before for the conversion of a links partition P into a nodes partition
C.
In the following, they are illustrated the basic concepts of the graph clustering
algorithm proposed by [18]. It has been selected among those available in liter-
ature for the capability of reconciling the principles of overlapping communities
and hierarchy as aspects of the same phenomenon. A wide simulation campaign
shows how the links clustering method developed by the authors overcomes ev-
ery other tested method over several networks, varying in size and topology. It
is chosen due to its simplicity and efficiency on large-scale networks, revealing
the effective clusters structure described by external meta-data. Moreover, it is
completely automatic and parameter free. The definition of similarity between
links adopted in [18] considers only the local network topology as available infor-
mation, i.e., the neighborhood of a node is employed in the similarity evaluation.
In particular, it is computed only the similarity between pairs of links that share
a node k, since it is unlikely that disjoint links are more similar than adjacent
links. The similarity S(eik , ejk ) between links eik and ejk sharing node k can be
evaluated with the Jaccard index
|N [i] ∩ N [j] |
S(eik , ejk ) = , (1)
|N [i] ∪ N [j] |
where N [i] represents the set constituted by node i and its neighbors. A single-
linkage agglomerative hierarchical clustering is then applied to derive the hier-
archy of link communities. Each link is initially assigned to a different cluster;
then, the pair of links with the largest similarity are merged, until all links are
members of the same community (ties are agglomerated simultaneously).
Although meaningful structures exist at each level of the dendrogram, to
extract the best community structure representing the data it is necessary to
cut the dendrogram at a certain level. It is chosen according to the partition
density
2 � mc − (nc − 1)
DP = mc , (2)
M c (nc − 2)(nc − 1)
where, given a link partition P = {Pc }c=1...m , M is the total number of links in
the network, mc is the number of links of the c-th cluster of P, and nc is the
number of nodes induced by links in Pc . The partition density is an objective
function measuring the quality of a link partition in terms of links density inside
communities, and it is evaluated at each level of the link dendrogram. It is then
cut at the level identified by the maximum value of DP . Note that, the most
time consuming phase is the computation of links similarity for each pair of link
sharing a node. The worst case is represented by a complete directed graph, in
which the number M of links is equal to N (N − 1), where N is the number of
nodes of the network.
RDF Data Clustering 225
Fig. 1. A sample RDF graph generated by the benchmark [19]. On the logical level,
the schema-graph (gray) and the data-graph (white) are distinguished. Dashed lines
represent edges labeled with rdf:type; sc stands for rdfs:subClassOf. URIs and blank
nodes are represented by ellipses, identifying blank nodes with the prefix :; literals
are represented by quoted strings.
4 Preliminary Results
In this section, the algorithm proposed by [18] is adopted to find a set C =
{C1 , . . . , Cm } of node-clusters, possible overlapping, over RDF(S) graphs. In or-
der to evaluate the algorithm behavior over RDF(S) graphs, it has been primarily
tested on a small synthetic data-set generated with the SP2 Bench benchmark
[19]. The benchmark provides a data generator for the creation of arbitrarily
large DBLP-like RDF documents, mirroring the key characteristics and social-
world distribution of the real DBLP4 data-set. It allows to obtain a controlled
data-set, useful for a first manual inspection of resulting clusters and the defini-
tion of algorithm refinements for the adoption over real large-scale RDF graphs.
Starting with a triple count limit fixed to 50, the generator ends up in a consis-
tent state resulting in a total number of 266 triples, including data and schema
information (see fig. 1). In detail, the description of 20 Articles and one Journal
issue is created, together with schema definition. The execution of algorithm [18]
on the data-set results in 53 clusters, one of which corresponds to a separated
connected component of the RDF graph, originated by property vertices of G s
describing the semantics of the link labels LR and LA in Ld . The remaining clus-
ters, obtained cutting the links dendrogram at a level defined by the partition
density optimization, can be classified in three categories:
4
http://dblp.uni-trier.de/
226 S. Giannini
The considerable amount of RDF data-sets being made available on the Web,
and linked to each other, poses the question of how to efficiently exploit the
knowledge therein. As depicted in the Introduction, clustering seems to be a
good mining function in this perspective. First RDF clustering techniques pre-
sented in literature rely on traditional data-clustering methods [21]. However,
the RDF data-model is not intuitively suited for standard machine learning algo-
rithms. As a matter of fact, traditional data-clustering is based on instances, and
not on large interconnected graphs. Therefore, a first step in the application of
these methods over RDF graphs requires the extraction of instances, where the
term instance refers to a subgraph relevant for a resource description. Unfortu-
nately, instance extraction can be a time-consuming data-dependent phase (for
a review of instance extraction methods and possible solutions to the problem
of instance-based clustering of RDF data see [22]). Then, a classical hierarchical
agglomerative or partitional algorithm can be applied to build a tree of clusters.
In this step, an important role is assumed by the distance metric, that allows
for evaluating the similarity between each pair of instances someway extracted.
A classification of different distance measures for RDF instances clustering is
adopted in [22].
First, the traditional feature-vector based clustering is revised. Paths of the
RDF subgraphs representing instances are mapped to features, and the value of
a feature is represented by the set of nodes reachable through that path. The
distance between two instances is then computed with a vectors similarity notion
inspired to the Dice coefficient.
Graph-based distance measures exploit the topology of the RDF graph in
similarity evaluation. Conceptual similarity and relational similarity, respectively
computed through the evaluation of nodes and edges overlap between the two
subgraphs representing the pair of instances under evaluation, can be combined
for the extraction of a distance value [23]. The work by [24] proposes instead
kernel functions for RDF(S) data, i.e. specific for directed labeled graphs, based
on the concepts of intersection graph and intersection tree between the instance-
subgraph, or the instance-tree, of the two examined resources.
On the other hand, when it is available an ontological background, expressed
with a semantic language, an ontologically-based distance measure can be
adopted. Maedche and Zacharias developed and applied an ontological similarity
metric for RDF data conforming to a well-defined ontology [25]. It is expressed
as the weighted combination of three dimensions: the taxonomy, the relation
RDF Data Clustering 229
and the attribute similarity. It does not require the phase of instance extrac-
tion, but it turned out to be more expensive than other similarity measures.
A different approach for clustering RDF statements exploiting the ontological
knowledge was proposed by Delteil et al. in [26]. This relies on the construction
of a subsumption hierarchy, systematically generating the most specific general-
ization of all possible set of resources, using both the intension and extension of
newly formed concepts. Unfortunately, both ontologically-based methods cited
here are not applicable for real Semantic Web data, often noisy, incomplete and
inconsistent (a list of assumptions that have to be verified by the RDF data-set
at hand for these methods application is reported again in [22]).
The most noticeable limitation of traditional data-clustering techniques over
RDF graphs is represented by the not fully exploitation of the native data model
in the process - i.e., the graph model - that makes it necessary the instance ex-
traction phase. More recently, it has been proposed the investigation of other
suitable methods for RDF clustering; in particular, the techniques that fall into
the category of graph-mining. The goal of graph clustering is to partition graph
vertices into several densely connected components, based on various criteria
(vertex connectivity, neighborhood similarity, etc.). Many graph clustering meth-
ods are solely based on the network structure (see [27] for a comparison). A recent
work by Alzogbi and Lausen [28] treats the problem of identifying similar struc-
tures inside an RDF graph, in order to reduce its size while keeping its structure
invariate as much as possible. It combines bisimulation with an agglomerative
data-clustering step. However, their work aims at finding set of nodes of an RDF
graph which cannot be distinguished by looking at the sequence of predicates
of their paths. It means that the method is only focused on paths, and not on
property-object patterns. In many real applications, both the graph topological
structure and the vertices’, or relations’, attribute similarity have to be taken
into account. In fact, an ideal graph clustering method should generate clusters
which have a cohesive intra-cluster structure with homogeneous vertices prop-
erties. Although they do not work directly on RDF graphs, the authors of [29]
propose a graph augmented version, that makes it possible to take into account
also the attribute similarity in the graph partitioning task, making the approach
interesting for this state-of-art analysis. The augmented graph partitioning is
then realized through a k-medoids clustering method, using a pairwise distance
measure based on vertices neighborhood similarity. However, the algorithm relies
on the a priori definition of the number of clusters to extract.
Community detection methods, while also concerned with dividing a graph
into well-connected partitions, are more exploratory in spirit, and aim at ex-
tracting number and size of partitions from the data [11]. Therefore, they seem
to represent a good candidate in refurbishing or improving RDF clustering meth-
ods. To the best of my knowledge there is no analysis of community detection
algorithms behavior against RDF-graph for RDF clustering purposes.
230 S. Giannini
6 Conclusions
References
11. Fortunato, S.: Community detection in graphs. Physics Reports 486(3), 75–174
(2010)
12. W3C: http://www.w3.org/tr/r2rml/ (September 27, 2012)
13. W3C: http://www.w3.org/tr/rdfa-lite/ (June 07, 2012)
14. Augenstein, I., Padó, S., Rudolph, S.: LODifier: Generating linked data from un-
structured text. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V.
(eds.) ESWC 2012. LNCS, vol. 7295, pp. 210–224. Springer, Heidelberg (2012)
15. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of com-
munities in large networks. Journal of Statistical Mechanics: Theory and Experi-
ment 2008(10), P10008 (2008)
16. Tran, T., Cimiano, P., Rudolph, S., Studer, R.: Ontology-based interpretation of
keywords for semantic search. In: Aberer, K., et al. (eds.) ASWC 2007 and ISWC
2007. LNCS, vol. 4825, pp. 523–536. Springer, Heidelberg (2007)
17. Evans, T., Lambiotte, R.: Line graphs, link partitions, and overlapping communi-
ties. Physical Review E 80(1), 016105 (2009)
18. Ahn, Y.Y., Bagrow, J.P., Lehmann, S.: Link communities reveal multiscale com-
plexity in networks. Nature 466(7307), 761–764 (2010)
19. Schmidt, M., Hornung, T., Lausen, G., Pinkel, C.: Sp2 bench: a sparql performance
benchmark. In: IEEE 25th International Conference on Data Engineering, ICDE
2009, pp. 222–233. IEEE (2009)
20. Ravasz, E., Somera, A.L., Mongru, D.A., Oltvai, Z.N., Barabási, A.L.: Hierarchical
organization of modularity in metabolic networks. Science 297(5586), 1551–1555
(2002)
21. Fanizzi, N., d’Amato, C.: A hierarchical clustering method for semantic knowledge
bases. In: Apolloni, B., Howlett, R.J., Jain, L. (eds.) KES 2007, Part III. LNCS
(LNAI), vol. 4694, pp. 653–660. Springer, Heidelberg (2007)
22. Grimnes, G.A., Edwards, P., Preece, A.D.: Instance based clustering of semantic
web resources. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M.
(eds.) ESWC 2008. LNCS, vol. 5021, pp. 303–317. Springer, Heidelberg (2008)
23. Grimnes, G.A., Edwards, P., Preece, A.D.: Learning meta-descriptions of the FOAF
network. In: McIlraith, S.A., Plexousakis, D., van Harmelen, F. (eds.) ISWC 2004.
LNCS, vol. 3298, pp. 152–165. Springer, Heidelberg (2004)
24. Lösch, U., Bloehdorn, S., Rettinger, A.: Graph kernels for RDF data. In: Simperl,
E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS,
vol. 7295, pp. 134–148. Springer, Heidelberg (2012)
25. Maedche, A., Zacharias, V.: Clustering ontology-based metadata in the semantic
web. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) PKDD 2002. LNCS (LNAI),
vol. 2431, pp. 348–360. Springer, Heidelberg (2002)
26. Delteil, A., Faron-Zucker, C., Dieng, R.: Learning ontologies from rdf annotations.
In: Workshop on Ontology Learning (2001)
27. Rattigan, M.J., Maier, M., Jensen, D.: Graph clustering with network structure
indices. In: Proceedings of the 24th International Conference on Machine Learning,
pp. 783–790. ACM (2007)
28. Alzogbi, A., Lausen, G.: Similar structures inside rdf-graphs (2013)
29. Zhou, Y., Cheng, H., Yu, J.X.: Graph clustering based on structural/attribute
similarities. Proceedings of the VLDB Endowment 2(1), 718–729 (2009)