Professional Documents
Culture Documents
Fregrapad: Frequent RDF Graph Patterns Detection For Semantic Data Streams
Fregrapad: Frequent RDF Graph Patterns Detection For Semantic Data Streams
Abstract—Nowadays, high volumes of data are generated and descriptions and their observation data using semantic web
published at a very high velocity by real-time systems, such as technologies [6] and gave rise to semantic data streams.
social networks, e-commerce, weather stations and sensors, pro- Today, these initiatives increase interoperability and provide
ducing heterogeneous data streams. To take advantage of linked
data and offer interoperable solutions, semantic Web technologies useful information, such as contextual data, related to target
have been used. To analyze these huge volumes of data, different applications. However, given the specificity of this type of
stream mining algorithms exist such as compression or load- streams, neither the semantic web technologies nor those of
shedding. Nevertheless, most of them need many passes through Data Stream Management Systems (DSMS) (see [1] and [3])
the data and often store part of it on disk. If we want to apply are adapted to process this new type of data streams. This
efficient compression on semantic data streams, we need to first
detect frequent graph patterns in RDF streams. In this article, has favoured the emergence of a new research axis from the
we present FreGraPaD, an algorithm that detects those patterns semantic web community and led researchers to propose new
in a single pass, using exclusively internal memory and following systems called RDF Stream Processors (RSP), to deal with
a data structure oriented approach. Experimental results clearly this new kind of streams : C-SPARQL [5], CQELS [18],
confirm the good accuracy of FreGraPaD in detecting frequent SPARQL Stream [9], Sparkwave [15], EP-SPARQL [2] and
graph patterns from semantic data streams.
Streaming SPARQL [7]. Recently, this interest has also led
I. I NTRODUCTION to the creation of the W3C RDF Stream Processing Com-
Today’s information production is reaching an astronomical munity Group (RSP)4 whose purpose is the definition of ”a
volume which is overwhelming the capacity of systems to common model for producing, transmitting and continuously
manage it and make use of it. Every second, 29 000 giga- querying RDF Streams”4 . Unfortunately, the proposed systems
bytes (GB) of information are published in the world, i.e. are limited [16] and become fallible as soon as their maximum
2.5 exabytes a day or 912.5 exabytes a year1 . Digital Universe supported speed is reached and/or the resources of the system
Study2 declares that the production reached 4.4 zettabytes, and hosting them are saturated (overload). This necessarily leads
forecasts that it can bypass 44 zettabytes around 2020. Much to a significant increase in response time and an inevitable
of these impressive volumes of data are generated continuously degradation in quality of their response and sometimes even
as streams on the Web. In fact, this data is produced by sensors to the crash of the system.
like weather stations or GPS, or by different applications such In a limited system resources environment, it becomes
as electronic commerce or social networks where users became essential to know the properties of the received data (type,
part of the web. Thus, every second, nearly 126 RFID chips size, structure, etc.) and use them to anticipate and prevent
are sold, 5 900 tweets are sent, 43 000 YouTube videos are costly processing tasks by doing some “low-cost” prepro-
viewed, 100 000 searches are done on Google and more than cessing ones. Existing techniques such as compression or
510 000 Facebook comments are posted1 . load-shedding (see [4] and [21]) could help to reduce the
These high volumes of data are generated and published at size of incoming data. To do that, some pre-processing tasks
a very high velocity, producing heterogeneous data streams. are necessary. Nevertheless, these tasks (i) need many passes
The need for knowledge extraction from these continuous through the data and often (ii) use disk space to store part of
data streams has favoured some initiatives such as the Se- their data. Studying the RDF stream and detecting the related
mantic Sensor Web (SSW)3 which have semanticized their graph patterns will help to improve those pre-processing tasks.
1 http://www.planetoscope.com/developpement-durable/Internet-
We propose in this paper a novel algorithm: FreGraPaD
2 http://france.emc.com/collateral/analyst-reports/idc-digital-universe- which stands for Frequent RDF Graph Patterns Detection. It
2014.pdf
3 http://en.wikipedia.org/wiki/Semantic Sensor Web 4 http://www.w3.org/community/rsp/
handles the limitations listed above. It detects frequent RDF in such situations. Hence, they do not satisfy requirements R1
graph patterns in a single pass, using exclusively internal and R2.
memory and following a data structure oriented approach. Two recent works consider that semantic data streams
Experimental results clearly confirm the good effectiveness are constituted of a small set of RDF schema and have a
of our algorithm in detecting frequent graph patterns from very regular RDF graph structure according to a graph-based
semantic data streams. data model. In the first one, authors propose RDSZ [14]
FreGraPaD is -as far as we know- the first algorithm for de- (RDF Differential Stream compressor based on Zlib [12]), an
tecting frequent RDF graph patterns in semantic data streams approach for lossless RDF stream compression. Their main
processing area. It is based on two main key features: (k1) The idea is based on a differential item encoding mechanism, by
intrinsic graph-based data model and (k2) The regularity of the representing new items in the stream on the basis of the
RDF graph structures, in semantic data streams. It addresses previously processed ones. The results are then compressed
three challenges: (c1) to be a single pass algorithm; (c2) a using the popular streaming compressor Zlib. The second
memory oriented one and (c3) to avoid false positives. To work proposes ERI (Efficient RDF Interchange Format) for
address these challenges, FreGraPaD is mainly based on two RDF data streams [13]. Inspired by EXI [20] (Efficient XML
data structures: (1) Bit Vector, in order to identify the RDF Interchange format), the authors present their straightforward
graph patterns and optimize memory space usage; (2) Hash approach which considers the list of all triples with the same
table, to identify, detect and hold predicates and graph patterns. subject. Thus, they called it subject-molecule and then took
The rest of the paper is organized as follows: Section II this grouping as the method by default. However, even if
presents the requirements of our approach. Section III presents authors of RDSZ and ERI, in their respective works, evoke
a critical overview of graph stream and semantic data stream that they (i) exploit the structural similarities in RDF data
processing approaches. In section IV, we review some needed streams; and (ii) detect frequent patterns; they need at least
basic foundations and we present the Frequent RDF Graph two passes over the data. Therefore, they do not satisfy R1.
Patterns Detection (FreGraPaD) algorithm in Section V. Sec- We summarize the presented state of the art in Table I
tion VI reports the empirical evaluation. Finally, we conclude and then present some preliminaries concepts in the following
and give some research perspectives in Section VII. section before detailing our approach.
II. R EQUIREMENTS TABLE I: Comparative study of existing approaches
Before we examine the various existing techniques of dy-
Approaches Single Memory Structure
namic data stream processing, we detail some of the main pass (R1) oriented (R2) oriented (R3)
requirements that our proposed algorithm must satisfy: RDSZ [14] 7 3 7
• R1: Pass-Efficiency. An algorithm should minimize the ERI [13] 7 3 3
number of passes. DSMatrix [8], [11] 7 7 3
• R2: Memory Efficiency. An algorithm should minimize
FreGraPaD 3 3 3
disk access.
• R3. Data structure oriented. An algorithm should use data
structures instead of their values. IV. P RELIMINARIES
These requirements serve as a base to compare the existing This section presents some definitions related to this paper:
approaches to ours. adjacency matrix as a data structure to represent graphs, the
III. R ELATED W ORK RDF Data Model and the RDF Data Stream Model.
Over the last decade, there has been considerable interest in A. Adjacency Matrix
designing algorithms that include data structures of dynamic One of the most known methods to represent a graph is the
graphs, for processing massive graphs in the data stream model adjacency matrix, where, a graph G of n nodes is represented
[17]. by a n × n dimension matrix M [n × n], where M [i, j] is
Recent works (see [8] and [11]) propose different stream set to 1 if nodes i and j are connected to each other, 0 else.
mining algorithms to mine frequent patterns in graph streams, (
using a specialized data structure called DSMatrix (Data 1 if {i, j} ∈ G
Stream Matrix). It is -to the best of our knowledge- the newest ∀i, j ∈ [1, n], M [i, j] =
0 else.
approach that proves its efficiency to discover collections of
frequently co-occurring connected edges, when applied on When the graph is undirected, the adjacency matrix is
streams of linked graph structured data. Nevertheless, the symmetric, which means that half of the memory used by
2
proposed algorithms need many passes over the DSMarix to the matrix, i.e. n2 , is uselessly occupied (Figure 1a). When
detect the connected edges. A first pass to detect singleton the graph is directed, M [i, j] is set to 1 if node i is connected
pattern graphs, a second pass to detect graph patterns with to node j, 0 else. This leads to a sparse matrix as shown in
two connected edges and so on. Besides, this data structure is Figure 1b. Thus, only the half of the occupied memory space is
saved on disk when exclusive memory use is recommended necessary. One particular case of directed graph is the directed
star graph where only one node is connected to all the others. and each edge represents the labeled link between two re-
This leads to use only one row of the matrix ( n1 ) as illustrated sources. Thus, as a graph stream, an RDF stream S can also
in Figure 1c. be represented as a continuous and unbounded sequence of
hG : [t]i where G is a timestamped RDF graph at t time.
S ≤t = {hG : [t0 ]i ∈ S | t0 ≤ t}
Gi = {(si1 , pi1 , oi1 ), (si2 , pi2 , oi2 ), ..., (sim , pim , oim )}
(c)
Gi = {(si , pi1 , oi1 ), (si , pi2 , oi2 ), ..., (si , pim , oim )}
Fig. 1: Adjacency Matrix for: By construction, we can deduce the following property:
(a) graph, (b) directed graph and (c) directed star graph. a) Property: Each graph in an RDF Stream can
be represented as a directed star-graph Gi (V, E), where
V = {v0 , v1 , ..., vm } is the set of vertices with v0 as the
B. From RDF data model to RDF data stream model central vertex and vi (i = 1..m) the leaf vertices and
The RDF data model5 is based upon the idea of making E = {(v0 , v1 ), (v0 , v2 ), ..., (v0 , vm )} is the set of edges
statements about resources as triples, in the form of (Subject, labeled with the predicates.
Predicate, Object) expressions. A triple (s, p, o) ∈ IB × I ×
IBL is called an RDF triple, where: b) Definition. (Graph Pattern): Let P = {p1 , ..pn } be
- I, B, and L are a pair-wise disjoint domains of respectively the set of predicates in the graph stream S,
Information Resource Identifiers (IRIs), Blank nodes and GPi = {pi ∈ P | i ≤ n} a subset of P
Literals; and, and Gi ∈ S a graph in the stream.
- IL = I ∪ L, IB = I ∪ B and IBL = I ∪ B ∪ L are the
respective unions. GPi is a graph pattern of Gi if f
Thus, by extension, a Semantic data stream S is usually
defined as an infinite set of h(s, p, o) : [t]i, where (s, p, o) is
a timestamped RDF triple at t, i.e. ∀pi (pi ∈ GPi → pi ∈ Gi )
S ≤t = {h(s, p, o) : [t0 ]i ∈ S | t0 ≤ t} Each RDF graph Gi in the stream S is constructed ac-
cording to an RDF graph pattern GPi , to which we as-
Thereby, a collection of linked RDF triples forms a directed
sociate a binary vector GraphBV . We say that the graph
and labeled multi-graph, where nodes represent the resources
Gi satisfies the pattern GPi , if for all predicates pi in Gi ,
5 http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/ GraphBV [i − 1] = 1, thus:
The next subsections present the main contribution of the
( paper: FreGraPad, an algorithm that detects the frequent RDF
1 if pi+1 ∈ Gi graph patterns in the semantic data stream in a single pass
∀ i ∈ [0, m − 1], GraphBV [i] =
0 else. through three steps: (1) The bit vector construction, to identify
and hold the graph structure; (2) The predicates hash-table
Our approach takes advantage of the fact that, semantic construction. This table detects the predicate patterns in the
data streams are created according to a known ontology stream and holds them; (3) The graphs hash-table construction
describing the concepts and the relations between them. Thus, which detects and holds the RDF graph patterns.
most studied RDF data streams are composed according to
a very frequent schema of RDF data structure. We show in
Listing 1 an example of a small RDF graph describing a
resource using Turtle notation, where we can read the type
of the resource (Person), his full name (”Jack Franc”), his
email address (JF@mysite.org) and his title (Dr.). The graph
representation is shown in Figure 2.
does not consider any minimum threshold of the frequency, When the second graph (Graph2) is received, the bit vector
this is left to the appreciation of the user and its use-case. is initialized to 0. Then, FreGraPaD discovers that: (i) its
C. Proof of concept: frequent RDF graph patterns detection two first predicates a and b already exist in PHT, so the bits
using FreGraPaD corresponding to their PHT indexes (0 and 1) in the bit vector
are set to 1 and no insertion in PHT is needed; (ii) the third
Figure 4 illustrates how our algorithm detects patterns over predicate (e) is a new one, so the PHT index is incremented
an example of 5 RDF graphs composing an extract of a (from 3 to 4) and its corresponding bit is set to 1 leading to the
semantic data stream. The figure shows also the evolution of: insertion of < e, 4 > in PHT; (iii) Graph2 has a new pattern
the bit vector, the PHT and GHT tables whenever a graph is 10011 (19) which gives rise to the insertion of < 19, 1 > in
processed. GHT.
At the reception of the first graph (Graph1), the algorithm
checks one by one the presence of its predicates a, b, c and d
in PHT. Each corresponding bit in the bit vector is then set to 1
and the index incremented respectively from 0 to 1, 2 and then The algorithm continues so on with no insertion in PHT, an
3, because all of them are new predicates. This gives 1111 (15) insertion of < 6, 1 > in GHT for 110 as new detected pattern
as value for the bit vector. Thus, each predicate is inserted for Graph3, an incrementation of the frequency of pattern 19
in PHT with its corresponding index < a, 0 >, < b, 1 >, for Graph4 and finally an incrementation of the frequency of
< c, 2 > and < d, 3 >. After processing the last predicate (d), pattern 6 for Graph5. At the end, FreGraPaD returns GHT
the algorithm checks the presence of the constructed bit vector and PHT giving 3 graph patterns 15, 19 and 6 based on 5
(constituting the graph pattern 1111(15)) in GHT. As a first predicates a, b, c, d and e. Note that, at each step, we mark
pattern, it is inserted with frequency 1 i.e. < 15, 1 >. by asterisk (∗) the concerned fields in each hash table.
Algorithm 1: Frequent RDF Graph Patterns Detection
Data: RDF Stream
Result: Graph patterns Hash Table, Predicates Hash Table
1 begin
2 HashTable PHT<predicate, index>;
3 HashTable GHT<graph, frequency>;
4 Int i ←− 0 ; /* Index initialization */
5 foreach graph ∈ RDF Stream do
6 begin
7 BitVector GraphBV ←− 0 ; /* patterns bit vector initialization */
8 foreach predicate do
9 begin
10 if predicate ∈ PHT then
11 ind ←− PHT.get(predicate) ; /* Get the predicates bit index */
12 else
13 ind ←− i ;
14 PHT.put(predicate , ind) ; /* New predicate */
15 i ←− i+1
16 end
17 GraphBV[ind] ←− 1 ; /* Set the corresponding bit to 1 */
18 end
19 ;
20 if graph ∈ GHT then
21 frequency ←− GHT.get(GraphBV) + 1 ; /* Increment the patterns frequency */
22 GHT.put(GraphBV , frequency) ;
23 else
24 GHT.put(GraphBV , 1) ; /* New graph pattern */
25 end
26 end
27 ;
28 return GHT , PHT ; /* The graph patterns and predicate hash tables */
29 end
are two datasets delivered by Linked Observation Data. They treatment such as sketching, compressing or load-shedding
represent sensor observations of different weather parameters. semantic data streams.
Those observations represent meteorological phenomena like Even if there is no limitation of a bit vector size in theory,
humidity, temperature, pressure, visibility or precipitation. some platforms could have restrictions about this structure of
Finally, the well known Flickr9 and DbPedia10 dataset which data. Our conducted experimentations show that our algorithm
are generally static datasets such as the DBLP dataset. is very useful in case of homogeneous RDF data streams,
which is very frequent in the domain, especially when the
B. Evaluation and Results consumer does not know the data structure delivered by the
producer. We show how frequent can be a very reduced set of
Table 2 summarizes the different experiments that we
RDF graph patterns in an unlimited semantic data stream. In
conducted to evaluate FreGraPaD. It lists the experimental
some cases, the data is sent according to the same structure as
datasets, reporting: number of triples, number of RDF graphs,
a unique pattern for all the stream (eg. petrol). Thus, we guess
number of detected predicates performed by FreGraPaD vs.
that using FreGraPaD in compression RDF Data stream will
ERI, number of RDF graph patterns detected by FreGraPaD
lead to save significantly the memory volume. Then, instead
vs. RDSZ and finally the range of the frequency of the detected
of millions of data values, systems will deal with only a very
RDF graph patterns in each dataset. Note that we give a unique
reduced set of data structures. Flickr and LOD datasets are
value when our algorithm detects a unique RDF graph pattern
good illustrations of this case.
as for petrol dataset. Thus, this prove that FreGraPaD does not
detect fault positives. VII. C ONCLUSION
Table 2 shows that our algorithm gives the same results as
ERI when dealing with homogeneous datasets and surpasses We present in this paper a novel algorithm to deliver effi-
it when dealing with heterogeneous and irregular ones (ex: ciently the exact set of graph patterns detected in a processed
DbPedia). It also shows that FreGraPaD largely outperforms RDF data stream. Experimentation conducted in this paper
RDSZ as seen for AEMET-1 dataset. This is due to the predi- have shown that we are able to detect frequent patterns in
cate’s order which change in some RDF graphs in the stream. semantic data streams. This has been done in one pass with
Our algorithm considers that two RDF graphs composed of the reduced memory space, paradoxically to the huge volume of
same predicates in different order have the same graph pattern, the transmitted data in this particular kind of stream. Thus, it
which makes sense in semantic web. As an example, the RDF satisfies R1 and R2 requirements. In addition, it satisfies R3
graph {:J :T ype :P erson; :N ame ”Jean”; :Age 25.} is since it considers the data structure instead of the data itself.
semantically the same RDF graph as {:J :T ype :P erson; In the immediate future, we will use the detected RDF
:Age 25; N ame ”Jean”.}. graph patterns as a container for handling data in semantic
The last column of Table 2 computes graph patterns fre- data streams. We will use them to compress and transmit
quency range. It shows the range of the less frequent detected semantic data stream by exploiting their frequency to reduce
pattern to the most frequent one. For example, for Flickr considerably the volume of the RDF data stream. We plan
dataset the frequency goes from 1 to 3 122 147 and from to concept a lossless algorithm using the detected patterns
59 098 to 19 233 458 for Katrina dataset which illustrates returned by FreGraPaD.
how frequent are some RDF graph patterns even if they form In our future works, we plan to use the graph bit vector
a very small set of data structures. These results prove that our data structure with an RDF Stream Processor on the consumer
approach is promising for the domain, especially for further side to get the query pattern. The objective is to improve the
efficiency of the RSP system by selecting the necessary data
9 https://www.flickr.com/ and load-shedding the unnecessary ones, by applying a very
10 http://wiki.dbpedia.org/ simple boolean operations such as AN D or XOR.
ACKNOWLEDGMENTS [23] Agrawal, R., Imieliski, T., and Swami, A. (1993, June). Mining associ-
ation rules between sets of items in large databases. In ACM SIGMOD
This work is partially funded by the French National Re- Record (Vol. 22, No. 2, pp. 207-216). ACM.
search Agency (ANR) project CAIR (ANR-14-CE23-0006)
R EFERENCES
[1] D. Abadi, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, C. Erwin,
E. Galvez, M. Hatoun, A. Maskey, A. Rasin, et al. Aurora: a data
stream management system. In Proceedings of the 2003 ACM SIGMOD
international conference on Management of data. ACM, 2003.
[2] D. Anicic, P. Fodor, S. Rudolph, and N. Stojanovic. Ep-sparql: a unified
language for event processing and stream reasoning. In Proceedings of
the 20th international conference on World wide web, WWW ’11, pages
635–644, New York, NY, USA, 2011. ACM.
[3] A. Arasu, B. Babcock, S. Babu, M. Datar, K. Ito, I. Nishizawa,
J. Rosenstein, and J. Widom. Stream: the stanford stream data manager
(demonstration description). In Proceedings of the 2003 ACM SIGMOD
international conference on Management of data, pages 665–665. ACM,
2003.
[4] B. Babcock, M. Datar, and R. Motwani. Load shedding for aggregation
queries over data streams. In Data Engineering, 2004. Proceedings. 20th
International Conference on, pages 350–361, March 2004.
[5] D. F. Barbieri, D. Braga, S. Ceri, and M. Grossniklaus. An execution
environment for c-sparql queries. In Proceedings of the 13th International
Conference on Extending Database Technology, EDBT ’10, pages 441–
452, New York, NY, USA, 2010. ACM.
[6] T. Berners-Lee, J. Hendler, O. Lassila, et al. The semantic web. Scientific
american, 284(5):28–37, 2001.
[7] A. Bolles, M. Grawunder, and J. Jacobi. Streaming sparql extending
sparql to process data streams. In Proceedings of the 5th ESWC on
The semantic web: research and applications, ESWC’08, pages 448–462,
Berlin, Heidelberg, 2008. Springer-Verlag.
[8] P. Braun, J. J. Cameron, A. Cuzzocrea, F. Jiang, and C. K. Leung.
Effectively and efficiently mining frequent patterns from dense graph
streams on disk. Procedia Computer Science, 35:338–347, 2014.
[9] J.-P. Calbimonte, O. Corcho, and A. J. G. Gray. Enabling ontology-based
access to streaming data sources. In ISWC2010, pages 96–111, 2010.
[10] Ó. Corcho, D. Garijo Verdejo, J. Mora, M. Poveda Villalon,
D. Vila Suero, B. Villazón-Terrazas, P. Rozas, and G. A. Atemezing.
Transforming meteorological data into linked data. Semantic Web, 2012.
[11] A. Cuzzocrea, F. Jiang, and C. K. L. Leung. Frequent subgraph mining
from streams of linked graph structured data. pages 237–244, 2015.
[12] P. Deutsch and J.-L. Gailly. Zlib compressed data format specification
version 3.3. Technical report, 1996.
[13] J. D. Fernández, A. Llaves, and O. Corcho. Efficient rdf interchange
(eri) format for rdf data streams. In The Semantic Web–ISWC 2014, pages
244–259. Springer.
[14] N. Fernández, J. Arias, L. Sánchez, D. Fuentes-Lorenzo, and Ó. Corcho.
Rdsz: An approach for lossless rdf stream compression. In The Semantic
Web: Trends and Challenges, pages 52–67. Springer, 2014.
[15] S. Komazec, D. Cerri, and D. Fensel. Sparkwave: continuous schema-
enhanced pattern matching over RDF data streams. In DEBS2012, pages
58–68. ACM.
[16] A. Margara, J. Urbani, F. van Harmelen, and H. Bal. Streaming the
web: Reasoning over dynamic data. Web Semantics: Science, Services
and Agents on the World Wide Web, 0(0), 2014.
[17] A. McGregor. Graph stream algorithms: A survey. SIGMOD Rec.,
43(1):9–20, May 2014.
[18] D. L. Phuoc. A Native and Adaptive Approach for Linked Stream Data
Processing. Phd thesis, Digital Enterprise Research Institute, National
University of Ireland, Galway, 2013.
[19] E. Prudhommeau, G. Carothers, and L. Machina. Rdf 1.1 turtle terse
rdf triple language. w3c recommendation 25 february 2014.
[20] J. Schneider, T. Kamiya, D. Peintner, and R. Kyusakov. Efficient xml
interchange (exi) format 1.0. W3C Proposed Recommendation, 20, 2011.
[21] N. Tatbul, U. Çetintemel, S. B. Zdonik, M. Cherniack, and M. Stone-
braker. Load shedding in a data stream manager. In VLDB, pages 309–
320, 2003.
[22] Babcock, Brian and Datar, Mayur and Motwani, Rajeev. Load Shedding
in Data Stream Systems. In Data Streams, Springer US, pages 127-147,
2007.