Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

FreGraPaD: Frequent RDF Graph Patterns Detection

for semantic data streams


Fethi Belghaouti∗ , Amel Bouzeghoub∗ , Zakia Kazi-Aoul† and Raja Chiky†
∗ SAMOVAR, Telecom SudParis, CNRS, Universite Paris-Saclay,
9 rue Charles Fourier, 91011 Evry Cedex, France
www.telecom-sudparis.eu
Email: firstname.lastname@telecom-sudparis.eu
† Institut Superieur d’Electronique de Paris,

28 Rue Notre-Dame des Champs, 75006 Paris, France


www.isep.fr
Email: zkazi,raja.chiky@isep.fr

Abstract—Nowadays, high volumes of data are generated and descriptions and their observation data using semantic web
published at a very high velocity by real-time systems, such as technologies [6] and gave rise to semantic data streams.
social networks, e-commerce, weather stations and sensors, pro- Today, these initiatives increase interoperability and provide
ducing heterogeneous data streams. To take advantage of linked
data and offer interoperable solutions, semantic Web technologies useful information, such as contextual data, related to target
have been used. To analyze these huge volumes of data, different applications. However, given the specificity of this type of
stream mining algorithms exist such as compression or load- streams, neither the semantic web technologies nor those of
shedding. Nevertheless, most of them need many passes through Data Stream Management Systems (DSMS) (see [1] and [3])
the data and often store part of it on disk. If we want to apply are adapted to process this new type of data streams. This
efficient compression on semantic data streams, we need to first
detect frequent graph patterns in RDF streams. In this article, has favoured the emergence of a new research axis from the
we present FreGraPaD, an algorithm that detects those patterns semantic web community and led researchers to propose new
in a single pass, using exclusively internal memory and following systems called RDF Stream Processors (RSP), to deal with
a data structure oriented approach. Experimental results clearly this new kind of streams : C-SPARQL [5], CQELS [18],
confirm the good accuracy of FreGraPaD in detecting frequent SPARQL Stream [9], Sparkwave [15], EP-SPARQL [2] and
graph patterns from semantic data streams.
Streaming SPARQL [7]. Recently, this interest has also led
I. I NTRODUCTION to the creation of the W3C RDF Stream Processing Com-
Today’s information production is reaching an astronomical munity Group (RSP)4 whose purpose is the definition of ”a
volume which is overwhelming the capacity of systems to common model for producing, transmitting and continuously
manage it and make use of it. Every second, 29 000 giga- querying RDF Streams”4 . Unfortunately, the proposed systems
bytes (GB) of information are published in the world, i.e. are limited [16] and become fallible as soon as their maximum
2.5 exabytes a day or 912.5 exabytes a year1 . Digital Universe supported speed is reached and/or the resources of the system
Study2 declares that the production reached 4.4 zettabytes, and hosting them are saturated (overload). This necessarily leads
forecasts that it can bypass 44 zettabytes around 2020. Much to a significant increase in response time and an inevitable
of these impressive volumes of data are generated continuously degradation in quality of their response and sometimes even
as streams on the Web. In fact, this data is produced by sensors to the crash of the system.
like weather stations or GPS, or by different applications such In a limited system resources environment, it becomes
as electronic commerce or social networks where users became essential to know the properties of the received data (type,
part of the web. Thus, every second, nearly 126 RFID chips size, structure, etc.) and use them to anticipate and prevent
are sold, 5 900 tweets are sent, 43 000 YouTube videos are costly processing tasks by doing some “low-cost” prepro-
viewed, 100 000 searches are done on Google and more than cessing ones. Existing techniques such as compression or
510 000 Facebook comments are posted1 . load-shedding (see [4] and [21]) could help to reduce the
These high volumes of data are generated and published at size of incoming data. To do that, some pre-processing tasks
a very high velocity, producing heterogeneous data streams. are necessary. Nevertheless, these tasks (i) need many passes
The need for knowledge extraction from these continuous through the data and often (ii) use disk space to store part of
data streams has favoured some initiatives such as the Se- their data. Studying the RDF stream and detecting the related
mantic Sensor Web (SSW)3 which have semanticized their graph patterns will help to improve those pre-processing tasks.
1 http://www.planetoscope.com/developpement-durable/Internet-
We propose in this paper a novel algorithm: FreGraPaD
2 http://france.emc.com/collateral/analyst-reports/idc-digital-universe- which stands for Frequent RDF Graph Patterns Detection. It
2014.pdf
3 http://en.wikipedia.org/wiki/Semantic Sensor Web 4 http://www.w3.org/community/rsp/
handles the limitations listed above. It detects frequent RDF in such situations. Hence, they do not satisfy requirements R1
graph patterns in a single pass, using exclusively internal and R2.
memory and following a data structure oriented approach. Two recent works consider that semantic data streams
Experimental results clearly confirm the good effectiveness are constituted of a small set of RDF schema and have a
of our algorithm in detecting frequent graph patterns from very regular RDF graph structure according to a graph-based
semantic data streams. data model. In the first one, authors propose RDSZ [14]
FreGraPaD is -as far as we know- the first algorithm for de- (RDF Differential Stream compressor based on Zlib [12]), an
tecting frequent RDF graph patterns in semantic data streams approach for lossless RDF stream compression. Their main
processing area. It is based on two main key features: (k1) The idea is based on a differential item encoding mechanism, by
intrinsic graph-based data model and (k2) The regularity of the representing new items in the stream on the basis of the
RDF graph structures, in semantic data streams. It addresses previously processed ones. The results are then compressed
three challenges: (c1) to be a single pass algorithm; (c2) a using the popular streaming compressor Zlib. The second
memory oriented one and (c3) to avoid false positives. To work proposes ERI (Efficient RDF Interchange Format) for
address these challenges, FreGraPaD is mainly based on two RDF data streams [13]. Inspired by EXI [20] (Efficient XML
data structures: (1) Bit Vector, in order to identify the RDF Interchange format), the authors present their straightforward
graph patterns and optimize memory space usage; (2) Hash approach which considers the list of all triples with the same
table, to identify, detect and hold predicates and graph patterns. subject. Thus, they called it subject-molecule and then took
The rest of the paper is organized as follows: Section II this grouping as the method by default. However, even if
presents the requirements of our approach. Section III presents authors of RDSZ and ERI, in their respective works, evoke
a critical overview of graph stream and semantic data stream that they (i) exploit the structural similarities in RDF data
processing approaches. In section IV, we review some needed streams; and (ii) detect frequent patterns; they need at least
basic foundations and we present the Frequent RDF Graph two passes over the data. Therefore, they do not satisfy R1.
Patterns Detection (FreGraPaD) algorithm in Section V. Sec- We summarize the presented state of the art in Table I
tion VI reports the empirical evaluation. Finally, we conclude and then present some preliminaries concepts in the following
and give some research perspectives in Section VII. section before detailing our approach.
II. R EQUIREMENTS TABLE I: Comparative study of existing approaches
Before we examine the various existing techniques of dy-
Approaches Single Memory Structure
namic data stream processing, we detail some of the main pass (R1) oriented (R2) oriented (R3)
requirements that our proposed algorithm must satisfy: RDSZ [14] 7 3 7
• R1: Pass-Efficiency. An algorithm should minimize the ERI [13] 7 3 3
number of passes. DSMatrix [8], [11] 7 7 3
• R2: Memory Efficiency. An algorithm should minimize
FreGraPaD 3 3 3
disk access.
• R3. Data structure oriented. An algorithm should use data
structures instead of their values. IV. P RELIMINARIES
These requirements serve as a base to compare the existing This section presents some definitions related to this paper:
approaches to ours. adjacency matrix as a data structure to represent graphs, the
III. R ELATED W ORK RDF Data Model and the RDF Data Stream Model.

Over the last decade, there has been considerable interest in A. Adjacency Matrix
designing algorithms that include data structures of dynamic One of the most known methods to represent a graph is the
graphs, for processing massive graphs in the data stream model adjacency matrix, where, a graph G of n nodes is represented
[17]. by a n × n dimension matrix M [n × n], where M [i, j] is
Recent works (see [8] and [11]) propose different stream set to 1 if nodes i and j are connected to each other, 0 else.
mining algorithms to mine frequent patterns in graph streams, (
using a specialized data structure called DSMatrix (Data 1 if {i, j} ∈ G
Stream Matrix). It is -to the best of our knowledge- the newest ∀i, j ∈ [1, n], M [i, j] =
0 else.
approach that proves its efficiency to discover collections of
frequently co-occurring connected edges, when applied on When the graph is undirected, the adjacency matrix is
streams of linked graph structured data. Nevertheless, the symmetric, which means that half of the memory used by
2
proposed algorithms need many passes over the DSMarix to the matrix, i.e. n2 , is uselessly occupied (Figure 1a). When
detect the connected edges. A first pass to detect singleton the graph is directed, M [i, j] is set to 1 if node i is connected
pattern graphs, a second pass to detect graph patterns with to node j, 0 else. This leads to a sparse matrix as shown in
two connected edges and so on. Besides, this data structure is Figure 1b. Thus, only the half of the occupied memory space is
saved on disk when exclusive memory use is recommended necessary. One particular case of directed graph is the directed
star graph where only one node is connected to all the others. and each edge represents the labeled link between two re-
This leads to use only one row of the matrix ( n1 ) as illustrated sources. Thus, as a graph stream, an RDF stream S can also
in Figure 1c. be represented as a continuous and unbounded sequence of
hG : [t]i where G is a timestamped RDF graph at t time.

S ≤t = {hG : [t0 ]i ∈ S | t0 ≤ t}

Based on this definition, in next section, we will develop


how we define and detect RDF graph patterns in the stream,
using the bit vector data structure.

V. F REQUENT RDF GRAPH PATTERNS DETECTION :


F RE G RA PA D
(a)
In this section, we present our adopted definitions and ap-
proach for detecting frequent RDF graph patterns in semantic
data streams. Then, as a proof of concept, we illustrate its
execution over a graph stream model example. Finally, we
present the FreGraPaD algorithm.
Let S = {G1 , ...Gn } be an RDF stream composed of a
series of graphs where each Gi is a set of triples:

Gi = {(si1 , pi1 , oi1 ), (si2 , pi2 , oi2 ), ..., (sim , pim , oim )}

(b) These graphs generally describe one resource with object


and data properties. A unique subject is then linked via these
properties to several objects (using object properties) or values
(using data properties). The graph is thus reduced to a set of
triples having the same subject:

si = si1 = si2 = ... = sim

(c)
Gi = {(si , pi1 , oi1 ), (si , pi2 , oi2 ), ..., (si , pim , oim )}

Fig. 1: Adjacency Matrix for: By construction, we can deduce the following property:
(a) graph, (b) directed graph and (c) directed star graph. a) Property: Each graph in an RDF Stream can
be represented as a directed star-graph Gi (V, E), where
V = {v0 , v1 , ..., vm } is the set of vertices with v0 as the
B. From RDF data model to RDF data stream model central vertex and vi (i = 1..m) the leaf vertices and
The RDF data model5 is based upon the idea of making E = {(v0 , v1 ), (v0 , v2 ), ..., (v0 , vm )} is the set of edges
statements about resources as triples, in the form of (Subject, labeled with the predicates.
Predicate, Object) expressions. A triple (s, p, o) ∈ IB × I ×
IBL is called an RDF triple, where: b) Definition. (Graph Pattern): Let P = {p1 , ..pn } be
- I, B, and L are a pair-wise disjoint domains of respectively the set of predicates in the graph stream S,
Information Resource Identifiers (IRIs), Blank nodes and GPi = {pi ∈ P | i ≤ n} a subset of P
Literals; and, and Gi ∈ S a graph in the stream.
- IL = I ∪ L, IB = I ∪ B and IBL = I ∪ B ∪ L are the
respective unions. GPi is a graph pattern of Gi if f
Thus, by extension, a Semantic data stream S is usually
defined as an infinite set of h(s, p, o) : [t]i, where (s, p, o) is
a timestamped RDF triple at t, i.e. ∀pi (pi ∈ GPi → pi ∈ Gi )
S ≤t = {h(s, p, o) : [t0 ]i ∈ S | t0 ≤ t} Each RDF graph Gi in the stream S is constructed ac-
cording to an RDF graph pattern GPi , to which we as-
Thereby, a collection of linked RDF triples forms a directed
sociate a binary vector GraphBV . We say that the graph
and labeled multi-graph, where nodes represent the resources
Gi satisfies the pattern GPi , if for all predicates pi in Gi ,
5 http://www.w3.org/TR/1999/REC-rdf-syntax-19990222/ GraphBV [i − 1] = 1, thus:
The next subsections present the main contribution of the
( paper: FreGraPad, an algorithm that detects the frequent RDF
1 if pi+1 ∈ Gi graph patterns in the semantic data stream in a single pass
∀ i ∈ [0, m − 1], GraphBV [i] =
0 else. through three steps: (1) The bit vector construction, to identify
and hold the graph structure; (2) The predicates hash-table
Our approach takes advantage of the fact that, semantic construction. This table detects the predicate patterns in the
data streams are created according to a known ontology stream and holds them; (3) The graphs hash-table construction
describing the concepts and the relations between them. Thus, which detects and holds the RDF graph patterns.
most studied RDF data streams are composed according to
a very frequent schema of RDF data structure. We show in
Listing 1 an example of a small RDF graph describing a
resource using Turtle notation, where we can read the type
of the resource (Person), his full name (”Jack Franc”), his
email address (JF@mysite.org) and his title (Dr.). The graph
representation is shown in Figure 2.

Listing 1: RDF example describing a person (Jack Franc) using


Turtle notation
@ p r e f i x r d f :< h t t p : . . . / 2 2 − r d f −s y n t a x −n s # >.
@ p r e f i x c o n t a c t : < h t t p : . . . / pim / c o n t a c t # >.

< h t t p : / / www. w3 . o r g / P e o p l e /EM/ c o n t a c t # JF>


rdf : type c o n t a c t : Person ;
c o n t a c t : fullName ” Jack Franc ” ;
c o n t a c t : m a i l b o x <m a i l t o : JF@mysite . org >;
c o n t a c t : p e r s o n a l T i t l e ” Dr . ” . Fig. 3: Construction of the graph pattern using a bit vector.

A. Bit vector and predicates hash-table construction


The Predicates Hash Table (PHT) contains all the detected
predicates in RDF graphs of an input stream. PHT is an
indexed table where each new detected predicate is inserted.
Each graph is represented as a bit vector GraphBV. Each bit
at index i of this vector is set to 1 or 0 according to the
presence or not of the corresponding predicate in PHT at the
same index i. As an example, Figure 3 shows that the current
in-processing RDF graph pattern includes predicates p1 , p2 , p3
and p4 which gives 1111 (15) as value for it’s corresponding
bit vector.

B. RDF graph patterns detection - Graphs hash-table con-


struction
The Graph Hash Table (GHT) is also a key-value table that
Fig. 2: An RDF graph describing a person.
contains the corresponding decimal values of the bit vectors
patterns as keys and their corresponding frequencies as values.
Thus, based on the foregoing, we consider that the RDF data As already stated in the example depicted in Figure 2, the
structures in the stream follow a directed star graph model, like corresponding decimal value of the bit vector is 15. The
the one used in abbreviated Triple groups in Turtle [19]. frequency of this pattern is incremented in GHT. Thus, as soon
Thereby, as stated before, the RDF graphs could be repre- as the corresponding bit vector of a processed RDF graph is
sented using adjacency matrix as it is usually the case in most generated, FreGraPaD checks the presence of its pattern in
of the approaches. However, despite the efficiency of this data the GHT. If the pattern already exists, its frequency is incre-
structure, it is not suitable for RDF graphs since this matrix mented, otherwise, a new RDF graph pattern is being detected.
will be very sparse using only 1/n of the occupied space. A new entry is thus inserted in GHT with the bit vector’s value
For this reason, we decide to reduce the associated adjacency as a key and 1 as a frequency value (frequency = 1 for first
matrix to a bit vector. occurrence of the pattern). Note that deliberately FreGraPaD
Fig. 4: RDF data stream example: Detecting frequent graph patterns with FreGraPaD algorithm. (*) accessed fields.

does not consider any minimum threshold of the frequency, When the second graph (Graph2) is received, the bit vector
this is left to the appreciation of the user and its use-case. is initialized to 0. Then, FreGraPaD discovers that: (i) its
C. Proof of concept: frequent RDF graph patterns detection two first predicates a and b already exist in PHT, so the bits
using FreGraPaD corresponding to their PHT indexes (0 and 1) in the bit vector
are set to 1 and no insertion in PHT is needed; (ii) the third
Figure 4 illustrates how our algorithm detects patterns over predicate (e) is a new one, so the PHT index is incremented
an example of 5 RDF graphs composing an extract of a (from 3 to 4) and its corresponding bit is set to 1 leading to the
semantic data stream. The figure shows also the evolution of: insertion of < e, 4 > in PHT; (iii) Graph2 has a new pattern
the bit vector, the PHT and GHT tables whenever a graph is 10011 (19) which gives rise to the insertion of < 19, 1 > in
processed. GHT.
At the reception of the first graph (Graph1), the algorithm
checks one by one the presence of its predicates a, b, c and d
in PHT. Each corresponding bit in the bit vector is then set to 1
and the index incremented respectively from 0 to 1, 2 and then The algorithm continues so on with no insertion in PHT, an
3, because all of them are new predicates. This gives 1111 (15) insertion of < 6, 1 > in GHT for 110 as new detected pattern
as value for the bit vector. Thus, each predicate is inserted for Graph3, an incrementation of the frequency of pattern 19
in PHT with its corresponding index < a, 0 >, < b, 1 >, for Graph4 and finally an incrementation of the frequency of
< c, 2 > and < d, 3 >. After processing the last predicate (d), pattern 6 for Graph5. At the end, FreGraPaD returns GHT
the algorithm checks the presence of the constructed bit vector and PHT giving 3 graph patterns 15, 19 and 6 based on 5
(constituting the graph pattern 1111(15)) in GHT. As a first predicates a, b, c, d and e. Note that, at each step, we mark
pattern, it is inserted with frequency 1 i.e. < 15, 1 >. by asterisk (∗) the concerned fields in each hash table.
Algorithm 1: Frequent RDF Graph Patterns Detection
Data: RDF Stream
Result: Graph patterns Hash Table, Predicates Hash Table
1 begin
2 HashTable PHT<predicate, index>;
3 HashTable GHT<graph, frequency>;
4 Int i ←− 0 ; /* Index initialization */
5 foreach graph ∈ RDF Stream do
6 begin
7 BitVector GraphBV ←− 0 ; /* patterns bit vector initialization */
8 foreach predicate do
9 begin
10 if predicate ∈ PHT then
11 ind ←− PHT.get(predicate) ; /* Get the predicates bit index */
12 else
13 ind ←− i ;
14 PHT.put(predicate , ind) ; /* New predicate */
15 i ←− i+1
16 end
17 GraphBV[ind] ←− 1 ; /* Set the corresponding bit to 1 */
18 end
19 ;
20 if graph ∈ GHT then
21 frequency ←− GHT.get(GraphBV) + 1 ; /* Increment the patterns frequency */
22 GHT.put(GraphBV , frequency) ;
23 else
24 GHT.put(GraphBV , 1) ; /* New graph pattern */
25 end
26 end
27 ;
28 return GHT , PHT ; /* The graph patterns and predicate hash tables */
29 end

D. FreGraPaD algorithm RDF graph pattern and its structure.


In this section, we explain the FreGraPaD algorithm (Algo-
E. Case study
rithm 1) which takes the RDF data stream as input and returns
as output the graph patterns and the corresponding predicates We present in Figure 5 an extract of Linked Observation
in GHT and P HT respectively. Data (LOD)6 dataset as an example of RDF stream on which
At the beginning of the algorithm, we initialize to 0 the we apply our algorithm. The figure shows in the middle
index i of the corresponding predicate’s bit in the bit vector the evolution of the RDF stream over the time, at the left
GraphBV of the graph pattern (line 4). Then, for each graph the constructed bit vectors corresponding to the detected
in the stream (set of linked triples with the same subject), the graph patterns and their frequencies. As we can see, the
bit vector GraphBV is set to 0 (line 7). For each predicate three first graph patterns (11111, 1100001 and 10000001) are
of the current graph, FreGraPaD checks if it already exists composed respectively of the predicates in Listing 2. These
in P HT . If it is the case, the index ind is taken from three graph patterns will be very frequent and only two other
P HT , else the new predicate is inserted in P HT and the predicates will be detected as shown in Table 2 for Katrina’s
index is incremented (lines 8 to 16). The corresponding bit in and Charley’s datasets.
GraphBV is set to 1 (line 17). After setting to 1 all corre-
sponding predicate’s bits in GraphBV , the algorithm checks Listing 2: Predicates detected by FreGraPaD in LOD stream
the presence of the current graph pattern in the corresponding as shown in Figure 5.
GHT (line 20). In case of a positive answer, FreGraPaD [ Type , om−owl : o b s e r v e d P r o p e r t y ,
increments its corresponding frequency (line 21), otherwise om−owl : p r o c e d u r e , om−owl : r e s u l t ,
it inserts it in GHT with the corresponding f requency = 1 om−owl : s a m p l i n g T i m e ]
In line (line 24), a new graph pattern is detected. At the end, [ Type , om−owl : f l o a t V a l u e , om−owl : uom ]
the algorithm returns the two hash tables GHT and P HT [ Type , # inXSDDateTime ]
containing respectively the detected RDF graph patterns with
their frequencies and the predicates forming the RDF graph
patterns with their corresponding indexes. Note that the graph 6 http://wiki.knoesis.org/index.php/LinkedSensorData#Linked
bit vector represents, at the same time, the identifier of the Observation Data
Fig. 5: FreGraPaD: RDF graph patterns detection on LOD data stream.

VI. E VALUATION AND DISCUSSION A. Datasets description


AEMET-1 and AEMET-2 are two datasets provided by
the Spanish Meteorological Office (AEMET). They represent
We implemented a first prototype of our algorithm FreGra- meteorological information, taken from weather stations in
PaD using Java language. To validate it and test its perfor- Spain [10] according to different schema. The Petrol dataset
mance, we used 8 datasets (described below) almost used also provides metadata about credit cards transactions in petrol
by RDSZ and ERI algorithms against which we compared station, furnished by a Spanish start-up (Localidata7 ). The
our results. We compared the detected predicate patterns to DBLP dataset provides a comprehensive list of research papers
ERI’s results to prove the validity of our approach in terms of in computer science from the DBLP computer science bibliog-
predicates detection; and then we compared the detected RDF raphy8 which contains the metadata of nearly 2 millions pub-
graph patterns to the results of RDSZ to prove the performance lications, written by more than 1 million authors in different
of FreGraPaD. In this section we present the datasets used in journals or conference proceedings series. Charley and Katrina
the conducted experimentations, a case study to illustrate the 7 http://www.localidata.com/

proposal and the evaluation results with a discussion. 8 http://www.dblp.org


DataSet # RDF # RDF # RDF Predicates # RDF Graph Patterns Graph Patterns
Triples Graphs FreGraPaD vs. ERI FreGraPaD vs. RDSZ frequency range
aemet-1 1 018 815 33 095 59/59 24/1 459 1 to 13619
aemet-2 2 788 429 398 347 7/7 1/2 398 347
petrol 3 356 616 419 577 8/8 1/1 419 577
Flickr 49 107 168 5 490 006 23/23 25/- 1 to 3 122 147
DBLP 60 139 735 3 799 856 26/27 296/- 1 to 1 207 367
Charley 108 644 569 25 303 346 10/10 5/5 45 111 to 11 648 606
Katrina 179 128 408 41 600 926 10/10 5/5 59 098 to 19 233 458
DbPedia 3-8 431 440 396 29 688 668 54 782/57 986 1 333 098/- 1 to 8 051 080

TABLE II: RDF Graph patterns detection on different semantic datasets.


( - : not experimented by RDSZ)

are two datasets delivered by Linked Observation Data. They treatment such as sketching, compressing or load-shedding
represent sensor observations of different weather parameters. semantic data streams.
Those observations represent meteorological phenomena like Even if there is no limitation of a bit vector size in theory,
humidity, temperature, pressure, visibility or precipitation. some platforms could have restrictions about this structure of
Finally, the well known Flickr9 and DbPedia10 dataset which data. Our conducted experimentations show that our algorithm
are generally static datasets such as the DBLP dataset. is very useful in case of homogeneous RDF data streams,
which is very frequent in the domain, especially when the
B. Evaluation and Results consumer does not know the data structure delivered by the
producer. We show how frequent can be a very reduced set of
Table 2 summarizes the different experiments that we
RDF graph patterns in an unlimited semantic data stream. In
conducted to evaluate FreGraPaD. It lists the experimental
some cases, the data is sent according to the same structure as
datasets, reporting: number of triples, number of RDF graphs,
a unique pattern for all the stream (eg. petrol). Thus, we guess
number of detected predicates performed by FreGraPaD vs.
that using FreGraPaD in compression RDF Data stream will
ERI, number of RDF graph patterns detected by FreGraPaD
lead to save significantly the memory volume. Then, instead
vs. RDSZ and finally the range of the frequency of the detected
of millions of data values, systems will deal with only a very
RDF graph patterns in each dataset. Note that we give a unique
reduced set of data structures. Flickr and LOD datasets are
value when our algorithm detects a unique RDF graph pattern
good illustrations of this case.
as for petrol dataset. Thus, this prove that FreGraPaD does not
detect fault positives. VII. C ONCLUSION
Table 2 shows that our algorithm gives the same results as
ERI when dealing with homogeneous datasets and surpasses We present in this paper a novel algorithm to deliver effi-
it when dealing with heterogeneous and irregular ones (ex: ciently the exact set of graph patterns detected in a processed
DbPedia). It also shows that FreGraPaD largely outperforms RDF data stream. Experimentation conducted in this paper
RDSZ as seen for AEMET-1 dataset. This is due to the predi- have shown that we are able to detect frequent patterns in
cate’s order which change in some RDF graphs in the stream. semantic data streams. This has been done in one pass with
Our algorithm considers that two RDF graphs composed of the reduced memory space, paradoxically to the huge volume of
same predicates in different order have the same graph pattern, the transmitted data in this particular kind of stream. Thus, it
which makes sense in semantic web. As an example, the RDF satisfies R1 and R2 requirements. In addition, it satisfies R3
graph {:J :T ype :P erson; :N ame ”Jean”; :Age 25.} is since it considers the data structure instead of the data itself.
semantically the same RDF graph as {:J :T ype :P erson; In the immediate future, we will use the detected RDF
:Age 25; N ame ”Jean”.}. graph patterns as a container for handling data in semantic
The last column of Table 2 computes graph patterns fre- data streams. We will use them to compress and transmit
quency range. It shows the range of the less frequent detected semantic data stream by exploiting their frequency to reduce
pattern to the most frequent one. For example, for Flickr considerably the volume of the RDF data stream. We plan
dataset the frequency goes from 1 to 3 122 147 and from to concept a lossless algorithm using the detected patterns
59 098 to 19 233 458 for Katrina dataset which illustrates returned by FreGraPaD.
how frequent are some RDF graph patterns even if they form In our future works, we plan to use the graph bit vector
a very small set of data structures. These results prove that our data structure with an RDF Stream Processor on the consumer
approach is promising for the domain, especially for further side to get the query pattern. The objective is to improve the
efficiency of the RSP system by selecting the necessary data
9 https://www.flickr.com/ and load-shedding the unnecessary ones, by applying a very
10 http://wiki.dbpedia.org/ simple boolean operations such as AN D or XOR.
ACKNOWLEDGMENTS [23] Agrawal, R., Imieliski, T., and Swami, A. (1993, June). Mining associ-
ation rules between sets of items in large databases. In ACM SIGMOD
This work is partially funded by the French National Re- Record (Vol. 22, No. 2, pp. 207-216). ACM.
search Agency (ANR) project CAIR (ANR-14-CE23-0006)

R EFERENCES
[1] D. Abadi, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, C. Erwin,
E. Galvez, M. Hatoun, A. Maskey, A. Rasin, et al. Aurora: a data
stream management system. In Proceedings of the 2003 ACM SIGMOD
international conference on Management of data. ACM, 2003.
[2] D. Anicic, P. Fodor, S. Rudolph, and N. Stojanovic. Ep-sparql: a unified
language for event processing and stream reasoning. In Proceedings of
the 20th international conference on World wide web, WWW ’11, pages
635–644, New York, NY, USA, 2011. ACM.
[3] A. Arasu, B. Babcock, S. Babu, M. Datar, K. Ito, I. Nishizawa,
J. Rosenstein, and J. Widom. Stream: the stanford stream data manager
(demonstration description). In Proceedings of the 2003 ACM SIGMOD
international conference on Management of data, pages 665–665. ACM,
2003.
[4] B. Babcock, M. Datar, and R. Motwani. Load shedding for aggregation
queries over data streams. In Data Engineering, 2004. Proceedings. 20th
International Conference on, pages 350–361, March 2004.
[5] D. F. Barbieri, D. Braga, S. Ceri, and M. Grossniklaus. An execution
environment for c-sparql queries. In Proceedings of the 13th International
Conference on Extending Database Technology, EDBT ’10, pages 441–
452, New York, NY, USA, 2010. ACM.
[6] T. Berners-Lee, J. Hendler, O. Lassila, et al. The semantic web. Scientific
american, 284(5):28–37, 2001.
[7] A. Bolles, M. Grawunder, and J. Jacobi. Streaming sparql extending
sparql to process data streams. In Proceedings of the 5th ESWC on
The semantic web: research and applications, ESWC’08, pages 448–462,
Berlin, Heidelberg, 2008. Springer-Verlag.
[8] P. Braun, J. J. Cameron, A. Cuzzocrea, F. Jiang, and C. K. Leung.
Effectively and efficiently mining frequent patterns from dense graph
streams on disk. Procedia Computer Science, 35:338–347, 2014.
[9] J.-P. Calbimonte, O. Corcho, and A. J. G. Gray. Enabling ontology-based
access to streaming data sources. In ISWC2010, pages 96–111, 2010.
[10] Ó. Corcho, D. Garijo Verdejo, J. Mora, M. Poveda Villalon,
D. Vila Suero, B. Villazón-Terrazas, P. Rozas, and G. A. Atemezing.
Transforming meteorological data into linked data. Semantic Web, 2012.
[11] A. Cuzzocrea, F. Jiang, and C. K. L. Leung. Frequent subgraph mining
from streams of linked graph structured data. pages 237–244, 2015.
[12] P. Deutsch and J.-L. Gailly. Zlib compressed data format specification
version 3.3. Technical report, 1996.
[13] J. D. Fernández, A. Llaves, and O. Corcho. Efficient rdf interchange
(eri) format for rdf data streams. In The Semantic Web–ISWC 2014, pages
244–259. Springer.
[14] N. Fernández, J. Arias, L. Sánchez, D. Fuentes-Lorenzo, and Ó. Corcho.
Rdsz: An approach for lossless rdf stream compression. In The Semantic
Web: Trends and Challenges, pages 52–67. Springer, 2014.
[15] S. Komazec, D. Cerri, and D. Fensel. Sparkwave: continuous schema-
enhanced pattern matching over RDF data streams. In DEBS2012, pages
58–68. ACM.
[16] A. Margara, J. Urbani, F. van Harmelen, and H. Bal. Streaming the
web: Reasoning over dynamic data. Web Semantics: Science, Services
and Agents on the World Wide Web, 0(0), 2014.
[17] A. McGregor. Graph stream algorithms: A survey. SIGMOD Rec.,
43(1):9–20, May 2014.
[18] D. L. Phuoc. A Native and Adaptive Approach for Linked Stream Data
Processing. Phd thesis, Digital Enterprise Research Institute, National
University of Ireland, Galway, 2013.
[19] E. Prudhommeau, G. Carothers, and L. Machina. Rdf 1.1 turtle terse
rdf triple language. w3c recommendation 25 february 2014.
[20] J. Schneider, T. Kamiya, D. Peintner, and R. Kyusakov. Efficient xml
interchange (exi) format 1.0. W3C Proposed Recommendation, 20, 2011.
[21] N. Tatbul, U. Çetintemel, S. B. Zdonik, M. Cherniack, and M. Stone-
braker. Load shedding in a data stream manager. In VLDB, pages 309–
320, 2003.
[22] Babcock, Brian and Datar, Mayur and Motwani, Rajeev. Load Shedding
in Data Stream Systems. In Data Streams, Springer US, pages 127-147,
2007.

You might also like