Professional Documents
Culture Documents
KisungLee RDF Cloud
KisungLee RDF Cloud
KisungLee RDF Cloud
I. I +T,-D!CTI-+
Clou% computing infrastructures are .i%ely recognize% as an attracti/e computing platform for efficient 0ig %ata pro1 cessing 0ecause it minimizes the upfront o.nership cost for the large1scale computing infrastructure %eman%e% 0y 0ig %ata analytics# 2ith Lin"ing -pen Data community pro3ect an% 2orl% 2i%e 2e0 Consortium 425C6 a%/ocating ,D7 4,esource Description 7rame.or"6 8*9 as a stan%ar% %ata mo%el for 2e0 resources, .e ha/e .itnesse% a stea%y gro.th of 0oth 0ig ,D7 %atasets an% large an% gro.ing num0er of %omains an% applications capturing their %ata in ,D7 an% performing 0ig %ata analytics o/er 0ig ,D7 %atasets# 7or e:ample, more than ;< 0illion ,D7 triples are pu0lishe% as of =arch <(>< on Lin"e% Data 8><9 an% a0out *#?* 0illion triples are pro/i%e% 0y the Data1go/ 2i"i 8@9 as of 7e0ruary <(>5# ,ecently the !K 4!nite% King%om6 go/ernment is pu0lishing ,D7 a0out its legislation 8)9 .ith a SA ,QL 4a stan%ar% &uery language for ,D76 &uery interface for its %ata sources 8;9# Ba%oop =ap,e%uce programming mo%el an% Ba%oop Dis1 tri0ute% 7ile System 4BD7S6 are one of the most popular %istri0ute% computing technologies for %istri0uting 0ig %ata processing across a large cluster of compute no%es in the
Clou%# Bo.e/er, processing the huge ,D7 %ata using Ba%oop =ap,e%uce an% BD7S poses a num0er of ne. technical challenges# 7irst, .hen /ie.ing a 0ig ,D7 %ataset as an ,D7 graph, it typically consists of millions of /ertices 4su03ects or o03ects of ,D7 triples6 connecte% 0y millions or 0illions of e%ges 4pre%icates of ,D7 triples6# Thus triples are correlate% an% connecte% in many %ifferent .ays# ,an%om partitioning of 0ig %ata into chun"s through either horizontal 40y triples6 or /ertical partitioning 40y su03ect, o03ect or pre%icate6 is no longer a /ia0le solution 0ecause %ata partitions generate% 0y such a simple partitioning metho% ten% to ha/e high correlation .ith one another# Thus, most of the ,D7 &ueries nee% to 0e processe% through multiple roun%s of %ata shipping across partitions hoste% in multiple compute no%es in the Clou%# Secon%, BD7S 4an% its attache% storage systems6 is e:cellent for managing 0ig ta0le li"e %ata .here ro. o03ects are in%epen%ent an% thus 0ig %ata can 0e simply %i/i%e% into e&ual1size% chun"s .hich can 0e store% in a %istri0ute% manner an% processe% in parallel efficiently an% relia0ly# Bo.e/er, BD7S is not optimize% for processing 0ig ,D7 %atasets of high correlation# Therefore, e/en simple retrie/al &ueries can 0e &uite inefficient to run on BD7S# Thir% 0ut not the least, Ba%oop =ap,e%uce programming mo%el is optimize% for 0atch1oriente% processing 3o0s o/er 0ig %ata rather than real1time re&uest an% respon% types of 3o0s# Thus, .ithout correlation preser/ing %ata
partitioning, Ba%oop =ap,e%uce alone is neither a%e&uate for han%ling ,D7 &ueries nor suita0le for structure10ase% reasoning on ,D7 graphs# 2ith these challenges in min%, in this paper, .e present a Scala0le an% yet customiza0le %ata A rtitioning frame.or", calle% SA , for %istri0ute% processing of 0ig ,D7 graph %ata# -ur %ata partitioning frame.or" has t.o uni&ue features# 7irst, .e intro%uce a suite of /erte:1centric %ata partitioning 0uil%ing 0loc"s, calle% e:ten%e% /erte: 0loc"s, to allo. efficient an% yet customiza0le %ata partitioning of large heterogeneous ,D7 graphs 0y preser/ing the 0asic /erte: structure# Cy efficient, .e mean that the SA %ata partitions can support fast processing of 0ig %ata of %ifferent sizes an% comple:ity# Cy customiza0le, .e mean that one partitioning techni&ue may not fit all# Thus the SA partitions are 0y %esign a%apti/e to %ifferent %ata processing %eman%s in terms of structural correlations# Secon%, .e propose a selection of scala0le par1 allel processing techni&ues to %istri0ute the structure% /erte: 0loc"10ase% partitions across a cluster of compute no%es in a manner that minimizes inter1no%e communication cost 0y localizing most of the &ueries to in%epen%ent partitions an% 0y ma:imizing intra1no%e processing# Cy partitioning an% %istri0uting 0ig ,D7 %ata using structure% /erte: 0loc"s, .e can consi%era0ly re%uce the inter1no%e communication cost of comple: &uery processing 0ecause most ,D7 &ueries can
http
type
Proceedings
1-108 Journal 1
Journal
star
Journal
SELECT ?yr #
!E"E
bi-vertex
% Pr 1 o oc l 1 u & e
$nProc1
na l
? ( o u r n a l r d ) * t y p e J o u r n a l +
type
bloc k
? ( o u r n a l t i t l e , J o u r n a l 1 , 1-.0 $ n / 1 $nPro -0 title c1 P 0 r o 1 c * e e di n g s 1-.0 45 3 ' c a Per g' r o http t son $n r i m 333 Pro 1 ti c pl c 1 cl cr e l e e x ea 1
to r
?yr Jo urn al 1
ae r 1 & r e es a t o o n r
SEL EC T 6$S T$7 CT ? type pers on ? ? na& na& ep e e ? ! nr E" as E# o & ? en inpr oc rd)* type $npr ocee ding s ? inpr oc crea tor ? pers on + ? pers on na& e? na& e+ t i t l e n o t e
P e r s o n
Pn
aP e& e e
r r s s o o n 1 n
3 3 3
Pers on/ 4 1 P 1 0 e a r 8s s tr o ac n t
$n type Pr oc ee di ng s
Person outv er te x bl o c
1 k Prsc'rticle o
r
0 e e / a l u a t e % l o c a ll y o n a p a rt iti o n s e r / e r . it h o u t r e & u ir i n g % a t
6 a t a 7 o d e 9 P ue 3 ry u $n te r) ac e + : + E ;e + cu to < r =
cu st o mi za 0l e for pa rtit io ni ng an % %i str i0 uti ng 0i g , D 7 %a ta se ts of %i /e rs e siz es an % str uc tur es , an % eff ec ti/ e for pr oc es si ng
4 o r s o 1 c a l l e %
t i o n s h i p
c t e %
D a c h e % g e i s % i r e c t e % , e m a n a t i n g f r o m i t s s u 0 3 e c t / e r t e : t o i t s o 0
3ect /erte: # 7ig# >4a6 sho.s an e:am ple ,D7 graph 0ase% on the struct ure of SA<C
ench 4SA ,QL Aerf orma nce Cenc hmar "6 8<(9#
SA ,QL is a SQL1 li"e stan% ar% &uery langu age for ,D7 recom men% e% 0y 25C# =ost SA , QL &uerie s consis t of multip le triple patter ns, .hich are simila r to ,D7 triples e:cep t that in each triple
patter n, the su03ec t, pre%ic ate an% o03ect may 0e a /aria0l e# 2e categ orize SA , QL &uerie s into t.o typesE star an% compl e:, 0ase% on the 3oin chara cteristi cs of triple patter ns# Star &uerie s consis t of su03ec t1 su03ec t 3oins an% each 3oin /aria0l e is the su03ec t of all the triple patter ns in/ol/ e%# 2e refer to the remai ning &uerie s as compl
ex &ueri es# 7ig# >406 sho. s t.o e:a mple SA ,QL &uer y grap hs# The first is a star &uer y re&u estin g the year of pu0li catio n of Four nal > an% the seco n% is a com ple: &uer y re&u estin g the nam es of all pers ons .ho are an auth or of at least one pu0li catio n of inpr oce e%in
g type# SA ,QL &uery proce ssing can 0e /ie. e% as fin%in g matc hing su0gr aphs in the ,D7 grap h .her e ,D7 terms from those su0gr aphs may 0e su0st itute% for the &uery /aria 0les#
frame.or" SA # The first prototype of SA is implemente% on top of Ba%oop =ap,e%uce an% BD7S# The Ba%oop SA frame.or" consists of one coordinator an% a set of worker no%es 4G=s6 in the SA cluster# The coor%inator ser/es as the +ame+o%e of BD7S an% the Fo0Trac"er of Ba%oop =ap,e%uce an% each .or"er ser/es as the Data+o%e of BD7S
B. S
y st e m A rc hi te ct ur e 7ig # < s"etc hes the archit ectur e of our ,D7 %ata partiti oning
an% the Tas"Trac"er of Ba%oop =ap,e%uce# To efficiently store the generate% partitions, .e utilize an ,D71specific storage system on each .or"er no%e# The core componen ts of our ,D7 %ata partitionin g frame1 .or" are the partition 0loc" generator an% %istri0utor# The generator uses a /erte:1centric approach to construct partition 0loc"s such that all triples of each partition 0loc" are resi%ing in the same .or"er no%e# 7or a 0ig ,D7 graph .ith a huge num0er of /ertices, .e nee% to carefully %istri0ute all gener1 ate% partition 0loc"s across a cluster of .or"er no%es using a partition %istri0ution mechanism#
2 e p r o / i % e a n e ff i c i e n t % i s t ri 0 u t e % i m p l e m e n t a ti o n o f o u r % a t a p a r ti ti o n i n g s y s
t e m o n t o p o f B a % o o p = a p , e % u c e a n % B D 7 S #
t h i s s e c ti o n . e % e s c ri 0 e t h e S A
% a t a p a r ti ti III. o n i n g f r a m e . o r " , f o c u s i n g o n t h e I t
. o c o r e c o m p o n e n t s E c o n s tr u c ti n g t h e p a rt it i o n 0 l o c " s a n % % i s tr i 0 u ti n g t h e m a c r
oss multi ple .or" er no%e s# In or%e r to pro/i %e partit ionin g mo% els that are cust omiz a0le to %iffer ent proc essi ng nee% s, .e %e/i se thre e type s of partit ion 0loc "s 0ase % on the /erte : struc ture an% the /erte : acce ss patte rn# The goal of cons tructi ng partit ion 0loc
"s is to assi gn all triple s of each partit ion 0loc " to the sam e .or" er no%e in or%e r to supp ort effici ent &uer y proc essi ng# In ter m s of eff ici en t &u er y pr oc es sin g, th er e ar e t. o %if fer en t type s of proc essi ngE intra -VM proc
e s s i n g
n t h a t l o c a l l y
a a n % & s u e i e a n r r t y c e h r Q i n V c g M a n t p h r 0 e o e 1 s f u c u 0 e l g s l r s y a i p n e h g : s # e c m C u a y t t e c i % h n i t i n r n g a 1 p t G a h = r e a p l t r l r o e i c l p e l s o e s n i p n e a g a t , c t h e . r e G n =s m e 0 o a y f
a Q t , o r . i s t i h m o p u l t y a s n e y n % c s o o Q r % t i o n a a t l i l o n . o f r r " o e m r o n n o e % e G s = 4 t G o = s a 6 n , o t . h i e t r h # o u T t h e u s c i o n o g r % B i a n %
o o p , a n % t h e n m e r g e s t h e p a r t i a l r e s u l t s r e c e i / e % f r o m a l l G = s t
o gen erat e the final resul ts of Q# Cy inter 1G= proc essi ng, .e mea n that a &uer y Q as a .hol e cann ot 0e e:ec ute% on any G=, an% it nee %s to 0e %eco mpo se% into
a set of su0& uerie s such that each su0& uery can 0e e/al uate % 0y intra1 G= proc essi ng# Thus , the proc essi ng of Q re&ui res multi ple roun %s of coor %inat ion an% %ata trans fer acro ss the
cluster of .or"ers using Ba%oop# In contrast to intra1G= processing, inter1G= communication cost can 0e e:tremely high, especially .hen the num0er of su0&ueries is not small an% the size of interme%iate results to 0e transferre% across the net.or" of .or"er no%es is large#
1*66 Journal1
$nProc 1
'rticle
A. o
ns tru cti n! "x te nd ed Ve rte x #l oc ks 2 e c e n t e r o u r % a t a p a r t i t i o n i n g o n a / e r t e :
title note
Cy /erte:1centric, it means that .e construct a /erte: 0loc" for each /erte:# Cy e:ten%ing a /erte: 0loc" to an e:ten%e% /erte: 0loc" .e can assign more triples .hich are close to efficient &uery processing# for
%efine% as V #
in "
in "
J 4V$ % "$ % K
V in in J
Cefore .e formally %efine the concept of /erte: 0loc" an% the concept of e:ten%e% /erte: 0loc", .e first %efine some 0asic concepts of ,D7 graphs# De&inition >' 4,D7 Graph6 n ,D7 graph is a %irecte%,
$
6 such % that l $ $ in
in
i n
the verte* block of $ as the com0ination of 0oth out1 /erte: 0loc" an% in1/erte: 0loc", namely the 0i1/erte: 0loc", an% is (i formally (i (i %efine% as V #
2 r e y r e / f e e r r t t e o : / e 0 r t l e o : c $ " a s t h e a n c h o r / e r t e : o f t h e / e r t e : 0 l o c " c e n t e r e % a t h a s a n a n c h o r / e r t e : # 2 e r e f e r t o t h o s e
i n a / e r t e : 0 l o c " , . h i c h a r e n o t t h e a n c h o r / e r t e : , a s t h e ( o r d
(i (i
out in
% l" 6 .here V
( i
$
J {$} {$ |$
V% 4$% $
6 "$
% $6
"$
or 4$ }#
is a set of an% " is a multiset of/ertices %irecte% e%ges 4i#e#, or%ere% pairs of /ertices6# %irecte% e%ge 4u% $ " %enotes a6 triple in the ,D7 mo%el from su03ect u to o03ect $# K" is a set of a/aila0le
J {4$% o6.e |4$% %efine o6 "} # Similarly, a set ofo(*ect e%ges/erte: 4triples6 .hose is $ as the in,tri les of a /erte: $ V , %enote% 0y $ J {4s% $6|4s% $6 "}#les 2eof %efine bi, tri a /erte: $ V as the union of its out,
tri les an% in,tri les, %enote%
%efine .hose a set of e%ges 4triples6 su(*ect /erte: is $ $ as the out,tri les out of /erte: , %enote% 0y "
a map from an e%ge to its la0el 4" K" 6# In ,D7 %atasets, multiple triples may ha/e the same su03ect an% o03ect# Thus " is a multiset instea% of a set# 2e consi%er that t.o triples in an ,D7 %ataset are correlate% if they share the same su03ect, o03ect or pre%icate# 7or simplicity, only /erte:10ase% correlation is consi%ere% in this paper# Thus .e only consi%er three types of correlate% triples as follo.sE De&inition <' 4Different types of correlate% Let J4 V% "% K" % 7or ltriples6 6 0e " an ) ,D7 graph#
$
"
each .e /erte: $ V , in
/ e r $ t # i c D e / s e
>4 : c6 sh 0 o l .s o thr c ee " %if fer i en n /e t rte h : e 0l oc "s s of a a m /e e rte : p A a e r r t s i o t n i > o # n , C y . e l o c c a a n t i e n f g f i a c l i l e n t t r l i y p l p e r s o c o e f s s a a / l 7 e l i r g t & # e u
eries .hich re&ue st only triple s %irect ly conn ecte% to a /erte : 0eca use those &ueri es can 0e proce sse% using intra1 G= proce ssing# 7or e:am ple, to proce ss the star &uery in 7ig# >406, .e can run the &uery on each G= .itho ut any coor% inatio n .ith other G=s 0eca use it is guara ntee% that all triple
s conn ecte% to a /erte : .hich can 0e su0sti tute% for the /aria 0le L 3our nal are locate % in the same G=# D/ en tho ugh the /ert e: 0lo c"s are 0en efic ial for star &ue ries , .hich are com mon in practi ce, there are some ,D7 %atas ets in .hich chain &ueri es or more compl e: &ueri es may re&ue
st 0y " J" " # $ $ $ +o. .e %efine the concept of $ertex (lock, the 0asic 0uil%ing 0loc" for graph partitioning# /erte: 0loc" can 0e represente% 0y a /erte: ID an% its connecte% triples# n intuiti/e .ay to construct the /erte: 0loc" of a /erte: $ is to inclu%e all connecte% triples 4i#e#, 0i1 triples6 of $ regar%less of their %irection# Bo.e/er, for some ,D7 %atasets, all their &ueries may re&uest only triples in one %irection from a /erte:# 7or e:ample, to e/aluate the first SA ,QL &uery in 7ig# >406, the &uery processor nee%s only outtriples of a /erte: .hich may 0e su0stitute% for /aria0le L 3ournal# Since one partitioning techni&ue cannot fit all, .e nee% to pro/i%e customiza0le options that can 0e efficiently use% for %ifferent ,D7 %atasets an% SA ,QL &uery types# In a%%ition to &uery efficiency, partition 0loc"s shoul% also minimize the triple re%un%ancy 0y consi%ering only necessary triples an% re%ucing the %is" space of generate% partitions# This moti/ates us to intro%uce three %ifferent .ays of %efining the concept of /erte: 0loc"s 0ase% on the e%ge %irection of triples# De&inition 5'"% 4Gerte: Let ) J 4V% K" %"ut, l"0loc"6 6 0e an ,D7 graph#
(i
out in
o u t V #
o u u t o t
more triples that are 0eyon% those %irectly connecte% triples# Consi%er the secon% &uery in 7ig# >406# Clearly there is no guarantee that any triple connecting from Lperson to Lname is locate% in the same partition .here those connecte% triples of /erte: L inproc are locate%# Thus inter1G= processing may 0e re&uire%# This moti/ates us to intro%uce the concept of k1 hop e:ten% e% /erte: 0loc" 4k >6# Gi/en a /erte: $ V in an ,D7 graph ), k1hop extended $ertex (lock of $ inclu%es not only %irectly connecte% triples 0ut also all near0y triples .hich are in%irectly connecte% to the /erte: .ithin k1hop ra%ius in )# Thus, a /erte: 0loc" of $ is the same as the +-hop e:ten%e% /erte: 0loc" of $# Similar to the /erte: 0loc", .e %efine three %ifferent .ays of %efining the e:ten%e% /erte: 0loc" 0ase% on the e%ge %irection of triplesE k-hop extended out-$ertex (lock, k-hop extended in-$ertex (lock an% k-hop extended (i-$ertex (lock# Similarly, gi/en a k1hop e:ten%e% /erte: 0loc" anchore% at $, .e refer to those /ertices that are not the anchor /erte: of the e:ten%e% /erte: 0loc" as the 0or%er /ertices# 7or e:ample, 7ig# 5 sho.s the <1hop e:ten%e% 0i1/erte: 0loc" of /erte: Aerson># Due to the space limitation, .e omit the formal %efinitions of the three types of the k1hop extended $ertex (lock in this paper# out 2e 0elo. %iscuss 0riefly ho. to set the system1 %efine%
parameter k# 7or a gi/en ,D7 %ataset, .e can %efine k 0ase% on the common types of &ueries .e .ish to pro/i%e fast e/aluation# If the ra%ius of most of such &ueries is 5 hops or less, .e can set k J 5# The higher k /alue is, the higher %egree of triple replication each %ata partition .ill ha/e an% thus more &ueries can 0e e/aluate% using intra1G= processing# In fact, .e fin% that the most fre&uent &ueries ha/e > hop or < hops as their ra%ius# -nce the system1%efine% k /alue is set, in or%er to %etermine if a gi/en &uery can 0e e/aluate% using intra1G= processing or it has to pay inter1G= processing cost, all .e nee% to %o is to chec" .hether e/ery /erte: in the correspon%ing &uery graph can 0e co/ere% 0y any k1hop e:ten%e% /erte: 0loc"# Concretely, if there is any /erte: in the &uery graph .hose k1 hop e:ten%e% /erte: 0loc" inclu%es all triple patterns 4e%ges6 of the &uery graph, then .e can say that the &uery can 0e e/aluate% using intra1G= processing un%er the current k1hop %ata partitioning scheme# In the first prototype of SA system, .e implement our ,D7 %ata partitioning schemes using Ba%oop =ap,e%uce an% BD7S# 7irst, users can use the SA configuration to set some system1%efine% parameters, inclu%ing the num0er of .or"er no%es use% for partitioning, say n, the type of e:ten%e% /erte: 0loc"s 4out1/erte:, in1/erte: or 0i1/erte:6 an% the k /alue of e:ten%e% /erte: 0loc"s# -nce the selection is entere%, the SA %ata partitioner .ill launch a set of Ba%oop 3o0s to partition the triples of the gi/en %ataset into n partitions 0ase% on the %irection an% the k /alue of the e:ten%e% /erte: 0loc"s# The first Ba%oop 3o0 rea%s the input ,D7 %ataset store% in BD7S chun"s# 7or each chun", it e:amines triples one 0y one an% groups triples 0y t.o partitioning parametersE 4i6 the type of e:ten%e% /erte: 0loc"s 4out1/erte:, in1/erte: or 0i1 /erte:6, .hich %efines the %irection of the k1hop e:tension on the input ,D7 graphM 4ii6 the k /alue of k1hop e:ten%e% /erte: 0loc"s# If the parameters are set to generate >1hop e:ten%e% (i-$ertex 0loc"s, each triple is assigne% to t.o e:ten%e% /erte: 0loc"s 4or one 0loc" if the su03ect an% o03ect of the triple is the same entity6# This can 0e /ie.e% as in%e:ing this triple 0y 0oth its su03ect an% o03ect# Bo.e/er, if the parameters are set to generate >1hop e:ten%e% out-$ertex or in-$ertex 0loc"s, each triple .ill 0e assigne% to one e:ten%e% /erte: 0loc" 0ase% on its su03ect or o03ect respecti/ely# 2e 0elo. %escri0e ho. the SA %ata partitioner ta"es the input ,D7 %ata an% uses a cluster of compute no%es to partition it into n partitions 0ase% on the settings of <1hop e:ten%e% out-$ertex 0loc"s# In our first prototype, the SA %ata partitioner is implemente% in t.o phasesE Ahase > generates /erte: 0loc"s or >1hop e:ten%e% /erte: 0loc"s, an% Ahase < generates k1hop e:ten%e% /erte: 0loc"s for k , ># If k J >, only Ahase > is nee%e%# In the follo.ing %iscussion, .e assume that the ,D7 %ataset to 0e partitione% has 0een loa%e% into BD7S# The first Ba%oop 3o0 is initialize% .ith a set of map tas"s in .hich each map tas" rea%s a fi:e%1size %ata chun" from BD7S 4the chun" size is initialize% as a part of the BD7S configuration6# 7or each map tas", it e:amines triples in a chun" one 0y one an% returns a "ey1/alue pair .here the "ey is the su03ect /erte: an% the /alue is the remaining part 4pre%icate an% o03ect6 of the triple# The local com0ine function of this Ba%oop 3o0 groups all "ey1 /alue pairs that ha/e the same "ey into a local su03ect10ase%
triple group for each su03ect# The re%uce function rea%s the outputs of all maps to further merge all local su03ect10ase% groups to form a /erte: 0loc" .ith the su03ect as the anchor /erte:# Dach /erte: group .ill ha/e one anchor /erte: an% 0e store% in BD7S as a .hole# !pon the completion of this first Ba%oop 3o0, .e o0tain the full set of >1hop e:ten%e% out$ertex 0loc"s for the ,D7 %ataset# In the ne:t Ba%oop 3o0, .e e:amine each triple against the set of >1hop e:ten%e% out1/erte: 0loc"s, o0taine% in the pre/ious phase, to %etermine .hich e:ten%e% /erte: 0loc"s shoul% inclu%e this triple in or%er to meet the k1hop guarantee# 2e call this phase the controlle% triple replication# Concretely, .e perform a full scan of the ,D7 %ataset an% e:amine in .hich >1hop e:ten%e% out1/erte: 0loc"s this triple shoul% 0e a%%e% in t.o steps# 7irst, for each triple, one "ey1/alue pair is generate% .ith the su03ect of the triple as the "ey an% the remaining part 4pre%icate an% o03ect6 of the triple as the /alue# Then .e group all triples .ith the same su03ect into a local su03ect10ase% triple group# In the secon% step, for each >1hop e:ten%e% out1/erte: 0loc" an% each su03ect .ith its local triple group, .e test if the su03ect /erte: matches the e:ten%e% out1 /erte: 0loc"# Cy matching, .e mean that the su03ect /erte: is the 0or%er /erte: of the e:ten%e% out1/erte: 0loc"# If a match is foun%, then .e create a "ey1/alue pair .ith the uni&ue ID of the e:ten%e% out1/erte: 0loc" as the "ey an% the su03ect /erte: of the gi/en local su03ect10ase% triple group as the /alue# This ensures that all out1triples of this su03ect /erte: are a%%e% to the matching e:ten%e% out1 /erte: 0loc"# If there is no matching e:ten%e% out1/erte: 0loc" for the su03ect /erte:, the triple group of the su03ect /erte: is %iscar%e%# This process is repeate% until all the su03ect10ase% local triple groups are e:amine% against the set of e:ten%e% out1/erte: 0loc"s# 7inally, .e get a set of <1 hop e:ten%e% out1/erte: 0loc"s 0y the re%uce function, .hich merges all "ey1/alue pairs .ith the same "ey together such that the su03ect10ase% triple groups .ith the same e:ten%e% out1/erte: 0loc" ID 4"ey6 are integrate% into the e:ten%e% out1/erte: 0loc" to o0tain the <1hop e:ten%e% out1/erte: 0loc" an% store it as a BD7S file# 7or k , <, .e .ill repeat the a0o/e process until the k- hop e:ten%e% /erte: 0loc"s are constructe%# In our e:periment section .e report the performance of the SA partitioner in terms of 0oth the time comple:ity an% %ata characteristics of generate% partitions to illustrate .hy partitions generate% 0y the SA partitioner is much more efficient for ,D7 graph &ueries than ran%om partitioning 0y BD7S chun"s#
the %efinition of the e:ten%e% /erte: 0loc", a triple may 0e replicate% in se/eral e:ten%e% /erte: 0loc"s# If /erte: 0loc"s ha/ing many common triples can 0e assigne% to the same partition, the storage cost can 0e re%uce% consi%era0ly 0ecause %uplicate triples are eliminate% on each .or"er no%e# 2hen many triples are replicate% on multiple partitions, the enlarge% partition size .ill increase the &uery processing cost on each .or"er no%e# Therefore, re%ucing the num0er of replicate% triples is also an important factor critical to efficient &uery processing# Thir%, .hen selecting the %istri0ution mechanism, .e nee% to consi%er the processing time for the %istri0ution of e:ten%e% /erte: 0loc"s# This criterion is also important for ,D7 %atasets that re&uire more fre&uent up%ates# In the first prototype of SA partitioner, .e implement t.o alternati/e techni&ues for %istri0ution of e:ten%e% /erte: 0loc"s to the set of .or"er no%esE hashing10ase% an% graph partitioning10ase%# The hashing10ase% techni&ue can achie/e the 0alance% partitions an% ta"e less time to complete the %istri0ution# It assigns the e:ten%e% /erte: 0loc" of /erte: $ to a partition using the hash /alue of $# 2e can also achie/e the secon% goal re%ucing replication 0y %e/eloping a smart hash function 0ase% on the semantics of the gi/en %ataset# 7or e:ample, if .e "no. that /ertices ha/ing the same !,I prefi: are closely connecte% in the ,D7 graph, then .e can use the !,I prefi: to get the hash /alue, .hich .ill assign those /ertices to the same partition# The graph partitioning10ase% %istri0ution techni&ue utilizes the partitioning result of the minimum cut /erte: partitioning algorithm 8>;9# The minimum cut /erte: partitioning algorithm %i/i%es the /ertices of a graph into n smaller partitions 0y minimizing the num0er of e%ges 0et.een %ifferent partitions# 2e generate a /erte:1to1.or"er1no%e map 0ase% on the result of the minimum cut /erte: partitioning algorithm# Then .e assign the e:ten%e% /erte: 0loc" of each /erte: to the .or"er no%e using the /erte:1to1no%e map# Since it is li"ely that close /ertices are inclu%e% in the same partition an% thus place% on the same .or"er no%e, this techni&ue focuses on re%ucing replication 0y assigning close /ertices, .hich are more li"ely to ha/e common triples in their e:ten%e% /erte: 0loc"s, to the same partition#
IV. D NAD,I=D+TS
In this section, .e report the e:perimental e/aluation of our ,D7 %ata partitioning frame.or"# -ur e:periments are performe% using not only %ifferent 0enchmar" %atasets 0ut also /arious real %atasets# 2e first intro%uce 0asic character1 istics of %atasets use% for our e/aluation# 2e categorize the e:perimental results into t.o setsE 4>6 2e sho. the effects of %ifferent types of /erte: 0loc"s an% the %istri0ution techni&ues in terms of the 0alance% partitions, replicate% triples an% %ata partitioning an% %istri0ution time# 4<6 2e con%uct the e:periments on &uery processing latency#
A. Setup
2e use a cluster of <> machines 4one is the coor%inator6 on Dmula0 8<>9E each has >< GC , =, one <#? GBz *?10it &ua% core Neon D;;5( processor an% t.o <;(GC @<(( rpm S T %is"s# The net.or" 0an%.i%th is a0out ?( =COs# 2hen .e measure the &uery processing time, .e perform fi/e col% runs un%er the same setting an% sho. the &astest time to remo/e any possi0le 0ias pose% 0y -S an%Oor net.or" acti/ity# To store each generate% partition on each machine, .e use ,D715N 8>)9 /ersion (#5#; .hich is a ,D7 storage system# 2e also use Ba%oop /ersion (#<(#<(5 running on Fa/a >#*#( to run /arious partitioning algorithms an% 3oin the interme%iate results generate% 0y su0&ueries# In a%%ition to our ,D7 %ata partitioning frame.or", .e ha/e implemente% a ran%om partitioning for comparison, .hich ran%omly assigns each triple to a partition# ,an%om partitioning is often use% as the %efault %ata partitioner in Ba%oop =ap,e%uce an% BD7S# To sho. the 0enefits of %ifferent 0uil%ing 0loc"s of our partitioning frame.or", .e e:periment .ith fi/e %ifferent e:ten%e% /erte: 0loc"sE >1hop e:ten%e% out1/erte: 0loc" 4+out6, >1hop e:ten%e% in1/erte: 0loc" 4+-in6, >1hop e:ten%e% 0i1 /erte: 0loc" 4+-(i6, <1hop e:ten%e% out1/erte: 0loc" 4 -- out6 an% <1hop e:ten%e% in1/erte: 0loc" 4 --in6# lso hashing1 0ase% an% graph partitioning10ase% %istri0ution techni&ues are compare% in terms of 0alance% partitions, triple replication an% the time to complete the %istri0ution# 2e use the graph parti1 tioner =DTIS 8?9 /ersion ;#(#< .ith its %efault configuration to implement the graph partitioning10ase% %istri0ution#
B. Datasets
2e use t.o %ifferent 0enchmar" generators an% three %iffer1 ent real %atasets for our e:periments# To generate 0enchmar" %atasets, .e utilize t.o famous ,D7 0enchmar"sE L!C= 4Lehigh !ni/ersity Cenchmar"6 8>>9 ha/ing a uni/ersity %o1 main an% SA<Cench 8<(9 ha/ing a DCLA %omain# 2e
4i6 L!C=;(( 4*@= triples6 an% L!C=<((( 4<*@= triples6 co/ering ;(( an% <((( uni/ersities respecti/ely 4ii6 SA<C1 >((= an% SA<C1<((= ha/ing a0out >(( an% <(( million triples respecti/ely# 7or real %atasets, .e use 4iii6 DCLA 4;@= triples6 8>9 containing 0i0liographic %escriptions in computer science, 4i/6 7ree0ase 4>(>= triples6 8<9 .hich is a large "no.le%ge 0ase, an% 4/6 DCpe%ia 4<))= triples6 859 .hich is structure% information from 2i"ipe%ia# 2e ha/e %elete% any %uplicate triples using one Ba%oop 3o0# 7ig# ? sho.s some 0asic metrics of the se/en ,D7 %atasets# The %atasets generate% from each 0enchmar" ha/e almost the
100000
100000000
@su8(ects AlogB
100000 10000 1000 100 10 1
10000000 1000000
@su8(ects AlogB
100000 10000 1000 100 10 1
er of pr e%i cat es
(d)
1 1 10 E C 100 0 1 E C 1 1 E C / 1 E C 0 1 E C 4 1 E C . @outgoing 1 E C ? 1 E C 5
7 u &
' '%g Type+ s perPre Su8(dica ect tes per Typ e AlogB
0 0 0 0 1 0 0 1 0 0
@ s u
1 0 0 1 0 0
1 0 0 0 0 1 0 0 0 1 0 0 1 0 1
1 0 0 1 0 0 1 0 1
L D 3 = . L 0 D 0
1 1
@ s u
/ 0 0 &
1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1
0 0
1 0 0 0 0 0 0 0 0
LD3 =.0 0 LD3 =/0 00 SP/ 3110 0& SP/ 31/0 0&
+um Ty A pe re (i) @out goin g s %i -ut edg es pe c e Alog B r at % Su e (j) g 03e s Ine%g e ct p e er % %is T i tri y s 0u p t tio e r n i 0 u t i o n 7ig# ?E Casi c Char acter istics of Se/e n ,D7 %ata sets
same charact eristics regar%l ess of their %ata size, similar to those reporte % in 8'9# -n the other han%, the real %ataset s ha/e totally %ifferen t charact eristics# 7or
e:ampl e, .hile DCLA has <) pre%icat es, DCpe%i a has a0out ;),((( pre%icat es# To measur e the metrics 0ase% on types 47ig# ?4f6, ?4g6, ?4h66, .e ha/e counte%
those triples ha/ing rd&.type as their pre%icat e# The %istri0ut ion of the num0er of outgoin gOincom ing e%ges in the three C. real %ataset s 47ig# ?4%6 an% ?4e66 ha/e /ery similar po.er la.1li"e %istri0ut ions# It is interesti ng to note that, e/en though a huge num0er of o03ects ha/e less than >(( incomin g e%ges, there are still many o03ects ha/ing a signific ant num0er of incomin g e%ges 4e#g# more than >((,((
( e%ges6 in real %ataset s# Cenchm ar" %ataset s ha/e totally %ifferent %istri0u1 tions as sho.n in 7ig# ?4i6 an% ?436# ) e n e r a t i n ! P a r t i t i o n s 7ig# ; sho.s the partition %istri0uti on in terms of the num0er of triples for %ifferent 0uil%ing 0loc"s using the hashing1 0ase% an% graph partitioni ng1 0ase% %istri0uti on
o. that .e can get alm ost perf ectl y 0al anc e% part itio ns if .e e:t en% onl y out goi ng tripl es .h en .e con stru ct the e:t en% e% /ert e: 0lo c"s an% use the has hin g1 0as e% %ist ri0u tion tec hni &ue # The se res ults are rela te% to the
out goi ng e%g e %ist ri0u tion in 7ig# ?4%6 an% ?4i6# Sin ce mo st su03 ects ha/ e less tha n >(( out goi ng tripl es an% ther e is no su03 ect ha/i ng a hug e nu m0 er of out goi ng tripl es, .e can gen erat e 0al anc e% part ition s 0y %ist ri0u ting tripl
e s 0 a s e % o n s u 0 3 e c t s # 7 o r e : a m p l e , t h e m a : i m u m n u m 0 e r o f o u t g o i n g tr i p l e s o n L
e al s o . el l 0 al a n c e % . h e n . e e :t e n % o nl y o ut g oi n g e % g e s 0 e c a u s e = D TI S h a s g e n er at e % al m o st u
n i f o r m
n c e %
t h / a e n r t t e h : o s p e a r u t s i i t n i g o n t s h , e e / e n t h o u g h h a s h i n g 1 0 a s e %
. th at it is ha r% to ge ne rat e 0ala nce% partit ions if .e inclu %e inco ming triple s for our e:te n%e% /erte : 0loc "s, rega r%les s of the %istri 0utio n tech1 ni&u e# Sinc e there are man y o03e cts ha/i ng a huge num 0er of inco ming triple s as sho .n in 7ig# ?4e6 an% ?436,
partit ions ha/in g such high in1 %egr ee /ertic es may 0e muc h 0igg er than the other partit ions# 7or e:a mple , each of t.o o03ec t /ertic es on L!C =<( (( has a0ou t >* millio n inco ming triple s .hic h is
a0out *Q of all triples# Ta0le I sho.s the norma lize% stan%a r% %e/iati on 4i#e#, the stan%a r% %e/iati on %i/i%e % 0y the mean6 in terms of the num0 er of triples on gener ate% partitio ns to compa re the 0alanc e 0et.e en using the hashin g1 0ase% an% graph partitio ning1 0ase% %istri0 ution techni &ues#
T C L D I E C a l a
0loc"s using the hashin o f g1 0ase% p an% a graph r partitio t i ning1 t 0ase% i %istri0 o ution n techni s &ues# 4 Since + the o hashin r m g1 a 0ase% l %istri0 i ution z techni e % &ue is incorp s orate% t into a n our % imple a mentat r ion % using % Ba%oo e p for / efficie i nt a proces t i sing, o .e %o n not 6 Dataset separa L!C=<((( 4hashing6 te the L!C=<((( 4graph6 %istri0 SA<C1<((= 4hashing6 SA<C1<((= 4graph6 ution DCLA 4hashing6 DCLA 4graph6 time 7ree0ase 4hashing6 an% 7ree0ase 4graph6 the DCpe%ia 4hashing6 DCpe%ia 4graph6 e:ten% e% Ta0l /erte: e II 0loc" sho.s constr the uction e:ten% time# e% T /erte: 0loc" C L constr D uction time I I for E %iffere D nt : t 0uil%in e g n
n c e
s un # = . D e I TI t S f co i i ns r s i% s er t i a0 n ly n t /a e e rie e r s % e fo s r t t %if o i fe n re c g nt o %a n t ta / o se ts# Dataset e L!C=<((( 4hashing6 r n Si L!C=<((( 4graph6 t o nc SA<C1<((= 4hashing6 t e SA<C1<((= 4graph6 DCLA 4hashing6 t e ea DCLA 4graph6 ch 7ree0ase 4hashing6 h 7ree0ase 4graph6 e t lin DCpe%ia 4hashing6 h e DCpe%ia 4graph6 , a in 7or D t a the 7 = graph t D partitio % h TI ning1 a e S 0ase% t in %istri0u a f pu tion s o t techni& e r fil ue, on t m e the s a re other t pr han%, i es there n c en is an t o ts a%%itio o n a nal / /e proces = e rt sing D r e: time T s an to run I i % =DTIS S o its as n co sho.n i nn 0elo.# n t ec Concre p i te tely, to u m % run the t e /e graph rti partitio f t ce ning i o s, using l . =DTIS e r e
% e % / e rt e : 0 l o c " c o n s tr u c ti o n ti m e 4 s e c 6
e s s i n g # u r i m p l e m e n t a t i o n c o n s i s t s o f f o u r B a % o o p = a p , e % u c e 3 o 0 s
h s u 0 3 e c t t o i t s I D , 4 5 6 c o n / e r t e a c h o 0 3 e c t t o i t s I D , a n % ? 6 g r o u p a l l c o n n e c t e % / e r t i c e s o f
100 -0
@Tripl es A&illio nB
80 50 ?0 .0 40 00 /0 10 0
-0
rando&
@Triples A&illionB
80 50 ?0 .0 40 00
1-out
18
rando&
@Triples A&illionB
1? 14 1/ 10 8 ? 4 / 0
1-out
18
rando&
@Triples A&illionB
1? 14 1/ 10 8 ? 4 / 0
1-out
-0
@Triples A&illionB
80 50 ?0 .0 40 00
1 / 0 4 . ? 5 8 - 10 11 1/ 10 14 1. 1? 15 18 1- /0
/0 10 0 1 / 0 4 . ? 5 8 - 10 11 1/ 10 14 1. 1? 15 18 1- /0
Partition
1 / 0 4 . ? 5 8 - 10 11 1/ 10 14 1. 1? 15 18 1- /0
1 / 0 4 . ? 5 8 - 10 11 1/ 10 14 1. 1? 15 18 1- /0
/0 10 0 1 / 0 4 . ? 5 8 - 10 11 1/ 10 14 1. 1? 15 18 1- /0
Partition
Partition
Partition
Partition
(e)
DCpe %i a 4 h a s hi n g 6
@ T r i p l e s
18
A . ? 5 8 & - 10 11 1/ i 10 14 1. 1? 15 18 1- /0 l l Partition i o n B
1 / 0 4
(f)
-0
ra nd o& 1ou t
@Triples A&illionB
80
1in 18i
50
8 0 5 0 ? 0 . 0 4 0 0 0 / 0 1 0 0
/out /in
?0 .0 40 00 /0 10 1 ? 11 1? 0 / 5 1/ 15 0 8 10 18 4 14 1. 10 1. /0
/ 4 ? 8 10 1/ 14 1? 18 /0
Partit ion
Partition
Partiti on
(h)
DCL A 4 g r a p h 6 7ig # ;E Aa rtiti on Dis tri0 uti on
(i)
% rs e/en though it r has the u smallest n num0er n of i triples# It n too" g a0out 5; t hours to i con/ert m the e 7ree0as e to a o =DTIS f input = file# 2e e D 0elie/e a T that this c I is h S 0ecause # there / T are e h much r e more t /ertices e c .hich : o are # n connect T / e% to a a e huge 0 r num0er l s of e i /ertices o on I n 7ree0as I e an% I t DCLA s i as h m sho.n o e in 7ig# . ?4e6# s o Since t f most h D e:isting e C graph L partitioni c A ng tools o re&uire n i similar / s input e formats, r a this s 0 result i o in%icate o u s that n t the @ graph ti # partitioni m ; ng1 e 0ase% h %istri0uti a o on n u techni&u
7ig# * sho.s, for %ifferent 0uil%ing 0loc"s, the ratio of the total num0er of triples of all generate % partition s to that of the original ,D7 %ata# large ratio means that many triples are replicate % across the cluster of machine s through the %ata partitioni ng# Aartition s using the >1 hop e:ten%e % out1 /erte: or in1/erte: 0loc"s ha/e >
as the ratio 0ecause each triple is assigne% to only one partition 0ase% on its su03ect or o03ect respecti/ ely# s .e mentione % a0o/e, the results sho. that the graph partitionin g10ase% %istri0utio n techni&ue can re%uce the num0er of replicate% triples consi%era D. 0ly, compare% to the hashing1 0ase% techni&ue # This is 0ecause the graph partitioner generates compone nts in .hich close /ertices, .hich ha/e high pro0a0ilit y to share some triples in their e:ten%e% /erte: 0loc"s, are assigne% to the same compone
nt# Dspecially , there are only a fe. replicate% triples for the <1hop e:ten%e% out1 /erte: 0loc"s# Bo.e/er, for the other e:ten%e% /erte: 0loc"s, the re%uction is not so significant if .e consi%er the o/erhea% of the graph partitionin g10ase% %istri0utio n techni&ue # Q u e r y P r o c e s s i n ! To e/aluate &uery processin g performan ce on the generate% partitions, .e choose %ifferent &ueries from one
0enchmar " %ataset 4L!C=<( ((6 an% one real %ataset 4DCpe%ia 6 as sho.n in Ta0le IG# Since L!C= pro/i%es a set of 0enchmar " &ueries, .e select three among them# +ote that Q>> an% Q>5 are slightly change% 0ecause our focus is not ,D7 reasoning # 2e create three &ueries for DCpe%ia 0ecause there is no represent ati/e 0enchmar " &ueries# 7ig# @ sho.s the &uery proces sing time of the &ueries for
su03ect an% +- 0et.een shoul% in# The the ()B- .//E SDLDCT 0e results hashing1 Ly /su0-rganization-f locate% clearly 0ase% ()B- ./0E SDLDCT L: /un%ergra%uateDegree7rom in the sho. an% the ()B- ./1E SDLDCT L: 2BD,D same that graph DB edia ./E using partitionin /Carac" -0amapartition DB edia .2E SDLDCT Lperson 2BD,D for inter1 g10ase% DB edia .0E SDLDCT Lsuc< Lsuc> Lperson 2BD,D intra1 G= %istri0utio Lsuc< /successor G= processi n %ifferent process ng is techni&ue 0uil%ing ing# -n slo.er s is not 0loc"s the than significan an% other using t for most %ifferent han%, intra1 cases %istri0uti for G= e/en on DCpe%i processi though techni& a Q> ng they ha/e ues# ll .hich is 0ecaus %ifferent 0uil%ing a e of the 0alance 0loc"s re$erse commu an% ensure star nication replicatio intra1 &uery cost n le/els G= .here across as sho.n processi all triple the in Ta0le I ng for patterns cluster an% 7ig# L!C= share of *# 2e Q>? the machin thin" that an% same es to this is DCpe%i o03ect, 3oin 0ecause a Q< random interme% each 0ecaus , +-out iate partition e they an% -results is are out generate efficiently simple nee%s % from store% retrie/al inter1 su0&ueri an% &ueries G= es# in%e:e% ha/ing process 7urther in the only ing# 7or more, ,D71 one L!C= each specific triple Q>> Ba%oop storage pattern# an% 3o0 system, 7or DCpe%i usually thus such L!C= a Q5 has le/el of Q>5, .here a0out %ifference random, the >( on the +-in an% o03ect secon%s 0alance --in of one of an% nee%s triple initializa replicatio inter1 pattern tion n %oes G= is lin"e% o/erhea not ha/e processi to the % a great ng su03ect regar%le effect on using of the ss of its &uery Ba%oop other input processin 0ecaus triple %ata g# The e Q>5 is pattern, size# In only a star inter1 terms of nota0le &uery G= the %ifference an% so process intra1 is .hen all ing is G= .e use triples re&uire processi --out for sharing % for ng, the partitionin the random %ifferen same , +-out ce
T CLD IGE Queries
g 0ecause the graph partitionin g10ase% techni&ue re%uces the num0er of replicate% triples consi%era 0ly# Dspecially , on DCpe%ia .here -out using the hashing1 0ase% %istri0utio n techni&ue generates many replicate% triples, -out using the graph partitionin g1 0ase% %istri0utio n techni&ue is three times faster than that using the hashing1 0ase% techni&ue for DCpe%ia Q5# In summary, .e conclu%e that the most important factor of ,D7 %ata partitionin g is to ensure the intra1 G= processin g
4 "eplication "atio
!ashing-8ased
3 Eraph partitioning8ased
/ 1 0
. "eplication "atio
4 Eraph partitioning-8ased
!ashing-8ased
1-out
1-in
1-8i
/-out
/-in
1-out
1-in
1-8i
/-out
/-in
1 0+. 0
(b)
SA<C1<((=
(c) DCLA
7ig# *E Triple ,eplication
(d)
7ree0ase
(a)
r a n i n 9 u o e u
DCpe%ia
1/-in
out
1-in
1-8i
/-out
(a)
(((
L!C=<
1 2 r y P r o c e s s i n g T i & e i n S e c 1 0 0
1 2 -
r a n i n o u
1 0
Coth are special cases of our SA partition er# 2e argue that our SA frame. or" present s a system atic approac h to graph partition ing an% is more general an% easily configur a0le .ith customi za0le perform ance guarant ee# GI# C-+CL!SI
-+
90
91 910 914 914 911 911 0 Aha shi ngB Agr ap hB Aha shi ngB Agr ap hB Aha shi ngB Agr ap hB
(a)
L!
2e ha/e presente % SA , an efficient an% customiz a0le ,D7 (b) %ata DCpe partitioni ng mo%el for %istri0ute %
p r
o c
essing of
for all 4or most6 gi/en SA ,QL &ueries 0ecause inter1G= processing using Ba%oop is usually much slo.er than intra1 G= processing# If all 4or most6 &ueries can 0e co/ere% 0y only outgoing e:tension, the graph partitioning10ase% %istri0ution techni&ue shoul% 0e preferre% 0ecause it can re%uce the num0er of replicate% triples consi%era0ly# Bo.e/er, for all the other cases, the hashing10ase% %istri0ution techni&ue is a 0etter %istri0ution metho% if .e consi%er the huge o/erhea% of the graph partitioning10ase% techni&ue# lso, it shoul% 0e note% that running e:isting graph partitioner can 0e infeasi0le for some ,D7 %atasets %ue to their comple: structure#
0ig ,D7 graph %ata in the Clou%# The main contri0ution is to %e/elop a selection of scala0le techni&ues to %istri0ute the partition 0loc"s across a cluster of compute no%es in a manner that minimizes inter1no%e communication cost 0y localizing most of the &ueries on %istri0ute% partitions# 2e sho. that the SA approach is not only efficient for partitioning an% %istri0uting 0ig ,D7 %atasets of %i/erse sizes an% structures 0ut also effecti/e for processing ,D7 &ueries of %ifferent types an% comple:ity# &cknowledgmentE This .or" is partially sponsore% 0y grants from +S7 CISD +etSD program, SaTC program, an% Intel ISTC on Clou% Computing# , D7D,D+CDS
8>9 H 0out 7acete%DCLA,I http.00d(lp'l1s'de0d(lp22'php# 8<9 HCTC <(>< Dataset,I http.00km'ai&('kit'edu0pro*ects0(tc--3+-0# 859 HDCpe%ia 5#) Do.nloa%s,I http.00wiki'd(pedia'or!0Downloads14# 8?9 H=DTIS,I http.00www'cs'umn'edu05metis# 8;9 H-pening up Go/ernment,I http.00data'!o$'uk0spar6l# 8*9 H,esource Description 7rame.or" 4,D76,I http.00www'w1'or!0RDF0# 8@9 HThe Data1go/ 2i"i,I http.00data-!o$'tw'rpi'edu0wiki# 8)9 H!K Legislation,I http.00www'le!islation'!o$'uk0de$eloper0&ormats0rd&# 8'9 S# Duan, Kementsietsi%is, Srini/as, an% !%rea, H pples an% orangesE a comparison of r%f 0enchmar"s an% real r%f %atasets,I in S7)M8D, <(>># 8>(9 C# 7ran"e an% et al#, HDistri0ute% Semantic 2e0 Data =anagement in BCase an% =ySQL Cluster,I in 7""" L89D, <(>># 8>>9 Y# Guo, Z# Aan, an% F# Beflin, HL!C=E 0enchmar" for -2L "no.le%ge 0ase systems,I :e( Semant', /ol# 5, no# <15, -ct# <((;# 8><9 T# Beath, C# Cizer, an% F# Ben%ler, Linked Data, >st e%#, <(>># 8>59 C# Ben%ric"son an% ,# Lelan%, H multile/el algorithm for partitioning graphs,I in Supercomputin!, >'';# 8>?9 F# Buang, D# F# 0a%i, an% K# ,en, HScala0le SA ,QL Querying of Large ,D7 Graphs,I AGLDC, ?4<>6, pp# >><5R>>5?, ugust <(>># 8>;9 G# Karypis an% G# Kumar, H 7ast an% Bigh Quality =ultile/el Scheme for Aartitioning Irregular Graphs,I S7AM ;' Sci' omput', >'')# 8>*9 # Kyrola, G# Clelloch, an% C# Guestrin, HGraphchiE large1scale graph computation on 3ust a pc,I in 8SD7# Cer"eley, C , !S E !SD+IN ssociation, <(><, pp# 5>R?*# 8>@9 G# =ale.icz, =# B# ustern, # F# Ci", F# C# Dehnert, I# Born, +# Leiser, an% G# Cza3"o.s"i, HAregelE a system for large1scale graph processing,I in S7)M8D# +e. Yor", +Y, !S E C=, <(>(, pp# >5;R>?*# 8>)9 T# +eumann an% G# 2ei"um, HThe ,D715N engine for scala0le man1 agement of ,D7 %ata,I <he VLD# ;ournal, /ol# >', no# >, 7e0# <(>(# 8>'9 K# ,ohloff an% ,# D# Schantz, HClause1iteration .ith =ap,e%uce to scala0ly &uery %atagraphs in the SB ,D graph1store,I in D7D , <(>># 8<(9 =# Schmi%t, T# Bornung, G# Lausen, an% C# Ain"el, HSA<CenchE SA ,QL Aerformance Cenchmar",I in 7 D", <(('# 8<>9 C# 2hite an% et al#, H n integrate% e:perimental en/ironment for %istri0ute% systems an% net.or"s,I in 8SD7, <((<#
V. , DL
TDD
2-,K
Graph partitioning has 0een e:tensi/ely stu%ie% in se/eral communities for se/eral %eca%es 8>59, 8>;9# ,ecently a num0er of techni&ues ha/e 0een propose% to process ,D7 &ueries on a cluster of machines# =ost of them utilize e:isting %istri0ute% file systems such as BD7S to store an% %istri0ute ,D7 %ata# SB ,D 8>'9 %irectly stores ,D7 triples in BD7S as flat te:t files an% runs one Ba%oop 3o0 for each clause 4triple pattern6 of a SA ,QL &uery# 8>(9 uses t.o BCase ta0les to store ,D7 triples an%, gi/en a &uery, iterati/ely e/aluates each triple pattern of the &uery on the t.o ta0les to generate final results# Cecause they use general %istri0ute% file systems .hich are not optimize% for ,D7 %ata, their &uery processing is much less efficient than the state1of1the1art ,D7 storage an% in%e:ing technology 8>)9# 8>?9 generates partitions using e:isting graph partitioning algorithms an% stores them on ,D71 5N# s .e reporte% in Sec# IG, running an e:isting graph partitioner for large ,D7 %ata has a huge amount of o/erhea% an% is not feasi0le for some real ,D7 %atasets# In a%%ition, t.o recent papers 8>@9, 8>*9 ha/e propose% t.o %ifferent approaches to general processing of 0ig graph %ata# 8>@9 promotes to store a /erte: an% all its outgoing e%ges together an% implemente% a %istri0ute% system to sho. the parallel processing opportunity of their approach# 8>*9 proposes to store a /erte: an% all its incoming e%ges together an% sho.s that this can spee% up some graph processing on a single ser/er, such as Aage,an"# In comparison, 8>@9 presents a system 0uilt using out1/erte: 0loc" for 0oth su03ect an% o03ect /ertices, .hereas 8>*9 presents a single ser/er solution using the in1/erte: 0loc" for 0oth su03ect an% o03ect /ertices#