Professional Documents
Culture Documents
SubGraphChallenge 2017-02-09
SubGraphChallenge 2017-02-09
SubGraphChallenge 2017-02-09
- draft -
http://GraphChallenge.org
Outline
• Introduction
• Data Sets
• Metrics
• Summary
Graph Challenge - 2
Introduction
Graph Challenge - 4
Static versus Streaming Mode
Graph Challenge - 5
Labels and Filtering
• Filtering on edge/vertex labels is often used when available to reduce the search
space
– Filtering can be applied at initialization, during intermediate steps, or at the vertex level
– Very problem dependent
• Some Graph Challenge data sets have labels and some data sets are unlabeled
– Some participants will want to filter on labels
Graph Challenge - 6
Outline
• Introduction
• Data Sets
• Metrics
• Summary
Graph Challenge - 7
Publicly Available Datasets
Initial Datasets in Blue
Stanford Large Network Dataset Collection snap.stanford.edu/data
• Social networks (10 data sets up to 4.8M vertices & 69M edges) • Road networks (3 data sets up to 1.9M vertices & 2.8M edges)
– Online social networks, edges represent interactions between people – Nodes represent intersections and edges roads connecting the
intersections
• Networks with ground-truth (6 data sets up to 66M vertices & 1.8B edges,
e.g. Friendster) • Autonomous systems (5 data sets up to 26K vertices & 106K edges)
– Ground-truth network communities in social and information networks – Graphs of the internet
• Communication networks (3 data sets up to 2.3M vertices & 5M edges) • Signed networks (10 data sets up to 4.8M vertices & 69M edges)
– Email communication networks with edges representing communication – Networks with positive and negative edges (friend/foe, trust/distrust)
• Citation networks (3 data sets up to 3.7M vertices & 16M edges) • Location-based online social networks (2 data sets up to 198K vertices
& 950K edges)
– Nodes represent papers, edges represent citations
– Social networks with geographic check-ins
• Collaboration networks (5 data sets up to 23K vertices & 198K edges)
• Wikipedia networks, articles, and metadata (7 data sets up to 3.5M
– Nodes represent scientists, edges represent collaborations
vertices & 250M edges)
• Web graphs (4 data sets up to 875K vertices & 5.1M edges)
– Talk, editing, voting, and article data from Wikipedia
– Nodes represent webpages and edges are hyperlinks
• Twitter and Memetracker (4 data sets up to 96M vertices & 476M edges)
• Amazon networks (5 data sets up to 548K vertices & 3.4M edges)
– Memetracker phrases, links and Tweets
– Nodes represent products and edges link commonly co-purchased
products • Online communities (3 data sets up to 2.3M images)
– Data from online communities such as Reddit and Flickr
• Internet networks (9 data sets up to 62K vertices & 147K edges)
– Nodes represent computers and edges communication • Online reviews (6 data sets up to 34M product reviews)
– Data from online review systems such as BeerAdvocate and Amazon
Graph Challenge - 8
Publicly Available Datasets
Initial Datasets in Blue
AWS Public Data Sets aws.amazon.com/public-data-sets Yahoo! Webscope Datasets webscope.sandbox.yahoo.com
• Astronomy (1 data set 180 GB) • Advertising and Market Data (4 data sets up to 3.7 GB)
– Sloan Digital Sky Survey SQL MDF files – Yahoo!'s auction-based platform for selling advertising space
• Biology (4 data sets up to 200 TB) • Competition Data (3 data sets up to 1.5 GB)
– Genome sequence data
– Data challenges run by Yahoo
• Climate (3 data sets size growing daily)
• Computing Systems Data (5 data sets up to 8.8 GB)
– Satellite imagery data
• Economics (10 data sets up to 220 GB) – Computer systems log data
– Census, transaction, and transportation data • Graph and Social Data (3 data sets up to 5 GB)
• Encyclopedic (10 data sets up to 541 TB - Common Crawl Corpus) – Graph data from search, groups, and webpages
– Various online encyclopedia data
• Image Data (3 data sets up to 14 GB)
• Geographic (3 data sets up to 125 GB)
– Flickr imagery and metadata
– Street maps
• Mathematics (1 data sets 160 GB) • Language Data (29 data sets up to 166 GB - Answers browsing behavior)
– University of Florida Sparse Matrix Collection – Wide range of question/answer data sets
• Ratings and Classification Data (10 data sets up to 83 GB — 1.5 TB
Graph500.org Data Generator dataset withdrawn)
– Community preferences and data
• Used to generate world’s largest power law graphs
• Can be modeled with Kronecker Product of a Recursive Tools to generate (at various scales and parameters)
MATrix (R-MAT): G⊗k = G⊗k–1 ⊗ G
– Where “⊗”denotes the Kronecker product of two matrices data sets will be provided as part of the challenge
Graph Challenge - 9 • R-MAT: A Recursive Model for Graph Mining, Chakrabarti, Zhan & Faloutsos (2004)
• Mathematically Tractable Graph Generation and Evolution, Leskovec, Chakrabarti, Kleinberg & Faloutsos, (2005)
GraphChallenge Data Formats
8
57
78
rds
es1 ang5
ha
ric
ctiv qu
y_
ete am
_ton
nts
ph
odey
ha
ey
erc
ha adyD
SS
DrKe
davidth
bb
moh
M
DomSpoerli
PO urj ed
nA
PG
a
ID ik
L
_y
lu
LinuxMisyone ri
nto
rin
am
re
Va
DayTTa
ire
im
Sinan
ate sl
je
ow
mD RrlaciS
c
ed
dn
ak
Jh tors
D
ris
e
Ra
6C
lsrl
bem
rm
_h
he
ajou
ch
on
linuxnotla
z ay
ch
Cana
9
lp
an
ah
elA yK
ek
terd
m
ya
hem
luqm
nM
nx
kO
nd
r
on
og nixnd
Ju yP
es
2n
n
ks
rwso
n ge inen
veet
ri
ey
hipa
dJa
l
ha
O
Ed
an T_
oist ola
an o karisaer
ita
rm gerD rsohlenmis
ig riK
Clic
psy
Pet i usiJaa
mou wor
kasimerkan rees
nD
stertpraetorianlabs
leillK
Me
u3re
so
g
eI U
RFairclo
kI
sya
G po bittaGin ndWod
ct3
_S us login
ea Aks A
Net
ntai ks
e haJHsv
esh
t t
Rerie ne es kki
Th C Gr_
in t
kt
thrertis
C
nresw8rbwpgri4
ckie vaH ez
ec
icky de
on
atC
altd LD
clear
Ti
tuA
AA phai CA
rism
gu d Pere PJ
nsI 4Vchinsi
Fauta
e g owie
re
G ns
VM
pan
an S
ughdom
cuer t ou u27 Te nfo
chmichto vikY
rFF ies7 gg
RN ne
ooha
Bfo
et Lara garin olic afta B
Info
m 107
rdD
Thai
ec
tofuS
Hop sh Cas d
vnik5287
glew
run
cFieose
rC
scottdware
arx etIO
m mK Fran
P n2te
gm
itc te av rso
icloha
mastalet
raM n
Ha
ar
D
nic
gu
hej
ud H el
ecJ joe
U kg 1
es
k__w
ile
ud ello _ linprot
sw In djdrdika
emoore Aio
lh0
eill
ku im
Sock
an s
_
ig
F ymza ta r wi
JayKyuu
uz
thecre
toaxgu
esu kle
anHa ob
josephredfern
nho
Arb
n
ota asgr bioS or
Ch1 coliccn
TheGutes
e
hyeti
he aHuhmpile
M
Gar m ala
Infola it eK rbyem ng
s in
willie x g3aT
Nftas
Ja
oherr
hing
eckm
dire
ikis rk Liv ecm rli inbimat Jasin anmegerCo heini
hm er6_6
llsttArt
s
Sureca are
ne
TACp cje teng
Cover
eri
m
pa ia D num
Hot uug th ays se onys Scec hester Ge ut no a the88 PauliAlti
tI
od
shca g se
Am
ulC k
ch Sec
pestnd GISg sal ian
Tural30
vid th sp Pa dd kjope tomihj jve
k rack Spo ee info Cynpsd an w b berze kra ilwain
ieIO
am da
ity
deerv t vo Ho pz cy junaxu mattmc atotoro
CertiVox
CurlyBe de UX lexisani io_ ox saunmobify
autys equarte d_o Bar Turm 15
Bluc
n oinB ELF
y BAES Andvgc reaerf thalmicTese AlexKara laGe Indie
msauk ko ano
ospad
nodesecu ystems
evktalo archer aggiora
giannim
ichetandhem rity _AI r00f nikmd23 hack_lu SofieHa gen ova
Apryl_P
Tiffani_B
bre bull3tp erry atseitlin
ibroadfo
minoad7 nolaforensix xandersh farahney
antkanan is RusiAlpo KPbewelldoc
_tty0
SexyLikeMeios
AcademiaObscura inboxbygmail mikk0j kev_mcdougall skriptuurus
defcoin
ustwogames anttitikkanen
willcanine SpikeBeeJo
l BitquarkBot twittersecurity NeustarCTO
martinbubertendart
romainle daavidhentunen marpichun
AbiWilks HopeFran
jerrychen k
bspector
miratim GPGTools chrisb jonasanbrow herrod
sitelutions certiv
kelly
nsporndly
TK5EPfujin_ i Reversity camicelli shittydogs Mbosinwa
ihmd mu ne ge
elizae
ox
stevewerby
Agsarr simonb
lab abhishe fallen jehova SoluM prcheachineney s
ksau
bradfitz
man and idcom kmdb jasontco
therealred wZeal
peoE i_FRCryp ayly soto phil hen mesosphe Boulevard_
dotG ngc
C
pra ponore leahk
YaiAou
H
PureNe cpu PekkaN
n
ijim illa gua uipers
an en sege
hilto
So
n0 out
nalytics riaNig q hN3Weikan
RISI
landonfulle
h_ sleemarkacigeeb l htD stin erikander
IO
ar hum angeTe e
Piqua kreno
l iya ws der
oursdemer
x0
m
RulerA 75
accessjames
iveStu
No vir abn
eliGN
avemd re st viP_ri mat eichp9
Daarchuser ble ro alb lorbo son
m
h ic
0 sb
xa
samhocevar Lucio
Ha aeFe
reNZtr n1dq Se ria ma ertl04
sV ne Fig dio
ers
an
ultweet
erdo
einm evetesh tbra1c 0olng
rs Cus
ini10
osxnt
_
Z ulliv nm
loogxii nox_ CAh arsmi reasona htThe ha
mo
An onO_Po cke mhausher
i b rafa
anarchival
nLucceitanm
eC om
Explo wo JR ga Gudad
mathpunk
pNattrs s es
Sa
tem m
orp mitS
odstar roag
ellla_
a
s
r
auditdi
er
o
jonoevheat Gajam
WOGofFARs ztehowe lessn lehtio eyCh snd jda sb
amA ns
ysh_ga er4 G su Ancks emmangoldstein DC s Br rebrowfest bernin
zfson
lana
i
deklano
rcasticlArchip
Ubad
rgy
irib b
To9
_ua ian
yami
ksid1HA ng r2 upha
bth
iusS oRn o
esaopera
er_P
erd9
fis ot top Uto Go d wski
ba herc
ga
alk
rlie mu
a t AS nt n
raffi
tre ea
2 ifeulne
ystem
SBhi
lly
MarineTckUK
dpKhal ud je an e es Keltoedy D ke lerthbpi Erik ule arOS
pdhalili
idEne
a Rno Colis jjcga
zar
flamsm
trcoipn mille
aark
ett ca rm
lent
dyR ouln rtw en nn t_is _S
PoliceVideo
Clo qm
Spac dk s_ed
r gusuEC
sh
Kolib
open
cijournalism
m Mw
Roc
LE_spckyfe an cTionk
litt
_girlveRo
H i sununteVer
Pron
gg
pS
bla Mic un utro
ip11 To ey
cu hAatrtifa_io an okjetaoscl felip ervic
eti
An
fallsinJe
Ro
ad ivlp ree
lo oP
uic A st ysSea
ra ti ele P
sleepyle UnmagicalMe
ace rreiraikam
V
iallu sktnwe LB e on
lec PerE
hae hahe
deray
tieurm
ck hqeirllee etbe er
es
ecli
wd Debod
theodoricm
hory
z
T
M
or db
eo
th no ainDeb
bct de
rech
avalo
The_SolarS
re
yawnbo
he 6ra6nd
TheMatja
z hersvm
Lilia
s ofic
Astero
d e Fo
op
rs oor anna
coracchri
cgoa nN coineaslp
alc din
fre pla
mile LxM C2D rtsra penos ap adrie
at
am Netm
tions
sonmic uardng
rg
Ph
le
ni
arkila langl
__agwa
sc
rtorseregu
lle
a_ma
ageis eb otr_im
ra e c ndr RCAn
tlu i diea alex n692
etnaS
Dtoistitzere
birds
olas flo
sd nsP langcKry P nruiz ajuC
uin eb 16
p
ro
_S
ulist sin
gda Fr
skim
ul
DebConf tonic
Ap
en wsom om RCte
jheb orba UN 90
on ver
JD nd
th
tealc
4b
pp lear
th
Jam
bz
l mfr
us532 sy ks kg
ca
n
u ep H o
eb
mur
lia
ut1 a
gir le hu
EightTons
Yo
urrier
ke Swdmro s ISorror
yipo
as age tocoIL_ ra
e201
sednawk ert ieli BF
spst ks
ge lesh n3
he
AC
keska
lComunityCube cssq
jlan
al
neahe
x
ca
scnd
O
cl
lly eieeidg_
_jo
KillYourTV_I2P Fund
one
cavatica
an ure ejomcin Fdan_sie
rshall_
ow 005
edo
ira ckstca
Cha
ktro anns
CSua
McN
n
lan _N
mie
casa
o1 ne
esR
tte
oo
Dog me
LUrlieS
eyer
lecy
M in huAPISHsllvim re_u
Ibasw
bo
rdluu
ana
be
HUBB be
__ nejoepie91
4
NT
s geln
ary
T
muO
te
raisen
ItalianFD
dPCcam
jim
ni
iamjohnoliver o
tie
nkI inm Flo nb
anto
na
sM
thhn
i2porignal
n
O kurs Kedecourl aks
ibNorC
Sklcrshrmtn
o
indie
lej
vin ayy11 mna tilla
ntoon Kad
im
eirsv
die
ichjo
plan
swad
IC
sl1pe Ja dSmta
CC
au
lu
_m
sander
ug
odry
mer
ans
MorH
ior
hremWasTaken
errls
ele
Tom
386_ 3Mch
te
Street_Dogglevudev
eSpto
enn
un is kayg
augru
4kmith
es_
al
les
Da
jim
Ke ust w
die
thle
au
mipham
ks ooks
Sp g _
avs
anked
l3 vy_Cre
g bHvied mm
TA
am
dyt
ParteiHH
nieM
att
l89 oks
Har
yaurlana
Nienor_
BB a
utar
3tnO
ff
ehes
SX
Yorgosh
iamsambee
RA
nrtG MsLods lRari
CrowdTRe
mon
ez
r ka
erg
phlast
opjk
boo_eb
oklCh Tire
diau
kayla_ebo
ss
JM
SP
supaheld
TL
rs eD
bv
books
ers
jcs om
esA
da
yG
les
Fl0range
illym
oad
ro
w
aJe
s sc13ns Franz wo eleben
va
erer
87
oks
bo
ez R ey rits
ndro799imJPion
diah
ee
i_erian
ica Numm
csdotc
ne
frod
a_enyc
2
samp
zm
de
Mauerfal
D3rB3obacht3r
ago
tin ds ijsF
nda
en 20 ric
cznweb
go
co
ganebo
wand
chaospira
m
G
dueri
Jede
elizabe
sfeed
torma
in
jcsjay
snob
m
OlafScholz
In 9e0jp e
rre
yla
r _L noF
lig
tesse
ari
ahn
tbw
Th fatim
tail
alld
sp
Tay
X
su thy
La rk ss_
m
ort
ther
esl
XX
Ne
eD uthe
on
nki o
SPIEG fibre
Ideen
atB
eD
mo
gnew
bike
ale ob
biozul
dP
erMa
to
iko Tim
de H
joshca
ky
iggs
a le
thusa
She
rd4U_
firmreadgoldtris
lo
za
MontrealSauce
nt uib
Be Be
nixs
moo
eie
tin
atr
Cyclin
Bundes_G
lla
gazin
getlychee
n
EL_
oe
ijs
th
sP
Pro ekh
traeum
euNetzros
Rits
Bro
vo
ap
Ata 234
Lo
ph
fraukeld
No
HL
fV uis
nXploit
em
p
nF
ca infil
diN
mr
ee
riF 2
matthich
a
liz
nde
ns
ov h
ale
de
e 00p
autis ssivau
i art
tra
enSek
philsweeney
pre
u ticho
ro
r75 tok
YourAno
ShoKN
Au
be
_bolte
Blit keel
ILiv
rt
e_av
zW y
eCing
tor
ya
of0fe0
tism
acmebench
in
e
usleb
billo
Uru
fm
Sho
en
ater
calsilcock
ials
Wikipedia Voting Wikipedia Adminship Road Network Twitter Data
Amazon cloud services will be used to host particularly large data sets
Graph Challenge - 10
The Matrix Market Exchange Formats: Initial Design, Boisvert, Pozo & Remington (1996)
Outline
• Introduction
• Data Sets
• Metrics
• Summary
Graph Challenge - 11
Static Subgraph Isomorphism Problem
Vertex-Based Array-Based
• Is there a 1-to-1 mapping of the vertices • A subgraph isomorphism exists iff there
in H to vertices in G such that every is a permutation array P where
edge in H is also in G ?
B = PT A P
Bold uppercase indicates a matrix, Bold lowercase indicates a vector, lowercase indicates scalars, sets, ...
Graph Challenge - 12
Scaling Observations
Graph Challenge - 13
Triangle Counting
Information Visualization
Paul Burkhardt
rkhardt Abstract
5
The increasing size of Big Data is often heralded but how data are transformed and represented is also pro-
foundly important to knowledge discovery, and this is exemplified in Big Graph analytics. Much attention has
been placed on the scale of the input graph but the product of a graph algorithm can be many times larger
hich every vertex is connected to every other vertex, Theorem 1. Given G P
and the Hadamard product
than the input. This is true for many graph problems, such as listing all triangles in a graph. Enabling scal-
able graph
2 s exploration for Big Graphs requires new 1 approaches 2 to algorithms, architectures, and visual analy-
d a cycle, i.e. a path with identical endpoints. P A), then D(v) =
(A (A A) and D(G) = s
tics. A brief tutorial is given to aid the argument for
2 thoughtful
v v of data in the context of graph
representation
1
analysis. 2 s algebraic method to reduce the arithmetic operations in counting and listing triangles in
Then a new
iangles in graphs are useful for complex network (A A) .
graphs ij
6 isijintroduced. Additionally, a scalable triangle listing algorithm in the MapReduce model will be pre-
sented followed by a description of the experiments with that algorithm that led to the current largest and
alysis based solely on structural, i.e. adjacency, fastest triangle listing benchmarks to date. Finally, a method for identifying triangles in new visual graph
Proof. The paths of length two, i.e. 2-paths, are iden-
exploration technologies is proposed.
ormation. Let D(v) be the local count of triangles
tified by A2 ; then element-wise multiplication A A2 s
2
C(i,j) = {i,x,y} iff
A(i,x) = A(i,y) = 1 and ET(x,j) = ET(y,j) = 1
0 1
0 1 0 1 ; ; ; ; ; ;
0 0 0 0 1 1 0 0 0 0 0 B C
B C B 0 1 0 0 1 0 C B ; ; ; ; ; {2, 4, 5} C
0 0 0 1 1 B C
C=AET ==BB@ C
B C
A B
A = B 0 0 0 1 1 C
C E B
T
B = B 0 0 1 1 0 0 C
C
; ; ; ; ; {3, 4, 5} C
@ 0 1 1 0 1 A @ 0 0 0 1 1 1 A ; {4, 2, 5} {4, 3, 5} ; ; ; A
1 1 1 1 0 1 1 1 0 0 1 ; ; ; {5, 3, 4} {5, 2, 4} ;
Example 3-truss
Graph Challenge - 17 Graphulo: Linear algebra graph kernels for NoSQL databases
Gadepally, Bolewski, Hook, Hutchison, Miller & Kepner, IPDPS GABB 2015
K Truss: Array Formulation
- Example Algorithm -
Algorithm
• If E is the unoriented incidence array
(rows are edges and columns are R=EA
vertices) of graph G, and A is the x = find( (R == 2 ) 1 < k − 2 )
adjacency array while x
Ex = E(x,∶)
• If G is a k-truss, the following must E = E(xc,∶)
be satisfied R = E(xc,∶) A
– (E A == 2) 1 > (k – 2)
R = R − E ( ExExT − diag(ExExT) )
x = find( (R == 2 ) 1 < k − 2 )
Graph Challenge - 18 Graphulo: Linear algebra graph kernels for NoSQL databases,
Gadepally, Bolewski, Hook, Hutchison, Miller & Kepner, IPDPS GABB 2015
Outline
• Introduction
• Data Sets
• Metrics
• Summary
Graph Challenge - 19
Metrics
• Correctness
– Number of triangles is exact for a given graph
– Enumerating all k-trusses is exact for a given graph
• Performance
– Total number of edges in graph (edges)
– Execution time (seconds)
– Rate (edges/second)
– Energy (Watts)
– Rate per energy (edges/second/Watt)
Graph Challenge - 20
Summary
• An implementation will perform these algorithms on provided data sets and report the
given metrics
• A submission to the IEEE HPEC Graph Challenge consists of a conference style paper
describing the approach, implementation, innovations, and results
Graph Challenge - 21