Professional Documents
Culture Documents
Csci5352 2017 L8
Csci5352 2017 L8
003
Assistant Professor of Computer Science
052
002
051
001
University of Colorado Boulder
External Faculty, Santa Fe Institute
→ herbivore
→ parasite
→
plant
hierarchical communities
modules
hierarchical communities
nested
modules
hierarchical communities
?
hierarchical communities
assortative modules
probability pr
i j
instance
j
Pr(i, j connected) = pr
= p(lowest common ancestor of i,j)
hierarchical communities
Y
Pr(A | D, {pr }) = pE
r (1
r
pr ) L r R r Er
}
r
→
Lr Rr
Clauset, Moore, Newman, Nature 453, 98-101 (2008)
Er
Clauset, Moore, Newman, ICML (2006)
!
L(D, {pr }) = pE
r (1 − pr )
r Lr Rr −Er
!" # " # $r
1 8
1 8
1/9 L=
9 9
1 1
1
L = 0.0433
1
hierarchical communities
hierarchical communities
!"#$%$&'()')*+,$-
hierarchical communities
degree distribution
a 0
10
Fraction of vertices with degree k
original
→
−1
10
!"#$%$&'()')*+,$-
−2
10
→
resampled
−3
10 0 1
10 10
Degree, k
hierarchical communities
density of triangles
Fraction of graphs with clustering coefficient c
0.25
original
→
0.2
original
0.15
!"#$%$&'()')*+,$-
0.1
0.05
→ →
resampled resampled
0
0 0.05 0.1 0.15 0.2 0.25 0.3
Clustering coefficient, c
hierarchical communities
geodesic distances
b 0
10
Fraction of vertex−pairs at distance d
→ original
−1
10
!"#$%$&'()')*+,$-
−2
10
resampled →
−3
10
2 4 6 8 10
Distance, d
hierarchical communities
25
14
8
26
3
34
13
10
4
33
20
22
18
8 20 25 26
16 22
10 28
2
4
24
30 30 18
31 27
3
13 1
15
27 2
34
6 32
7
16
24 12
5 19
12
14
33 21 28 5
17 9
11
29 23
29 6
7
32
11
21
17
19
15
1
23
31
9
MAP
hierarchical communities
BrighamYoung (0)
(59)Louisia
(58)LouisianTech
Stat (9)
04)
(11
(97)L astCarinnatm
(36)Cen TNStatef
SanDiego o (4)
(63)M ianaL a
(44) )Cin gha n
1)
CoilrForceegas (1
Utah ming (16)
2)A (48 6)T )Ar issis
Norkantate Stat (4
re S (5 te (1 (24)
(92 irmin ustoane
A MS ado 3)
E
NewMexic
ouis olin i
LB )Ho ul my
id
go ta 0) (2 1)
(9
I o th as 9)
NV (23)
UtdahiseSTexa Stat
(7
tF
8)
B r s (6
A LasV
nMonr
5) (6 7)L
lo
O ah o ta s
So 6) o
)
(
Wyo
c
08
8
te )
N or
ida
ta (90
ut Me uis
)
(1
(91 rnM ph ille
(8 (8
(5
he m v
a
2
nS te
49
(5 )No t
ta )
e
53
(8 T tr) )
aS (22 (111 )
58
4 e e
46
63
(10 )Ok xa Da n
o a 8
83 114
(40 2)M lah Te me
s iz on nia e (7
25
33 28
11 97 (72 )Co iss om ch Ar riz lifor Stat )
A a sh (21 )
(81 )Iow lor our a
88
1
67
59
C a A (68 )
)T a a i W CL on (77 7)
(10 exas Statdo
73
105 24
50
7)O A& e U reg ord al (
103 37
O tanf ernC (51)
89
(98 KSta M S outh gton
(10)B)Texatse
69
S ashin 14)
36
45 109 110
(3)Ka W waii (1
nsas aylor
57 90
44 Ha ada (67)
(52)KaS
66 34
42 tate Nev sElPaso (83)
n Texa State (46)
(74)Nebrassas
16 75 82
4
31
93 91 112 86
80 ka Fresno
0 48 18 54
(15)Wisconsin TXChristian (110)
9 92 (6)PennState Tulsa (88)
(64)Illinoist SanJoseState (7
Rice 3
higanStata
23 7 29
South(49)
104 8
94
61 71
(100 )M ic
41 35
neso ernMe
(60 in)Indiana Flo
78
) M th (53
W ridaSt
68
6 n )
Maar keForeate (1)
99
22 19
(10 ster
21 77
55
o rt hwe Statea Cle ylan st (1
o
5 10
(13)N47)Ohi(2)Iowan N m d( 05)
D CS son 109)
111 30
81 101
( hig ue Vi uke tate ( (103)
3 79
MicPurd ers e
108
) G rg ( 2
N e in 45) 5)
51
85 38 2
(3 39) utg pl ll
W oCorg ia (3
52 84
( )R em o y
i
98
2 6 113
(94 9)T tonCNav gh es ar aT 3)
te oli ec
C or llSt o (8 ich
17
(7 os 0) ur
76 43 26
en th at 5
N a ed rnM )
70
( am ia gi se
107 rn na h
B l
60 39
tr ern e ( )
40
i in ir u
14
r h
Toaste n (1 (34)
74
ic 89 )
al I 26
E kro lo
(56 )Ark abamida
72 47 95
62
96 12 9
(2 Pi
A ffa 1)
)Ke ans a
13 27
h )
M ll )
Buhio (754)
(76) (27)Flotuckys
(10 9)V e y
5)
O t(
(1
ic (1
100
a
15
oCar see
Ma lingG
102
h 2)
4)
olina
BowmiOhio (61)
5
Conn
(62)Vanderbilt
65
(
ia
MissState (65)
LouisianStat (96)
20 87
(3
i
(3 (3
106
r
56
sis ip
64 32
8)
bu
l
n
(17)Aus
8
Ten
3
(43
(87)Mis
(99)
)
)
MAP
hierarchical communities
Shortest paths
0.8 Hierarchical structure
!"#$%$&'()')*+,$-
AUC
0.7
Area under
0.5
pure chance
0.4
0 0.2 0.4 0.6 0.8 1
Fraction of edges observed, k/m
hierarchical communities
0.7
b T. pallidum metabolic network
1
0.6 Pure chance
Common neighbors
0.9 Jaccard coefficient
0.5 Degree product
Shortest paths
0.8 Hierarchical structure
0.4
0 0.2 0.4 0.6 0.8 1 AUC
Fraction of edges observed 0.7
0.6
0.5
0.4
0 0.2 0.4 0.6 0.8 1
Fraction of edges observed
hierarchical communities
other approaches
hierarchical communities
other approaches
3
very popular approach of modularity optimization, which lacks built-in statistical validation, but also for
=
l
more principled methods based on statistical inference and model selection, which do incorporate statistical
fit another SBM to these, repeat
Nested model
validation in a formally correct way. Here, we construct a nested generative model that, B1through
nodes a complete
description of the entire network hierarchy at multiple scales, is capable of avoiding this limitation and
E edges
enables the detection of modular structure at levels far beyond those possible with current approaches. Even
with this increased resolution, the method is based on the principle of parsimony, and is capable of
2
=
l
separating signal from noise, and thus will not lead to the identification of spurious modules even on sparse
networks. Furthermore, it fully generalizes other approaches in that it is not restrictedB0tonodes
purely assortative
mixing patterns, directed or undirected graphs, and ad hoc hierarchical structures Esuch edgesas binary trees.
Despite its general character, the approach is tractable and can be combined with advanced techniques of
community detection to yield an efficient algorithm that scales well for very large networks.
1
=
l
Observed network
DOI: 10.1103/PhysRevX.4.011047 Subject Areas: Complex Systems, Interdisciplinary
N nodes
Physics, Statistical Physics
E edges
0
=
l
I. INTRODUCTION The method that has perhaps gathered the most wide-
Peixoto, Phys. Rev. X 4, 011047 (2014)
The detection of communities and other large-scale spread use is called modularity optimization [10] and
hierarchical communities
cin cout
n n
cout cin
n n
Decelle, Krzakala, Moore, & Zdeborová, Phys. Rev. Lett. 107, 065701 (2011)
cout /cin > ϵc . In other words, in this region both BP and MCMC converge to the factorize
inals contain no information about the original assignment. For cout /cin < ϵc , however, th
thelimits offixed
factorized statistical inference
point is not the one to which BP or MCMC converge.
ht-hand side of Fig. 1 shows the case of q = 4 groups with average degree c = 16, correspondin
Newman and Girvan [9]. We show the large N results and also the overlap computed wit
128 which is the commonly used size for this benchmark. Again, up to symmetry breakin
es the best partition
planted possible overlap
problemthat can be inferred from the graph by any algorithm. Therefor
ested for performance, their results should be compared to Fig. 1 instead of to the common bu
• synthetic
t the four datadetectable
groups are with knownforcommunities
any ϵ < 1.
• 2 groups, equal sized
1
• mean N=70k,
degree
N=500k, BP
MCMC c N=100k, BP
N=70k, MC
easy hard MC
N=128,
0.8 ✏=
to cout /cin
detect
N=128, full BP
to detect
overlap (accuracy) q=4, c=16
0.6
overlap
0.4
strong random graph
q=2, c=3
communities
0.2 undetectable
undetectable
0
.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1
!= cout/cin != cout/cin
Decelle, Krzakala, Moore, & Zdeborová, Phys. Rev. Lett. 107, 065701 (2011)
cout /cin > ϵc . In other words, in this region both BP and MCMC converge to the factorize
inals contain no information about the original assignment. For cout /cin < ϵc , however, th
thelimits offixed
factorized statistical inference
point is not the one to which BP or MCMC converge.
ht-hand side of Fig. 1 shows the case of q = 4 groups with average degree c = 16, correspondin
Newman and Girvan [9]. We show the large N results and also the overlap computed wit
128 which is the commonly used size for this benchmark. Again, up to symmetry breakin
es the best partition
planted possible overlap
problemthat can be inferred from the graph by any algorithm. Therefor
ested for performance, their results should be compared to Fig. 1 instead of to the common bu
• synthetic
t the four datadetectable
groups are with knownforcommunities
any ϵ < 1.
• 2 groups, equal sized
1
• mean N=70k,
degree
N=500k, BP
MCMC c N=100k, BP
N=70k, MC
N=128, MC
0.8 ✏ = cout /cin N=128, full BP
• 2nd order phase transition
q=4, c=16
in detectability overlap (accuracy)
0.6
• overlap goes to 0 for
overlap
p
c c
✏ p 0.4
c+ c(k 1) strong random graph
q=2, c=3
communities
0.2 undetectable
undetectable
0
.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1
!= cout/cin != cout/cin
Decelle, Krzakala, Moore, & Zdeborová, Phys. Rev. Lett. 107, 065701 (2011)
limits of statistical inference
Decelle, Krzakala, Moore, & Zdeborová, Phys. Rev. Lett. 107, 065701 (2011)
Ghasemian et al., arxiv:1506.0679 (2015)
Peixoto, Phys. Rev. X 4, 011047 (2014)
Newman & Clauset, Nature Communications, to appear (2016)
the trouble with community detection
the trouble with community detection
22
18
20 25 26
8
10
28
2
4
30
24
31 27
13 1 3
34 15
6
32 16
7
5 14
19
12 9
33
21
17
11
29 23
49
53
58
63
46
83 114
33 28
25 11 97
88
1 59
67
73
105 24
50
103 37
89
69 36
45 109 110
57 90
44 66 34
42
16 75 82
4
31
93 91 112 86
80
0 48 18 54
9 92
23 7 29
104 8 61 71
94
41 35
78
68
99
22 19
98 113
2 6 17
76 43 26
70
107 60 39
40 14
74 72 62
47 95 96 12
13 27
100 15
102
65 20 87
106 64 32 56
but
[1] maximum NMI between any partition layer of the metadata partitions and any layer returned by the community detection method
but wait!
idea:
use metadata x to help select a partition P ⇤ 2 {P} that correlates with x ,
from among the exponential number of plausible partitions
idea:
use metadata x to help select a partition P ⇤ 2 {P} that correlates with x ,
from among the exponential number of plausible partitions
generation
model
Pr(G | θ) G = (V, E)
data
inference
a metadata-aware stochastic block model
generation
given metadata x = {xu } and degree d = {du } for each node u
• each node u is assigned a community s with probability sx
Y
• thus, prior on community assignments is P (s | , x) = si ,xi
i
• given assignments, place edges independently, each with probability:
puv = du dv ✓su ,sv
• where the ✓st are the stochastic block matrix parameters
inference
given observed network A (adjacency matrix)
• the model likelihood is
X network metadata
P (A | ⇥, , x) = P (A | ⇥, s)P (s | , x)
s
XY Y
= pA
uv (1
uv
puv ) 1 Auv
su ,xu
s u<v u
weaker stronger
1 0.5
0.6
0.95 undetectable 0.7
Fraction of correctly assigned nodes
0.8
0.9 0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
0.5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
[1] n = 10 000
c -c
in out
networks with planted structure
metadata, or 0.6
0.55
• metadata alone.
0.5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
c -c
in out
real-world networks
real-world networks
[1] Add Health network data, designed by Udry, Bearman & Harris
real-world networks
[1] Add Health network data, designed by Udry, Bearman & Harris
real-world networks
[1] Add Health network data, designed by Udry, Bearman & Harris
real-world networks
[1] Add Health network data, designed by Udry, Bearman & Harris
real-world networks
1
Probability of community membership
1 3 2
0.5 2
Detritivore
Carnivore
out metadata 0
-12 -9 -6 -3 0 3 6 9
10 10 10 10 10 10 10 10 Omnivore
Mean body mass (g)
1 Herbivore
[1] here, we’re using a continuous metadata
FIG. model
S4: Learned priors, as a function of body mass, for the
[2] Brose et al. (2005) three-community division of the Weddell Sea network shown
Primary producer
real-world networks
HVR6
C⇡M
C⇡M C 6= M
• NMI
.
the number of=metadata
0.668 values, 1
I(s ; x)
, (B4)
min[H(s), H(x)]
0
None 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
malized mutual information lies Year
that
[1] Traud,it has
Mucha a (20012)
& Porter symmetric defini-
13
thereal-world networks
number of metadata values, 1