Professional Documents
Culture Documents
Analysis of Large Graphs: Overlapping Communities: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
Analysis of Large Graphs: Overlapping Communities: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
Analysis of Large Graphs: Overlapping Communities: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org
NCAA conferences
Nodes: Proteins
Edges: Physical interactions
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4
Protein-Protein Interactions
Functional modules
Nodes: Proteins
Edges: Physical interactions
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5
Facebook Network
Can we identify
social communities?
Stanford (Basketball)
Stanford (Squash)
Nodes
Communities as “tiles”
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10
Recap so far…
B
Generative A
A
F
model for D E
networks C
G
B
F
A
D E
Generative
C
G model for
H
networks
B
Generative A
F
F
model for D
D E
networks C
C
G
Nodes, V
Model Network
Generative model B(V, C, M, {pc}) for graphs:
Nodes V, Communities C, Memberships M
Each community c has a single probability pc
Nodes, V
Community Affiliations Network
AGM
generates the links: For each
For each pair of nodes in community ,
we connect them with prob.
The overall edge probability is:
P(u , v) 1 (1 p )
cM u M v
c
… set of communities
If share no communities: node belongs to
Think of this as an “OR” function: If at least 1 community says “YES” we create an edge
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15
Recap: AGM networks
Model
Network
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16
AGM: Flexibility
AGM can express a
variety of community
structures:
Non-overlapping,
Overlapping, Nested
B
F
A
D E
G
C
Given a Graph , find the Model
1) Affiliation graph M
2) Number of communities C
3) Parameters pc
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19
Maximum Likelihood Estimation
Maximum Likelihood Principle (MLE):
Given: Data
Assumption: Data is generated by some model
… model
… model parameters
Want to estimate :
The probability that our model (with parameters )
generated the data
Now let’s find the most likely model that could have
generated the data:
𝒇 ( 𝑿|𝚯 )
𝑷
What did we learn? Our data was most
likely generated by coin with bias
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 𝚯
21
MLE for Graphs
How do we do MLE for graphs?
Model generates a probabilistic adjacency matrix
We then flip all the entries of the probabilistic
matrix to obtain the binary adjacency matrix
every pair
For Flip
biased
𝑨
of nodes AGM 0 0.10 0.10 0.04 0 1 0 0
gives the prob. coins
0.10 0 0.02 0.06 1 0 1 1
of them being 0.10 0.02 0 0.06 0 1 0 1
linked
0.04 0.06 0.06 0 0 1 1 0
0 0.9 0.9 0 0 1 1 0
A B
0.9 0 0.9 0 1 0 1 0
0.9 0.9 0 0.9 1 1 0 1
0 0 0.9 0 0 0 1 0
Θ=B(V, C, M, {pc})
G
P(G|Θ)
P (G | ) P (u, v) (1 P (u , v))
( u ,v )E ( u ,v )E
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23
MLE for Graphs
Our goal: Find such that:
P( 𝑮 | )
AGM
arg max
How do we find that maximizes the
likelihood?
𝑭𝒖𝑨
u v
Probability of connection is
proportional to the product of
strengths
𝑭=¿ Notice: If one node doesn’t belong to
the community () then
Prob. that at least one common
j
community links the nodes:
… vector of
… community
strength of ’s membership
membership to strengths of
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27
From AGM to BigCLAM
Community
links nodes independently:
Example matrix:
𝑭 :
❑
𝒖 0 1.2 0 0.2 Then:
𝑭 :
❑ 0.5 0 0 0.8
And:
𝒗
But:
𝑭❑ :
𝒘 0 1.8 1 0
Node community
membership strengths
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28
BigCLAM: How to find F
Task:
Given a network , estimate
Find that maximizes the likelihood:
where:
Many times we take the logarithm of the likelihood,
and call it log-likelihood:
Goal: Find that maximizes :
Compute
gradient of a single row of :
We cache
So, computing now takes linear time
in the degree of
In networks degree of a node is much smaller to the total
number of nodes in the network, so this is a significant
speedup!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31
BigClam: Scalability
Edge prob.:
Network log-likelihood:
Detecting
Cohesive and 2-mode Communities in Directed and Undirected Networks by J.
Yang, J. McAuley, J. Leskovec. ACM International Conference on Web Search
and Data Mining (WSDM), 2014.