Analysis of Large Graphs: Overlapping Communities: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 39

Note to other teachers and users of these slides: We would be delighted if you found this our

material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org

Analysis of Large Graphs:


Overlapping Communities
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Stanford University
http://www.mmds.org
Identifying Communities
Can we identify
node groups?
(communities,
modules, clusters)

Nodes: Football Teams


Edges: Games played
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2
NCAA Football Network

NCAA conferences

Nodes: Football Teams


Edges: Games played
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3
Protein-Protein Interactions
Can we identify
functional modules?

Nodes: Proteins
Edges: Physical interactions
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4
Protein-Protein Interactions

Functional modules

Nodes: Proteins
Edges: Physical interactions
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5
Facebook Network

Can we identify
social communities?

Nodes: Facebook Users


Edges: Friendships
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6
Facebook Network
Social communities

High school Summer


internship

Stanford (Basketball)
Stanford (Squash)

Nodes: Facebook Users


Edges: Friendships
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7
Overlapping Communities
 Non-overlapping vs. overlapping communities

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8


Non-overlapping Communities

Nodes

Nodes Adjacency matrix


Network

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9


Communities as Tiles!
 What is the structure of community overlaps:
Edge density in the overlaps is higher!

Communities as “tiles”
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10
Recap so far…

Communities This is what we want!


in a network

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11


Plan of attack
 1) Given a model, we generate the network:

B
Generative A
A
F

model for D E

networks C
G

 2) Given a network, find the “best” model

B
F
A
D E
Generative
C
G model for
H
networks

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12


Model of networks
 Goal: Define a model that can generate
networks
 The model will have a set of “parameters” that we
will later want to estimate (and detect communities)

B
Generative A
F
F

model for D
D E

networks C
C
G

 Q: Given a set of nodes, how do communities


“generate” edges of the network?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13
Community-Affiliation Graph
Communities, C pA pB
Model
Memberships, M

Nodes, V

Model Network
 Generative model B(V, C, M, {pc}) for graphs:
 Nodes V, Communities C, Memberships M
 Each community c has a single probability pc

 Later we fit the model to networks to detect


communities
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14
AGM: Generative Process
Communities, C pA pB
Model
Memberships, M

Nodes, V
Community Affiliations Network
 AGM
  generates the links: For each
 For each pair of nodes in community ,
we connect them with prob.
 The overall edge probability is:
P(u , v)  1   (1  p )
cM u  M v
c
  … set of communities
  If share no communities: node belongs to

Think of this as an “OR” function: If at least 1 community says “YES” we create an edge
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15
Recap: AGM networks

Model

Network
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16
AGM: Flexibility
 AGM can express a
variety of community
structures:
Non-overlapping,
Overlapping, Nested

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17


How do we detect
communities with AGM?
Detecting Communities
 Detecting communities with AGM:

B
F
A
D E
G
C

 
Given a Graph , find the Model
1) Affiliation graph M
2) Number of communities C
3) Parameters pc
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19
Maximum Likelihood Estimation
  Maximum Likelihood Principle (MLE):
 Given: Data
 Assumption: Data is generated by some model
 … model
 … model parameters
 Want to estimate :
 The probability that our model (with parameters )
generated the data
 Now let’s find the most likely model that could have
generated the data:

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20


Example: MLE

 Imagine
  we are given a set of coin flips
 Task: Figure out the bias of a coin!
 Data: Sequence of coin flips:
 Model: return 1 with prob. else return 0
 What is ? Assuming coin flips are independent
 So,
 What is ? Simple,
 Then,
  ∗ =𝟑/𝟖
𝚯
 For example:

  𝒇 ( 𝑿|𝚯 )
𝑷
 What did we learn? Our data was most
likely generated by coin with bias
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 𝚯
  21
MLE for Graphs
  How do we do MLE for graphs?
 Model generates a probabilistic adjacency matrix
 We then flip all the entries of the probabilistic
matrix to obtain the binary adjacency matrix
  every pair
For Flip
biased
 
𝑨
of nodes AGM 0 0.10 0.10 0.04 0 1 0 0
gives the prob. coins
0.10 0 0.02 0.06 1 0 1 1
of them being 0.10 0.02 0 0.06 0 1 0 1
linked
0.04 0.06 0.06 0 0 1 1 0

 The likelihood of AGM generating


P(G | )   P(u , v)  (1 graph
P(u , vG:
))
( u , v )E ( u ,v )E
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22
Graphs: Likelihood P(G|Θ)
 Given graph G(V,E) and Θ, we calculate
likelihood that Θ generated G: P(G|Θ) G

0 0.9 0.9 0 0 1 1 0
A B
0.9 0 0.9 0 1 0 1 0
0.9 0.9 0 0.9 1 1 0 1
0 0 0.9 0 0 0 1 0
Θ=B(V, C, M, {pc})
G
P(G|Θ)
P (G | )   P (u, v)  (1  P (u , v))
( u ,v )E ( u ,v )E
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23
MLE for Graphs
  Our goal: Find such that:

P( 𝑮 |  )
 
AGM
arg max

 How do we find that maximizes the
likelihood?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24


MLE for AGM
  Our goal is to find such that:

 Problem: Finding B means finding the


bipartite affiliation network.
 There is no nice way to do this.
 Fitting is too hard,
let’s change the model (so it is easier to fit)!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25


From AGM to BigCLAM
  Relaxation: Memberships have strengths

  𝑭𝒖𝑨

u v

 The membership strength of node


to community (: no membership)
 Each community links nodes independently:

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26


 Factor Matrix
  Community membership strength matrix
Communities  
Nodes

 Probability of connection is
proportional to the product of
strengths
 
𝑭=¿  Notice: If one node doesn’t belong to
the community () then
 Prob. that at least one common
j
community links the nodes:
  … vector of

  … community
strength of ’s membership
membership to strengths of
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27
From AGM to BigCLAM
 Community
  links nodes independently:

 Then prob. at least one common links them:

 Example matrix:
𝑭 :
  ❑
𝒖 0 1.2 0 0.2   Then:
𝑭 :
  ❑ 0.5 0 0 0.8
And:
𝒗
But:
 𝑭❑ :
𝒘 0 1.8 1 0
Node community
membership strengths
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28
BigCLAM: How to find F

Task:
  Given a network , estimate
 Find that maximizes the likelihood:

 where:
 Many times we take the logarithm of the likelihood,
and call it log-likelihood:
 Goal: Find that maximizes :

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29


BigCLAM: V1.0

 Compute
  gradient of a single row of :

 Coordinate gradient ascent:  .. Set out


outgoing neighbors
 Iterate over the rows of :
 Compute gradient of row (while keeping others fixed)
 Update the row :
 Project back to a non-negative vector: If :
 This is slow! Computing takes linear time!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30
BigCLAM: V2.0
  However, we notice:

 We cache
 So, computing now takes linear time
in the degree of
 In networks degree of a node is much smaller to the total
number of nodes in the network, so this is a significant
speedup!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31
BigClam: Scalability

 BigCLAM takes 5 minutes for 300k node nets


 Other methods take 10 days
 Can process networks with 100M edges!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32
Extension:
Directed memberships
Extension: Beyond Clusters

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 34


Extension: Directed AGM
 Extension:
Make community membership edges directed!
 Outgoing membership: Nodes “sends” edges
 Incoming membership: Node “receives” edges

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35


Example: Model and Network

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36


Directed AGM
  Everything is almost the same except now
we have 2 matrices: and
 … out-going community memberships  𝑭  𝑯
 … in-coming community memberships

 Edge prob.:

 Network log-likelihood:

which we optimize the same way as before


J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37
Predator-prey Communities

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38


More details at…
 Overlapping Community Detection at Scale: A Nonnegative Matrix Factorizatio
n Approach
 by J. Yang, J. Leskovec. ACM International Conference on Web Search and Data
Mining (WSDM), 2013.

 Detecting
Cohesive and 2-mode Communities in Directed and Undirected Networks by J.
Yang, J. McAuley, J. Leskovec. ACM International Conference on Web Search
and Data Mining (WSDM), 2014.

 Community Detection in Networks with Node Attributes by J. Yang, J. McAuley,


J. Leskovec. IEEE International Conference On Data Mining (ICDM), 2013.

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39

You might also like