Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

YACDA: Yet Another Community Detection Algorithm

Samarth Shah

Jeel Patel

Computer Science Department


University at Albany, SUNY
sshah4@albany.edu

Computer Science Department


University at Albany, SUNY
jjpatel@albany.edu

ABSTRACT
Finding communities in static networks has applications in many
useful areas, like identifying functions in a biological network, set
of tightly-bound people interacting on social networks, or
employees of a company communicating through an E-mail
network. But if we consider the exact time at which such
interactions occur, it can provide us with invaluable information.
Inclusion of time in detecting communities can bring out
important aspects of a network and can fetch us some important
hidden communities which may be difficult to identify otherwise.
So we consider timed interactions. For instance, in a studentteacher network, relatively more e-mails are sent between the
students and a teacher around his office hours or some curricular
milestones. We aim to fetch such communities having more
interactions in short time spans. We propose an algorithm called
YACDA to identify top clusters in such networks, and also
compare our algorithm with TopGC [15] and Greedy Binary [1].

Keywords
Community; LSH; Network Mining Clustering; Interactive
Network; Conductance; Dynamic graphs

1. INTRODUCTION
It is a common practice to portray large social networks as
graphs. We are interested in interactive networks which considers
the time dimension. The graphs can often be divided into bunch of
dense subgraphs, called communities. The classic definition of a
community is a group of nodes with more interactions amongst
the members than between its members and the rest of the
network. The term community is interchangeably used as cluster
or module in this paper. It is generally agreed upon that a subset S
of vertices forms a good cluster if the induced subgraph is dense,
but there are relatively few connections from the included vertices
to vertices in the rest of the graph [4,5,6].
There are various metrics with which one can define the goodness
of a community, i.e. how good or tight the community is
intuitively. These metrics are objective functions which quantifies
the internal connectivity of a community as opposed to its external
connectivity. There are many objective functions which have
different ways of assessing the quality of the community, and it
differs from application to application. Clustering similar nodes of
large networks is shown to be of great importance [4,5,9]. It has
many applications as it can detect functional units in biological
networks, organizational units in social networks, etc. But if one
ignores the interaction dynamics then some communities will be
hidden and tough to discover. Such as, a group of twitter users
interested in new technology products may not know each other
very well but follow each other for regular updates. Their
interactions become very intense when a new product is released.

So, only if we consider the time instances among the community


members, we can identify them. The occurrence of the edges
(interaction) will be more and in relatively short time intervals in
that particular time span. We assume that we know the time-stamp
of all the interactions. Examples of such interactions network
include social networks, email communication network, call
graphs in telecommunication networks, et cetera.

2. RELATED WORK
To detect clusters in interactive social networks is a nontrivial task. Whenever exact computation is non-deterministic,
heuristic and approximation methods are applied. Approximation
algorithms are pragmatic approach for solving NP-complete or
NP-hard problems. Since optimizing the objective function is NPhard [7], one typically applies approximation algorithms to
optimize that particular objective function. By doing this, we can
get much better communities, i.e. good quality scores. In a static
graph, the community detection problem is similar to finding
densest subgraph problem, which is polynomially-time solvable
[14]. But, Polina et al. [1] proves that finding the densest
subgraph in a dynamic network is NP-hard. The objective
function they use is density, i.e., they find a community which
maximizes the value of density. They split the problem of finding
communities in time into two problems: (a) Finding a timeinterval set which maximizes the density, given a set of nodes and
(b) Finding a good set of nodes which maximizes density, given a
set of time intervals. They use k (no. of time intervals) and B (total
time span) budget constraints on time-interval set. They use
Linear Programming method proposed by Charikar [14] to find
densest subgraph problem (b). And they solve (a) greedily by
using their Greedy-Binary method. The divide-and-conquer
approach used by them has a shortcoming it induces the
overhead of searching community in time and set of nodes
iteratively.
Leskovec et al [4] uses Network Community Profile (NCP) [2] to
compare different objective functions. NCP plots a graph between
size of cluster and the quality of cluster. They compare 12 quality
metrics some of which are Conductance, Expansion [10],
Internal Density [10], Cut Ratio [11], etc. (not including density)
and show that they exhibit qualitatively similar behaviors and
gives best scores to similar clusters. In fact, they show that in
large networks, small communities have good conductance scores.
Since we are looking at large interactive networks in our project,
we use conductance, which is a good tried and tested bi-criterion
function (since it depends on internal as well as external edges)
[4]. Informally, conductance is nothing but the ratio of number of
edges between the members of the cluster to the number of edges
going from the members of the cluster to the non-members [4].
We use the algorithm of Polina et al. [1] and run the experiments
on a Facebook dataset, and instead of finding density, we obtain

conductance of the best community in time. We also show that


conductance show similar behavior like that of density.
It differs from application to application whether we need all the
clusters in a given graph. In static setting, most clustering
algorithms find all the possible clusters, even if some of them are
not useful. Sometimes, loosely connected nodes also form a
cluster, which is of lesser importance in most applications.
Macropol et al. [15] proposes TopGC to probabilistically find
only the top well-connected clusters in a graph. Macropols
algorithm is inspired from Locality Sensitive Hashing (LSH) [12]
and extends the concept in Graph theory. The logic is under the
assumption that nodes with similar neighborhoods in a graph
should cluster together.
One way to find similarity between nodes is by distance matrix
method, where we can consider the distance measures between
two nodes like Euclidian, Hamming, etc. But again, this will
define relationship based on the distance and not on the similarity.
Perhaps the most straightforward method of determining if two
nodes are similar is to study the neighborhood sets of those nodes.
Macropol et al. [15] finds top communities in a static graph. So,
we are using similar algorithm to find communities, but now in
time.

3. PROPOSED WORK
Concepts similar to [15] are used in this paper except they
are tweaked to include the timestamps. But, prior to that, we
revise some terms and modify it as we go. Jaccard index is a way
to find similarity between the nodes by comparing the
neighborhood sets of two nodes. Formally, lets say we have two
nodes pi and pj, the Jaccard index is defined between two nodes as
(whose range is between 0 to 1), where Ni and Nj are the
neighborhood sets of vertex pi and pj (including itself)
respectively. But finding Jaccard similarity for a million nodes
interacting will have about billion ways to compare
neighborhoods is infeasible. Hence we need something more
robust which could run efficiently on large graphs giving results
similar to what we expect to get from the Jaccard similarity. So we
use signatures of length l instead.
First step here is to generate m random permutations 1, 2,, m of
the nodes. Each node goes through these m permutations to
generate m minhash values mh1, , mhm per node. The value of
mhi for a vertex pj is the element in its neighborhood Nj with the
lowest ordering index in i. So at the end of this step, a node has
m minhash values, and each minhash consists of one node from its
neighborhood. It can be seen that the probability that two nodes p i
and pj agree on their values for some mhk is the Jaccard index
[15].
We define an interaction network as G = (V, E), which consists of
a set of n nodes and m time-stamped interactions E between the
vertices in V,
E = {(ui, vi, ti)}, with i = 1,...,m such that ui, vi V and ti R.

We define the neighborhood Ni of node i as,


Ni = {j | (i , j)

E and k=1}

Here, k is the distance of node i to its neighborhood elements. k=1


means the immediate neighborhood of a node i. Note that the
neighborhood of any node contains the node itself. Since we deal
with interactive networks, we also consider the neighborhood of a
node in time. We call such neighborhood set as temporal
neighborhood.
We define the Temporal neighborhood of node i in equation (1)
as,

If we set k as 2, above equation finds the neighborhood of a node


i at time t, t+1 and t-1. Since this is an initial idea, we are not sure
how much to go further or backward in time than t, so we assume
that t-1 and t+1 will give us a fair idea of temporal neighborhood.
For example, if t might be 1 day and t-1 and t+1 would represent
the day before and after, respectively. So at any time t, the
neighborhood of a node i would contain its neighbors in the
current time, its neighbors at time t-1 and neighbors at t+1.
To realize the equation (1), we have to perform some
preprocessing on the dataset. The preprocessing algorithm is
explained in detail in the Appendix section.
We define a time-set T as,
T = {ti | span (ti) = 1 day}
Now, using the time-set T, we change the old IDs into new IDs
which takes ti into account. Every nodes ID will be changed
according to the function,
temporal_id (ti) = old_id + (number of nodes)*(ti)

(2)

We explain the aforementioned formula using an example.


Consider fig 1. For exposition purpose, suppose a node set N
consists of IDs {0, 1, 2, 3, 4} at time T = {0, 1, 2}. These IDs
could represent a user ID on Facebook. We change these IDs into
temporal IDs using (2). New temporal_ids are:
At time t = 0: {0, 1, 2, 3, 4 }
At time t = 1: {5, 6, 7, 8, 9 }
At time t = 2: {10, 11, 12, 13, 14}
These IDs are such that, the node 0 at time t = 0 will be
represented by node 5 and node 10 at time t = 1 and t= 2
respectively. An edge (0, 1) at time t= 1 will be represented as
edge (5, 6) since 5 and 6 represent 0 and 1 respectively at time t =
1. To express continuity in time, we add temporal edges between
all the nodes in different times. So, node 5 will have an edge with
node 0 and 10, and so on.
By default, we take k = 2. To find the neighborhood set of 6 at
time t = 1 by using (1), we get,
N6 = {0, 1, 2, 4, 5, 7, 8, 9, 10, 11, 12, 14}
The temporal neighborhood is a recursive function which stops
when k becomes 0. What this means is, we want to limit the
spread of the neighborhood of a node, because we dont want to
include outliers or far neighbors that are not remotely related to
the node. We want to look for a tight and good community.

Interestingly, the spread of the temporal neighborhood of any


node makes a shape similar to diamond.

minhash value of a node will always have one of its elements from
its neighborhood set.
Typically, one may use 100 or more hash functions making the
minhash signature of each node a large integer vector. Thus, we
fix a size l and create a signature matrix with each node as
columns of size l. To generate the signature of size l, we randomly
take numbers from 1 to m and l such random numbers l1, l2,,ll
are created. This series of l random numbers are used as an index
in the minhash table and corresponding minhash value of a node
is entered in the signature matrix. The mhli value for a node is the
minhash entry at li. Thus, the signature of each node still consists
of one of the element from its neighborhood. Since now we have a
signature of size l for each node, we hash the integer vector of size
l to buckets.
But hashing a vector of size l (given that l is still large enough) to
find similarity between nodes, we can make it more accurate by
using the method in [12]. We divide the integer vector of length l
into b bands, each containing r rows, such that b*r = l.
Now, we create b hash functions. For each band of r rows, we
hash the vector of size r, and assume that if they hash to the same
bucket, they are candidates for being included in the community.
Similarly, we do this for b bands. If two vectors hash into one
bucket at least once, their nodes are considered similar. As we
need the hash functions to collide rather than not fall together, we
keep the bucket size as total number of nodes.
Consecutively, after finding candidate sets in all the buckets, we
try to identify only those communities which have really good
conductance scores.
The volume of a candidate set S V of vertices is vol (S) = du
of all the vertices u inside the set S, and the volume of the entire
graph is vol (V) = 2m, where m is the number of edges in the
graph.
The edge boundary of a set is defined to be,
(S) = {{x, y} E | x S, y V\S}.
As mentioned in [4], the Conductance of a set S can be defined
by,

Fig 1

(S) = |(S)| / min (vol(S), 2m vol(S)).

(3)

The detailed description of the algorithm is given in the Appendix


section.
Visually, since we go 2 steps in the horizontal direction to the
right and left of the node 6 as k = 2, it can spread out till the edge
of the diamond (only visually, not really). So is true for the upper
and lower corner because k = 2. As for the rest of the sides of
diamond, since 6 has to traverse through 1 and 11 to jump to t-1
and t+1 respectively in order to find neighbors, it would already
have paid the price to jump. And so, with only 1 more step left, it
chooses neighbors adjacent to 1 and 11.
After finding the neighborhood sets of all the nodes in G, we
initialize m random permutations 1, 2.., m of the form (ax + b)
mod |V|. The neighborhood of each node will go through each of
these m random permutations. After initializing the minhash table
with all the nodes as columns of size m, we enter the minhash
values for each node. The value of mhi for a vertex vj is the
element in its neighborhood Nj with the lowest ordering index in
i. Hence, after entering the minhash values for all the nodes, a
node will have a minhash signature of size m. Also note that the

4. RESULTS AND FINDINGS


In this section, we run our baseline algorithm by Polina
et al. [1] which finds community iteratively in time and set of
nodes. Experiments were run on Intel Core i5 processor with 6
GB memory. The baseline algorithm had two budget constraints k
and B on set of time-interval. We restrict the value of k as 1 for all
the experiments, since we only care about one time-interval in the
entire timeline of dataset. Modification of B is not done in the
experiments done by [1]. Thus, we modify B i.e., upper bound on
span of the time-interval, and see how degree changes as a
function of B. The results are showed in fig 2. As we see, by
increasing B, we get better communities. But we can only increase
B up to a certain level. Because dropping restriction on B kills the
entire purpose of finding dynamic communities in a tight time
span. We also tweak the code for [1] by saving a copy of the
underlying graph and the graph of Community found (using [1]s

algorithm). The dataset we used is a subset of Facebook dataset


that covers 3 months of communication from 2006-05 to 2006-08,
i.e. 104 days. The underlying graph copy of Facebook dataset
consists of 5143 edges and 4117 nodes. Nodes represent users
interacting with each other in time.

(C): no of boundary edges = e1 e2 e3.


(D): Conductance: ratio of boundary edges / volume of the set S.
Then, we tune the value of B from 1 day to 10 days and plot how

Fig 4
Fig 2
for the same communities. Another intuitive finding was, when B
was set to 1 day, number of nodes in the community S were 18;
while when B was set to 5 days, number of nodes in community
increased to 38. And at 10 days, there were 61 nodes in the
community. This is also shown by the fig 3 since it seems to find
better community at B = 5 and 10 days.
For our proposed approach, we were able to perform the
preprocessing (Appendix). We converted the timestamps into a
sequence of numbers, each depicting a span of 1 day.

5. KEY RISKS/UNKNOWNS

Fig 3
The function we created to find conductance of the discovered
community is found out by following steps:
(A): Creating a graph copy G_S of the original graph G, and then
removing nodes in the community set S (using networkx package
in Python).
(B): Finding number of edges e1, e2 and e3 in the graphs G, S and
G_S respectively.

We do not how to fix a budget on the time span like in


[1]. Some constraint is required on the span to get solid
communities. How far and backward to go in time might differ
depending on applications. The time and memory complexity of
the algorithm is unknown. We need to store neighborhood sets of
all the nodes along with the m random orderings - which might
take up gigantic slice of memory. Also, the depth of the
neighborhood (k) for our experiments is 2. We are not sure what
kind of communities we can fetch using higher k as it is yet to be
experimented.

6. ACKNOWLEDGMENTS
We thank Polina et al. for providing us with the code to
evaluate the baselines, and Dr. Petko Bogdanov to direct us to the
idea of using LSH in our algorithm.

7. REFERENCES
[1] Rozenshtein, Polina, Nikolaj Tatti, and Aristides Gionis.
"Discovering dynamic communities in interaction

networks."Machine Learning and Knowledge Discovery in


Databases. Springer Berlin Heidelberg, 2014. 678-693.
[2] Network Community Profile. http://snap.stanford.edu/ncp/
[3] Enron data set - http://www.cs.cmu.edu/~./enron/ ##change
to Python
[4] Leskovec, Jure, Kevin J. Lang, and Michael Mahoney.
"Empirical comparison of algorithms for network community
detection."Proceedings of the 19th international conference
on World wide web. ACM, 2010. Schaeffer, Satu Elisa.
"Graph clustering."Computer Science Review1.1 (2007):
27-64.
[5] ma, Ji, and Satu Elisa Schaeffer. "On the NPcompleteness of some graph cluster measures."SOFSEM
2006: Theory and Practice of Computer Science. Springer
Berlin Heidelberg, 2006. 530-537.
[6] Leighton, Tom, and Satish Rao. "An approximate max-flow
min-cut theorem for uniform multicommodity flow problems
with applications to approximation algorithms."Foundations
of Computer Science, 1988., 29th Annual Symposium on.
IEEE, 1988.

APPENDIX
Preprocessing is a very important step in our algorithm. After
finding all the edges and their corresponding timestamps, we
convert all the nodes IDs into new temporal_ids. The new
temporal_id, by definition, will remove the requirement of
considering time at every step. Hence, in the preprocessing
step itself, we convert everything into a set of timestamps T
and set of Temporal Nodes NT. After the completion of this
step, we can assure that each node considers its
neighborhood in time, and if we are able to perform some
final experiments, then hashing these neighborhood sets will
give us communities in time similar to Greedy binary
algorithm of [1].

Preprocessing for YACDA


1.
2.

[7] Karypis, George, and Vipin Kumar. "A fast and high quality
multilevel scheme for partitioning irregular graphs."SIAM
Journal on scientific Computing20.1 (1998): 359-392.
[8] Andersen, Reid, Fan Chung, and Kevin Lang. "Local graph
partitioning using pagerank vectors."Foundations of
Computer Science, 2006. FOCS'06. 47th Annual IEEE
Symposium on. IEEE, 2006.
[9] Radicchi, Filippo, et al. "Defining and identifying
communities in networks."Proceedings of the National
Academy of Sciences of the United States of America101.9
(2004): 2658-2663.
[10] Diesner, Jana, Terrill L. Frantz, and Kathleen M. Carley.
"Communication networks from the Enron email corpus It's
always about the people. Enron is no
different."Computational & Mathematical Organization
Theory11.3 (2005): 201-228.
[11] Gleich, David F., and C. Seshadhri. "Vertex neighborhoods,
low conductance cuts, and good seeds for local community
methods."Proceedings of the 18th ACM SIGKDD
international conference on Knowledge discovery and data
mining. ACM, 2012.
[12] Rajaraman, Anand, and Jeffrey David Ullman. Mining of
massive datasets. Cambridge University Press, 2011.

3.

4.

5.

temporal_id (t0) [0,1, 2,,j,,n-1] where j

[0,n]

temporal_id (ti) j + n * ti
6.

7.

[13] Bader, Gary D., and Christopher WV Hogue. "An automated


method for finding molecular complexes in large protein
interaction networks." BMC bioinformatics4.1 (2003): 2.
[14] Charikar, Moses. "Greedy approximation algorithms for
finding dense components in a graph." Approximation
Algorithms for Combinatorial Optimization. Springer Berlin
Heidelberg, 2000. 84-95.
[15] Macropol, Kathy, and Ambuj Singh. "Scalable discovery of
best clusters on large graphs." Proceedings of the VLDB
Endowment 3.1-2 (2010): 693-702.

Format data in item ={yy-mm-dd hh-mm-ss id1,id2}.


//It should have such a format.
//Initialize time-set T
Strip mm-dd from items and calculate total number of
days in the set as total_days, such that
T=[0,1,2,., total_days-1]
Initialize edgeTS
Strip id1 and id2 from item set of step 1
//id1 and id2 are original IDs of nodes
//After extracting data, our edgeTS should be of the
form shown below:
edgeTS = {(ti, [id1, id2]},
where ti [0,total_days-1] & id1,id2 N
//Obtain all the nodes having edges and sort it
Sorted_nodeSet = sorted(nodeSet)
Also, n=|N|, (total no of nodes in N)
Initialize new ID process
At time t=0,

8.

//Initialize nodeDictionary to keep records


// (Mapping between old id & new id)
key = sorted_nodeset
value = temporal_idSet
//Update the edgeTS with new temporal_ids
for edge in edgeTS;
Replace (id1, id2) with corresponding (temporal_id1,
temporal_id2) using nodeDictionary
for i in range (len(edgeTS)):
for edge in edgeTS[i]:
edgeSet.add(edgeTS[i][1])
end for
end for
//This will replace old ids with new ids in edgeTS
//Add temporal edges
// Edges between same node at different times.
for n

NT:
for i = 0 to total_days-1:

edgeSet.add (temporal_node (ti),


temporal_node (ti +1))
end for
end for
------------After this preprocessing, we have
N = Original node set
NT = Temporal node set
T = [0,1,,total_days-1] where ti T
//new total no. of nodes
new_n = |NT| = n * total_days
-------------

YACDA Algorithm
1.

Begin:

2.

for 1 to m:
Initialize mhi() with random a and b such that
mhi(x) = (ax+b)mod |N|

3.

Initiliaze s[i][j] // Signature matrix

4.

//Permutation matrix
for i = 1 to new_n:
for j = 1 to l:
s[i][j] rand (m)
//will assign a random number from 1 to m.

5.

Initialize Hashtable H
for n

V(t)
For i = 1 to new_n
For j = 1 to m
Minhash [j] = mh (Ni)

//N is the temporal neighborhood of i


End for
End for
End for
// minhash[j] of a node n will always be an element from
its temporal Neighborhood.
6.

Initialize Sig // Signature of a node


for j = 1 to l:
Sig Sig + minhash [s[i][j]]

//append nodes at position s[i][j] in


//minhash table of n.
End for
7.

// store node and its corresponding signature.


Store (n, Sig) in H

8.

Initialize set C

9.

Initialize dictionary record{}


//to store conductance records.

10. For bucket b in H:


Vol (b) = sum (degree of nodes) //indicates internal
edges.
Conductance (b) = # edges on the boundary/vol (b)
record [b] = conductance (b).
11. sort (record.values()) //lowest conductance first.
12. Delete record [k+1: ] //first k nodes only.
13. Add k record.keys() to C //add node to C.
14. Return C
15. End

You might also like