Nait Hamoud2016

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Core Community Detection Algorithm based on Edge Removal

Learning
M. C. Nait-Hamoud F. Didi Y. Boualleg
Department of Computer Sciences, Department of Computer Sciences, Department of Mathematics and Science
University Abou Bekr Belkaid University Abou Bekr Belkaid computing, University Larbi Tebessi
P.O. Box 230 P.O. Box 230 P.O. Box 12000
Tlemcen, Algeria Tlemcen, Algeria Tebessa, Algeria
University Larbi Tebessi fedouadidi@yahoo.fr yaakoub.boualleg@gmail.com
P.O. Box 12000
Tebessa, Algeria
mc_naithamoud@hotmail.com
ABSTRACT where is a set of vertices and stands for the edges of the
graph. Community detection seeks for a partition of the graph
In this work, we contribute to solve the community detection such that each subset of nodes of
problem by proposing an algorithm for the detection of disjoint tightly connected within and sparsely connected elsewhere in
communities’ cores considering unweighted and undirected the graph form a community.
social graphs. The proposed algorithm is based on the removal of
non-essential edges and induced isolated nodes from networks Several community detection algorithms were designed to
(graphs). To this purpose; we have built a model for predicting optimize objective functions based on defined quality metrics to
edges removal using Weighted Support Vector Machines assess how compliant is the obtained communities to a given
(WSVM) trained on real-life social network datasets. The considered definition. In [5] Newman has defined the so called
training phase was carried out by means of an appropriate modularity (see Eq. 1), the idea is to detect regions that enclose
proposed heuristic to label edges, and discriminating features more links than random to establish that there is a structure in
extracted from both edges end-point nodes and network the network. Authors in [6] proposed a greedy algorithm that
structure characteristics. Our designed algorithm shows optimizes the modularity and detects communities. In [7] the
promising results for communities’ cores detection. authors have defined a metric called WCC for Weighted
Clustering Coefficient. This metric was defined for a node
CCS CONCEPTS relatively to a community (see eq. 2), for a community (see eq.
• Information systems~Data mining 3) and for a partition (see eq. 4).

KEYWORDS
 
   Ci , C j 
1 ki k j
Communities’ cores detection, social networks, classification, Q  i, j  Aij  (1)
edges removal. 2E  2m 

1. INTRODUCTION where represent the number of edges of the network,
if nodes is linked to node and 0 otherwise, is the
Community detection problem was extensively studied due to
its several real-life applications such as network evolution [1], degree of node , and if nodes are in the same
recommendation system [2], and expert detection [3]. community, otherwise
Community detection problem consists of partitioning a graph
(network) in a set of regions that exhibit dense connections in-  t x ,S  vt  x ,V 
between its vertices (members) and sparse links with the rest of  . if t  x ,V   0 (2)
WCC  x , S     
t x ,V S     vt  x ,V  S 
x
the graph (network) [4]. Formally, given a graph

0 else

Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or
distributed for profit or commercial advantage and that copies bear this notice and
where is the number of triangles that node close in
the full citation on the first page. Copyrights for components of this work owned community , V stands for the set of nodes of the hole graph, an
by others than ACM must be honored. To copy otherwise, distribute, republish, or and represent the number of nodes of that close at
post, requires prior specific permission and/or a fee. Request permissions from
permissions@acm.org.
least one triangle with
MedPRAI-2016, November 22-23, 2016, Tebessa, Algeria
© 2016 ACM. ISBN 978-1-4503-4876-8/16/11…$15.00
DOI: http://dx.doi.org/10.1145/3038884.3038902
1 In [15] authors have proposed FCD algorithm which exploits
WCC  S  
S
 x S
WCC  x , S  (3) local information of graph nodes to detect communities. The key
idea of this algorithm is based on two main assumptions; the
first one supposes that nodes tend to follow nodes with high
1
WCC  P    Si . WCC  Si 
n
(4) degrees, whereas the second one suggests that nodes with more
i 1
v common neighbors belong more likely to the same community.
FCD outperforms in terms of quality metrics several state-of-the-
where stands for the number of nodes of community . art algorithms namely, InfoMap [16], Walktrap [17], and GN [9].
To assess the performances of their algorithm, in addition to
Modularity and WCC metrics, authors have used conductance
The WCC metric is based upon the idea that members of a
and internal density. The conductance is defined as:
community should have more common neighbors, said
differently nodes of a community tend to close more triangles
cS
conductance  S  
together. Authors in [8] proposed SCD algorithm based on WCC
(6)
to discover the best partition of a graph into disjoint 2mS  c S
communities (i.e. the partition with the maximal WCC). The
authors have compared their algorithm to the more relevant
where is the number of edges with one end-point in the
works on communities detection and have outperformed their
results. Besides, edges removal was investigated and considered community and the other one outside, is the number of
in the literature. In [9] Girvan and Newman used this principle edges with two end-point nodes in . Internal Density is
and proposed removing edges progressively from the network formally given by:
using the notion of edges betweeness to seek for edges that are
most likely situated between communities. In [10] authors have mS
InternalDensity  S   (7)
proposed the removal of edges as a pre-processing phase for a nS nS  1 / 2
specific algorithm, namely CNM [11], to enhance its
performances; they defined a threshold based on the ratio of the where stands for the number of nodes of the set of vertices
common neighbors of the edge end-point nodes to the sum of .
their respective degrees. Given end-point nodes and of an In this work, we have proposed a new algorithm for detection
edge , the threshold was formally defined as in (Eq. 5) and of communities’ cores. For this purpose we have trained WSVM
the edge is removed if the computed threshold is below a classifier using real-life social networks datasets to remove noisy
given value. Setting such a threshold is not obvious and depends edges and eventually isolated nodes. The training phase is based
on the characteristics of the network, i.e., networks with dense on the threshold defined in [10]. Moreover, the choice of the best
or sparse configurations may have different threshold settings. threshold is carried out with a proposed heuristic which
considers the average clustering coefficient of networks.
Aij 
Th  ij
(5) The rest of this paper is organized as follows: in section 2,
degi  deg j we introduce our contribution and discuss the proposed
algorithm. Section 3 presents the experimental results including
a discussion. Finally, section 4 concludes this paper.
ij stands for number of common neighbors of nodes and ,
represents the degree of node and finally, indicates if 2. CONTRIBUTION
nodes and are neighbors , or not .
In [12] authors investigated edges removal on synthetic Considering the subjective definition of a community defined
graphs using linear weighted support vector machine and local in section 1, this leads us to an important question: should we
edge features to the purpose of detecting communities. Instead affect a community to each node in the graph, or each member
of using score functions to rank edges such as the fraction of of a social network? Does a new social network member or
possible triangles that contain the edge defined by Radicchi in inactive member belong to a community given that its Followee
[13], or Jaccard similarity between adjacency list of end-point or friend belongs to it? The purpose of the clustering problem is
nodes of a given edge as in [14], authors have proposed to learn to affect each sample a membership cluster or group. However,
a threshold based on some selected features namely the degrees considering the common definition of a community, it will fall in
of the edge end-point nodes, the number of triangles containing common sense to not confuse this problem with clustering.
this latter and the mean values of these features. They have Effectively, from the aforementioned remarks it will be more
stated that their work achieves mixed performance for real-life accurate to consider this problem definition as communities’
networks. cores detection. Besides, nodes that are not involved in any
triangle within the community do not contribute positively in
WCC metric of the community. Consequently, their removal

2
from the graph will enhance the results of SCD algorithm, and neighbors of the hole network (see Eq. 9). Finally, the two last
improve the modularity of the community. Additionally, nodes features and are the ratios of each edge end-point node
that exhibit linear structure if connected to a dense region share
clustering coefficient to the network average clustering
this same property and could not also be considered as
coefficient (see Eq. 10).
community according to our considered interpretation, their
removal will improve results based on WCC metric, and enhance
the modularity of the partition. In what follows, we will refer to degi
f i (1)  (8)
nodes candidates to removal as isolated nodes. Detecting n
communities’ cores brings an interesting level of granularity for
social networks analysis applications. Effectively, nodes and
f ij3  ij
(9)
communities are respectively the lowest and highest level of n n  1 
granularity considered in this domain. Core community  ACC
detection offers an intermediary level which is sub-communities
2
that could be exploited in networks analysis applications such as CCi
investigating information diffusion, and refining customers f i (4)  (10)
targeting in marketing campaigns, etc. ACC
In the above equations represents the degree of an end-
The key idea of the designed core community detection point node of a given edge , refers to the average degree of
algorithm is to reduce the graph representing the network the network (graph), stands for the number of common
ij
structure by the removal of non essential edges and eventually
isolated nodes. To this end, the edge removal problem is mapped neighbors of the edge end-point nodes and is the
to a pair-wise classification, where the positive class represents clustering coefficient of an end-point node and finally, is
edges to be removed and the negative one is dedicated to those the average clustering coefficient of the network.
edges to be kept. The learning phase of the classification was
conducted using Weighted Support Vector Machines (WSVM) Our designed algorithm described below (see algorithm 1)
with RBF kernel function. Effectively, given a bunch of real-life starts by computing the degree and the clustering coefficient of
social networks datasets, a training dataset was built using the each node. As a second step the selected features are computed
threshold defined in equation 5 considering only the using equations 8, 9 and 10 described above and the results of
neighborhood of nodes and Edges with a value below the previous step. Afterwards, these features are submitted to
the threshold are labeled as positive and those above this value the model to predict whether to remove the edge or not. It
are considered as negative. The choice of the threshold is not should be pointed out that non-essential edges removal can
obvious, but based upon the observation that the best threshold eventually induce standalone nodes which are considered as
value is correlated to reduced networks average clustering noisy and in turn are erased from the reduced graph. To achieve
coefficient weighted by the ratio of reduced graph and original community detection and ensure removal of communities with a
graph number of nodes, we gradually increase the threshold tree-like and linear structure, as a final step we remove from the
value seeking for the first highest value (first local maximum) of reduced graph all nodes with a clustering coefficient equal to 0.
the weighted reduced graph average clustering coefficient. It is This step is based upon our aforementioned interpretation of
important to stop this process once the first highest weighted community, which supposes these structures as not essential and
average clustering coefficient obtained to preserve essential leads us to the detection of communities’ cores. Edges’ end-point
edges from removal. Afterwards, local information about end- nodes clustering coefficients of the reduced graph are not
point nodes of each edge were collected to build a training recomputed, but updated to the purpose of computational
dataset. The collected information for each edge in addition to complexity reduction. To this end, for each removed edge we
global information inherent to the characteristics of the social update the clustering coefficients of its end-point nodes and
graph will constitute the selected features. To this purpose, the obtained from step 1 using the equation 11 above:
degree of each end-point node, its clustering coefficient, the
number of the common neighbors of the end-point nodes and ni ni  1CCi  J

finally the class of the edge (removed, maintained) were CCiup  i


(11)
recorded, in addition to two measures related to the ni  1ni  2
characteristics of the network namely, the average degree of the
graph and its average clustering coefficient. The first two
where CCiup stands for the updated clustering coefficient of
components and of the retained features vector for a
given edge represent the ratio of each end-point node degree node ni its degree, CCi its clustering coefficient and finally
to the average degree of the network (see Eq. 8), the third feature j
i the common neighbors of end-point nodes and
is the ratio of the edge end-point nodes number of common
neighbors to the approximated mean number of connected

3
shown in Table 1 below, we have increased progressively the
Algorithm 1 threshold and picked the value that induces the first highest
Input: Adjacency list weighted clustering coefficient of the reduced graph (0.83 in the
Initialization: set average clustering coefficient and average degree to 0
step 1
example below). Note that the threshold zero corresponds to the
For each node of the adjacency list original network
Compute the degree and the clustering coefficient
Update average clustering coefficient and average degree of the
network. Table 1: American Football College dataset threshold setting
end
using our heuristic
Step2
For each node of the adjacency list
Compute the features of the end-point nodes using the results of Threshold Graph characteristics
step 1 and equations 8, 9, and 10 settings
… is the model obtained from the training phase
#Nodes #Edges Average Weighted average
degree clustering
if 1 then
Remove edge
coefficient
Update clustering coefficient of end-point nodes and 0 115 1232 10.7 0.40
using the following equations: 0.05 115 918 7.98 0.72
0.1 113 820 7.25 0.83
0.15 107 784 7.32 0.80
0.2 103 656 6.36 0.82
End
end Fig. 1 (a) below depicts the original graph, whereas Fig. 1 (b)
Step3 shows the obtained reduced graph corresponding to the best
Remove nodes with updated clustering coefficient equal to 0 threshold.

3. Experiments and results

To the purpose of communities’ cores detection we have


selected four real-world social networks datasets namely Karate
club [18], American Football College [19], Email-URV [20], and
finally Dolphins [21]. Figures for communities’ visualization
were realized using Gephi [22].

3.1 Datasets description

- Karate club: this benchmark originates from the work of


Zachary [18], it represents a network of members of a karate
club separated in two groups. (a) (b)
- American Football College: collected by Girvan and
Newman [19], this dataset concerns football games that opposed Figure 1: (a) original network, (b) reduced network with
American Football College in a given season. Nodes of the graph threshold = 0.10
represent teams and edges mean that end-point teams have
disputed a match. The Conference membership of each team is 3.3 Learning phase
available, teams in the same conference played more games than
with other ones. To build the predictive model for edges removal and
- Email-URV: this networks represents sent and received generalize the selection of the threshold we have selected
emails between Virgili University graduate students in Spain alternately three datasets to conduct the training phase and one
[20]. dataset for the testing phase. Based on the proposed heuristic we
- Dolphins: is a network of frequent associations between have selected the threshold as mentioned in section 3.2 for each
62 dolphins living off Doubtful sound in New Zealand compiled dataset and recorded the class of each edge (1 for removed, 0 for
by David Lusseau [21]. retained). Afterwards, we have collected information about each
edge end-point nodes (degree, common neighbors of the
3.2 Threshold settings involved nodes, and their respective clustering coefficients), as
well as global information about each network (average degree
In what follows we illustrate the proposed heuristic of the and average clustering coefficient) and built a training dataset.
threshold setting on American Football College network. As Each record of this later represents the five features computed

4
according to equations 8,9 and 10 mentioned in section 2. To The results of our designed algorithm using Karate club
preserve essential edges from removal, we gave this class more network as a testing network are depicted in Fig. 4 below.
importance as in [12] and hence used WSVM with RBF kernel.
Finally, we have conducted a 5-folders cross-validation to assess
the generalization aptitude of the predictive model and fine-
tuned the user-defined parameters and As best accuracy we
have obtained 98.27% for and .

3.4 Results and discussion

Fig. 2 illustrates the results of step 2 of our algorithm


obtained using Email URV as a testing network. As shown in Fig.
2 (b), this step induces eventually communities with a tree-like
and linear structure. According to our interpretation of the (a) (b)
communities detection problem, these latter are considered as
non-essential since they do not represent communities’ cores Figure 4: Result of our communities’ cores detection for Karate
club network. (a) original network, (b) detected communities’
cores

It should be noted that if compared to the ground truth


communities (see Fig. 5) our algorithm splits the first community
in blue in two cores since this latter exhibits some weak ties for
nodes 25, 26 and 32 with nodes 24, 28 and 29 that were
considered as isolated nodes and removed from the graph. We do
not consider this result as a drawback since this brings an
interesting level of granularity for social networks analysis
applications. Effectively, nodes and communities are
respectively the lowest and highest level of granularity
(a) (b) considered in this domain. Detection of communities’ cores
offers an intermediary level which is sub-communities that could
Figure 2: (a) original Email URV graph, (b) corresponding be exploited in networks analysis applications, such as
reduced graph result of step 2 investigating information diffusion, and refining customers
targeting in marketing campaigns, etc.
The final result for Email URV network after applying step 3
of the proposed algorithm is depicted in Fig. 3 below.

Figure 5: Karate club network ground truth [23]

For the sake of comparative study, we have focused on FCD


algorithm described in section 1; our motivation for this choice is
that similarly to our designed algorithm, this later exploits local
Figure 3: Communities’ cores detected from Email URV network. information relative to nodes of the social graph. To this
purpose, we have considered the aforementioned quality metrics:
Modularity, Conductance, Internal Density, and finally WCC.

5
Table 2 below shows the obtained results using our algorithm in [4] M. E. J. Newman. 2006. Modularity and Community Structure in Networks. J.
Natl. Acad. Sci. USA (2006), 103, 8557–8582.
terms of quality metrics. [5] M. E. J. Newman and M. Girvan. 2004. Finding and evaluating community
structure in networks. Phys. Rev. E, 69, 2 (2004), DOI:10.1103/PhysRevE.69.02
[6] M. E. J. Newman. 2004. Fast algorithm for detecting community structure in
Table 2: Obtained results using our algorithm networks, Phys. Rev. E, 69, 6 (2004), 066133. DOI: 10.1103/PhysRevE.69.066133.
[7] A. Prat-Pérez, D. Dominguez-Sal, J.M. Brunat and J-L. Larriba-Pey. 2012.
Shaping communities out of triangles. Procedding of 21st ACM CIKM, 1677–
Selected Datasets 1681.
Karate Dolphins Football Email [8] A . Prat-Pérez, D. Dominguez-Sal and J-L. Lariba-Pey. 2014. High Quality,
Quality metrics Scalable and Parallel Community Detection for Large Real Graphs,
Club URV Proceedings of the 23rd international conference on World wide web
(WWW'14), ACM, Korea, 225-236, DOI:
Modularity 0.59 0.77 0.91 0.92 http://dx.doi.org/10.1145/2566486.2568010
Internal Density 0.55 0.84 0.92 0.82 [9] M. Girvan and M.E.J. Newman. 2002. Community structure in social and
biological networks. PNAS, 99, 12 (2002), 7821–7826.
WCC 0.39 0.77 0.90 0.70 [10] K. Alfalahi, Y. Atif and S. Harous. 2013. Community Detection in Social
# Core communities 3 5 13 23 Networks through Similarity Virtual Networks, Proceedings of the 2013
IEEE/ACM International Conference on Advances in Social Networks
Analysis and Mining, Ontario, Canada, 1116-1123.
As it could be seen from the comparison of Table 2 and Table [11] A. Clauset, M. E. J. Newman and C. Moore. 2004. Finding community structure
3 below, our algorithm outperforms FCD algorithm and in very large networks. Physical Review E, 70, 6 (2004), 066111, DOI:
10.1103/PhysRevE.70.066111
consequently InfoMap, Walktrap, and GN for all the considered [12] T. V. Laarhoven and E. Marchiori. 2013. Network community detection with
quality metrics. The higher the modularity, the internal density edge classifiers trained on LFR graphs. Proceeding of 21st European
and the WCC, the better the communities found. Whereas, the Symposium on Artificial Neural Networks, ESANN, Bruges, Belgium, (2013).
[13] F. Radichi, C. Castellano, F. Cecconi, V. Loreto and D. Parisi. 2004. Defining
lower the conductance, the higher the quality of the detected and identifying communities in networks. Proc. Natl. Acad. Sci. USA, 101, 9
community. (2004), 2658-2663.
[14] Y.Y. Ahn, J.P. Bagrow and S. Lehmann, 2010. Link communities reveal
multiscale complexity in networks. Nature, 466, 7307 (2010), 761-764.
[15] Y. Song, S. Bressan, 2013. Fast community detection. DEXA. 1, 404-418.
Table 3: Obtained results with FCD algorithm [16] M. Rosvall, C.T. Bergstrom, 2008. Maps of random walks on complex
networks reveal community structure. Proc. Natl. Acad. Sci. USA, 105, 4
(2008), 1118–1123.
Selected Datasets [17] P. Pons, M. Latapy, 2006. Computing communities in large networks using
Karate Dolphins Football Email- random walks. Journal of Graph Algorithms and Applications, 10, 2 (2006),
Quality metrics 191–218.
Club URV
[18] W. W. Zachary, 1977. An information flow model for conflict and fission in
small groups, Journal of Anthropological Research, 33, 452-473.
Modularity 0.41 0.51 0.49 0.008 [19] M. Girvan, M.E.J. Newman, 2002. Community structure in social and
Conductance 0.13 0.32 0.37 0.6 biological networks. Proc. Natl. Acad. Sci. USA, 99, 7821-7826.
Internal Density 0.24 0.42 0.50 0.36 [20] R. Guimera, L. Danon, A. Diaz-Guilera, F. Giralt and A. Arenas, 2003. Self-
similar community structure in a network of human interactions. Physical
WCC 0.19 0.16 0.17 0.007 Review E, 68, 6 (2003), 065103.
# Communities 2 6 13 31 [21] D. Lusseau, K. Schneider, O.J. Boisseau, P. Haase, E. Slooten and S.M. Dawson,
2003. The bottlenose dolphin community of Doubtful Sound features a large
proportion of long-lasting associations, Behavioral Ecology and Sociobiology,
4. Conclusion 54, 4 (2003), 396-405.
[22] M. Bastian, S. Heymann and M. Jacomy, (2009). Gephi: an open source
In this paper we have introduced a new algorithm for software for exploring and manipulating networks. Proceedings of the 3rd
International ICWSM Conference, 361-362.
communities’ cores and essential communities detection based [23] M. Chen, K. Kuzmin and B.K. Szymansk 2014. Community Detection via
on edge removal learning. We have used real-life social networks Maximization of Modularity and Its Variants. IEEE Trans. Computation Social
datasets to conduct the training phase with weighted support System, 1, 1(2014), 46-65.
vector machine. To this purpose, we have proposed a heuristic to
determine the best threshold to label the edges. It should be
pointed out that the selected features constitute the aggregation
of local end-point nodes information and global networks
characteristics, which we consider as an important factor to be
taken into consideration. Our experiments show promising
results compared to FCD algorithm and consequently InfoMap,
Walktrap and GN.

REFERENCES
[1] E.M. Jin, M. Girvan and M. E. J. Newman. 2001. Structure of growing social
networks. Phys. Rev. E. 64, 4 (2001), 046132.
[2] P. K. Reddy, M. Kitsuregawa, P. Sreekanth and S. Srivinivasa Rao, 2002. A
graph based approach to extract a neighborhood customer community for
collaborative filtering. DNIS 2002, 188-200.
[3] K. Balog, L. Azzopardi and M. de Rijke. 2016. Formal models for expert finding
in enterprise corpora. Proceedings of the 29th annual international ACM
SIGIR conference on Research and development in information retrieval
SIGIR, (2006), 43-50.

You might also like