Overlapping Community Detection Method Based On Ne

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3041472, IEEE Access

Date of publication xxxx 00, 0000, date of current v ersion xxxx 00, 0000.

Digital Object Identifier 10.1109/ACCESS.2017.DOI Number

Overlapping Community Detection Method


Based on Network Representation Learning
and Density Peaks
HONGTAO LIU1 AND GEGE LI 2
Chongqing University of P osts and Telecommunications, Chongqing 400000, China

Corresponding author: GEGE LI2 (e-mail: S180201032@ stu.cqupt.edu.cn).


T his work was supported in part by the National Social Science Fund of China 18BGL266 and 20BSH076.

ABSTRACT At present, the research on complex social networks has attracted extensive attention from
scholars, and community detection is an important research direction in the study of network structure.
Network data is often high-dimensional and very large, which makes it very difficult to process. Therefore,
it is of great significance for community detection to represent network structure with low-dimensional vector.
And many real world social networks contain overlapping communities . In this paper, we propose an
overlapping community detection method based on network representation learning and density peaks, called
NRLDP. First, it uses network representation learning technology to represent the unweighted network or
weighted network with low-dimensional vectors. Then, it applies the density peaks clustering algorithm to
overlapping community detection, uses cosine similarity to calculate the distance between nodes, and
improves the local density calculation method. Finally, it selects the core node according to the relative
distance and local density, and allocates the remaining nodes to achieve overlapping community detection of
unweighted network or weighted network. Compared with relevant community detection methods on real
world social networks and synthetic networks of LFR Benchmark, the results of the experiment show that
our proposed approach is effective and accurate.

INDEX TERMS Network representation learning, density peaks, overlapping community detection

I. INTRODUCTION homogeneity, such as Individuals in groups with the same


In the late 1990s, Duncan J. Watts and Steven H. Strogatz a et interests have a high probability of becoming friends. In
al. published "Collective dynamics of ‘small world’ networks" addition, many complex social networks will exhibit a strong
in "Nature" [1], followed by A.L. Barabasi et al. in "Science" social effect. The manifestation of this social effect is the
Published "Emergence of Scaling in Random Networks" [2]. formation of a variety of but closely connected groups, and the
The advent of these two articles represents the birth of small- contacts between individuals within the group are relatively
world networks and scale-free networks that are closer to the frequent. And there is much less contact with other individuals
real world, opening a new era of complex network research. outside the group. If an individual is divided into multiple
The social network is essentially a special complex network. groups, it is an overlapping community detection, and these
The nodes in the network represent individual users or certain individuals are overlapping nodes. In real networks, there are
groups, and the edges represent the intricate relationships often overlapping nodes. Therefore, overlapping community
between nodes through interaction. Mining useful information detection has important research significance.
in the network is very important for scientific research and With the development of social networks, network data is
application. Early researchers proposed that it is more often high-dimensional, very large and complex, making it
meaningful to discover hidden laws in the network by studying very difficult to process. Traditional community detection
social groups than simply studying individual users. methods mainly obtain community information in the network
esearchers used community detection algorithms to find a based on the representation of the adjacency matrix, but the
homogenous community structure in complex social networks adjacency matrix can only represent the direct connection
[3]. Individuals and individuals in a social group often have information between nodes, which often has the problem of

VOLUME XX, 2017 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3041472, IEEE Access
H. Liu, G. Li: Ov erlapping Community Detection Method Based on Network Representation Learning and Density Peaks

unique disasters and excessive complexity. And Network through feature matrix preprocessing and ranking
Representation Learning (NRL) can represent the information optimization, thereby discovering the network structure of the
in the network with low-dimensional vectors, which not only unknown number of communities. In 2018, Li et al. [10]
expresses the network structure information, but also reduces proposed an overlapping community detection algorithm
the computational complexity. based on semi-supervised matrix factorization and random
In this paper, we propose an overlapping community walk. Another method is spectral clustering, which is mainly
detection method (NRLDP) based on network representation based on the feature vector of the adjacency matrix on the way
learning and density peaks, which not only considers the to the network to perform graph segmentation. Some research
problem of network data representation, but also considers the scholars prioritize local information [11], [12], but its
problem of irregular community structure in the actual optimization is carried out in the entire feature space. Research
network. scholars also consider the local structure of the network. The
The rest of this paper is arranged as follows: Section II local expansion algorithm [13] is mainly based on the idea of
summarizes the related work of the algorithm, and Section III community growth. The community seed is locally expanded
describes the detailed process of the proposed NRLDP and optimized through a custom local expansion function until
algorithm. Section IV introduces the experimental results of it becomes a community with the greatest benefit. Resulting in
the NRLDP algorithm obtained on the synthetic network and a community structure. Lanchinicetti et al. [14] proposed the
the real network data set, and compared with other algorithms. LFM algorithm, which randomly selects a node as the initial
Conclusions and suggestions for future work are presented in seed, and then expands from the seed node to build a
Section V. community until the fitness function reaches the local
optimum. Liu et al. [15], a locally optimal extended
II. RELATED WORK hierarchical clustering algorithm is proposed. Yu et al. [16]
A. CLASSICAL OVERLAPPING COMMUNITY proposed SEOCO, a seed expansion overlap community
DETECTION METHODS
detection algorithm based on random walk. Bhatia [17]
Classical overlapping community detection methods can be
proposed a hierarchical method based on autoencoder to
roughly divided into the following categories: clique
initialize candidate seed nodes, and determine the number of
percolation based methods, graph partitioning based methods,
communities by considering the network structure. The
local expansion and optimization based methods, and label
disadvantage of these methods is that the quality of the
propagation based methods.
division result depends on the selection of seeds, which leads
In 2005, Palla et al. proposed Clique percolation Method
to unstable division results. The method based on label
(CPM) algorithm [4] to detect overlapping communities. The
propagation is also a method considering the local structure of
main idea is to detect overlapping communities based on k-
the network. The idea is to initialize a label for each node in
max groups. The k-maximal clique represents a fully
the network, and then update the label according to the
connected subgraph with k nodes in the network. If a k-
conditions of these nodes until the node no longer changes.
maximum clique overlaps with another k-maximum clique by
COPRA [18] and SLPA [19] is a relatively classic algorithm.
k-1 nodes, then these two k-maximum cliques are called
Both are improvements to the LPA [20] algorithm to realize
adjacent k extremely large group. The adjacent k-max group
the detection of overlapping communities. The COPRA
is the overlapping community structure detected by the CPM
algorithm is based on multiple labels, and is the first fuzzy
algorithm. In 2009, Shen et al. proposed the EAGLE
overlapping community detection algorithm based on label
algorithm [5], which combines the ideas of hierarchical
propagation. This algorithm assigns a label series to each node,
clustering and extremely large cliques . A modularity function
and uses the parameter v to control the length of the label
EQ is proposed to evaluate the detection quality of
series, that is, the maximum number of labels that each node
overlapping communities. It not only considers the structure
can contain, and the maximum number of communities that a
of overlapping communities in the network, but also considers
node can belong to. The SLPA is an information label
the hierarchical structure between communities. In addition,
propagation algorithm based on spermer-listenter, which
Farkas et al. [6] extended the CPM algorithm to a weighted
propagates labels between nodes according to the rules of
network and proposed the CPMw algorithm, but the algorithm
acceptance interaction. The method based on tag propagation
stipulates that only k- factions whose internal density exceeds
has the advantages of simplicity and efficiency, but the
a given threshold can become a community, which has certain
community structure of its detection has great uncertainty.
limitations. There are two main methods based on graph
In 2014, Rodriguez et al. [21] proposed a density peak
partitioning. One method for graph partitioning is to use non -
clustering algorithm. Compared with other clustering
negative matrix factorization. Zhang et al. [7] first used the
algorithms, this algorithm can cluster non-spherical data sets
NMF model in overlapping community detection, but it needs
well. In real complex networks, the relationships between
to reflect the number of communities. Subsequently, many
nodes are often intricate and the community structure in the
scholars have proposed some improved methods, such as
network is irregular. Therefore, the improved density peak
SNMF [8]. In addition, NMFOSC [9] solves this problem
algorithm used in community detection can effectively divide

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3041472, IEEE Access
H. Liu, G. Li: Ov erlapping Community Detection Method Based on Network Representation Learning and Density Peaks

the irregular community structure. Deng et al. [22] improves First, DPC algorithm define the local density 𝜌𝑖 of node 𝑖,
the density peak algorithm for community detection. This as Equation (1) and (2):
method uses Jaccard similarity and shortest path to obtain
composite similarity, and obtains the distance based on the 𝜌𝑖 = ∑𝑗 𝜒 (𝑑 𝑖𝑗 − 𝑑 𝑐 ) (1)
node similarity value as the input of the algorithm, but the
algorithm cannot detect overlapping communities. 1 𝑥<0
𝜒 (𝓍) = { (2)
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
B. NETWORK REPRESENTATION LEARNING
The main task of network representation learning is to where, 𝜌𝑖 is the local density of node 𝑖, 𝑑𝑖𝑗 is the distance
represent the structural features of any node in the network between node 𝑖 and node 𝑗, and 𝑑 𝑐 is the cutoff distance.
with low-dimensional vectors for better data mining. Recently Second, the relative distance 𝛿𝑖 of the node 𝑖 is defined as
in the field of natural language processing, research on word Equation (3):
embedding has provided new ideas for the feature
representation of network nodes. Perozzi et al. introduced max ( 𝑑𝑖𝑗 ) 𝜌𝑖 = max ( 𝜌𝑘 )
𝑗 𝑘
word embedding related technologies into network 𝛿𝑖 = { (3)
min (𝑑𝑖𝑗 ) 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
representation learning, and proposed the Deepwalk [23]. This 𝑗:𝜌𝑗>𝜌𝑖
result has triggered a wave of research on network
representation learning. Then, the point with relatively large 𝜌𝑖 and 𝛿𝑖 is selected as
Deepwalk combines two models in different fields. One is the cluster center. Finally, it assigns the remaining sample
random walk, which is used to generate a large number of points to the clusters with the closest distance and localdensity
sequences composed of nodes in network representation greater than the current center point.
learning, which is equivalent to sentences in the corpus of
natural language processing. The other is the Skipgram model III. NRLDP ALGORITHM
proposed by Mikolov et al. in word2vec, which takes the This section details the framework of the proposed algorithm
obtained node sequence as input to learn and train the vector NRLDP. The method mainly includes four steps: First, we use
representation of the nodes in the network. The process frame Deepwalk to represent the structural information of each node
diagram of Deepwalk is shown in Figure 1. in the network with a low-dimensional and continuous vector,
and calculates the distance between nodes by obtaining the
similarity between nodes according to the vector
representation of the nodes; Second, we use the degree of the
node and the Local Clustering Coefficient (LCC) [24] to
measure the local density of each node, and calculates the
relative distance of each node; Then, we choose the point with
higher local density and relatively far distance as the core point
of the community; Finally, we allocate the remaining nodes
according to the degree of belonging of the remaining nodes
to detect overlapping communities. The flow chart of NRLDP
algorithm is shown in Figure 2.

A. RELATIVE DISTANCE CALCULATION


The relationship between nodes in a complex social network
FIGURE 1. The framework diagram of Deepwalk. is often represented by an adjacency matrix. The adjacency
matrix of an unweighted network has an element of 0 or 1. 0
means that there is no connection between nod es in the
C. DENSITY PEAKS CLUSTERING network, and 1 means that the nodes are directly connected.
Density Peak Clustering (DPC) is a density-based clustering The connected nodes in the weighted network are represented
algorithm proposed by Rodriguez et al. in Science [21]. by weights. In a real world social network, the direct
According to the characteristics of the data distribution, DPC connection between most individuals and other individuals in
assumes that the cluster center points have a large local density, the network is very limited, so only a small amount of inter-
and the relative distance between different cluster centers is node connection information can be obtained, and the
relatively large. And then according to the obtained center relationship between nodes cannot be well represented. In
points, it allocates the remaining sample points to the cluster order to solve this problem, this paper uses the Deepwalk
to which a certain center point with a closer distance and algorithm to preprocess complex social network data, and each
greater local density belongs. The algorithm includes the
following processes.

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3041472, IEEE Access
H. Liu, G. Li: Ov erlapping Community Detection Method Based on Network Representation Learning and Density Peaks

closer connection, while the indirect connection means a


weaker connection. Therefore, the local density of a node
cannot be simply calculated based on the number of nodes
around a node in the network.
As shown in Figure 3 (a) and (d), the number of neighbors
of node A in Figure 3 (a) and (d) is 3. However, it can be seen
from the figure that the local density of node A in (a) is
obviously greater than the local density of node A in (d).
Therefore, when calculating the node density, not only the
number of neighbor nodes of the node, but also the tightness
of the connections around the node must be considered. So,
this paper uses degree and local clustering coefficient to
calculate the local density of nodes, as shown in Equation (6).
2×∑𝑗 ,ℎ 𝑎 𝑗ℎ 𝑎 𝑖𝑗𝑎 𝑖ℎ 𝑎 𝑗𝑖𝑎 ℎ𝑖
𝜌𝑖 = 𝑘𝑖 + (6)
𝑘𝑖 (𝑘𝑖 −1)

Where, 𝑘𝑖 represents the degree of the node.

FIGURE 2. The flow chart of NRLDP algorithm.

node is represented by a low-dimensional vector. The node


vector represents the structural information of the node in the
network. Therefore, in the vector space, the greater the vector
similarity of the node, the closer the distance, on the contrary,
FIGURE 3. Four different structure network diagrams with node A
the smaller the similarity, the farther the distance. degree of 3.
Usually a network with n nodes can be regarded as a graph
𝐺 = (𝑉, 𝐸), where 𝑉 = {𝑣1 , 𝑣2 , … , 𝑣𝑛 } is a node set and 𝐸 = According to Equation (6) to calculate the local density of
{𝑒1 , 𝑒2 , … , 𝑒𝑚 } is an edge set. The node vector representation node A in different structure network graphs, then (a) the local
set obtained by using network representation learning is 𝑋 = density of node A in the graph network graph is 3 +
{𝑥 𝑣1 , 𝑥 𝑣2, … , 𝑥 𝑣𝑛}, which can represent either unweighted (2 × 3)⁄ (3 × 2) = 4. The local density of node A in Figure
network nodes or weighted network nodes. (b) is 3 + (2 × 2)⁄ (3 × 2) ≈ 3.67. The local density of node
According to the vector representation of nodes 𝑖 and 𝑗,the A in Figure (c) is 3 + (2 × 1)⁄ (3 × 2) ≈ 3.33 . The local
cosine similarity is used to calculate the similarity 𝑆(𝑖, 𝑗) density of node A in Figure (d) is 3 + (2 × 0)⁄ (3 × 2) = 3.
between nodes 𝑖 and 𝑗. Equation (4) is shown below. According to Equation (3), if node 𝑖 is the point of local
maximum density, the relative distance 𝛿𝑖 of node 𝑖 is the
distance between node 𝑗 and node 𝑖 that is closest to node 𝑖.If
∑𝑛𝑘 =1 (𝑥 𝑣 × 𝑥 𝑣 )
𝑖 𝑗 the node 𝑖 is the non-local maximum density point, the relative
𝑆(𝑖 , 𝑗) = (4)
2 2 distance 𝛿𝑖 of the node 𝑖 is the distance between the node 𝑗
√∑𝑛𝑘 =1(𝑥 𝑣𝑖 ) × √∑𝑛𝑘=1 ( 𝑥 𝑣𝑗 ) and the node 𝑖 with the local density higher than the node 𝑖
and the closest to the node 𝑖.
Where, 𝑛 represents the dimension represented by the node
vector. C. SELECTION OF THE CORES
Then, according to the similarity between nodes 𝑖 and 𝑗, we According to the previous section, the local density ρ and
calculate the distance 𝐷 (𝑖, 𝑗) from node 𝑖 to 𝑗 as shown in relative distance δ of each node in the network can be obtained.
Equation (5). In DPC algorithm, the key step is to obtain the points with
𝐷(𝑖 , 𝑗) = 1 − 𝑆(𝑖, 𝑗) (5)
larger 𝜌 and 𝛿 as the cluster center according to the decision
graph. However, only individual points obviously have larger
𝜌 and 𝛿, and in the actual network, some communities have
B. LOCAL DENSITY CALCULATION
large changes in scale, resulting in a relatively small density of
In the density peak clustering algorithm, according to
individual community centers. For such points, the decision
Equations (1) and (2), the local density of node 𝑖 refers to the
chart is not prominent. Therefore, there may be two different
number of nodes near node 𝑖 within the cutoff distance.
situations. One is that both 𝜌 and 𝛿 are large, the other is that
However, in the network topology, nodes do not exist
one is relatively large, and the other is relatively small. In order
independently, but there are some connections, that is, edges
to select the center point of the community more accurately,
in the network. The direct connection of two nodes means a
this article will take three steps to select.

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3041472, IEEE Access
H. Liu, G. Li: Ov erlapping Community Detection Method Based on Network Representation Learning and Density Peaks

First, according to the decision diagram, select the point Second, we calculate the attribution degree of node 𝑖 ,
where the local density 𝜌 and the relative distance 𝛿 are represented by 𝑝𝑖 ,𝑐 = {𝑝𝑖 ,𝑐 1 , 𝑝𝑖,𝑐 2 , … , 𝑝𝑖,𝑐 𝑚 } , with a value
significantly larger as the center point. range of 0-1. The greater the degree of belonging of a node to
Next, considering the second case, the NRLDP algorithm a certain community, the greater the probability that the node
selects the community center point by calculating the product is assigned to the community. First, we set the attribution
of 𝜌 and 𝛿, and the product represents the center value of the degree corresponding to the community to which the center
node with 𝛾. Before calculating 𝛾, you need to normalize 𝜌 point belongs to 1, and the attribution degree of th e center
and 𝛿 to ensure that they are in the same range. The calculation point to other communities is set to 0. For example, the
formulas are shown in Equation (7) and (8). community label corresponding to the center point 3 is 𝑐2 ,
except that 𝑝3,𝑐 2 is 1, the rest are 0, that is, 𝑝3,𝑐 2 =
𝜌𝑖 − min (𝜌)
𝜌𝑖∗ = (7) {0,1,0, … ,0}. Then, we use Equation (10) to calculate the
max (𝜌 ) − min(𝜌 ) degree of belonging of the remaining nodes.
𝛿𝑖 − min(𝛿 ) 𝑆(𝑖,𝑗)
𝛿𝑖∗ = (8) 𝑝𝑖 ,𝑐 = ∑𝑗 ∈𝑛𝑒𝑖𝑔ℎ ∑ 𝑝𝑗 ,𝑐 (10)
max (𝛿 ) − min(𝛿 ) 𝑘 ∈𝑛𝑒𝑖𝑔ℎ 𝑆( 𝑖,𝑘)

Then, we use Equation (9) to calculate the center value of node where, 𝑆(𝑖, 𝑗) is the similarity between nodes 𝑖 and 𝑗 , and
𝑖. Then, we arrange the center value in ascending order, and neigh is the N neighbor nodes with greater local density than
we can get the change of the node center value from small to node 𝑖.
large. The larger the center value, the more likely the node 𝑖 Most people in social networks will be influenced by
becomes the center point. friends. People are more inclined to be with their friends.
Therefore, the more likely it is to be in a community with
𝛾𝑖 = 𝜌𝑖∗ × 𝛿𝑖∗ (9) friends. When calculating node affiliation, this paper considers
At last, because in DPC algorithm, clusters are disjoint the similarity between node 𝑖 and N neighbor nodes with a
clusters, and in the process of community detection, there are greater local density than node 𝑖 on the one hand, and
often overlapping nodes in the network, and overlapping considers the affiliation degree of N neighbor nodes on the
nodes are generally located at the edge of the community. other hand. In Equation (10), it means that the greater the
Therefore, the relative distance between it and the nearest node similarity between a neighbor node 𝑗 and 𝑖, and the greater the
with higher local density is relatively large, and if the degree of belonging of 𝑗 to a certain community, the greater
overlapping node belongs to multiple communities, it may the degree of belonging of node 𝑖 to the community.
have higher local density, and it is easy to be selected as the Third, we assume that the ratio of the degree of belonging
center point in the center list. The neighbor nodes of of node 𝑖 to the 𝑐𝑟 community to the degree of belonging to
overlapping nodes will be closer to a certain community and the 𝑐𝑡 community is greater than the threshold 𝜎, that is,
𝑝𝑖 ,𝑐 𝑟
have a closer connection. Their local density is relatively ⁄𝑝𝑖,𝑐 ≥ 𝜎, then node 𝑖 is allocated to the communities 𝑐𝑟
𝑡
larger than that of overlapping nodes. In order to make the and 𝑐𝑡 at the same time. That is to say, the degree of belonging
center point not including overlapping nodes, this paper of the node 𝑖 to the community 𝑐𝑟 and 𝑐𝑡 is not much different,
compares the average local density of neighbor nodes, and that is, the node i is an overlapping node, which belongs to
deletes the nodes that are less than the average local density of both the community 𝑐𝑟 and the community 𝑐𝑡 .
neighbor nodes from the center list.
IV. EXPERIMENTAL RESULTS AND ANALYSIS
D. OVERLAPPING ALLOCATION OF REMAINING To verify the feasibility and effectiveness of our proposed
NODES method, we compare the algorithms of recent years on the real
According to the central point selected by the NRLDP network dataset and artificial synthetic network dataset. The
algorithm in the previous section, the overlapping environment is carried on a PC(Windows10 64bit, Intel(R)
communities are detected for the remaining nodes . In the Core(TM) i5-7400 CPU @3.00GHz, 8GB RAM).
density peak clustering algorithm, allocating the remaining
sample points is to allocate the remaining sample points to the A. EXPERIMENTAL DATASETS
clusters that are closest and whose local density is greaterthan 1) REAL WORLD NETWORK
the current center point. This method makes each remaining Karate Club Network [25]. This data set describes the friend
sample point belong to only one cluster. Therefore, this paper relationship between members of a karate club in a certain
improve the method to realize the overlapping allocation of the university in the United States.
remaining nodes. The specific allocation steps are as follows: Dolphin Network [26]. This data set describes the
First, we assume that the number of community center relationship between 62 bottlenose dolphins in New Zealand.
points obtained is 𝑚, set the label for each community as 𝑐 = The middle edge of the network indicates that two dolphins
{𝑐1 , 𝑐2 , … , 𝑐𝑚 }. often move together.

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3041472, IEEE Access
H. Liu, G. Li: Ov erlapping Community Detection Method Based on Network Representation Learning and Density Peaks

TABLE 2. Description of LFR benchmark parameters.


Football is an American college football game network [27].
This data describes the game situation between 115 college
teams divided into 12 leagues in a football league in the United Parameters Description
States. N Number of nodes
Polbooks is a network of American political books [28].
k Average degree
This data set describes the sales network of American political
books on Amazon during the 2004 US presidential election. maxk Maximum degree
Lesmis is a network of characters in the novel [29]. This 𝜇 Mixing parameter
data set describes the network of characters in Hugo's famous t1 Minus exponent for the degree sequence
novel "Les Miserables". t2 Minus exponent for the community size distribution
Polblogs is an American blog political orientation network minc Minimum for community sizes
[30]. This dataset describes the citations of blogs of different maxc Maximum for the community sizes
political orientations during the 2004 US presidential election.
On Number of overlapping nodes
The specific information of the above real public datasets is
Om Number of memberships of the overlapping nodes
shown in Table 1.
C Average clustering coefficient
TABLE 1. Description of real world networks.

Data sets Vertex number Edge number Average degree


TABLE 3. Parameter setting information of synthetic network.
Karate 34 78 4.6
Dolphin 62 159 5.1
Para
Football 115 613 10.6 mete 𝑁1 𝑁2 𝑁3 𝑁4 𝑁5
Polbooks 105 441 8.4 rs
Lesmis 77 254 6.6 N 1000 2000 3000 4000 5000
k {5,10} {5,10} {5,10} {5,10} {5,10}
Polblogs 1224 19022 27.3
max
{20,50} {20,50} {20,50} {20,50} {20,50}
k
𝜇 {0.1,0.3} 0.1 0.1 0.1 0.1
2) SYNTHETIC NETWORK
The experiment in this paper uses the widely used artificial minc {10,20} {10,20} {10,20} {10,20} {10,20}
benchmark network program LFR Benchmark [31] in recent maxc {50,100} {50,100} {50,100} {50,100} {50,100}
years to generate artificial synthetic network data sets. The {100,500
𝑂𝑛 200 300 400 500
}
artificial network generated by this program can well show the
𝑂𝑚 2-8 2 2 2 2
community structure of the network and can also simulate the
real network well. The LFR benchmark program needs to
adjust parameters to generate artificial synthetic networks of
different scales and connection strengths. The specific compare and analyze the real network and artificial network
parameters are described in Table 2. Among them, the value data sets to evaluate the accuracy of the NRLDP algorithm in
of 𝜇 ranges from 0 to 1. The larger the value of 𝜇, the weaker this paper.
the connection, the more complex the network structure, and
the more difficult to detect the community structure in the 2) EVALUATION INDICATOR
network. The evaluation indicators used in this paper are EQ [38] and
This paper uses the LFR benchmark program to generate 5 Normalized Mutual Information [39] (NMI) which are
groups of artificial networks of different scales. The basic commonly used in overlapping community detection
parameter settings of the experiment are shown in Table 3. algorithms.
The overlapping modularity (EQ) is used to evaluate the
B. EXPERIMENTAL METHOD AND EVALUATION quality of the overlapping community structure. The closer the
INDICATOR EQ value is to 1, the better the quality of the overlapping
1) EXPERIMENTAL METHOD community structure divided by the algorithm. The definition
In order to verify the effectiveness and feasibility of our of EQ is as follows.
proposed method, we compare NRLDP with related
algorithms and classic algorithms in recent years. The selected 𝑐𝑙
comparison algorithms are: OCDRDD [32] algorithm, DCN 1 1 𝑘(𝑖) 𝑘 (𝑗)
𝐸𝑄 = ∑ ∑ (𝐴 (𝑖, 𝑗) − ) (11)
[33] algorithm, CDRS [34] Algorithm, LDC [35] algorithm, 2𝑚 𝑂(𝑖 ) 𝑂(𝑗) 2𝑚
𝑙=1 𝑖∈𝑙,𝑗∈𝑙
Multiscale [36] algorithm, COPRA [37] algorithm. We use
different evaluation indicators and these algorithms to

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3041472, IEEE Access
H. Liu, G. Li: Ov erlapping Community Detection Method Based on Network Representation Learning and Density Peaks

Where, m represents the number of edges in the network, and


𝑐𝑙 represents the lth community. 𝑂(𝑖 ) and 𝑂 (𝑗) respectively
represent the number of communities to which nodes 𝑖 and 𝑗
belong. 𝐴(𝑖 , 𝑗) represents the adjacency matrix of the network.
𝑘 (𝑖) and 𝑘(𝑗) represent the degrees of nodes 𝑖 and 𝑗 ,
respectively.
Standardized mutual information NMI is an information
theory method used to measure the difference between two
sets. It is used to evaluate the difference between the result of
the network division of the algorithm and the real division
result. The closer the NMI value is to 1, the closer the division
result is to the real division. result. The definition of NMI is as
follows.
𝑁 ∙𝑁
−2 ∑𝑖=1
𝐶𝐴
∑𝑗𝐶=1 𝑁𝑖𝑗 ∙ log ( 𝑖𝑗 )
𝐵 𝑁𝑖 , 𝑁𝑗
(
𝑁𝑀𝐼 𝐴, 𝐵 = )
𝑁𝑖 𝑗=1 𝑁𝑗 (12)
∑𝑖=1 𝑁 𝑖 ∙ log ( ) + ∑ 𝑁𝑗 ∙ log ( ) (b)
𝐶𝐴 𝑁 𝐶𝐵 𝑁
FIGURE 4. Threshold 𝝈 experiment.

Where, 𝐶𝐴 is the standard division result, 𝐶𝐵 is the algorithm As shown in Figure 4, for different 𝜎 values, the EQ value
division result, 𝑁𝑖𝑗 represents the number of public nodes and the NMI value are not much different on data sets of
between the 𝑖 𝑡ℎ community in 𝐶𝐴 and the 𝑗 𝑡ℎ community in different scales. It can be seen that the threshold σ given by
𝐶𝐵, 𝑁 is the total number of nodes in the network, and 𝑁𝑖 the NRLDP algorithm in the process of assigning nodes has
represents 𝐶𝐴 The number of nodes in the 𝑖 𝑡ℎ community in little effect on the network detection results. Therefore, the
𝐶𝐵, 𝑁𝑗 represents the number of nodes in the 𝑗 𝑡ℎ community threshold σ=0.9 is set in the following experiments.
in 𝐶𝐵.
2) PARAMETERS IN SYNTHETIC NETWORK
C. PARAMETER EXPERIMENT Different network structures may result in different algorithm
1) THRESHOLD IN NRLDP ALGORITHM performance. In order to test which parameters of the NRLDP
In order to verify the influence of the threshold 𝜎 given by the algorithm are affected by the network, this experiment is
NRLDP algorithm in the process of assigning nodes on the mainly carried out from the following aspects.
detection results of different networks, we carried out First, we test the influence of the degree of nodes in the
experiments on 5 different networks of 𝑁1, 𝑁2 , 𝑁3 , 𝑁4 , and network, the scale of the network, and the scale of the
𝑁5 . community in the network on the algorithm. The parameter k
is the average degree of the node, and other parameters are
controlled to be the same. The k is set to 5 and 10, and the
𝑚𝑎𝑥𝑘 is 20 and 50 respectively. The size of the community in
the network is set to (20, 100) for the large community and (10,
50) for the small community. Experiments were carried out on
5 networks of different sizes, 𝑁1, 𝑁2 , 𝑁3 , 𝑁4 , and 𝑁5 . As
shown in Figure 5(a), on data sets of different network sizes,
the NMI value with a node average degree of 10 is generally
larger. Because the average value of most large-scale real
social networks is around 10, the experimental results are in
line with the actual situation. According to Figure 4 and Figure
5 (a) and (b), it can be seen that as the network scale increases,
most of the curve changes in the figure are not obvious. It can
be seen that the network size has little effect on the NRLDP
algorithm. Therefore, the next experiment chooses the
network size 𝑁1 = 1000 to test the influence of the internal
connection strength and overlap on the algorithm. The
parameter 𝜇 is the internal connection strength coefficient of
the community, which is set to 0.1 and 0.3 respectively. On is
(a) set to 100 and 500 respectively. Om is set to 2, 4, 6, 8
respectively. In Figure 5 (c), the larger the 𝜇 value, the more
complex the community structure in the network, and the more

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3041472, IEEE Access
H. Liu, G. Li: Ov erlapping Community Detection Method Based on Network Representation Learning and Density Peaks

TABLE 4. The EQ value of different algorithms on six networks.

NRLDP OCDRDD DCN CDRS LDC Multiscale COPRA


Karate 0.3986 0.3715 0.3715 0.3434 0.2469 0.2599 0.1654
Dolphin 0.3874 0.3801 0.3787 0.3717 0.3693 0.327 0.3759
Football 0.5505 0.3585 0.3534 0.551 0.2415 0.384 0.5863
Polbooks 0.4956 0.4993 0.4456 0.3777 0.4339 0.2185 0.4802
Lesmis 0.4197 0.483 0.1102 0.4386 0.1597 0.3866 0.4997
Polblogs 0.4029 0.2967 0.1269 0.14 0.078 — 0.3159
Avg 0.4425 0.3981 0.2977 0.3704 0.2548 0.3152 0.4039

(a) (b)

(c) (d)
FIGURE 6. The NMI value of different algorithms on synthetic network.
FIGURE 5. Experiment of parameters on synthetic network.

difficult it is to detect. In Figure 5 (d), the degree of better results, but overall the average value of the NRLDP
community overlap also has a certain impact on the algorithm. algorithm is greater than 6 algorithms . Therefore, it can be said
The greater the degree of overlap, the lower the algorithm that NRLDP algorithm is better than OCDRDD algorithm,
performance. DCN algorithm, CDRS algorithm, LDC algorithm, Multiscale
algorithm, COPRA algorithm on the real word network
D. EXPERIMENTAL RESULTS ON REAL WORLD datasets.
NETWORK.
In order to verify the method proposed in this paper, this E. EXPERIMENTAL RESULTS ON SYNTHETIC
section experiments on 6 real network data sets, and selects NETWORK.
EQ as a measurement index, and compares them with 6 To further verify the effectiveness and feasibility of the
community detection algorithms. The results are shown in NRLDP algorithm. The experiment in this section is on the
Table 4. same artificial synthetic network, the number of network
According to the EQ value of each algorithm on the real nodes is 1000, k=10, μ=0.1, the community size is 20 to 100,
world network data set, it can be known that the EQ value of On=10%, and standardized mutual information NMI is
the NRLDP algorithm proposed in this paper on the 6 data sets selected as the measurement indicator. The selected
is greater than that of the DCN algorithm, LDC algorithm and comparison algorithms are DCN algorithm, CDRS algorithm,
Multiscale algorithm. Except that the EQ value on the LDC algorithm, Multiscale algorithm, COPRA algorithm, and
Polbooks data set is slightly smaller than the OCDRDD the results are shown in Figure 6. It can be seen from the
algorithm, the others are all larger than the OCDRDD Figure 6 that although the NRLDP algorithm decreases as the
algorithm. On the Football dataset and Lemis dataset, the number of community membership Om of overlapping nodes
CDRS algorithm and the COPRA algorithm have achieved increases, the NMI value of Om before 4 is greater than the

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3041472, IEEE Access
H. Liu, G. Li: Ov erlapping Community Detection Method Based on Network Representation Learning and Density Peaks

other 6 algorithms. In most cases in the real world, the number Web - WWW ’15, Florence, Italy, 2015, pp. 658–668, DOI:
10.1145/2736277.2741676.
of communities belonging to overlapping nodes does not [12] K. He, Y. Sun, D. Bindel, J. Hopcroft, and Y. Li, “Detecting
exceed 5. Therefore, in a certain sense, the NRLDP algorithm Overlapping Communities from Local Spectral Subspaces,” in 2015
has certain advantages. IEEE International Conference on Data Mining , Atlantic City, NJ,
USA, Nov. 2015, pp. 769–774, DOI: 10.1109/ICDM.2015.89.
[13] Z.Y. Zhu and C. Yuan, “Local Ext ension Class Overlapping
V. CONCLUSION Community Discovery Algorithm with H-index,” journal ofchinese
This paper proposes an overlapping community d etection computer systems, vol. 40, no. 01, 2019, pp. 20–25.
method (NRLDP) based on network representation learning [14] A. Lancichinetti, S. Fortunato, and J. Kertész, “Detecting the
overlapping and hierarchical community structure in complex
and density peaks. In order to deal with high-dimensional and networks,” New J. Phys., vol. 11, no. 3, p. 033015, Mar. 2009, DOI:
complex network data, NRLDP first uses network 10.1088/1367-2630/11/3/033015.
representation learning technology to represent non-weighted [15] H. Liu, L. Fen, J. Jian, and L. Chen, “Overlapping Community
Discovery Algorithm Based on Hierarchical Agglomerative
networks or weighted networks with low-dimensional vectors.
Clustering,” Int. J. Patt. Recogn. Artif. Intell., vol. 32, no. 03, p.
Then it uses the cosine similarity to calculate the distance 1850008, Mar. 2018, DOI: 10.1142/S0218001418500088.
between nodes and improves the local density calculation [16] Yu Z., Chen J., Quo K., Chen Y., and Xu Q., “Overlapping
method. The core nodes are selected according to the relative Community Detection Based on Random Walk and Seeds Extension,”
in Proceedings of the 12th Chinese Conference on Computer
distance and local density, and the remaining nodes are finally Supported Cooperative Work and Social Computing -
allocated to detect the overlapping community structure of the ChineseCSCW ’17, Chongqing, China, 2017, pp. 18–24, DOI:
unweighted network or the weighted network. The 10.1145/3127404.3127412.
[17] V. Bhatia and R. Rani, “A distributed overlapping community
experimental results both on synthetic and real networks detection model for large graphs using autoencoder,” Future
demonstrate performance of our method, it gets highly Generation Computer Systems, vol. 94, pp. 16–26, May 2019, DOI:
accurate and effective for overlapping community detection. 10.1016/j.future.2018.10.045.
[18] S. Gregory, “Finding overlapping communities in networks by label
For future work, we will consider using low-dimensional propagation,” New J. Phys., vol. 12, no. 10, p. 103018,Oct.2010, DOI:
vectors to represent the network with node and edge attribute 10.1088/1367-2630/12/10/103018.
information, and further enhance the algorithm to detect more [19] J. Xie and B. K. Szymanski, “T owards Linear T ime Overlapping
complex network community structures. Community Detection in Social Networks,” in Advances in
Knowledge Discovery and Data Mining, vol. 7302, P.-N. T an, S.
Chawla, C. K. Ho, and J. Bailey, Eds. Berlin, Heidelberg: Sp ringer
REFERENCES Berlin Heidelberg, 2012, pp. 25–36.
[1] D. J. Watts and S. H. Strogatz, “ Collective dynamics of ‘small-world’ [20] U. N. Raghavan, R. Albert, and S. Kumara, “Near linear time
networks,” vol. 393, 1998, pp. 3. algorithm to detect community structures in large-scale networks,”
[2] A.-L. Barabási and R. Albert, “Emergence of Scaling in Random Phys. Rev. E, vol. 76, no. 3, p. 036106, Sep. 2007, DOI:
Networks,” Science, vol. 286, no. 5439, pp. 509–512,Oct.1999,DOI: 10.1103/PhysRevE.76.036106.
10.1126/science.286.5439.509. [21] A. Rodriguez and A. Laio, “Clustering by fast search and find of
[3] “Researchers from Purdue University Provide Details of NewStudies density peaks,” p. 6.
and Findings in the Area of Expert Systems.” [22] Z.-H. Deng, H.-H. Qiao, M.-Y. Gao, Q. Song, and L. Gao, “Complex
https://www.zhangqiaokeyan.com/academic-journal- network community detection method by improved density peaks
foreign_other_thesis/020411135651.html (accessed Nov. 05, 2020). model,” Physica A: Statistical Mechanics and its Applications, vol.
[4] G. Palla, I. Derényi, I. Farkas, and T . Vicsek, “Uncovering the 526, p. 121070, Jul. 2019, DOI: 10.1016/j.physa.2019.121070.
overlapping community structure of complex networks in natureand [23] B. Perozzi, R. Al-Rfou, and S. Skiena, “DeepWalk: online learningof
society,” Nature, vol. 435, no. 7043, pp. 814–818, Jun. 2005, DOI: social representations,” in Proceedings of the 20th ACM SIGKDD
10.1038/nature03607. international conference on Knowledge discovery and data mining -
[5] F. Wei, W. Qian, C. Wang, and A. Zhou, “Detecting Overlapping KDD ’14, New York, New York, USA, 2014, pp. 701 –710, DOI:
Community Structures in Networks,” World Wide Web, vol. 12,no.2, 10.1145/2623330.2623732.
pp. 235–261, Jun. 2009, DOI: 10.1007/s11280-009-0060-x. [24] G. Bello-Orgaz, S. Salcedo-Sanz, and D. Camacho, “A Multi-
[6] I. Farkas, D. Ábel, G. Palla, and T . Vicsek, “ Weighted network Objective Genetic Algorithm for overlapping community detection
modules,” New J. Phys., vol. 9, no. 6, pp. 180–180, Jun. 2007, DOI: based on edge encoding,” Information Sciences, vol.462,pp.290–314,
10.1088/1367-2630/9/6/180. Sep. 2018, DOI: 10.1016/j.ins.2018.06.015.
[7] S. Zhang, R.-S. Wang, and X.-S. Zhang, “Uncovering fuzzy [25] W. W. Zachary, “An Information Flow Model for Conflict andFission
community structure in complex networks,” Phys. Rev. E,vol.76,no. in Small Groups,” Journal of Anthropological Research, vol. 33,no.
4, p. 046103, Oct. 2007, DOI: 10.1103/PhysRevE.76.046103. 4, pp. 452–473, Dec. 1977, DOI: 10.1086/jar.33.4.3629752.
[8] F. Wang, T . Li, X. Wang, S. Zhu, and C. Ding, “ Community discovery [26] D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. Slooten, and S.
using nonnegative matrix factorization,” Data Min Knowl Disc,vol. M. Dawson, “T he bottlenose dolphin community of DoubtfulSound
22, no. 3, pp. 493–521, May 2011, DOI: 10.1007/s10618-010-0181-y. features a large proportion of long-lasting associations,” Behavioral
[9] N. Chen, Y. Liu, and H.-C. Chao, “Overlapping Community Detection Ecology and Sociobiology, vol. 54, no. 4, pp.396–405,Sep.2003,DOI:
Using Non-Negative Matrix Factorization With Orthogonal and 10.1007/s00265-003-0651-y.
Sparseness Constraints,” IEEE Access, vol. 6, pp.21266–21274,2018, [27] M. Girvan and M. E. J. Newman, “Community Structure in Socialand
DOI: 10.1109/ACCESS.2017.2783542. Biological Networks,” Proceedings of the National Academy of
[10] W. Li, J. Xie, M. Xin, and J. Mo, “ An overlappingnetwork community Sciences of the United States of America, vol. 99, no. 12, pp. 7821–
partition algorithm based on semi-supervised matrix factorization and 7826, 2002.
random walk,” Expert Systems with Applications, vol.91,pp.277–285, [28] M. E. J. Newman and M. Girvan, “Finding and evaluatingcommunity
Jan. 2018, DOI: 10.1016/j.eswa.2017.09.007. structure in networks,” Phys. Rev. E, vol. 69, no. 2, p. 026113, Feb.
[11] Y. Li, K. He, D. Bindel, and J. E. Hopcroft, “Uncovering the Small 2004, DOI: 10.1103/PhysRevE.69.026113.
Community Structure in Large Networks: A Local SpectralApproach,” [29] D. Knuth, The Stanford GraphBase. A platform for combinatorial
in Proceedings of the 24th International Conference on World Wide computing. 1993.

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3041472, IEEE Access
H. Liu, G. Li: Ov erlapping Community Detection Method Based on Network Representation Learning and Density Peaks

[30] L. A. Adamic and N. Glance, “ T he Political Blogosphereandthe2004 HONGTAO LIU received his Master's degree in
US Election: Divided T hey Blog,” Proceedings of the 3rd Computer Application T echnology from Chongqing
international Workshop on Link Discovery, Chicago, 21-24 August China Normal University in 2001, and his Doctor`s
2005. degree in artificial intelligence from Chongqing
[31] A. Lancichinetti, S. Fortunato, and F. Radicchi, “ Benchmark graphs China Southwest University in 2004. His research
for testing community detection algorithms,” Phys. Rev.E,vol.78,no. interests include natural language processing, social
4, p. 046110, Oct. 2008, DOI: 10.1103/PhysRevE.78.046110. networking and swarm intelligence. He is now an
[32] Q. Zhang, H. M. Chen, and Y. F. Feng, “Overlapping Community associate professor in the School of Computer
Detection Method Based on Rough Sets and Distance Dynamic science, Chongqing University of Posts and
Model,” journal of computer science, vol. 47, no. 10,pp.75–82,2020. T elecommunications.
[33] J. Ding, X. He, J. Yuan, Y. Chen, and B. Jiang, “ Community detection
by propagating the label of center,” Physica A: StatisticalMechanics
and its Applications, vol. 503, pp. 675–686, Aug. 2018, DOI:
10.1016/j.physa.2018.02.174.
[34] Yunlei, Zhang , W. U. Bin , and L. Yu . "A Novel Community
Detection Method Based on Rough Set K-Means." Journal of
Electronics & Information Technology 39.4(2017):770-777.
[35] L. Huang, G. Wang, Y. Wang, W. Pang, and Q. Ma, “ A link density
clustering algorithm based on automatically selecting density peaks GEGE LI received a B.S. degree in computer
for overlapping community detection,” Int. J. Mod. Phys. B, vol.30, science and technology in 2018. She is currently a
no. 24, p. 1650167, Sep. 2016, DOI: 10.1142/S0217979216501678. graduate student at Chongqing University of Posts
[36] M. Brutz and F. G. Meyer, “A flexible multiscale approach to and T elecommunications. Her research interests
overlapping community detection,” Soc. Netw. Anal. Min.,vol.5,no. include machine learning and social networks
1, p. 23, Dec. 2015, DOI: 10.1007/s13278-015-0259-z.
[37] S. Gregory, “ Finding overlapping communities in networks by label
propagation,” New J. Phys., vol. 12, no. 10, p. 103018,Oct.2010, DOI:
10.1088/1367-2630/12/10/103018.
[38] H. Shen, X. Cheng, K. Cai, and M.-B. Hu, “Detect overlapping and
hierarchical community structure in networks,” Physica A: Statistical
Mechanics and its Applications, vol. 388, no. 8, pp. 1706–1712,Apr.
2009, DOI: 10.1016/j.physa.2008.12.021.
[39] P. Zhang, “ Evaluating accuracy of community detection using the
relative normalized mutual information,” J. Stat. Mech.,vol.2015,no.
11, p. P11006, Nov. 2015, DOI: 10.1088/1742-5468/2015/11/P11006.

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

You might also like