Zhou 2021

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Physica A 570 (2021) 125783

Contents lists available at ScienceDirect

Physica A
journal homepage: www.elsevier.com/locate/physa

Biased random walk with restart for link prediction with graph
embedding method

Yinzuo Zhou , Chencheng Wu, Lulu Tan
Alibaba Business School Research Center for Complexity Sciences, Hangzhou Normal University, Hangzhou 311121, China

article info a b s t r a c t

Article history: Link prediction is an important problem in topics of complex networks, which can
Received 12 January 2020 be applied to many practical scenarios such as information retrieval and marketing
Received in revised form 12 November 2020 analysis. Strategies based on random walk are commonly used to address this problem.
Available online 27 January 2021
In common practice of a random walk, a link predictor may move from one node to one
Keywords: of its neighbors with uniform transferring probability regardless of the characteristics of
Link prediction the local structure around that node, which, however, may contain useful information for
Graph embedding method a successful prediction. In this paper, we propose a refined random walk approach which
Random walk with restart incorporates graph embedding method. This approach may provide biased transferring
Biased transferring probability probabilities to perform random walk so as to further exploit topological properties
embedded in the network structure. The performance of proposed method is examined
by comparing with other commonly used indexes. Results show that our method
outperforms all these indexes reflected by better prediction accuracy.
© 2021 Elsevier B.V. All rights reserved.

1. Introduction

Complex networks, as an emerging interdisciplinary subject, acts as an increasingly important means to tackle various
problems in real-world systems [1,2]. Among these, a valuable application of complex networks theory is the link
prediction which shows a great power in restoring missing information and mining potential structure in networks [3,4].
Traditional methods of link prediction can be regarded as a machine learning problem where finding a solution of a link
prediction can be converted to the classification problem in machine learning [5,6]. However, the classification method
largely relies on the attributes of nodes, and attribute values of nodes may be difficult to obtain which may in turn damage
the effectiveness of the method. To avoid this issue, link prediction methods that are mainly based on network structure
have been developed. The aspect of network structure has the advantage that it could be relatively easy to access and the
results could be more reliable than those from node attribute, since the former only needs a binary score and the latter
requires accurate values in a continuous range easily smeared by noise.
The link prediction algorithms based on network structure have three dominant types: (i) similarity index based on
local information (LOCAL) [7–10]; (ii) similarity index based on path (PATH) [11,12]; and (iii) similarity index based on
random walk (RW) [13,14]. LOCAL refers to similarity index calculated by local information of nodes (e.g. degree, number
of common neighbors, etc.). The computational complexity of LOCAL is relatively low, but also at a cost of low prediction
accuracy. PATH refers to the similarity index calculated by using the path information between pairs of nodes to be
predicted (e.g. the number of paths between nodes, the information of intermediate nodes along the path, etc.), which
possesses a high computational complexity since this approach concerns global information of the network. RW is defined

∗ Corresponding author.
E-mail address: zhouyinzuo@163.com (Y. Zhou).

https://doi.org/10.1016/j.physa.2021.125783
0378-4371/© 2021 Elsevier B.V. All rights reserved.
Y. Zhou, C. Wu and L. Tan Physica A 570 (2021) 125783

based on the process of random walk, that is: assuming a particle/predictor starts from an initial node. Then, it walks
randomly to its neighboring nodes with certain probabilities. This process continues until a stable probability distribution
of the particle appearing on every node is reached. The RW index may offer a good balance between the computational
complexity and the effect of prediction, and therefore has been widely applied in the problem of recommender systems
and community identification [15].
This advantage of random walk allows it to be a primary method for the problem of link predication, and many
achievements have been acquired accordingly. A typical example is the PageRank [16] algorithm, where the random
walk method takes the key role. Moreover, Li et al. proposed a link prediction algorithm based on maximum entropy
random walk approach [17]. Liu et al. used the node representation vector obtained by deepwalk to define nodal
similarity by calculating Euclidean distance between them, and then proposed a link prediction algorithm according
to the representation vector [18]. Jin et al. proposed a supervised and extended random walk with restart algorithm,
in which each node corresponds to a restart probability [19]. Lu et al. proposed a biased random walk with restarting
probability getting from the nodal degree [20]. Curado introduced a return random walk method to address the condition
that inter-class links are stranger events [21].
The majority of methods based on random walk use uniform distributions to define the transferring probability when
a particle is about to move onto neighboring nodes [14]. This setting only considers the information of nodal degree,
but ignores subtle structure in a local region. Thus, a better prediction can be expected when local characteristics of the
networks is taken into consideration. Indeed, it has been revealed that certain key nodes can take special role when doing
a link prediction [22]. By considering the local structure of the networks, the uniformity of the transferring probability
may be broken, which will cause the random walk to be a biased one. In this paper, we propose a biased random walk with
restart for the problem of link predication. The performance of the approach is systematically examined by comparing with
other commonly used methods. Results show that this method outperforms the others with higher accuracy. Moreover, as
the ways to interpret local structure could be from different angles or preferences, Graph Embedding Methods (GEMs) [23–
27] are applied pertinent to our proposed approach. The GEM could deal with the local structure in ways according to
different purpose and transfer the information of the structure into vector form convenient for further process.
To be specific, in this paper we will propose an approach based on biased random walk with restart to acquire
high accuracy of link prediction. This approach aims to deal with networks where nodes tends to contact others alike
in structure. This approach is optimized with GEM by transferring the structure of original networks into compact
representation vectors upon which a random walk will be implemented. Hence, the combination of the two efforts
produces our target method, referred to as Graph Embedding Biased Random Walk with Restart (GEBRWR).
The remainder part of this paper is organized as follows: in Section 2, related works and methods are reviewed,
including classical algorithms and common methods for link prediction; in Section 3, the GEBRWR is introduced; in
Section 4, the performance of the GEBRWR is examined with empirical datasets, along with other popular methods for
comparison; in Section 5, the conclusion is provided with perspective of the future works.

2. Related works

In this section, five classical indexes of link prediction which will be used for comparison in latter sections as well
as the graph embedding method will be introduced. We note that all the indexes to be introduced are measures of the
similarity between pairs of nodes.

2.1. Classical indexes of link prediction

• Common Neighbors (CN) [7]


For a node x in the network, the set of its neighbors is denoted as Γ (x), then the similarity between two nodes x
and y is defined by the number of their common neighbors, which is

Sxy = |Γ (x) ∩ Γ (y)|. (1)


This definition could be extended to weighted networks, producing the weighted CN index as
∑ wxz + wzy
Sxy = , (2)
2
z ∈Γ (x)∩Γ (y)

where node z is a common neighbor of nodes x and y and wxz (wzy ) represents the weight of the link connecting
nodes x (y) and z.
• Adamic–Adar (AA) [8]
AA index considers the case that the impact of a common neighbor with smaller degree is greater than that with
larger degree. The impact of each common neighbor is weighted by the inverse of the logarithm of its degree, and
this index reads as
∑ 1
Sxy = . (3)
log(kz )
z ∈Γ (x)∩Γ (y)

2
Y. Zhou, C. Wu and L. Tan Physica A 570 (2021) 125783

For weighted networks, the definition of AA index could be extended as follows


∑ wxz + wzy
Sxy = , (4)
2(log(1 + sz ))
z ∈Γ (x)∩Γ (y)

where sz , the strength of node z, is the sum of the weights of the links connecting to it, and wxz and wyz are defined
similarly as in CN.
• Resource Allocation (RA) [9]
RA index considers the situation that two nodes x and y, which are not connected directly, transfer resources through
their common neighbors. Suppose each common neighbor has a unit resource obtained from x and is to distribute
this unit resource equally to all its neighbors, then the amount of resource received by y is defined as the similarity
between nodes x and y, which is
∑ 1
Sxy = . (5)
kz
z ∈Γ (x)∩Γ (y)

For weighted networks, the definition of RA index changes to


∑ wxz + wzy
Sxy = , (6)
2sz
z ∈Γ (x)∩Γ (y)

with sz , wxz and wzy defined similarly as above.


• Preferential Attachment (PA) [10]
For scale-free networks generated with the mechanism of Preferential Attachment, the probability of a newly joined
link connecting to a node already existed in the network is proportional to the degree of the node. Thus, the similarity
between two nodes pertinent to this scenario could be defined as

Sxy = kx ky , (7)

where kx (ky ) is the degree of node x (y). For weighted networks, the weighted PA index is defined as:
∑ ∑
Sxy = wix × wjy , (8)
i∈Γ (x) j∈Γ (y)

where wix (wiy ) is the weight of the link connecting nodes i and x (y).
• Random Walk with Restart (RWR) [14]
This index is developed from PageRank algorithm. Here, the Restart means that a particle performing random walk
may return to its initial positions with a certain probability at each step. Since the Markovian transferring matrix P
of a random walk can be expressed as
axy
pxy = , (9)
kx
where pxy and axy are elements of transferring matrix P and adjacency matrix A, and kx is the degree of node x.
Suppose the return probability equals 1 − α and a particle is initially at node x, then the evolution equation of the
probability vector πx (t) of this particle, denoting the possibility of the particle starting from x appearing at other
nodes in the network at time t, is

πx (t + 1) = α · P T πx (t) + (1 − α )ex , (10)

where vector ex describes the initial state in which the xth element equals 1 and zero otherwise. The steady-state
solution can be obtained as follows

πx = (1 − α )(I − α P T )−1 ex , (11)

where πx represents the final stable probability vector, i.e. πx ≡ πx (t)|t →∞ . Now, the similarity between the nodes
of this index is defined as
RWR
Sxy = πxy + πyx , (12)

where πxy denotes the yth element of πx .


For weighted networks, the elements of Markovian transferring matrix of weighted random walk can be written as
wxy
pxy = , (13)
sx
and then the index of weighted networks can be defined in due course as above.

3
Y. Zhou, C. Wu and L. Tan Physica A 570 (2021) 125783

Fig. 1. CBOW model (left), which predicts the middle part words by training the remaining words; and Skip-Gram model (right), which inputs the
middle part words to predict the remainder part of the context.

2.2. Evaluation index

In order to verify the effectiveness of the algorithm, the known link set E is generally divided into training set E T and
test set E P . The training set E T is used for calculating the similarity score between nodes. The test set E P acts as the part
to be predicted, and is used to evaluate the quality of prediction. Here, E = E T ∪ E P , E T ∩ E P = ∅. Let U be the complete
set composed of N(N − 1)/2 pairs of nodes, then the links that belong to U but do not belong to E are called nonexistent
links, and the links that belong to U but do not belong to E T are called unknown links.
Area under the receiver operating characteristic curve (AUC) [28] is the most commonly used index to measure the
accuracy of a link prediction approach, which refers to the probability that the score of randomly selected link in the test
set is higher than that of non-existent link. In an experiment, one link is randomly selected from the test set each time,
and then another link is randomly selected from the non-existent link. If the score of the link in the test set is greater
than the score of the non-existent link, then 1 point will be added, and if the two scores are equal, 0.5 point will be added.
In this way, a number of n independent comparisons can be made. If there are n′ times when the link score value in the
test set is greater than the non-existent link score, and there are n′′ times when the two score values are equal, then the
AUC index is defined as
n′ + 0.5n′′
AUC = . (14)
n
Generally, the closer AUC is to 1, the more accurate the algorithm is.

2.3. Graph embedding method

In this part, three graph embedding methods will be introduced.

• Word2vec
Word2vec was developed by Google in 2013 to train word vectors, which includes Continuous Bag of Words Model
(CBOW model) and Skip-Gram model where SGD is adopted [29]. The former model predicts the words in middle
part of sentence by training the remaining words, while the latter, on the contrary, inputs the middle words to get
the remainder part of the context. The model is a three-layer neural network including input layer, projection layer
and output layer, as shown in Fig. 1.
• Node2vec
Node2vec is a combination of Depth First Search (DFS) and Breadth First Search (BFS). Given that a particle is
currently at node v , then the probability of its accessing node x is given as
πv x / z if (v, x) ∈ E
{
P(ci = x|ci−1 = v ) = , (15)
0 otherwise

where πv x is proportional to transferring probability between nodes v and x, and z is the normalization factor. In
Node2vec, there are two parameters, p and q, to control the process of random walk. Suppose that a particle just
went from node t to node v through the link (t , v ) by a random walk in the last time step, then to determine the
walk happened in this time step, let πv x = αpq (t , x) · wv x with wv x being the weight of the link connecting node v
and x and
1/p if dtx = 0
{
αpq (t , x) = 1 if dtx = 1 , (16)
1/q if dtx = 2
4
Y. Zhou, C. Wu and L. Tan Physica A 570 (2021) 125783

where dtx is the shortest distance between nodes t and x. When dtx = 0, node t and node x are the same; when
dtx = 1, node t and node x are neighbors; and when dtx = 2, nodes t and x are not directly connected but have a
common neighbor. It can be seen that the parameter p controls the probability of returning to the node t which is
the one the particle just coming from, and a larger p corresponds to a lower probability of returning to the node t.
While, the parameter q controls whether the random walk is moving inward or outward respect to the node t. Since
the case dtx = 2 means t and x are not connected, if the particle moves onto x it will be further (therefore outward)
to node t. Hence, when q > 1 the particle tends to walk inward and thus visit nodes close to node t, which favors
BFS; while when q < 1 the particle tends to walk outwards and be away from node t, which favors DFS.
• Struc2vec
The key part of the method struc2vec is a biased random walk according to local structure similarity. The biased
random walk in this method could even consider the impact of those nodes which neither are connected to nor have
common neighbors with the objective node. In other words, the impact of those nodes which are far away from the
objective node can also be evaluated with their similarity of respective local topologies. The biased random walk in
struc2vec is realized as follows: Denoting Rk (u) as the set of nodes whose shortest distance to node u equal k and
hence R1 (u) is the set composed of the neighbors of node u. Then, denoting s(S) as the ordered degree sequence (say
in ascending order) of nodes belonging to the set S. Now, the local topology of a node u could be described by the
ordered degree sequence of the nodes with the same shortest distance to it, i.e. Rk (u) with k = 1, . . . , k∗ where k∗
is the diameter of the network. Then, the similarity of local topology between nodes u and v can be described by
fk (u, v ) which satisfies a recurrent relation as follows

fk (u, v ) = fk−1 (u, v ) + g(s(Rk (u)), s(Rk (v ))), (17)

with f−1 = 0. The function g gives a score by comparing two sequences with the procedure of Dynamic Time Warping
(DTW) [30]. The key to execute DTW is to measure the distance of corresponding pairs of elements in the two
sequences, and the distance is defined as
max(a, b)
d(a, b) = − 1, (18)
min(a, b)
where a and b will be input with the degrees in the sequences s(Rk (u)) and s(Rk (v )), respectively. We remark that
higher similarity of two sequence will give a smaller score of g, and g = 0 when the two sequence are completely
matched. Thus, a smaller f indicates a higher similarity.
After obtaining the similarity of local structure with respect to different distance k between all pairs of nodes, a
multilayered weighted network can be constructed where the weights of links in layer k is given as

wk (u, v ) = e−fk (u,v) , (19)

with k = 1, . . . , k∗ . Note that in this multilayer network, each layer is a complete graph connected by weighted links.
This multilayer network is equivalent to a multiplex network where each node in the original network has a replica
in each layer of the generated network. Obviously, the upper bound of the weight of the links in the multilayer
network is 1, and this bound is reached only when the corresponding degree sequences of two nodes are perfectly
matched. The weights defined in Eq. (19) could directly determine a biased random walk on that layer depending on
these weights. Besides, inter-layer random walk is valid as well, which, however, could only happen on the replicas
of the same original node in consecutive layers. The weight of the inter-layered links (connecting the replicas of the
same node) is defined as

w(uk , uk+1 ) = log(Γk (u) + 1), and w(uk , uk−1 ) = 1, (20)

where k = 1, . . . , k∗ − 1 and Γk (u) is the number of links that connecting node u within layer k whose weight are
greater than the average weight of links in that layer. Eq. (20) indicates that the links connecting the same node
in consecutive layers are bi-directional, and those pointing from layer k to layer k − 1 have the unit weight, while
the weight of those pointing from layer k to layer k + 1 depends on the weights of the links in layer k. With the
weighted inter-layer links defined in this way, biased random walk between layers thus is also set.
With intra- and inter-layer weighed links defined in Eqs. (19) and (20), a biased random walk can be defined
accordingly. For a intra-layer walk, the probability of node u transferring to node v in layer k equals
e−fk (u,v )
pk (u, v ) = , (21)
zk (u)
−fk (u,v )

with zk (u) = v∈V ,v̸=u e being the normalization factor. For a inter-layer walk, the probability of node u in
layer k transferring to its replica in layer k + 1 equals
w(uk , uk+1 )
pk (uk , uk+1 ) = , (22)
w(uk , uk+1 ) + w(uk , uk−1 )
and the probability to its replica in layer k − 1 thus equals pk (uk , uk−1 ) = 1 − pk (uk , uk+1 ).
5
Y. Zhou, C. Wu and L. Tan Physica A 570 (2021) 125783

Algorithm 1 Realization of BRWR with input from GEM


Input: The adjacency matrix of the network M = [aij ], restart factor α , strength of random walk γ
Output: Similarity matrix of nodes S = [sij ]
1: Initialization of transferring probability matrix P and similarity matrix S;
2: Calculate the transferring probability between each node and update the matrix P;
3: for i = 1 to N do
4: while S does not converge do
5: πx = (1 − α )(I − α P T )−1 ex ;
6: return S.

As mentioned above, the node sequence sampled from the biased random walk serve as the input of training
data which will then be learnt by the models, such as the CBOW and Skip-Gram. The output of these models is
representation vectors of the nodes which contains digested information of local structure of them and will be further
processed in the next step to be introduced in the next section.

3. Biased random walk with restart based on graph embedding

In this approach, by applying appropriate graph embedding method (introduced in the last section) and then inputting
the nodes and edges of the original network to train the model, the representation vector (low dimension compared to
the original presentation) of the nodal information can be obtained. In the training process, the local topological features
of the nodes will be trained continuously to obtain the optimal vector representation. The resulting vectors have the
property that the similarity of the nodes in the vector space can well represent the similarity of the local structure of
nodes in the original network.
Suppose the resulting vector of node x (y) that is obtained from a graph embedding method is φ (x) = [x1 , x2 , . . . , xd ]
(φ (y) = [y1 , y2 , . . . , yd ]), we adopt the common Cosine similarity as a preliminary index to quantify the similarity
between the two node vectors, which may suggest the potential topological similarity between the nodes. Specifically,
the topological similarity between nodes x and y is defined as follows
φ (x) · φ (y)
CosSim(x, y) = wxy · cos(φ (x), φ (y)) = wxy · , (23)
∥φ (x)∥ × ∥φ (y)∥
where wxy is the weight of the link connecting x and y in the original network. For an unweighted network, wxy will
degenerate into a binary value 0 or 1 depending on whether the connection exists or not. In the following, we will improve
this index by a biased random walk with restart (BRWR) method, and our improved index will show its superiority
compared with other commonly used baseline methods. The key of the BRWR is its transferring probability matrix P,
which is defined with the help of Cosin similarity score (23), given as
CosSim(x, y)
pxy = γ · ∑ , (24)
z ∈Γ (x) CosSim(x, z)

where Γ (x) is the set of the neighbors of node x, the coefficient γ is a rate controlling the strength of the random walk
such that in each time step a particle shall take a walk with probability γ or rest on the node otherwise, and pxy is an
element of the matrix P. After obtaining the P, the remaining part of the BRWR can be implemented in the standard
procedure as shown in Eq. (10). Details of implementing this procedure is provided in the algorithm 1, and an outline of
the main procedure of acquiring the transferring probability P is shown in Fig. 2.

4. Results and analysis

In this section, we will validate our algorithm with real empirical datasets and compare the results with the baseline
methods introduced above.

4.1. Experimental datasets

Empirical datasets (available in the websites in below1 , 2 ) are classified into two parts: unweighted datasets and
weighted datasets.
Unweighted datasets include:

1 http://konect.uni-koblenz.de.
2 https://github.com/GinToCi/ComplexNetworkDataSet.

6
Y. Zhou, C. Wu and L. Tan Physica A 570 (2021) 125783

Fig. 2. Diagram of main steps of GEBRWR.

• Jazz musicians partnership network (Jazz)


It is generated from the collaboration between jazz musicians, where nodes represent musicians and links are
generated if pairs of musicians have collaboration;
• Florida food chain network (FWFW)
The food chain network in the Florida Gulf during the rainy season, where nodes represent organisms and links
represent predatory relationships;
• Metabolic network of nematodes (Metabolic)
The node represents the metabolite of Caenorhabditis elegans, and the link represents the enzyme catalyzed
biochemical reaction between the metabolites;
• US preferred route network (ATC)
The network is constructed by the priority route database of the National Flight Data Center (NFDC) of the Federal
Aviation Administration (FAA). The node represents the airport or service center, and the link represents the
preferred route.

Weighted datasets include:

• US Airline Network (USAir)


Each node of the network corresponds to an airport, each link indicates that there is a direct flight line between two
airports, and the weight indicates the flight frequency between the two airports;
• Adolescent Health (Health)
The network was created based on a 1994/95 survey in which each student was asked to list five of his best female
and male friends. Each node represents the student, each link indicates one student selects another student as a
friend, and the weight indicates the strength of the relation between two students;
7
Y. Zhou, C. Wu and L. Tan Physica A 570 (2021) 125783

Table 1
Topological characteristics of empirical networks.
Networks N M ⟨k⟩ ⟨c ⟩ D
Jazz 198 2742 27.7 0.618 6
FWFW 128 2137 16.7 0.335 2.9
Metabolic 453 2025 8.94 0.647 7
ATC 1226 2615 4.2659 0.639 17
USAir 332 2126 12.81 0.749 6
Health 2539 12 969 10.216 0.142 10
TCM 185 1446 15.41 0.7458 5

Table 2
Performance of GEBRWR with Node2vec and Struc2vec, denoted as GEBRWR_N2V and GEBRWR_S2V, respectively as well as five
baseline indexes.
Networks CN AA RA PA RWR GEBRWR_N2V GEBRWR_S2V
Jazz 0.952275 0.95922 0.96927 0.769495 0.93684 0.9642 0.95227
FWFW 0.61021 0.611245 0.620545 0.7252 0.74592 0.75312 0.75101
Metabolic 0.933855 0.96194 0.967735 0.827745 0.96252 0.97125 0.97418
ATC 0.62969 0.629495 0.631475 0.698655 0.91672 0.91931 0.91119
USAir 0.919015 0.937815 0.94706 0.89887 0.93153 0.95351 0.94801
Health 0.774105 0.77445 0.77129 0.60944 0.90297 0.90818 0.90533
TCM 0.902855 0.924915 0.948725 0.84108 0.88328 0.95316 0.94335

• Traditional Chinese Medicinal Formula Network (TCM)


The network is generated from traditional Chinese Medicines collected from two books called Classic Treatise on
Febrile and Miscellaneous Diseases and Synopsis of Golden Chamber. The node represents the medicinal materials,
the connected link represents the compatibility of two medicinal materials, and the weight represents the frequency
of the two medicinal materials used in the same formula.

Table 1 lists the network topology features of the datasets, where N represents the number of nodes, M the number
of connected links, ⟨k⟩ the average degree, ⟨c ⟩ the average clustering coefficient, and D the network diameter.

4.2. Experimental procedure and results

The seven networks from different fields as listed in Table 1 will be divided into two parts, i.e. training sets and
test sets, in the ratio of 9 : 1. Our algorithm as well as baseline methods will be applied to these training sets, and
corresponding similarity scores will be obtained. To evaluate the effect of an approach, we will first calculate the similarity
score between nodes and then use AUC method to quantify the accuracy of the method for link prediction. Since in
our proposed approach GEM will be utilized to transfer the structure information of the nodes in original network into
compact representation vectors, we adopt two classical GEM to pre-process of the original networks, which are Node2vec
and Struc2vec as introduced above. With the representation vectors generated from the GEMs, similarity score between
each pair of nodes can be calculated with our GEBRWR approach and the link prediction can be made accordingly. In
practicing our GEBRWR, we set the restart coefficient α = 0.9 which is a typical practice in random walk methods [14].
As to the choice of γ , we have tested the impact of γ over the range from 0.1 to 1 as shown in Figs. 3 and 4. Results
show that an optimal γ may appear at different places for different networks, the optimal γ is then collected from the
tests acting as the input of random walk in our approach. In Table 2, results of the performance of our approach based
on two GEMs (Node2vec and Struc2vec) denoted as GEBRWR_N2V and GEBRWR_S2V, as well as five baseline methods
are presented.
It is obvious that GEBRWR algorithm has a better performance manifested by its leading scores over all the datasets.
While other five baseline methods for comparison may have a score close to our approach on certain network, they also
perform with evident discrepancy on some other networks. This fact clearly manifests the effectiveness of our approach
on empirical networks with overwhelming advantage on a wide spectrum of networks, while other baseline methods may
fulfill well just on some specific ones. Furthermore, we remark that thought the scores of GEBRWR_N2V are higher than
those of GEBRWR_S2V in most cases, their difference are consistently less than 1%, suggesting that our GEBRWR does not
quite depend on a specific choice of GE method. Thus, our method is benefited from the efficiency of GEM, meanwhile is
not restricted on the choice of a GEM.
As has been mentioned above, the effect of a biased random walk approach is expected to outperform a pure random
walk when the information of local structure can be properly utilized. Here, we validate this verdict by comparing their
performance over the datasets. Table 3 presents the improvements in AUC score of our GEBRWR method as compared
the RWR method [14]. Results evidently exhibit the superiorly of GEBRWR where the improvement on TCM is even about
7%. This fact implies that our method could wisely exploit the information embedded in the local structure of networks.
8
Y. Zhou, C. Wu and L. Tan Physica A 570 (2021) 125783

Table 3
The percentage of improvement of GEBRWR as compared to RWR.
Networks Jazz FWFW Metabolic ATC USAir Health TCM Average
GEBRWR% 2.74% 0.72% 0.87% 0.26% 2.20% 0.52% 6.99% 2.04%

Fig. 3. Influence of dynamic coefficient γ on AUC in unweighted networks.

Fig. 4. Influence of dynamic coefficient γ on AUC in weighted networks.

5. Conclusions

For link prediction approaches that are based on the process of random walk, pure random walk scheme, where
the probability of a particle transferring to its different neighbors are equal, is dominantly applied in previous works.
This scheme, however, ignores the detailed structure of networks in analysis, leaving relevant algorithms still have
significant space for improvement. In this paper, by considering the local structure of networks, we propose a biased
random walk for link prediction. Moreover, we incorporate graph embedding method into our approach, which could
transfer the local structure of original networks into neat vector representations for further process. Experiments on seven
9
Y. Zhou, C. Wu and L. Tan Physica A 570 (2021) 125783

empirical datasets, ranging from biological networks to social networks meanwhile permitting weighted and unweighted
connections, manifest our proposed method has an overwhelming advantage on these datasets as compared to five
common baseline methods. Our work exhibits that the combination of graph embedding method and biased random
walk has great potential in solving the problem of link prediction. We wish our approach could inspire future works
to wisely exploit the information embedded in the local structure of networks so as to provide better solution in link
prediction.

CRediT authorship contribution statement

Yinzuo Zhou: Conceptualization, Supervision, Methodology, Investigation, Writing - original draft. Chencheng Wu:
Data curation, Software, Investigation, Writing - review & editing. Lulu Tan: Data curation, Software.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have
appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by Entrepreneurship and Innovation Project of High Level Returned Overseas Scholar in
Hangzhou, China.

References

[1] L. Lu, T. Zhou, Link prediction in complex networks: a survey, Physica A 390 (6) (2011) 1150–1170.
[2] Q. Zhang, M. Shang, L. Lu, Similarity-based classification in partially labeled networks, Int. J. Modern Phys. C 21 (6) (2010) 813.
[3] M. Ahn, W. Jung, Accuracy test for link prediction in terms of similarity index: the case of WS and BA models, Physica A 429 (2015) 177–183.
[4] M. Hoffman, D. Steinley, M. Brusco, A note on using the adjusted rand index for link prediction in networks, Social Networks 42 (2015) 72–79.
[5] R. Sarukkai, Link prediction and path analysis using Markov chains, Comput. Netw. 33 (1) (2000) 377–386.
[6] A. Popescul, L. Ungar, Statistical relational learning for link prediction, in: Proceedings of the Workshop on Learning Statistical Models from
Relational Data at IJCAI-2003, 2003, pp. 81–87.
[7] M. Newman, Clustering and preferential attachment in growing networks, Phys. Rev. E 64 (2 Pt 2) (2001) 025102.
[8] L. Adamic, E. Adar, Friends and neighbors on the web, Social Networks 25 (3) (2003) 211–230.
[9] T. Zhou, L. Lu, Y. Zhang, Predicting missing links via local information, Eur. Phys. J. Condens. Matter Complex Syst. 71 (4) (2009) 623–630.
[10] A. Barabasi, R. Albert, Emergence of scaling in random networks, Science 286 (5439) (1999) 509–512.
[11] L. Lu, C. Jin, T. Zhou, Similarity index based on local paths for link prediction of complex networks, Phys. Rev. E 80 (4) (2009) 046122.
[12] L. Katz, A new status index derived from sociometric analysis, Psychometrika 18 (1) (1953) 39–43.
[13] D. Klein, M. Randic, Resistance distance, J. Math. Chem. 12 (1) (1993) 81–95.
[14] H. Tong, C. Faloutsos, J. Pan, Fast random walk with restart and its applications, in: Proceedings of the 6th International Conference on Data
Mining, IEEE, Piscataway, NJ, 2006, pp. 613–622.
[15] X. Fu, C. Wang, Z. Wang, Z. Ming, Scalable community discovery based on threshold random walk, J. Comput. Inf. Syst. 8 (21) (2012) 8953–8960.
[16] H. Nassar, A. Benson, D. Gleich, Neighborhood and pagerank methods for pairwise link prediction, Soc. Netw. Anal. Min. 10 (1) (2020).
[17] R. Li, J. Yu, J. Liu, Link prediction: the power of maximal entropy random walk, in: Proceedings of the 20th ACM Conference on Information
and Knowledge Management, CIKM 2011, Glasgow, United Kingdom, October 24-28, ACM, 2011, pp. 24–28.
[18] L. Liu, H. Liu, Q. Chen, C. He, Prediction algorithm based on network representation learning and random walk, J. Comput. Appl. 37 (8) (2017)
2234–2239.
[19] W. Jin, J. Jung, U. Kang, Supervised and extended restart in random walks for ranking and link prediction in networks, PLoS One 14 (3) (2019).
[20] Y. Lu, H. Han, C. Jia, Q. Qu, Link prediction algorithm based on biased restart random walk, Complex Syst. Complexity Sci. 15 (4) (2018) 17–24.
[21] M. Curado, Return random walks for link prediction, Inform. Sci. 510 (2020) 99–107.
[22] W. Liu, L. Lu, Link prediction based on local random walk, Europhys. Lett. 89 (5) (2010).
[23] T. Mikolov, I. Sutskever, K. Chen, et al., Distributed representations of words and phrases and their compositionality, in: Proceedings of the
26th International Conference on Neural Information Processing Systems, Vol. 2, NIPS ’13, Curran Associates, North Miami Beach, FL, 2013, pp.
3111–3119.
[24] B. Perozzi, R. Alrfou, S. Skiena, Deepwalk: online learning of social representations, in: Proceedings of the 20th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, ACMs, New York, 2014, pp. 701–710.
[25] J. Tang, M. Qu, M. Wang, et al., LINE: large-scale information network embedding, in: WWW ’15: Proceedings of the 24th International
Conference on World Wide Web, International World Wide Web Conferences Steering Committee, Geneva, Switzerland, 2015, pp. 1067–1077.
[26] A. Grover, J. Leskovec, Node2vec: scalable feature learning for networks, in: Proceedings of the 22nd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, KDD ’16, ACM, New York, 2016, pp. 855–864.
[27] L. Ribeiro, P. Savarese, D. Figueiredo, Struc2vec: Learning node representations from structural identity, Science 286 (5439) (2017) 509–512.
[28] A. Hanley, J. McNeil, The meaning and use of the area under a receiver operating characteristic (ROC) curve, Radiology 143 (1) (1982) 29–36.
[29] M. Curado, F. Escolano, M.A. Lozano, E.R. Hancock, net4Lap: Neural Laplacian regularization for ranking and re-ranking, in: 2018 24th
International Conference on Pattern Recognition, ICPR, IEEE, 2018, pp. 1366–1371.
[30] M. Muller, Dynamic time warping, in: Information Retrieval for Music and Motion, 2007, pp. 69–84.

10

You might also like