Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

Fast Network Community Detection with

Profile-Pseudo Likelihood Methods


Jiangzhou Wang1,2 , Jingfei Zhang3 , Binghui Liu1 , Ji Zhu4 , and Jianhua Guo1
arXiv:2011.00647v3 [stat.ME] 29 Aug 2021

1
School of Mathematics and Statistics & KLAS, Northeast Normal University, Jilin, 130024, China.
2
Department of Statistics and Data Science, Southern University of Science and Technology, Shenzhen, 518055, China.
3
Department of Management Science, University of Miami, Coral Gables, FL, 33146, USA.
4
Department of Statistics, University of Michigan, Ann Arbor, MI, 48109, USA.

Abstract

The stochastic block model is one of the most studied network models for com-
munity detection, and fitting its likelihood function on large-scale networks is known
to be challenging. One prominent work that overcomes this computational challenge
is Amini et al. (2013), which proposed a fast pseudo-likelihood approach for fitting
stochastic block models to large sparse networks. However, this approach does not
have convergence guarantee, and may not be well suited for small and medium scale
networks. In this article, we propose a novel likelihood based approach that decouples
row and column labels in the likelihood function, enabling a fast alternating maxi-
mization. This new method is computationally efficient, performs well for both small
and large scale networks, and has provable convergence guarantee. We show that our
method provides strongly consistent estimates of communities in a stochastic block
model. We further consider extensions of our proposed method to handle networks
with degree heterogeneity and bipartite properties.

Keywords: network analysis, profile likelihood, pseudo likelihood, stochastic block model,
strong consistency.

The first three authors contributed equally to this work. For correspondence, please contact Jianhua
Guo and Ji Zhu.

1
1 Introduction

One of the fundamental problems in network data analysis is community detection which

aims to divide the nodes in a network into several communities such that nodes within the

same community are densely connected, and nodes from different communities are relatively

sparsely connected. Identifying such communities can provide important insights on the

organization of a network. For example, in social networks, communities may correspond to

groups of individuals with common interests (Moody and White, 2003); in protein interac-

tion networks, communities may correspond to proteins that are involved in the same cellular

functions (Spirin and Mirny, 2003). There is a vast literature on network community de-

tection contributed from different scientific communities, such as computer science, physics,

social science and statistics. We refer to Fortunato (2010); Fortunato and Hric (2016); Zhao

(2017) for comprehensive reviews on this topic.

In the statistics literature, the majority of community detection methods are model-

based, which postulate and fit a probabilistic model that characterizes networks with com-

munity structures (Holland et al., 1983; Airoldi et al., 2008; Karrer and Newman, 2011).

Within this family, the stochastic block model (SBM; Holland et al., 1983) is perhaps the

best studied and most commonly used. The SBM is a generative model, in which the nodes

are divided into blocks, or communities, and the probability of an edge between two nodes

only depends on which communities they belong to and is independent across edges once

given the community assignment. Several extensions of the SBM have been considered,

notably the mixed membership model (Airoldi et al., 2008), which allows each node to be

associated with multiple clusters, and the degree corrected stochastic block model (DCSBM;

Karrer and Newman, 2011), which accommodates degree heterogeneity by including addi-

tional degree parameters. Due to the rapidly increasing interests, the statistical literature on

community detection in SBMs is fast growing with great advances on algorithmic solutions

(Snijders and Nowicki, 1997; Nowicki and Snijders, 2001; Daudin et al., 2008; Karrer and

2
Newman, 2011; Decelle et al., 2011; Amini et al., 2013; Bickel et al., 2013, among others)

and theoretical understandings of consistency and detection thresholds (Bickel and Chen,

2009; Rohe et al., 2011; Zhao et al., 2012; Lei and Rinaldo, 2015; Abbe, 2017; Gao et al.,

2017; Gao et al., 2018; Su et al., 2019; Abbe et al., 2020, among others).

It is well known that fitting the block model (i.e., SBM and DCSBM) likelihood functions

is a nontrivial task, and in principle optimizing over all possible community assignments is

a NP-hard problem (Bickel and Chen, 2009). Many work have considered using spectral

clustering for community detection in SBMs, which is computationally efficient and ensures

weak consistency, that is, the proportion of misclassified nodes tends to zero as the network

size increases, under certain regularity conditions (Rohe et al., 2011; Lei and Rinaldo, 2015;

Joseph et al., 2016). As such, spectral clustering is often used to produce initializations for

methods that aim to achieve strong consistency (Gao et al., 2017), that is, probability of

the estimated label being equal to the true label converges to one as the network size grows,

and methods that aim to directly maximize the nonconvex SBM and DCSBM likelihood

functions (Amini et al., 2013; Bickel et al., 2013).

To overcome the computational challenge in fitting the SBM likelihood, Amini et al.

(2013) proposed a novel pseudo likelihood approach that approximates the row rums within

blocks using Poisson random variables, and simplifies the likelihood function by lifting the

symmetry constraint on the adjacency matrix. This leads to a fast approximation to the

block model likelihood, which subsequently enables efficient maximization that can easily

handle up to millions of nodes. Additionally, it is shown that the maximum pseudo-likelihood

estimator achieves (weak) community detection consistency, in the case of a sparse SBM with

two communities. This pioneer work makes the SBM an attractive approach for network

community detection, due to its computational scalability and theoretical properties such as

the community detection consistency. However, this method may have two drawbacks. First,

in the examples that were presented in Amini et al. (2013), the authors found that empirically

the pseudo-likelihood maximization algorithm converged fast. It is, however, not guaranteed

3
Amini et al. (2013) Profile-Pseudo Likelihood

log pseudo likelihood

log pseudo likelihood

-111800 -111400
113400
112800

0 10 20 30 40 50 60 0 10 20 30 40 50 60

iteration number iteration number

Figure 1: An illustrative example comparing the pseudo likelihood method by Amini et al.
(2013) and the proposed profile-pseudo likelihood method. Details of the simulation setting
are described in Section 5.1.

that the algorithm will converge in general (see example in Figure 1). Convergence is a

critical property as it guarantees that the final estimator exists, and is therefore important

both computationally and statistically. Second, the pseudo likelihood approach may not be

suitable for small and medium scale networks, as the Poisson approximation may have non

negligible approximation errors in such cases. In the case of the DCSBM, cleverly employing

the observation that the conditional distribution (on node degrees) of the Poisson variables

is multinomial, Amini et al. (2013) proposed a conditional pseudo likelihood approach that

permits a fast estimation and adapts to both small and large scale networks. However, the

algorithm still does not have convergence guarantees.

Motivated by the pseudo likelihood approach, in this work, we propose a new SBM like-

lihood fitting method that decouples the membership labels of the rows and columns in the

likelihood function, treating the row label as a vector of latent variables and the column

label as a vector of unknown parameters. Correspondingly, the likelihood can be maximized

in an alternating fashion over the block model parameters and over the column label, where

the maximization now involves a tractable sum over the distribution of latent row label.

Furthermore, we consider a profile-pseudo likelihood that adopts a hybrid framework of the

profile likelihood and the pseudo likelihood, where the symmetry constraint on the adjacency

4
matrix is also lifted. Our proposed method retains and improves on the computational effi-

ciency of the pseudo likelihood method, performs well for both small and large scale networks

and has provable convergence guarantee. We show that the community label (i.e., column

label) estimated from our proposed method enjoys strong consistency, as long as the initial

label has an overlap with the truth beyond that of random guessing. We further consider two

extensions of the proposed method, including to the DCSBM and to the bipartite stochastic

block model (BiSBM; Larremore et al., 2014).

Our work is closely related to a recent and growing literature on strong consistency (or

exact recovery) pursuit in community detection (see, for example, Abbe et al., 2015; Lei

and Zhu, 2017; Gao et al., 2017; Gao et al., 2018). The strong consistency property may be

more desirable than weak consistency, as it enables establishing the asymptotic normality of

the SBM plug-in estimators (Amini et al., 2013) and performing goodness of fit tests (Lei,

2016; Hu et al., 2020b). To achieve strong consistency, the above methods usually consider a

refinement step after obtaining the initial label, which is assumed to obey weak consistency.

For example, in Gao et al. (2017), a majority voting algorithm is applied to the clustering

label obtained from spectral clustering. Similarly, our proposed profile-pseudo likelihood

estimation can be viewed as a refinement on the initial label to achieve strong consistency.

Similar to other refinement algorithms, the scalability of our proposed method depends on

the initialization step. While spectral clustering is used to produce initial solutions in our

work, other initialization methods can be considered as well (see Section 7).

The rest of the paper is organized as follows. Section 2 introduces the profile-pseudo

likelihood function and an efficient algorithm for its maximization. Moreover, we discuss the

convergence guarantee of the algorithm. Section 3 shows the strong consistency property

of the community label estimated from the proposed algorithm. Section 4 considers two

important extensions of the proposed method. Section 5 demonstrates the efficacy of the

proposed method through comparative simulation studies. Section 6 presents analyses of

two real-world networks with communities. A discussion section concludes the paper.

5
2 Profile-Pseudo Likelihood

Let G(V, E) denote a network, where V = {1, 2, . . . , n} is the set of n nodes and E is the

set of edges between the nodes. The network G(V, E) can be uniquely represented by the

corresponding n × n adjacency matrix A, where Aij = 1 if there is an edge (i, j) ∈ E

from node i to node j and Aij = 0 otherwise. In our work, we focus on unweighted and

undirected networks, and thus A is a binary symmetric matrix. Under the stochastic block

model, there are K communities (or blocks) and each node belongs to only one of the

communities. Let c = (c1 , c2 , . . . , cn ) ∈ {1, 2, . . . , K}n denote the true community labels

of the nodes, and assume that ci ’s are i.i.d. categorical variables with parameter vector
P
π = (π1 , . . . , πK ), where k πk = 1. Conditional on the community labels, the edge variables

Aij ’s are independent Bernoulli variables with E(Aij |c) = Pci cj , where P ∈ [0, 1]K×K is the

symmetric edge-probability matrix with the kl-th entry Pkl characterizing the probability

of connection between nodes in communities k and l. Let Ω = (π, P ). Our objective is to

estimate the unknown community labels c given the observed adjacency matrix A.

Denote the rows of A as ai = (Ai1 , Ai2 , . . . , Ain ), 1 ≤ i ≤ n and let e = (e1 , e2 , . . . , en ) ∈

{1, 2, . . . , K}n denote the column labeling vector. Define the pseudo likelihood function as
n
(K n
)
A
Y X Y
LPL (Ω, e; {ai }) = πl Plejij (1 − Plej )1−Aij , (1)
i=1 l=1 j=1

with its logarithm as


n
(K n
)
A
X X Y
`PL (Ω, e; {ai }) = log πl Plejij (1 − Plej )1−Aij .
i=1 l=1 j=1

We make a few remarks on the objective function defined in (1). First, in (1), we treat the

row labels as a vector of latent variables and the column labels e as a vector of unknown

model parameters. That is, given ej , each Aij is considered a mixture of K Bernoulli random

variables with mean Plej , 1 ≤ l ≤ K. This formulation decouples the row and column labels,

and allows us to derive a tractable sum when optimizing for the column labels e and the

block model parameter Ω. Second, the objective function LPL (Ω, e; {ai }) is calculated while

6
lifting the symmetry constraint on the adjacency matrix A, or equivalently, ignoring the

dependence among the rows ai ’s. Hence, we refer to (1) as the pseudo likelihood function,

which can be considered as an approximation to the SBM likelihood function.

We consider an iterative algorithm that alternates between updating e and updating Ω.

In each iteration, the estimation is carried out by first profiling out the nuisance parameter Ω

using maxΩ LPL (Ω, e; {ai }) given the current estimate of e, and then maximizing the profile

likelihood with respect to e. This is referred to as the profile-pseudo likelihood method. We

show in Theorem 1 the convergence guarantee of this efficient algorithm, and establish in

Theorem 2 the strong consistency of the estimated column labels e.

The estimation procedure proceeds in detail as follows. First, given the current ê and

treating the row labels as a vector of latent variables, LPL (Ω, ê; {ai }) can be viewed as

the likelihood of a mixture model with i.i.d. observations {ai } and parameter Ω. Conse-

quently, LPL (Ω, ê; {ai }) can be maximized over Ω using an expectation-maximization (EM)

algorithm, where both the E-step and M-step updates have closed-form expressions. Next,

given the estimated Ω, b e; {ai }) as the objective function. In


b we update e, treating LPL (Ω,

b e; {ai }) with respect to e is a NP-hard problem


this step, finding the maximizer of LPL (Ω,

since, in principle, it requires searching over all possible label assignments. As an alter-

native, we propose a fast updating rule that leads to a non-decreasing objective function
b e; {ai }) (although not necessarily maximized), which ensures the desirable ascent
LPL (Ω,

property of the iterative algorithm. This algorithm is summarized in Algorithm 1.

In what follows, we discuss in details the profile-pseudo likelihood algorithm. We refer

to the iterations between updating e and Ω as the outer iterations, and the iterations in the

EM algorithm used to update Ω as the inner iterations. Specifically, in the (t + 1)-th step

of the EM (inner) iteration, given e(s) and the parameter estimate from the previous EM

7
update Ω(s,t) = (π (s,t) , P (s,t) ), we let
n
 Aij  1−Aij
(s,t) Q (s,t) (s,t)
πk P (s) 1 − P (s)
j=1 kej kej
(s,t+1)
τik =  Aij  1−Aij (2)
K n
P (s,t) Q (s,t) (s,t)
πl P (s) 1 − P (s)
j=1 lej lej
l=1

for each 1 ≤ i ≤ n and 1 ≤ k ≤ K, which calculates the conditional probability that the row

label of node i equals to k at the (t + 1)-th step of the EM iteration. Next, we define

Q(Ω|Ω(s,t) , e(s) ) = Ez|{ai };Θ(s,t) ,e(s) log f {ai }, z; Ω, e(s) ,


 

where z denotes the latent row labels and


n
( n
)
A
Y Y
f ({ai }, z; Ω, e(s) ) = πz i P (s) (1 − Pzi e(s) )1−Aij .
ij
zi ej j
i=1 j=1

In the M-step, Ω(s,t+1) is updated by

Ω(s,t+1) = arg max Q(Ω|Ω(s,t) , e(s) ),


which has closed form solutions as follows


n P
n
P (s,t+1) (s)
n Aij τik I(ej = l)
(s,t+1) 1 X (s,t+1) (s,t+1) i=1 j=1
πk = τik , Pkl = n P n , (3)
n P (s,t+1) (s)
i=1 τik I(ej = l)
i=1 j=1
n o
(s+1) (s+1)
for 1 ≤ k, l ≤ K. Once the EM algorithm has converged, we let Ω and τil take

the values from the last EM update, respectively. Next, given Ω(s+1) , we propose to update

e as follows:
n X
X K n  o
(s+1) (s+1) (s+1) (s+1)
ej = arg max τil Aij log Plk + (1 − Aij ) log 1 − Plk . (4)
k∈{1,2,...,K}
i=1 l=1

The update for e(s+1) is obtained separately for each node, which can be carried out efficiently.

As we discussed earlier, this update is not guaranteed to maximize the pseudo likelihood

function LPL (Ω(s+1) , e; {ai }), which in fact is an intractable problem. Nevertheless, it can

be shown that the update in (4) leads to a non-negative increment in the pseudo likelihood.

This gives the desirable ascent property, which we will formally state in the following theorem.

8
Algorithm 1 Profile-Pseudo Likelihood Maximization Algorithm.
Step 1: Initialize e(0) using spectral clustering with permutations (SCP).
Step 2: Calculate Ω(0) = (π (0) , P (0) ). That is, for 1 ≤ l, k ≤ K,
n P
n
P (0) (0)
n Aij I(ei =k)I(ej =l)
(0) 1
P (0) (0) i=1 j=1
πk = n
I(ei = k), Pkl = Pn Pn
(0) (0)
.
i=1 I(ei =k)I(ej =l)
i=1 j=1

Step 3: Initialize Ω(0,0) = (π (0,0) , P (0,0) ) = (π (0) , P (0) ).


repeat
repeat
(s,t+1)
Step 4: E-step: compute τik using (2) for 1 ≤ k ≤ K and 1 ≤ i ≤ n.
(s,t+1) (s,t+1)
Step 5: M-step: compute πk and Pkl using (3) for 1 ≤ k, l ≤ K.
until the EM algorithm nconverges.
o
(s+1) (s+1)
Step 6: Set Ω and τik to be the final EM update.
n o
(s+1) (s+1)
Step 7: Given Ω(s+1) and τik , update ej , 1 ≤ j ≤ n, using (4).
until the profile-pseudo likelihood converges.

Theorem 1. For a given initial labeling vector e(0) , Algorithm 1 generates a sequence

{Ω(s) , e(s) } such that

LPL (Ω(s) , e(s) ; {ai }) ≤ LPL (Ω(s+1) , e(s+1) ; {ai }).

The proof of Theorem 1 is provided in the supplemental material. Theorem 1 guaran-

tees that the pseudo likelihood function is non-decreasing at each iteration in Algorithm

1. Assuming that the parameter space for Ω is compact, we arrive at the conclusion that

LPL (Ω(s) , e(s) ; {ai }) converges as the number of iterations s increases. This is a desirable

property that guarantees the stability of the proposed algorithm. Since the pseudo likelihood

function is not concave, Algorithm 1 is not guaranteed to converge to the global optimum.

Whether it converges to a global or local solution depends on the initial value. In practice,

we find that the initialization procedure in Algorithm 1 shows good performance, that is,

we are able to achieve high clustering accuracy in our simulation studies. To avoid local so-

lutions in real data applications, we recommend considering multiple random initializations

9
in addition to the initialization in Algorithm 1.

Finally, we summarize the differences between our proposal and the method in Amini

et al. (2013). Both our method and Amini et al. (2013) consider algorithms that iterate

through two parameter updating steps, namely, the step that updates the block model

parameter Ω using EM and the step that updates the membership label. However, the

likelihood function is treated very differently in these two methods. As the row and column

labels are enforced to be the same in Amini et al. (2013), a Poisson approximation is needed

in the pseudo likelihood calculation. The label e in Amini et al. (2013) is treated as an initial

in the EM estimation, and its value is assigned heuristically in each iteration. As such, the

resulting procedure is not guaranteed to converge, as seen in Figure 1. In comparison, our

method decouples the row and column labels (i.e., z and e), and does not require a Poisson

approximation in the pseudo likelihood calculation. When updating the column labels e,
b e; {ai }) as the objective function that guides our updating routine. The
we use LPL (Ω,

proposed node-wise update enjoys the ascent property, which subsequently guarantees the

convergence of the algorithm (see Theorem 1). We also remark that due to the differences

in our problem formulation, our theoretical analysis is nontrivial and new technical tools are

needed.

3 Consistency Results

In this section, we investigate the strong consistency of the estimator obtained from one

outer loop iteration (i.e., updating the column labels e) of Algorithm 1, denoted as ĉ{e(0) },

where e(0) is an initial of Algorithm 1. We first consider strong consistency in the case of

SBMs with two balanced communities, and then extend our strong consistency result to

SBMs with K communities.

We first present the consistency result for directed SBMs with two communities, fitted

to directed networks, and then modify the result to handle the more challenging case of

10
undirected SBMs, fitted to undirected networks. To separate the cases of directed and

undirected SBMs, we adopt different notations for the corresponding adjacency matrices

and edge-probability matrices. First, for a directed SBM, we denote the adjacency matrix

as A
e and assume that its entries A
eij ’s are mutually independent given c, that is,

eij |c ∼ Bernoulli(Pec c ), for 1 ≤ i, j ≤ n.


(directed) A (5)
i j

For an undirected SBM, we denote the adjacency matrix as A and assume its entries Aij ’s,

i ≤ j, are mutually independent given c, that is,

(undirected) Aij |c ∼ Bernoulli(Pci cj ) and Aij = Aji , for 1 ≤ i ≤ j ≤ n. (6)

Furthermore, we assume that the edge-probability matrix of the directed SBM has the form
!
1 a b
Pe = , (7)
m b a

while that of the undirected SBM has the form


! !
2 a b 1 a2 b 2
P = − 2 . (8)
m b a m b 2 a2

Such a coupling between the directed and undirected models makes it possible to extend the

consistency result of the directed SBM to the undirected case.

Given an initial labeling vector e(0) , estimates â, b̂ and (π̂1 , π̂2 ), the estimator ĉ{e(0) } can

be written as
n X
2 n
(0)
X o
ĉj {e } = arg max τ̂il Aij log(Pblk ) + (1 − Aij ) log(1 − Pblk , (9)
k∈{1,2}
i=1 l=1

where τ̂il is defined as in (2), Pb is defined as in (7) for directed SBMs and as in (8) for

undirected SBMs, with a and b replaced by â and b̂, respectively. Here the estimates â, b̂

and (π̂1 , π̂2 ) are outputs from the inner loop (i.e., EM) iterations, and are in effect initials for

the outer loop calculation. Consistency of the inner loop (i.e., EM) outputs â, b̂ and (π̂1 , π̂2 )

can be established using the result in Amini et al. (2013). In our theoretical analysis, we

11
focus our efforts on establishing strong consistency of the column labels e estimated in the
δ
outer loop, given that the outer loop initials satisfy (â, b̂) ∈ Pa,b in (10) and π̂1 = π̂2 = 1/2.

For SBMs with two balanced communities, we make the following assumption:

(A) Assume that each community contains m = n/2 nodes and π̂1 = π̂2 = 1/2.

The assumption that π̂1 = π̂2 = 1/2 is reasonable as the inner loop outputs (π̂1 , π̂2 ) are

consistent estimators of (π1 , π2 ) = (1/2, 1/2), as shown in Amini et al. (2013). Without loss

of generality, let ci = 1 for i ∈ {1, . . . , m}, and ci = 2 for i ∈ {m + 1, . . . , n}. Assume

that e(0) ∈ {1, 2}n assigns equal numbers of nodes to the two communities, i.e., the initial

labeling vector is balanced. Let e(0) match with the truth on γm labels in each of the two

communities for some γ ∈ (0, 1). We assume γm to be an integer. Next, let E γ denote the

set that collects all such initial labeling vectors, i.e.,


( m n
)
X (0)
X (0)
E γ = e(0) ∈ {1, 2}n : I(ei = 1) = γm, I(ei = 2) = γm .
i=1 i=m+1

Note that γ = 1/2 corresponds to “no correlation” between e(0) and c, whereas γ = 0 and

γ = 1 both correspond to perfect correlation. In our analysis, we do not require knowing the

value of γ, or knowing which labels are matched. In Theorem 2, we show that the amount of

overlap γ can be any value, as long as γ 6= 1/2. Our goal is to establish strong consistency

for ĉ{e(0) }. For a constant δ > 1, we define Pa,b


δ
as follows:
( )
δ â b̂
Pa,b = (â, b̂) : I(a > b) + I(a < b) ≥ δ . (10)
b̂ â

δ
The set Pa,b specifies that (â, b̂) has the same ordering as (a, b), and the relative difference

between the estimates â and b̂ is lower bounded. Our next theorem considers the collection
δ
of estimates (â, b̂) in Pa,b .

(a−b)2
Theorem 2. Assume (A) holds, δ > 1, γ ∈ (0, 1)\{ 12 } and (a+b)
≥ C log n for a sufficiently

large constant C > 0. For a directed SBM in (5) with the edge-probabilities given by (7) with

12
a 6= b, we have that for any  > 0, there exists N > 0 such that for all n ≥ N , the following

holds
( )  
(a−b)2 −4(a−b)+42 (2γ−1)2 (a−b)2
− −
\
(0)
P ĉ{e } = c ≥ 1 − ne 4(a+b) + n(n + 2)e 8(a+b) ,
δ
(â,b̂)∈Pa,b

for any e(0) ∈ E γ , where ĉ{e(0) } = c means that they belong to the same equivalent class of

label permutations.

The proof of Theorem 2 is provided in the supplemental material. It can be seen from

Theorem 2 that the one-step estimate ĉ{e(0) } for a directed SBM is a strongly consistent

estimate of c for any e(0) ∈ E γ . Note that weak consistency was established in Amini
(a−b)2
et al. (2013) under the assumption that (a+b)
→ ∞. In comparison, our result requires
(a−b)2
(a+b)
≥ C log n to establish strong consistency. In existing literature on strong consistency,
λn
the condition logn
→ ∞ is often commonly imposed (Bickel and Chen, 2009; Zhao et al.,

2012), where λn denotes the average network degree. Specifically, under the SBM setting

considered in Bickel and Chen (2009) and Zhao et al. (2012), we have that a − b  λn and

a + b  λn , where  denotes that the two quantities on both sides are of the same order. In
λn (a−b)2
this case, logn
→ ∞ implies (a+b)
≥ C log n for any constant C > 0.

Theorem 2 guarantees strong consistency for any e(0) ∈ E γ . In comparison, the weak

consistency in Amini et al. (2013) holds uniformly for all e(0) ∈ E γ , even if it is derived

from the data. Indeed, e(0) is usually derived from data using initialization procedures such

as the spectral clustering. For the strong consistency result to apply, one may consider a

data splitting strategy following the method in Li et al. (2020). Specifically, we may sample

a proportion of the node pairs to produce an initial value e(0) and estimate ĉ(e(0) ) using

the rest of the node pairs. In this case, e(0) is independent of the data used for community

detection and the result in Theorem 2 can be used to ensure strong consistency of ĉ(e(0) ).

In our numerical studies, for simplicity we did not use data splitting, while the simulation

results show that the proposed method still performs well. We also note that Theorem 2 can

13
be adapted to hold uniformly for all e(0) ∈ E γ , if stronger conditions are placed on γ and

a, b. Specifically, if the misclassification ratio of e(0) is, for example, O(1/(a + b)) and the

condition on a, b is strengthen to (a − b) & n log n (i.e., average degree is at least of order

n log n), then strong consistency in Theorem 2 holds uniformly for all such e(0) , even if it

is derived from the data. This can be shown by ! combining the union bound argument and
n
a Stirling approximation that gives log ≤ n log(en/nγ ), where nγ is the number of

misclassified nodes. The misclassification ratio of O(1/(a + b)) imposed above is known to

hold with high probability for spectral clustering (see, for example, Corollary 3.2 in Lei and

Rinaldo (2015)).

Next, we consider the case of undirected SBMs. Let aγ = (1 − γ)a + γb I(γ > 12 ) +
 

γa + (1 − γ)b I(γ < 21 ). We have the following result on the strong consistency of ĉ{e(0) }.
 

(a−b)2
Theorem 3. Assume (A) holds, δ > 1, γ ∈ (0, 1)\{ 12 } and (a+b)
≥ C log n for a sufficiently

large constant C > 0. For an undirected SBM in (6) with the edge-probabilities given by (8)

with 2(1 + )aγ ≤ |(1 − 2γ)(a − b)| for some  ∈ (0, 1), there exist ρ ∈ (0, 1) and N > 0,

such that for all n ≥ N , the following holds


( )
\
P ĉ{e(0) } = c
δ
(â,b̂)∈Pa,b
 1−ρ 2 2
 ( 1− )2 (2γ−1)2 (a−b)2 
( 4 ) (a−b) 2 /2
− − 2 − aγ
≥ 1 − 3ne 4(a+b) + n(n + 2) e 4(a+b) + 2e 1+/2 ,

for any e(0) ∈ E γ , where ĉ{e(0) } = c means that they belong to the same equivalent class of

label permutations.

The proof of Theorem 3 is provided in the supplemental material. It can be seen that the

one-step estimate ĉ{e(0) } for an undirected SBM is a strongly consistent estimate of c, for

any e(0) ∈ E γ . Given  and γ, the condition 2(1 + )aγ ≤ |(1 − 2γ)(a − b)| places an
1 1
upper bound on b/a. For example, for  = 3
and γ < 10
, the above condition is satisfied if

b/a ≤ (1 − 10γ)/(9 − 10γ).

14
Strong consistency can be more desirable than weak consistency, as it enables normal

distribution based inference and goodness of fit tests (see numerical studies in Section

5.2). For example, consider a SBM with K = 2, π = (π1 , π2 ) and true community la-
(w)
bels c = (c1 , c2 , . . . , cn ). Suppose we can construct a label vector ĉ(w) such that {ĉi }ni=1
(w) (w)
are independent with P(ĉi 6= ci ) = 2pn for ci = 1 and P(ĉi 6= ci ) = pn for ci = 2, where

pn = 1/ log n. Then it can be shown that ĉ(w) is weakly consistent, with a misclassification
n
(w)
ratio of Op (1/ log n), but not strongly consistent to c. Let π̂1w =
P
I(ĉi = 1)/n. It holds
i=1
√ n w  1−3π1
o
d
that n π̂1 − π1 + log n −→ N {0, π1 (1 − π1 )} (See the proof in the supplemental ma-

terial). Thus the bias term of π̂1w is O(1/ log n), which can be non negligible for inference.
 
(s) (s) (s)
On the other hand, for a strongly consistent estimator ĉ(s) = ĉ1 , ĉ2 , . . . , ĉn , letting
n
(s) √ d
π̂1s = I(ĉi = 1)/n, it holds that n {π̂1s − π1 } −→ N {0, π1 (1 − π1 )}.
P
i=1
Next, we consider the more general case of directed and undirected SBMs with K com-

munities. Similar to Assumption (A), we make the following assumption:

(B) Assume that each community contains m = n/K nodes and π̂k = 1/K.

Let the edge-probability matrix of the directed SBM be

a b
Pekl = 1(k = l) + 1(k 6= l), (11)
m m

and that of the undirected SBM be

a2 b2
   
2a 2b
Pkl = − 2 1(k = l) + − 2 1(k 6= l), (12)
m m m m

for k, l = 1, . . . , K. Without loss of generality, let ci = k for i ∈ {(k − 1)m + 1, . . . , km} for

k = 1, . . . , K. Let E γ denote the set that collects all initial labeling vectors such that
 
 km
X Xn 
γ (0) n (0) (0)
E = e ∈ {1, . . . , K} : I(ei = k) = γk m, I(ei = k) = m, k = 1, . . . , K ,
 
i=(k−1)m+1 i=1

where γ = (γ1 , . . . , γK ).Corollaries 1 and 2 establish the strong consistency of profile pseudo

likelihood estimators for directed and undirected SBMs, respectively.

15
(a−b)2
Corollary 1. Assume (B) holds, δ > 1, min {γ1 , γ2 , . . . , γK } ∈ ( 21 , 1) and (a+b)
≥ C log n

for a sufficiently large constant C > 0. For a directed SBM in (5) with the edge-probabilities

given by (11) with a 6= b, we have that for each  > 0, there exists N > 0 such that for all

n ≥ N , the following holds


( )
\
P ĉ{e(0) } = c (13)
δ
(â,b̂)∈Pa,b
( K K
)
2 2 2

(a−b)2 −4(a−b)+42 (10K − 8)n X X − (γk +γ8(a+b)
l −1) (a−b)
≥ 1− (K − 1)ne 4(a+b) + e ,
K k=1 l=1

for any e(0) ∈ E γ , where ĉ{e(0) } = c means that they belong to the same equivalent class of

label permutations.

(a−b)2
Corollary 2. Assume (B) holds, δ > 1, min {γ1 , γ2 , . . . , γK } ∈ ( 12 , 1) and (a+b)
≥ C log n for

a sufficiently large constant C > 0. For an undirected SBM in (6) with the edge-probabilities

given by (12) with 2(1 + )aγk ≤ (γk + γl − 1)(a − b) for all 1 ≤ k, l ≤ K and some  ∈ (0, 1),

where aγk = (1 − γk )a + γk b, there exist ρ ∈ (0, 1) and N > 0, such that for all n ≥ N , the

following holds
( )
\
P ĉ{e(0) } = c
δ
(â,b̂)∈Pa,b
" K K  #
1−ρ 2 32 aγ

( 4 ) (a−b)
2
(10K − 8)n2 X X − ( 1− 2 2
2 ) (γk +γl −1) (a−b)
2
− k
≥ 1 − 3(K − 1)ne 2(a+b) + e 6(a+b) + 2e 8(4+) ,
K k=1 l=1

for any e(0) ∈ εγn , where ĉ{e(0) } = c means that they belong to the same equivalent class of

label permutations.

The proofs of Corollaries 1 and 2 follow very similar steps as in the proofs of Theorems 2

and 3, respectively. We omit presenting the details.

4 Extensions

In this section, we study two useful extensions of the proposed method. First, we consider

the case of fitting the degree corrected stochastic block model using the proposed profile-

16
pseudo likelihood method. Second, we consider the case of fitting the bipartite stochastic

block model using the proposed profile-pseudo likelihood method (see Section A5 in the

supplemental material).

It has often been observed that real-world networks exhibit high degree heterogeneity,

with a few nodes having a large number of connections and the majority of the rest having

a small number of connections. The stochastic block model, however, cannot accommodate

such degree heterogeneity. To incorporate the degree heterogeneity in community detection,

Karrer and Newman (2011) proposed the degree-corrected SBM. Specifically, conditional

on the label vector c, it is assumed that the edge variables Aij for all i ≤ j are mutually

independent Poisson variables with

E[Aij |c] = θi θj λci cj ,

where Λ = [λkl ] is a K × K symmetric matrix and θ = (θ1 , θ2 , . . . , θn ) is a degree parameter


n
P
vector, with the additional constraint θi /n = 1 that ensures identifiability (Zhao et al.,
i=1
2012).

Define Ω = (π, Λ, θ). To fit the DCSBM to an observed adjacency matrix A, we define

the following log pseudo likelihood function


n
(K n
)
X X Y
`DC
PL (Ω, e; {ai }) = log πl e−θi θj λlej (θi θj λlej )Aij .
i=1 l=1 j=1

Pn
Let di = j=1 Aij , 1 ≤ i ≤ n. A profile-pseudo likelihood algorithm that maximizes

`DC
PL (Ω, e; {ai }) is described in Algorithm 2. At step 4, we update the conditional probabilities

for the row labels by


(s,t) (s,t) (s,t) Aij
−θi θj λ (s)
n

Q (s,t) ke (s,t) (s,t) (s,t)
πk e j θi θj λ (s)
j=1 ke j
(s,t+1)
τik = (s) (s,t) (s,t) Aij . (14)
−θi θj λ (s)
K Q
n

P (s,t) le (s,t) (s,t) (s,t)
πl e j θi θj λ (s)
lej
l=1 j=1

At step 5, we update the parameters by sequentially solving the following optimization

17
Algorithm 2 DCSBM Profile-Pseudo Likelihood Maximization Algorithm.
Step 1: Initialize e(0) using spectral clustering with permutations (SCP).
Step 2: Calculate Ω(0) = (π (0) , Λ(0) , θ (0) ). That is, for 1 ≤ l, k ≤ K, 1 ≤ i ≤ n,
n P
n
P (0) (0)
n Aij I(ei =k)I(ej =l)
(0) 1
P (0) (0) (0) i=1 j=1
πk = n
I(ei = k), θi ∝ di , λkl = n P
P n
(0) (0) (0) (0)
.
i=1 I(ei =k)I(ej =l)θi θj
i=1 j=1

Step 3: Initialize Ω(0,0) = (π (0,0) , Λ(0,0) , θ (0,0) ) = (π (0) , Λ(0) , θ (0) ).


repeat
repeat
(s,t+1)
Step 4: E-step: compute τik using (14) for 1 ≤ k ≤ K and 1 ≤ i ≤ n.
(s,t+1)
Step 5: CM-step: compute π , Λ(s,t+1) , θ (s,t+1) . For 1 ≤ k, l ≤ K, set
n P
n
P (s,t+1) (s)
n τik I(ej = l)Aij
(s,t+1)
X (s,t+1) (s,t+1) i=1 j=1
πk = τik /n, λkl = P
n P
n ,
(s,t+1) (s) (s,t) (s,t)
i=1 τik I(ej = l)θi θj
i=1 j=1

K
(s,t+1) P (s,t+1) (s) (s,t+1)
Letting gij = τik I(ej = l)λkl , for 1 ≤ i ≤ n, set
k,l=1

 q .
(s,t+1) (s,t+1) (s,t+1) 2 (s,t+1) (s,t+1)
θi = −hi + hi + 8di gii 4gii ,

i−1 n
(s,t+1) P (s,t+1) (s,t+1) P (s,t) (s,t+1)
where hi = θj gij + θj gij .
j=1 j=i+1
until the ECM algorithm converges.
Step 6: Set Ω(s+1) to be the final ECM update.
(s+1)
Step 7: Given Ω(s+1) , update ej , 1 ≤ j ≤ n, using

n P
K n o
(s+1) P (s+1) (s+1) (s+1) (s+1) (s+1)
ej = arg maxk∈{1,2,...,K} −θi θj λlk + Aij log(λlk ) τil .
i=1 l=1

until the profile-pseudo likelihood converges.

problems:

(π (s,t+1) , Λ(s,t+1) ) = arg max Q(π, Λ, θ (s,t) |Ω(s,t) , e(s) ),


(π,Λ)
(s,t+1) (s,t+1) (s,t+1) (s,t)
θi = arg max Q(π (s,t+1) , Λ(s,t+1) , θ1 , . . . , θi−1 , θi , θi+1 , . . . , θn(s,t) |Ω(s,t) , e(s) ).
θi

18
Here, the objective function Q(Ω|Ω(s,t) , e(s) ) is defined as

Q(Ω|Ω(s,t) , e(s) ) = Ez|{ai };Θ(s,t) ,e(s) log f {ai }, z; Ω, e(s) ,


 

where z = (z1 , · · · , zn )> denotes the row label vector and


 n oAij 
Y  Y −θi θj λz e(s) θi θj λzi e(s)
n n
f ({ai }, z; Ω, e(s) ) = πzi e i j j
.

i=1 j=1
Aij !

The inner loop of Algorithm 2, i.e., steps 4 and 5, is different from that in Algorithm

1, as it considers a conditional EM (ECM) update. Specifically, the objective function

Q(Ω|Ω(s,t) , e(s) ) in the M-step, i.e., step 5, which solves for block parameters λkl ’s and

degree parameters θi ’s, is nonconvex and does not have closed form solutions. Hence, directly

optimizing it using numerical techniques can be computationally costly and is not ensured

to find the global optimum. The ECM algorithm replaces the challenging optimization

problem in the M-step with a sequence of alternating updates, each of which has a closed-

form solution. It is easy to implement and enjoys the desirable ascent property (Meng and

Rubin, 1993). Consequently, Algorithm 2 has convergence guarantees, which improves over

Amini et al. (2013).

We also note that in our profile-pseudo likelihood approach, while the conditional distri-

bution (on node degrees) of the Poisson variables is multinomial, the multinomial coefficient
di !
(i.e., the bi1 !bi2 !···biK !
factorial term) in the density function involves the column labels (in

bik ’s). As such, optimizing for the column labels in the outer loop becomes highly chal-

lenging. In Algorithm 2, we work with the pseudo likelihood without conditioning on node

degrees and it requires estimating the degree parameters in the M-step. This is different

from that in Amini et al. (2013).

5 Simulation Studies

In this section, we carry out simulation studies to investigate the finite sample performance

of our proposed profile-pseudo likelihood method (referred to as PPL), and to compare with

19
existing solutions including the spectral clustering with permutations (referred to as SCP)

and the pseudo likelihood method (referred to as PL) proposed in Amini et al. (2013). Both

SCP and PL are implemented using the code provided by Amini et al. (2013). We also

compare with the strongly consistent majority voting method proposed in Gao et al. (2017)

(see Section A6 in the supplemental material).

We consider two evaluation criteria. The first one is the normalized mutual information

(NMI), which measures the distance between the true labeling vector and an estimated

labeling vector. The NMI takes values between 0 and 1, and a larger value implies a higher

accuracy. The second one is the CPU running time, which measures the computational cost.

Note the reported running time does not include the initialization step (see Section A6 in

the supplemental material and discussions in Section 7). All methods are implemented in

Matlab and run on a single processor of an Intel(R) Core(TM) i7-4790 CPU 3.60 GHz PC.

5.1 SBM

In this section, we simulate networks from SBMs. Three different settings are considered.

In Setting 1, we evaluate the convergence of PPL and PL; in Setting 2, we compare the

performance of PPL, SCP and PL when the networks are small and dense; in Setting 3, we

compare the three methods when the networks are large and sparse.

Setting 1: In this simulation, we evaluate the convergence performance of PPL and PL with

varying initial labeling vectors. We simulate from SBMs with n = 500 nodes that are divided

into K equal sized communities, and the within/between community connecting probabilities

are Pkl = p1 + p2 × 1(k = l), k, l = 1, . . . , K. We consider (K, p1 , p2 ) = (2, 0.13, 0.07), and

(K, p1 , p2 ) = (5, 0.10, 0.13). Both the PPL and PL algorithms are considered to have converged

if the change of the latest update (relative to the previous one) is less than 10−6 or if the

number of outer iterations exceeds 60. We let the NMI of the initial labeling vector vary

from 0.1 to 0.5. All simulations are repeated 100 times. The proportion of convergence for

20
K=2 K=5
1.00 1.00

convergence proportion

convergence proportion
0.75
0.75

0.50
0.50
Algorithm Algorithm
PL PL
PPL 0.25 PPL
0.25

0.00
0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4
NMI of initial NMI of initial

Figure 2: Proportion of convergence of PPL and PL with initial labels of varying NMI.
Algorithm ● SCP PL PPL Algorithm PL PPL
1.00 ● ● ● ● ● ● 0.4
● ● ●
● ●
● ●




0.75 ●
running time(secs)
● 0.3



NMI

0.50
● 0.2

0.25
0.1

0.00
600 900 1200 1500 600 900 1200 1500
n n

Figure 3: NMI and computing time of PPL and PL with varying network size n.

PPL and PL are presented in Figure 2. It is seen that the PL does not have a satisfactory

convergence performance. One example (in the case of K = 2) of the convergence of PPL

and non-convergence of PL is shown in Figure 1, where it is observed that the PL algorithm

did not converge, and the final estimate has a smaller log pseudo likelihood when compared

to the initial value.

Setting 2: In this simulation, we compare the performance of SCP, PL, and PPL on small

scale and dense networks. The PL method is not expected to perform well in this setting due

to the relatively large Poisson approximation error. We acknowledge that many networks in

real applications are large and/or sparse, and we note that here we use simulated examples

to investigate a limitation of the PL method. We simulate from SBMs with n nodes that are

21
divided into K = 2 equal sized communities, and the within/between community connecting

probabilities are Pkl = p1 + p2 × 1(k = l), k, l = 1, . . . , K. We consider (p1 , p2 ) = (0.84, 0.06).

Both PPL and PL are initialized by SCP. Figure 3 reports the NMI from the three methods

based on 100 replications. It is seen that PPL outperforms the PL both in terms of community

detection accuracy (when n < 1000) and computational efficiency. The unsatisfactory per-

formance of the PL method when n < 1000 is due to the errors from approximating binomial

random variables with Poisson random variables. This approximation is not expected to

work well when p1 (or p2 ) is large and when n is small (Hodges and Le Cam, 1960). Also

note that the PL method may perform worse than the initial labels, as its iterations do not

enjoy the ascent property. It can also be seen that as n increases, the performance of PL

improves notably.

Setting 3: In this simulation, we compare the performance of SCP, PL, and PPL on large-

scale and sparse networks. We consider similar simulation settings as in Amini et al. (2013).

As in Decelle et al. (2011), the edge-probability matrix P is controlled by the following two

parameters: the “out-in-ratio” β, varying from 0 to 0.2, and the weight vector ω, determining

the relative degrees within communities. We set ω = (1, 1, 1). Once β = 0, P ∗ is set to be

a diagonal matrix diag(ω), while otherwise we set the diagonal elements of P ∗ to be β −1 ω

and set all the off-diagonal ones to 1. Then, the overall expected network degree is set to be

λ, which varies from 3 to 5. Finally, we re-scale P ∗ to obtain this expected degree, giving

the resulting P as follows:

λ
P = P ∗, (15)
(n − 1)(π T P ∗ π)

which generates sparse networks, since Pkl = O(1/n). In this simulation study, both PL and

PPL are initialized by SCP. We let K = 3 and π = (0.2, 0.3, 0.5). We consider three scenarios:

1) varying β while setting λ = 5 and n = 4000, 2) varying λ while setting β = 0.05, and

n = 4000, and 3) varying n while setting λ = 5 and β = 0.05. Figure 4 reports the NMI from

the three methods and the computing time from PPL and PL, based on 100 replications. We

22
Algorithm SCP PL PPL Algorithm PL PPL

running time (secs)


0.9 ● 0.5
0.8 ●


● 0.4 ●



0.7
NMI


● 0.3 ●



0.6 ●
● ●
● ●

● ●

0.2 ●

● ●
● ●

0.5 ●





0.4 ●



0.1 ●



beta=0.02 beta=0.05 beta=0.1 beta=0.02 beta=0.05 beta=0.1

0.8 ●
0.5 ●

running time (secs)


● ●





0.4 ● ●

0.6 ●
● ●


0.3
NMI

● ●



● ●


● ●
0.4 ●

0.2 ●

● ●



● ●

0.1 ●

0.2 ●


● ●

lambda=3 lambda=4 lambda=5 lambda=3 lambda=4 lambda=5

log10 running time (secs)


2 ●
0.75 ● ● ●

● ●

0.70 1 ●



NMI


0.65

0 ●

0.60

0.55 −1
n=10^4 n=10^5 n=10^6 n=10^4 n=10^5 n=10^6

Figure 4: Comparisons of the NMI and computing time from SCP, PL and PPL under different
settings. The three rows correspond to the following three scenarios respectively: 1) varying
β while setting λ = 5 and n = 4000, 2) varying λ while setting β = 0.05, and n = 4000, and
3) varying n while setting λ = 5 and β = 0.05.

note the reported running times for PPL and PL do not include the initialization step. For

comparison, when λ = 5, β = 0.05 and n = 106 , the SCP initialization step takes less than

100 seconds (see Section A6 in the supplemental material). It is seen that PPL outperforms

both SCP and PL in terms of community detection accuracy. Moreover, PPL consistently

outperforms PL in terms of computational efficiency.

5.2 Goodness of fit test and normality of plug-in estimators

To evaluate goodness of fit, we consider the maximum entry-wise deviation based testing

procedure in Hu et al. (2020b). The authors showed that the distribution of the test statistic,

23
n=600 n=1200
SCP
SCP PPL
PPL limit

0.20
limit

0.20

0.15
0.15

Density
Density

0.10
0.10

0.05
0.05

0.00
0.00

−5 0 5 1. 15 20 25 −5 0 5 1. 15 20 25

Figure 5: Null densities of the test statistic with n = 600 (left plot) and n = 1200 (right
plot). The blue dashed lines, red dash-dotted lines and black solid lines show the densities
under SCP, PPL and the theoretical limit, respectively.

denoted by Tn and calculated with a strongly consistent community label, converges to a

Gumbel distribution. In this simulation study, we consider a SBM with K = 3, π =

(0.2, 0.3, 0.5), and Pkl = 0.12 + 0.08 × I(k = l), and investigate the distribution of Tn

calculated using estimates from PPL and SCP respectively. The results over 1000 replications

are shown in Figure 5. It is seen that the sample null distribution of Tn calculated with

PPL is very close to the limiting distribution while that calculated with SCP deviates from

the limit considerably. This is due to that Tn in Hu et al. (2020b) is calculated based on

maximum entry-wise deviation and as such, the misclassified nodes in SCP, albeit not many,

may much inflate the test statistic. With the refinement of PPL, the test statistic is seen to

have a sample null distribution close to the theoretical limit, ensuring a well-controlled test

size.

To examine normality of plug-in estimators, we consider a SBM with K = 3, π =

(0.2, 0.3, 0.5), Pkl = 0.12+0.08×I(k = l) and n = 800. We consider the empirical distribution

of π̂1 , π̂2 and π̂3 calculated using labels produced by PPL and SCP, respectively. The results

over 1000 replications are shown in Figure 6. It is seen that the empirical distributions

calculated with PPL are very close to the limiting distributions while those calculated with

24
π1 π2 π3
SCP SCP
30

25
SCP
PPL PPL PPL

25
limit limit limit
25

20
20
20
Density

Density

Density
15
15
15

10
10
10

5
5
5
0

0
0.15 0.20 0.25 0.20 0.25 0.30 0.35 0.40 0.40 0.45 0.50 0.55 0.60

Figure 6: Empirical distributions of π̂1 , π̂2 and π̂3 . The blue dashed lines, red dash-dotted
lines and black solid lines show the densities under SCP, PPL and the theoretical limit, re-
spectively.

SCP deviate, especially for π̂1 and π̂3 , from the theoretical limits.

5.3 DCSBM

In this section, we evaluate the performance of the profile-pseudo likelihood method under

the DCSBM, referred to as DC-PPL. We fix K = 3, n = 1200, π = (0.2, 0.3, 0.5) and let

P = 10−2 × [JK,K + diag(2, 3, 4)], where JK,K is a K by K matrix where every element is

equal to one. The degree parameters {θi }ni=1 are generated from (Zhao et al., 2012), i.e.,

2
P (θi = mx) = P (θi = x) = 1/2 with x = ,
m+1

which ensures that E(θi ) = 1. We consider m = 2, 4, 6. Given c and θ, the edge variables

Aij ’s are independently generated from a Bernoulli distribution with parameters θi θj Pci cj ,

1 ≤ i ≤ j ≤ n.

We compare DC-PPL with SCP as well as CPL, an extension of PL proposed for networks

with degree heterogeneity in Amini et al. (2013). The results are summarized in Figure 7,

based on 100 replications. We can see both DC-PPL and CPL outperform SCP, and DC-PPL

performs better than CPL in terms community detection accuracy.

25
Figure 7: Comparison of SCP, CPL, DC-PPL under DCSBM with varying m.

6 Real-world Data Examples


6.1 Political blogs data

In this subsection, we apply our proposed method to the network of political blogs collected

by Adamic and Glance (2005). The nodes in this network are blogs on US politics and

the edges are hyper-links between these blogs with directions removed. This data set was

collected right after the 2004 presidential election and demonstrates strong divisions. In

Adamic and Glance (2005), all the blogs were manually labeled as liberal or conservative,

and we take these labels as the ground truth. As in Zhao et al. (2012), we focus on the

largest connected component of the original network, which contains 1,222 nodes, 16,714

edges and has the average degree of approximately 27.

To perform community detection, we consider five different methods, namely, PL, PPL,

SCP, CPL, and DC-PPL. We compute the NMI between the estimated community labels with

the so-called ground truth labels. Figure 8 shows the community detection results from the

five different methods. It is seen that PPL and PL divide the nodes into two communities, with

low degree and high degree nodes, respectively. Both the PPL and PL estimates have NMI

close to zero as neither of these two methods take into consideration the degree heterogeneity.

The partition obtained using SCP has NMI=0.653, while that from the CPL has NMI=0.722

and that from the DC-PPL has NMI=0.727. Both CPL and DC-PPL achieve good performance

in this application.

26
(a) True (b) PL (c) PPL

(d) SCP (e) CPL (f) DC-PPL

Figure 8: Community detection on the political blogs data using PL, PPL, SCP, CPL, and
DC-PPL respectively. The sizes of nodes are proportional to the their degree, and the color
corresponds to different community labels.

6.2 International trade data

In this subsection, we apply our proposed method to the network of international trades. The

data contain yearly international trades among n = 58 countries from 1981–2000 (Westveld

and Hoff, 2011). Each node in the network corresponds to a country and an edge (i, j)

measures the amount of exports from country i to country j for a given year; see Westveld

and Hoff (2011) for details. Following Saldana et al. (2017), we focus on the international

trade network in 1995 and transform the directed and weighted adjacency network to an

undirected binary network. Specifically, let Wij = Tradeij + Tradeji , and set Aij = 1 if

27
Group Countries
Algeria, Barbados, Bolivia, Costa Rica, Cyprus, Ecuador, El Salvador,
1 Guatemala, Honduras, Iceland, Jamaica, Mauritius, Nepal, Oman, Panama,
Paraguay, Peru, Trinidad and Tobago, Tunisia, Uruguay, Venezuela
Belgium, Brazil, Canada, France, Germany, Italy, Japan, South Korea,
2
Mexico, Netherlands, Spain, Switzerland, United Kingdom, United States
Argentina, Australia, Austria, Chile, Colombia, Denmark, Egypt, Finland,
3 Greece, India, Indonesia, Ireland, Israel, Malaysia, Morocco, New Zealand,
Norway, Philippines, Portugal, Singapore, Sweden, Thailand, Turkey

Table 1: Community detection result on the international trade data using PPL with K = 3.

Wij ≥ W0.5 , and Aij = 0 otherwise. Here Tradeij records the amount of exports from

country i to country j and W0.5 denotes the 50th percentile of {Wij }1≤i<j≤n . Using different

model selection procedures, both Saldana et al. (2017) and Hu et al. (2020a) selected the

number of SBM communities to be K = 3 for this data set. Saldana et al. (2017) suggested

that larger community numbers such as K = 7 are also reasonable and they tended to provide

finer solutions. We apply PPL to this network with K = 3 and the community detection

result is summarized in Table 1. It is seen that the three communities mostly correspond

to developing countries in South America with low GDPs, countries with high GDPs and

industrialized European and Asian countries with medium-level GDPs, respectively.

To evaluate goodness of fit, we consider the maximum entry-wise deviation based testing

procedure (Hu et al., 2020b) that we investigated in Section 5.2. The community labels

identified using SCP under K = 3 gives a test statistic value of 52.13 with a p-value less

than 10−10 , suggesting a lack of fit. On the other hand, the community labels identified by

PPL, initialized using SCP under K = 3, gives a test statistic of 4.59 with a p-value of 0.03.

Therefore, the goodness of fit test for PPL under K = 3 is not rejected at the significance

level of 0.01. It is also worth noting that when K = 4, PPL gives a test statistic of 2.38

with a p-value of 0.08 while SCP gives a p-value less than 10−3 . It is seen through this data

example that refinement of the initial clustering solution can be useful in inferential tasks

28
such as the goodness of fit test.

7 Discussion

In this paper, we propose a new profile-pseudo likelihood method for fitting SBMs to large

networks. Specifically, we consider a novel approach that decouples the membership labels

of the rows and columns in the likelihood function, and treat the row labels as a vector

of latent variables. Correspondingly, the likelihood can be maximized in an alternating

fashion over the block model parameters and over the column community labels. Our pro-

posed method retains and improves on the computational efficiency of the pseudo likelihood

method, performs well for both small and large scale networks, and has provable conver-

gence guarantee. We show that the community labels (i.e., column labels) estimated from

our proposed method enjoy strong consistency, as long as the initial labels have an overlap

with the truth beyond that of random guessing.

In our approach, we consider spectral clustering as the initialization method, which re-

quires computing K leading eigenvectors. In real world applications, many implementations

of eigen-decomposition are scalable, such as the PageRank algorithm adopted in Google

search (Page et al., 1999). We also note that our method needs not to limit the initialization

algorithm to spectral clustering. For large-scale networks, one may consider the FastGreedy

method by Clauset et al. (2004), which has a complexity of O(n log2 n) or the Louvain al-

gorithm by Blondel et al. (2008), which has a complexity of O(n log n) (Yang et al., 2016).

These fast algorithms, to our best knowledge, may not have theoretical guarantees on their

performances. However, they have been validated empirically by many across various fields

(Yang et al., 2016) and can be considered as an initialization method when spectral clustering

is not feasible.

Although we focus on SBMs and DCSBMs in this work, we envision the idea of simplifying

the block model likelihoods by decoupling the membership labels of rows and columns can be

29
applied to other network block model problems, such as mixed membership SBMs (Airoldi

et al., 2008), block models with additional node features (Zhang et al., 2016) and SBMs with

dependent edges (Yuan and Qu, 2018). We plan to investigate these directions in our future

work.

The code is publicly available on Github (https://github.com/WangJiangzhou/Fast-Network-

Community-Detection-with-Profile-Pseudo-Likelihood-Methods).

Acknowledgment

Wang, Liu and Guo’s research are supported by NSFC grants 11690012 and 11571068, the

Fundamental Research Funds for the Central Universities grant 2412017BJ002, the Key

Laboratory of Applied Statistics of MOE (KLAS) grants 130026507 and 130028612, the

Special Fund for Key Laboratories of Jilin Province, China grant 20190201285JC. Zhang’s

research is supported by NSF DMS-2015190 and Zhu’s research is supported by NSF DMS-

1821243.

References
Abbe, E. (2017), “Community detection and stochastic block models: recent developments,”
The Journal of Machine Learning Research, 18, 6446–6531.

Abbe, E., Bandeira, A. S. and Hall, G. (2015), “Exact recovery in the stochastic block
model,” IEEE Transactions on Information Theory, 62, 471–487.

Abbe, E., Fan, J., Wang, K., Zhong, Y., et al. (2020), “Entrywise eigenvector analysis of
random matrices with low expected rank,” Annals of Statistics, 48, 1452–1474.

Adamic, L. A. and Glance, N. (2005), “The political blogosphere and the 2004 U.S. elec-
tion:divided they blog,” in International Workshop on Link Discovery, pp. 36–43.

Airoldi, E. M., Blei, D. M., Fienberg, S. E., and Xing, E. P. (2008), “Mixed membership
stochastic block models,” Journal of Machine Learning Research, 9, 1981–2014.

Amini, A. A., Chen, A., Bickel, P. J., and Levina, E. (2013), “Pseudo-likelihood methods for
community detection in large sparse networks,” The Annals of Statistics, 41, 2097–2122.

30
Bickel, P., Choi, D., Chang, X., and Zhang, H. (2013), “Asymptotic normality of maximum
likelihood and its variational approximation for stochastic block models,” The Annals of
Statistics, 1922–1943.

Bickel, P. J. and Chen, A. (2009), “A nonparametric view of network models and Newman–
Girvan and other modularities,” Proceedings of the National Academy of Sciences, 106,
21068–21073.

Bisson, G. and Hussain, F. (2008), “Chi-sim: A new similarity measure for the co-clustering
task,” in Machine Learning and Applications, 2008. ICMLA’08. Seventh International
Conference on, IEEE, pp. 211–217.

Blondel, V. D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E. (2008), “Fast unfolding of
communities in large networks,” Journal of statistical mechanics: theory and experiment,
2008, P10008.

Clauset, A., Newman, M. E. and Moore, C. (2004), “Finding community structure in very
large networks,” Physical review E, 70, 066111.

Daudin, J.-J., Picard, F. and Robin, S. (2008), “A mixture model for random graphs,”
Statistics and Computing, 18, 173–183.

Decelle, A., Krzakala, F., Moore, C., and Zdeborová, L. (2011), “Asymptotic analysis of the
stochastic block model for modular networks and its algorithmic applications,” Physical
Review E, 84, 066106.

Fortunato, S. (2010), “Community detection in graphs,” Physics Reports, 486, 75–174.

Fortunato, S. and Hric, D. (2016), “Community detection in networks: A user guide,” Physics
Reports, 659, 1–44.

Gao, C., Ma, Z., Zhang, A. Y., and Zhou, H. H. (2017), “Achieving Optimal Misclassification
Proportion in Stochastic Block Models,” Journal of Machine Learning Research, 18, 1–45.

Gao, C., Ma, Z., Zhang, A. Y., and Zhou, H. H. (2018), “Community detection in degree-
corrected block models,” Annals of Statistics, 46, 2153–2185.

Hodges, J. L. and Le Cam, L. (1960), “The Poisson approximation to the Poisson binomial
distribution,” The Annals of Mathematical Statistics, 31, 737–740.

Holland, P. W., Laskey, K. B. and Leinhardt, S. (1983), “Stochastic block models: First
steps,” Social Networks, 5, 109–137.

Hu, J., Qin, H., Yan, T., and Zhao, Y. (2020a), “Corrected Bayesian information criterion for
stochastic block models,” Journal of the American Statistical Association, 115, 1771–1783.

31
Hu, J., Zhang, J., Qin, H., Yan, T., and Zhu, J. (2020b), “Using Maximum Entry-Wise De-
viation to Test the Goodness of Fit for Stochastic Block Models,” Journal of the American
Statistical Association, 1–10.

Joseph, A., Yu, B. et al. (2016), “Impact of regularization on spectral clustering,” Annals of
Statistics, 44, 1765–1791.

Karrer, B. and Newman, M. E. (2011), “Stochastic block models and community structure
in networks,” Physical Review E, 83, 016107.

Larremore, D. B., Clauset, A. and Jacobs, A. Z. (2014), “Efficiently inferring community


structure in bipartite networks,” Physical Review E, 90, 012805.

Lei, J. (2016), “A goodness-of-fit test for stochastic block models,” The Annals of Statistics,
44, 401–424.

Lei, J. and Rinaldo, A. (2015), “Consistency of spectral clustering in stochastic block mod-
els,” The Annals of Statistics, 43, 215–237.

Lei, J. and Zhu, L. (2017), “Generic Sample Splitting for Refined Community Recovery in
Degree Corrected Stochastic Block Models,” Statistica Sinica, 1639–1659.

Li, T., Levina, E. and Zhu, J. (2020), “Network cross-validation by edge sampling,”
Biometrika, 107, 257–276.

Madeira, S. C., Teixeira, M. C., Sa-Correia, I., and Oliveira, A. L. (2010), “Identification
of regulatory modules in time series gene expression data using a linear time bicluster-
ing algorithm,” IEEE/ACM Transactions on Computational Biology and Bioinformatics
(TCBB), 7, 153–165.

Meng, X.-L. and Rubin, D. B. (1993), “Maximum likelihood estimation via the ECM algo-
rithm: A general framework,” Biometrika, 80, 267–278.

Moody, J. and White, D. R. (2003), “Structural cohesion and embeddedness: A hierarchical


concept of social groups,” American Sociological Review, 103–127.

Nowicki, K. and Snijders, T. A. B. (2001), “Estimation and prediction for stochastic block-
structures,” Journal of the American Statistical Association, 96, 1077–1087.

Page, L., Brin, S., Motwani, R., and Winograd, T. (1999), “The PageRank citation ranking:
Bringing order to the web.” Tech. rep., Stanford InfoLab.

Rohe, K., Chatterjee, S. and Yu, B. (2011), “Spectral clustering and the high-dimensional
stochastic blockmodel,” The Annals of Statistics, 39, 1878–1915.

32
Rohe, K., Qin, T. and Yu, B. (2012), “Co-clustering for directed graphs: the Stochastic
co-Blockmodel and spectral algorithm Di-Sim,” arXiv preprint arXiv:1204.2296.

Saldana, D. F., Yu, Y. and Feng, Y. (2017), “How many communities are there?” Journal
of Computational and Graphical Statistics, 26, 171–181.

Sarkar, S. and Dong, A. (2011), “Community detection in graphs using singular value decom-
position,” Physical Review E Statistical Nonlinear and Soft Matter Physics, 83, 046114.

Snijders, T. A. and Nowicki, K. (1997), “Estimation and prediction for stochastic block
models for graphs with latent block structure,” Journal of Classification, 14, 75–100.

Spirin, V. and Mirny, L. A. (2003), “Protein complexes and functional modules in molecular
networks,” Proceedings of the National Academy of Sciences, 100, 12123–12128.

Su, L., Wang, W. and Zhang, Y. (2019), “Strong consistency of spectral clustering for
stochastic block models,” IEEE Transactions on Information Theory, 66, 324–338.

Westveld, A. H. and Hoff, P. D. (2011), “A mixed effects model for longitudinal relational
and network data, with applications to international trade and conflict,” The Annals of
Applied Statistics, 5, 843–872.

Wu, C. J. (1983), “On the convergence properties of the EM algorithm,” The Annals of
statistics, 11, 95–103.

Yang, Z., Algesheimer, R. and Tessone, C. J. (2016), “A comparative analysis of community


detection algorithms on artificial networks,” Scientific reports, 6, 1–18.

Yuan, Y. and Qu, A. (2018), “Community Detection with Dependent Connectivity,” arXiv
preprint arXiv:1812.06406.

Zhang, J. and Chen, Y. (2018), “Modularity based community detection in heterogeneous


networks,” arXiv preprint arXiv:1803.07961.

Zhang, Y., Levina, E. and Zhu, J. (2016), “Community detection in networks with node
features,” Electronic Journal of Statistics, 10, 3153–3178.

Zhao, Y. (2017), “A survey on theoretical advances of community detection in networks,”


Wiley Interdisciplinary Reviews: Computational Statistics, 9, e1403.

Zhao, Y., Levina, E. and Zhu, J. (2012), “Consistency of community detection in networks
under degree-corrected stochastic block models,” The Annals of Statistics, 40, 2266–2292.

33
Supplementary Materials
Fast Network Community Detection with Profile-Pseudo
Likelihood Methods

Jiangzhou Wang, Jingfei Zhang, Binghui Liu, Ji Zhu, and Jianhua Guo

A1 Proof of Theorem 1

To prove Theorem 1, it suffices to show

LPL (Ω(s) , e(s) ; {ai }) ≤ LPL (Ω(s+1) , e(s) ; {ai }), (S1)

LPL (Ω(s+1) , e(s) ; {ai }) ≤ LPL (Ω(s+1) , e(s+1) ; {ai }). (S2)

Consider (S1). The updating procedure from {Ω(s) , e(s) } to {Ω(s+1) , e(s) } can be seen

as a procedure of fitting some mixture model, thus the inequality (S1) holds by the ascent

property of the EM algorithm (Wu, 1983).

Consider (S2). It is equivalent to

`PL (Ω(s+1) , e(s) ; {ai }) ≤ `PL (Ω(s+1) , e(s+1) ; {ai }). (S3)
We have
`PL (Ω(s+1) , e(s+1) ; {ai }) − `PL (Ω(s+1) , e(s) ; {ai })
 )Aij ( )1−Aij   )Aij ( )1−Aij 
n K n n K n
( (
(s+1) (s+1) (s+1) (s+1) (s+1) (s+1)
X X Y X X Y
= log  πl P (s+1) 1 − P (s+1) − log  πl P (s) 1 − P (s) 
lej lej lej lej
i=1 l=1 j=1 i=1 l=1 j=1
 ( )Aij ( )1−Aij ( )Aij ( )1−Aij 
n n
 K πl(s+1) (s+1) (s+1) (s+1) (s+1) (s+1)
Q Q
n
P (s+1) 1−P (s+1) πl P (s) 1−P (s)

X X j=1 lej lej j=1 lej lej 
= log 
 
( )Aij ( )1−Aij ( )Aij ( )1−Aij 
i=1
 n K n 
 l=1 (s+1) Q (s+1) (s+1) P (s+1) Q (s+1) (s+1)
πl P 1−P πl P 1−P

(s) (s) (s) (s)
j=1 lej lej l=1 j=1 lej lej
 ( )Aij ( )1−Aij  ( )Aij ( )1−Aij
n n
(s+1) (s+1)
 πl(s+1) (s+1) (s+1)
Q Q
n X
K
 P (s+1) 1 − P (s+1) P (s) 1 − P (s)
X  j=1 lej lej  j=1 lej lej
≥ log 
 
( )Aij ( )1−Aij  ( )Aij ( )1−Aij
 n  K (s+1) Q n
 Q  P
i=1 l=1 (s+1) (s+1) (s+1) (s+1)
P (s) 1 − P (s) πl P (s) 1 − P (s)
j=1 lej lej l=1 j=1 lej lej
( )Aij ( )1−Aij  ( )Aij ( )1−Aij 
n X
n X
K n X
n X
K
(s+1) (s+1) (s+1) (s+1) (s+1) (s+1)
X X
= τil log  P (s+1) 1−P (s+1)
− τil log  P (s) 1−P (s)

lej lej lej lej
j=1 i=1 l=1 j=1 i=1 l=1
 ( )Aij ( )1−Aij  ( )Aij ( )1−Aij 
n n X
K n X
K
(s+1) (s+1) (s+1) (s+1) (s+1) (s+1)
X X X
=  τil log  P (s+1) 1−P (s+1)
− τil log  P (s) 1−P (s)

lej lej lej lej
j=1 i=1 l=1 i=1 l=1

≥ 0,

1
where the first inequality is due to Jensen’s inequality, and the second inequality is due to

the update strategy for e(s) in Algorithm 1. The proof is completed.

A2 Proof of Theorem 2

We focus on the case of γ ∈ ( 12 , 1) and a > b. For the remaining three cases of (i) γ ∈ ( 21 , 1),

a < b, (ii) γ ∈ (0, 12 ), a > b, and (iii) γ ∈ (0, 12 ), a < b, the proofs are similar.
δ
For any (â, b̂) ∈ Pa,b , we have â > b̂. The PPL estimate can be written as follows:
n X
X 2 h n oi
(0)
ĉj {e } = arg max log (Pblk )Aij (1 − Pblk )1−Aij τ̂il {e(0) }.
e e
k∈{1,2}
i=1 l=1

Consider j ∈ {1, 2, . . . , m}. Then ĉj {e(0) } = 1 if


n X
X 2 h n oi n X
X 2 h n oi
log (Pbl1 )Aij (1 − Pbl1 )(1−Aij ) (0)
τ̂il {e } > log (Pbl2 )Aij (1 − Pbl2 )(1−Aij ) τ̂il {e(0) },
e e e e

i=1 l=1 i=1 l=1

which is equivalent to
2
( n n
)
X X X
eij τ̂il {e(0) }logPbl1 +
A eij )τ̂il {e(0) }log(1 − Pbl1 ) >
(1 − A
l=1 i=1 i=1
2
( n n
) (S4)
X X X
eij τ̂il {e(0) }logPbl2 +
A eij )τ̂il {e(0) }log(1 − Pbl2 ) .
(1 − A
l=1 i=1 i=1
n n
e0 ,
We let B
P eij τ̂il {e(0) } and n0 ,
A
P
τ̂il {e(0) } for all j = 1, 2, . . . , n and l = 1, 2, and
lj l
i=1 i=1
recall
! !
Pb11 Pb12 1 â b̂
Pb = = . (S5)
Pb21 Pb22 m b̂ â

By simplifying (S4), we can restate that ĉj {e(0) } = 1 if


!
 0 0
 Pb11 n e 0 0 0 0
o 1 − Pb12
B
e1j −B
e2j log + B1j − B2j − (n1 − n2 ) log
e > 0. (S6)
Pb12 1 − Pb11

Since â > b̂, we have Pb11 > Pb12 . Thus by (S6), we have
hn 0 o[n 0 oi
P ĉj {e(0) } = e0 ≤ 0 e 0 − (n0 − n0 ) ≤ 0
 
6 1 ≤ P B e −B
1j 2j Be −B
1j 2j 1 2
hn 0 o i
e0 ≤ 
[
≤ P B e −B
1j 2j {|n01 − n02 | ≥ }
h 0 i
≤ P B
e −B e 0 ≤  + P [|n0 − n0 | ≥ ] . (S7)
1j 2j 1 2

2
h i
0 0
Next we upper bound the two terms P Be1j −B
e2j ≤  and P [|n01 − n02 | ≥ ] separately.

Firstly, we have
h 0 0
i
P B1j − B2j ≤ 
e e
" n n
#
X X
= P eij τ̂i1 {e(0) } −
A eij τ̂i2 {e(0) } ≤ 
A
i=1 i=1
n
hX Xn
= P eij I(ci = 1) −
A A
eij I(ci = 2)
i=1 i=1
n
X n i
(0)
X
eij τ̂i2 {e(0) } − I(ci = 2) ≤ 
 
+ Aij τ̂i1 {e } − I(ci = 1) −
e A
"i=1n n
i=1
m
#
X X X
τ̂i1 {e(0) } − I(ci = 1) ≤ 

≤ P eij I(ci = 1) −
A A
eij I(ci = 2) + 2 A
eij
i=1 i=1 i=1
"( n n
) ( m
)#
X X [ X
τ̂i1 {e(0) } − I(ci = 1) ≤ −

≤ P eij I(ci = 1) −
A eij I(ci = 2) ≤ 2
A 2 A
eij
i=1 i=1 i=1
" n n
# " m
#
X X X
τ̂i1 {e(0) } − I(ci = 1) ≤ −

≤ P eij I(ci = 1) −
A eij I(ci = 2) ≤ 2 + P 2
A A
eij
" i=1
n n
i=1
m
# i=1
X X X h i
≤ P eij I(ci = 1) −
A Aij I(ci = 2) ≤ 2 +
e P τ̂i1 {e(0) } − I(ci = 1) ≥ . (S8)
i=1 i=1 i=1
n
Next, we have

P [|n01 − n02 | ≥ ]
h n  [ n  i
≤ P {n01 ≤ − } {n02 ≤ − }
2 2i 2 2i
h n  h n 
≤ P n01 ≤ − + P n02 ≤ −
 2 2 2 2   
[ n  
o [ n  
o
≤ P |τ̂i1 {e(0) } − I(ci = 1)| ≥ + P |τ̂i2 {e(0) } − I(ci = 2)| ≥
n n
i∈{1,2,...,m} i∈{m+1,m+2,...,n}
m n
X h
(0) i X hi (0)
≤ P |τ̂i1 {e } − I(ci = 1)| ≥ +P |τ̂i2 {e } − I(ci = 2)| ≥ . (S9)
i=1
n i=m+1
n
hP n
Similar to Lemma 1 in Amini et al. (2013), we can upper bound the term P A
eij I(ci =
i=1
Pn i
1) − Aij I(ci = 2) ≤ 2 as follows. Let
e
i=1
n n n
 X X X
ηej σ(c) = eij I(ci = 1) −
A A
eij I(ci = 2) , A
eij σi (c),
i=1 i=1 i=1

3

 1, ci = 1
where σi (c) = , and σ(c) = (σ1 (c), σ2 (c), . . . , σn (c)). Let α
eij = E[A
eij ]. Since
−1, ci = 2

h i
Aij σj (c) − E Aij σj (c) ≤ max{e
αij , 1 − α
eij } ≤ 1, we have, for j = 1, 2, . . . , m,
e e

a b
ηj (σ (c))] = m ×
E [e −m× = (a − b),
b m
n
X n
X h i X n h i
2
υ = Var (−e
ηj (σ (c))) = V ar(Aij ) ≤
e E Aij =
e E A
eij = (a + b).
i=1 i=1 i=1

Then by applying the Bernstein inequality to −e
ηj σ(c) , we have
t2
         −
P ηej σ(c) ≤ E ηej σ(c) − t = P −e ηj σ(c) ] + t ≤ e 2(v+t/3) , ∀t ≥ 0. (S10)
ηj σ(c) ≥ −E[e

Note that for t ∈ [0, 3(a + b)], we have 2(υ + t/3) ≤ 4(a + b). It follows from (S10) that
t2
P ηej σ(c) ≤ (a − b) − t ≤ e− 4(a+b) , ∀t ∈ [0, 3(a + b)].
  
(S11)
h  i
In order to bound P ηej σ(c) ≤ 2 , we take t = (a − b) − 2. Then t ∈ [0, 3(a + b)] when n
(a−b)2
is large enough as (a+b)
≥ C log n for a sufficiently large C. Thus we have

{(a−b)−2}2 (a−b)2 −4(a−b)+42


ηj (σ(c)) ≤ 2] ≤ e−
P [e 4(a+b) = e− 4(a+b) . (S12)
h
To obtain upper bounds of (S8) and (S9), we need to upper bound P |τ̂i1 {e(0) } − I(ci =
i h i
1)| ≥ n , ∀i ∈ {1, 2, . . . , m} and P |τ̂i2 {e(0) }−I(ci = 2)| ≥ n , ∀i ∈ {m+1, m+2, . . . , n}.
δ
Firstly, we consider the case of i ∈ {1, 2, . . . , m}. With (â, b̂) ∈ Pa,b and (S5), we have

Pb11 â â
1−Pb11 m−â m−b̂ â
= ≥ = ≥ δ.
Pb12 b̂ b̂ b̂
1−Pb12 m−b̂ m−b̂

4
n n
eik = P A
Let B eij I(ej = k) and nk = P I(ej = k), we then have
j=1 i=1
h
(0) i
P |τ̂i1 {e } − I(ci = 1)| ≥
 n
τ̂i1 {e(0) }

1 − /n
= P ≤
τ̂i2 {e(0) } /n
" #
(Pb11 )Bi1 (Pb12 )Bi2 (1 − Pb11 )n1 −Bi1 (1 − Pb12 )n2 −Bi2 1 − /n
e e e e
= P ≤
(Pb21 )Bei1 (Pb22 )Bei2 (1 − Pb21 )n1 −Bei1 (1 − Pb22 )n2 −Bei2 /n
 
e e Bi1 −Bi2
Pb11
1−Pb11 1 − /n 
= P  ≤
 
/n
Pb12

1−Pb12
 Bei1 −Bei2  
 Pb11 
 1 − /n  \ nn e o[n oo
= P   1−bP11  ≤ Bi1 − Bei2 ≥ 0 Bei1 − B
ei2 < 0 
 b

/n 

 1−P12

Pb12

 
ei1 −B
B 1 − /n h i
≤ P δ ≤ +P B ei1 − B
ei2 < 0
ei2
/n
  1 − /n 
ei1 − Bei2 ≤ 1 h
ei1 − B
i
= P B log +P B ei2 < 0
logδ /n
 
1  1 − /n
≤ 2P B ei1 − B ei2 ≤ log . (S13)
logδ /n
Let n
X
(0) eij σj {e(0) },

ξi σ{e } = Bi1 − Bi2 ,
e e e A
j=1
(
1, ej = 1 
where σj {e(0) } = , and σ{e(0) } = σ1 {e(0) }, σ2 {e(0) }, . . . , σn {e(0) } . Note that
−1, ej = 2
h i
Aij σj {e(0) } − E Aeij σj {e(0) } ≤ max {e
αij , 1 − α
eij } ≤ 1. For i ∈ {1, 2, . . . , m}, we have
e

 
h
(0)
i a b a b
E ξi σ{e } = γm ·
e + (1 − γ)m · − (1 − γ)m · + γm · = (2γ − 1)(a − b),
m m m m
 n n n
(0)
 X   X h i X
2
h i
υ = Var −ξi σ{e } =
e V ar Aij ≤
e E Aij =
e E Aeij = (a + b).
j=1 j=1 i=1

Then by applying the Bernstein inequality to −ξei σ{e(0) } , we have
t2
h   h  i i h   h  i i

P ξei σ{e(0) } ≤ E ξei σ{e(0) } − t = P −ξei σ{e(0) } ≥ −E ξei σ{e(0) } + t ≤ e 2(v+t/3) , ∀t ≥ 0.
(S14)

Note that for t ∈ [0, 3(a + b)], we have 2(υ + t/3) ≤ 4(a + b). It follows from (S14) that
t2
h i
P ξei σ{e(0) } ≤ (2γ − 1)(a − b) − t ≤ e− 4(a+b) , ∀t ∈ [0, 3(a + b)].

(S15)

5
h  i  
 1 1−/n 1 1−/n
In order to bound P ξei σ{e(0) } ≤ logδ
log /n
, we take t = (2γ−1)(a−b)− logδ log /n
.
Then t ∈ [0, 3(a + b)] when n is large enough. Thus, we have
  
(0)
 1 1 − /n
P ξei σ{e } ≤ log
logδ /n
(  )2
1−/n
(2γ−1)(a−b)− 1 log
logδ /n

≤ e− 4(a+b)
  
1−/n
2 2 2(2γ−1)(a−b) 1 log 2
(2γ−1) (a−b) logδ /n 1−/n
− 4(a+b)
− 4(a+b)
+{ 1
logδ }
log( /n ) 
≤ e
(2γ−1)2 (a−b)2

≤ e 8(a+b) (when n is large enough).

It follows from (S13) that (when n is large enough)


i (2γ−1)2 (a−b)2
h
≤ 2e− 8(a+b)

P τ̂i1 {e(0) } − I(ci = 1) ≥ ∀i ∈ {1, 2, . . . , m}. (S16)
n
Similar results for P |τ̂i2 {e(0) } − I(ci = 2)| ≥ n can be obtained by using similar arguments.
 

Specifically, we have
i (2γ−1)2 (a−b)2
h

(0)

P τ̂i2 {e } − I(ci = 2) ≥
≤ 2e 8(a+b) ∀i ∈ {m + 1, m + 2, . . . , n}. (S17)
n
Thus by (S8), (S12), (S16) and (S17), for j = 1, 2, . . . , m, we have
(a−b)2 −4(a−b)+42 (2γ−1)2 (a−b)2
h 0 0
i
e1j − B
P B e2j ≤  ≤ e− 4(a+b) + ne− 8(a+b) . (S18)
h i
For j = m + 1, m + 2, . . . , n, the term P B e0 − B e 0 ≤  can be bounded as follows,
2j 1j

(a−b)2 −4(a−b)+42 (2γ−1)2 (a−b)2


h 0 0
i
e2j − B
P B e1j ≤  ≤ e− 4(a+b) + ne− 8(a+b) . (S19)

According to (S9), (S16), and (S17), we have


(2γ−1)2 (a−b)2
P [|n01 − n02 | ≥ ] ≤ 2ne− 8(a+b) . (S20)

Finally, by (S18), (S19), and (S20), we have

P ĉ{e(0) } =
 
6 c
 
[
ĉj {e(0) } =

= P 6 cj 
j∈{1,2,...,n}
    
 n o[ n 0 o[
0 0 0
[ [
≤ P  B1j − B2j ≤ 
e e e2j − B
B e1j ≤ {|n01 − n02 | ≥ }
   
j∈{1,2,...,m} j∈{m+1,m+2,...,n}
m h 0 i n h 0 i
0 0
X X
≤ e1j − B
P B e2j ≤ + e2j − B
B e1j ≤  + P [|n01 − n02 | ≥ ]
j=1 j=m+1
(a−b)2 −4(a−b)+42 (2γ−1)2 (a−b)2
− −
= ne 4(a+b) + n(n + 2)e 8(a+b) .

6
Therefore, we have that

P ĉ{e(0) } = c = 1 − P ĉ{e(0) } =
   
6 c
 
(a−b)2 −4(a−b)+42 (2γ−1)2 (a−b)2
− −
≥ 1 − ne 4(a+b) + n(n + 2)e 8(a+b) .

A3 Proof of Theorem 3
Recall that A and A
e are the adjacency matrices of undirected and directed networks, respec-
tively. Similar to the technique in Amini et al. (2013), we introduce a deterministic coupling
between A and A, e which allows us to carry over the results from the directed SBM. Let
(
  h  i 0, A
eij = A
eji = 0
A=T A
e , T A
e = . (S21)
1, otherwise

That is, the graph of A is obtained from that of A


e by removing directions. Note that
   
Pkl = P (Aij = 1) = 1 − P Aeij = 0 P Aeji = 0 = 2Pekl − (Pekl )2 ,

which matches the relationship between (7) and (8). From (S21), it is not difficult to see
that

Aij ≥ A
eij ∀ i, j ∈ {1, 2, . . . , n} .

We focus on the case of γ ∈ ( 21 , 1) and a > b. For the remaining three cases of (i) γ ∈ ( 21 , 1),
a < b, (ii) γ ∈ (0, 12 ), a > b, and (iii) γ ∈ (0, 21 ), a < b, the proofs are similar. For any
δ
(â, b̂) ∈ Pa,b , we have â > b̂. The PPL estimate can be written as
n X
2    (1−Aij ) 
Aij
X
(0)
ĉj {e } = arg max log Plk 1 − Plk
b b τ̂il {e(0) }. (S22)
k∈{1,2}
i=1 l=1

We first consider j ∈ {1, 2, . . . , m}. Then ĉj {e(0) } = 1 if


n P 2
   (1−Aij ) 
A
τ̂il {e(0) } >
P ij
log Pbl1 1 − Pbl1
i=1 l=1  
n P 2  (1−Aij ) 
Aij
τ̂il {e(0) },
P
log Pl2 1 − Pl2
b b
i=1 l=1

which is equivalent to
2
( n n
)
X X X  
Aij τ̂il {e(0) }logPbl1 + (1 − Aij ) τ̂il {e(0) }log 1 − Pbl1 >
l=1 i=1 i=1
2
( n n
) (S23)
X X X  
Aij τ̂il {e(0) }logPbl2 + (1 − Aij ) τ̂il {e(0) }log 1 − Pbl2 .
l=1 i=1 i=1

7
n n
0
Aij τ̂il {e(0) } and n0l = τ̂il {e(0) } for all j ∈ {1, 2, . . . , n} and l ∈ {1, 2}. We
P P
Let Blj =
i=1 i=1
have
! ! !
Pb11 Pb12 2 â b̂ 1 â2 b̂2
Pb = = − 2 . (S24)
Pb21 Pb22 m b̂ â m b̂2 â2

By simplifying (S23), we can restate that ĉj {e(0) } = 1 if


!
 0 0
 Pb11 n 0 0
o 1 − Pb12
B1j − B2j log + B1j − B2j − (n01 − n02 ) log > 0. (S25)
Pb12 1 − Pb11

Since â > b̂, we have Pb11 > Pb12 . Thus by (S25), we have
h 0 0
[ 0 0
i
6 1 ≤ P {B1j − B2j ≤ 0} {B1j − B2j − (n01 − n02 ) ≤ 0}
P ĉj {e(0) } =
 
h 0 0
[ i
≤ P {B1j − B2j ≤ } {|n01 − n02 | ≥ }
h 0 0
i
≤ P B1j − B2j ≤  + P [|n01 − n02 | ≥ ] . (S26)
 0 0
Now we bound P B1j − B2j ≤  and P [|n01 − n02 | ≥ ] separately. Firstly,


h 0
i 0
P B1j − B2j ≤ 
" n n
#
X X
= P Aij τ̂i1 {e(0) } − Aij τ̂i2 {e(0) } ≤ 
i=1 i=1
" n n n
X X X
Aij τ̂i1 {e(0) } − I (ci = 1) −

= P Aij I (ci = 1) − Aij I (ci = 2) +
i=1 i=1 i=1
n
#
X
Aij τ̂i2 {e(0) } − I (ci = 2) ≤ 

i=1
" n n m
#
X X X
Aij τ̂i1 {e(0) } − I (ci = 1) ≤ 

≤ P Aij I (ci = 1) − Aij I (ci = 2) + 2
i=1 i=1 i=1
"( n n
) ( m
)#
X X [ X
τ̂i1 {e(0) } − I (ci = 1) ≤ −

≤ P Aij I (ci = 1) − Aij I (ci = 2) ≤ 2 2 Aij
i=1 i=1 i=1
" n n
# " m
#
X X X
τ̂i1 {e(0) } − I (ci = 1) ≤ −

≤ P Aij I (ci = 1) − Aij I (ci = 2) ≤ 2 + P 2 Aij
" i=1
n
i=1
n m
# i=1
X X X h i
≤ P Aij I (ci = 1) − Aij I (ci = 2) ≤ 2 + P τ̂i1 {e(0) } − I (ci = 1) ≥ . (S27)
i=1 i=1 i=1
n

8
Secondly,

P [|n01 − n02 | ≥ ]
hn n o[n 0 n  oi
≤ P n01 ≤ − n2 ≤ −
2 i2 2 2i
h n  h n 
≤ P n01 ≤ − + P n02 ≤ −
 2 2 2 2   
τ̂i1 {e(0) } − I (ci = 1) ≥   + P  τ̂i2 {e(0) } − I (ci = 2) ≥  
[ n o [ n o
≤ P
n n
i∈{1,2,...,m} i∈{m+1,m+2,...,n}
m n
X h i X h i
≤ P τ̂i1 {e(0) } − I (ci = 1) ≥ + P τ̂i2 {e(0) } − I (ci = 2) ≥ . (S28)
i=1
n i=m+1
n

Similar to Lemma 1 in Amini et al. (2013), we upper bound P |τ̂i1 {e(0) } − I (ci = 1) | ≥ n , ∀i ∈
 

{1, 2, . . . , m} and P τ̂i2 {e(0) } − I (ci =2) ≥n 
 
, ∀i ∈  {m + 1, m + 2, . . . , n} as follows.
Pb11 Pb12
With (S24), it can be deduced that 1−Pb / 1−Pb ≥ δ 2 / (1 − δ) > 1. Let Bik =
11 12
n n
I(ej = k) and δ̃ = δ 2 / (1 − δ), we have
P P
Aij I(ej = k), nk =
j=1 i=1

h i
P τ̂i1 {e(0) } − I (ci = 1) ≥
n
(0)
 
τ̂i1 {e } 1 − /n
= P (0)

τ̂i2 {e } /n
  n1 −Bi1  n2 −Bi2 
Bi1 b Bi2
 P11 P12 1 − P11 1 − P12 1 − /n 
b b b
= P n1 −Bi1  n2 −Bi2 ≤
/n
 
Bi1 b Bi2
P21 P22 1 − P21
b b 1 − P22
b
 b Bi1 −Bi2  
P11
 1 − /n  \ n [ o
= P   1−bP11  ≤ {Bi1 − Bi2 ≥ 0} {Bi1 − Bi2 < 0} 
b
 P12 /n 
1−Pb12
 
Bi1 −Bi2 1 − /n
≤ P δ̃ ≤ + P [Bi1 − Bi2 < 0]
/n
  
1 1 − /n
= P Bi1 − Bi2 ≤ log + P [Bi1 − Bi2 < 0]
logδ̃ /n
  
1 1 − /n
≤ 2P Bi1 − Bi2 ≤ log . (S29)
logδ̃ /n

9
 
Let ξi σ{e(0) } = Bi1 − Bi2 , and recall that ξei σ{e(0) } = B
ei1 − B
ei2 . Then we have

ξi σ{e(0) } − ξei σ{e(0) } = (Bi1 − B


 
ei1 ) − (Bi2 − B
ei2 )
n
X n
X
= (Aij − Aij )I(ej = 1) −
e (Aij − A
eij )I(ej = 2)
j=1 j=1
n
X
≥ − (Aij − A
eij )I(ej = 2)
j=1
Xn
≥ − (A
eij + A
eji )I(ej = 2) (by Aij − A
eij ≤ A
eij + A
eji ).
j=1

Thus, we have shown that


n n
 X X
ξi σ{e(0) } ≥ ξei σ{e(0) } −

eij I(ej = 2) −
A A
eji I(ej = 2).
j=1 j=1

Consequently, we have
  
(0)
 1 1 − /n
P ξi σ{e } ≤ log
logδ̃ /n
" n n
#
 X X 1  1 − /n 
(0)
≤ P ξi σ{e } −
e Aij I(ej = 2) −
e eji I(ej = 2) ≤
A log
j=1 j=1
logδ̃ /n
 " n
#
1  1 − /n  X
≤ P ξei σ{e(0) } ≤ 2(1 + )aγ +

log +P Aeij I (ej = 2) ≥ (1 + ) aγ
logδ̃ /n j=1
" n #
X
+P Aeji I (ej = 2) ≥ (1 + ) aγ . (S30)
j=1
h  i
 1 1−/n
Now we consider the term P ξei σ{e(0) } ≤ 2(1 + )aγ + logδ̃
log /n
. Recall that

n
(
X 1, ej = 1
ξei σ{e(0) } = B eij σj {e(0) }, where σj {e(0) } =

ei1 − B
ei2 = A .
j=1 −1, ej = 2

We have shown in (S15) that


t2
h i
P ξei σ{e(0) } ≤ (2γ − 1)(a − b) − t ≤ e− 4(a+b) ,

∀t ∈ [0, 3(a + b)] . (S31)
n  o
1 1−/n
Take t = (2γ − 1)(a − b) − 2(1 + )aγ + logδ̃
log /n
. Recall aγ = (1 − γ)a + γb, and
(a − b) → ∞, n → ∞. Then when n is large enough, we have
1− 1  1 − /n 
(2γ − 1)(a − b) > log . (S32)
2 logδ̃ /n

10
With the assumption that (2γ − 1)(a − b) ≥ 2(1 + )aγ and (S32), we have
1  1 − /n  1 + 
2(1 + )aγ + log ≤ (2γ − 1)(a − b).
logδ̃ /n 2
Thus, we have
1−
0< (2γ − 1)(a − b) ≤ t ≤ (2γ − 1)(a − b) ≤ 3(a + b). (S33)
2
n  o
By plugging t = (2γ − 1)(a − b) − 2(1 + )aγ + log1 δ̃ log 1−/n
/n
in (S31), it follows that
2

1  1 − /n 
− t2

( 1−
2 ) (2γ−1) (a−b)
2 2
(0)

P ξi σ{e } ≤ 2(1 + )aγ +
e log ≤e 4(a+b) ≤e 4(a+b) . (S34)
logδ̃ /n
" # "
n
P n
Next, we consider the terms P eij I(ej = 2) ≥ (1 + )aγ and P P A
A eji I(ej = 2) ≥
j=1 j=1
#
n n
ei∗ {e(0) } = P A
(1 + )aγ . Let A e∗i {e(0) } = P A
eij I(ej = 2) and A eji I(ej = 2). By symmetry,
j=1 j=1
we have that
h i h i
P Aei∗ {e(0) } ≥ (1 + )aγ = P Ae∗i {e(0) } ≥ (1 + )aγ .

Note that since both A ei∗ {e(0) } and A


e∗i {e(0) } are sums of independent bounded random
variables, we can apply the Bernstein inequality to obtain upper bounds. For A ei∗ {e(0) }, we
have

Aij I(ej = 2) − EA
eij I(ej = 2) ≤ 1, and
e

h
(0)
i a b
E Ai∗ {e } = (1 − γ)m ·
f + γm · = (1 − γ)a + γb = aγ ,
m m
  Xn Xn h i
(0) e2 I(ej = 2) = aγ .
υ = Var Ai∗ {e } =
e Var(Aij )I(ej = 2) ≤
e E A ij
j=1 j=1

ei∗ {e(0) }, we have


Then by applying the Bernstein inequality to A
t2
h h i i
P Ai∗ {e } ≥ E Ai∗ {e } + t ≤ e− 2(v+t/3) , ∀t ≥ 0.
e (0) e (0)
(S35)

Let t = aγ ≥ 0 in (S35) and by noting that v ≤ aγ , we have


2 /2
h i h i
P A e∗i {e(0) } ≥ (1 + )aγ = P Aei∗ {e(0) } ≥ (1 + )aγ ≤ e− 1+/3 aγ . (S36)

Thus, by (S29), (S30), (S34) and (S36), it follows that for i = 1, 2, . . . , m,


 ( 1− )2 (2γ−1)2 (a−b)2 
i 2 /2
h
− 2 − 1+/3 aγ
(0)

P τ̂i1 {e } − I(ci = 1) ≥
≤2 e 4(a+b) + 2e . (S37)
n

11
Similarly, we can also obtain, for i = m + 1, m + 2, . . . , n,
 ( 1− )2 (2γ−1)2 (a−b)2 
i 2 /2
h
− 2 − 1+/3 aγ
(0)

P τ̂i2 {e } − I(ci = 2) ≥
≤2 e 4(a+b) + 2e . (S38)
n
By (S28), (S37) and (S38), we have
 ( 1− )2 (2γ−1)2 (a−b)2 
2 /2
− 2 − a
P [|n01 − n02 | ≥ ] ≤ 2n e 4(a+b) + 2e 1+/3 γ
. (S39)
"
Pn
According to (S27), we still need to obtain the upper bound of P Aij I(ci = 1) −
i=1
#
n
P  Pn n
P
Aij I(ci = 2) ≤ 2 . Let ηj σ(c) = Aij I(ci = 1) − Aij I(ci = 2). We then have
i=1 i=1 i=1

n
X n
X
 
ηj σ(c) − ηej σ(c) = (Aij − Aij )I(ci = 1) −
e (Aij − A
eij )I(ci = 2)
i=1 i=1
n
X
≥ − (Aij − A
eij )I(ci = 2)
i=1
Xn
≥ − (A
eij + A
eji )I(ci = 2) (by Aij − A
eij ≤ A
eij + A
eji ).
i=1
n n
e∗j (c) = P A
Let A ej∗ (c) = P A
eij I(ci = 2) and A eji I(ci = 2). We have
i=1 i=1
 
ηj σ(c) ≥ ηej σ(c) − A
e∗j (c) − A
ej∗ (c).

From the assumption that (2γ − 1)(a − b) ≥ 2(1 + )aγ , γ ∈ ( 21 , 1) and a > b, we can get
that
2(1 + )γ + (2γ − 1)
a≥ b > b.
(2γ − 1) − 2(1 + )(1 − γ)
It is not difficult to check that there exists ρ ∈ (0, 1) such that

ρ(a − b) − 2(1 + )b > 0. (S40)

Then, we have
  
P ηj σ(c) ≤ 2
h i
≤ P ηej (σ(c)) − A∗j (c) − Aj∗ (c) ≤ 2
e e
   
 1−ρ 1−ρ
≤ P ηej σ(c) ≤ (a − b) + 2(1 + )b + 2 + P A∗j (c) ≥ (1 + )b +
e (a − b)
2 4
 
1−ρ
+P Aj∗ (c) ≥ (1 + )b +
e (a − b) . (S41)
4

12
1−ρ
  
Consider the term P ηej σ(c) ≤ 2
(a − b) + 2(1 + )b + 2 . Recall that in (S11), we have
shown
t2
P ηej σ(c) ≤ (a − b) − t ≤ e− 4(a+b) ,
  
∀t ∈ [0, 3(a + b)]. (S42)

Then we can take


 
1−ρ
t = (a − b) − (a − b) + 2(1 + )b + 2
2
 
1−ρ
= (a − b) − 2 + {ρ(a − b) − 2(1 + )b} . (S43)
2
With (S40), (S43) and (a − b) → ∞ as n → ∞, it follows that when n is large enough we
have
1−ρ
0< (a − b) ≤ t ≤ 3(a + b). (S44)
4
With (S42), (S43), (S44), we get (when n is large enough)
2

 1−ρ


( 1−ρ
4 ) (a−b)
2

P ηej σ(c) ≤ (a − b) + 2(1 + )b + 2 ≤ e 4(a+b) . (S45)


2
h i n
To bound the term P A e∗j (c) ≥ (1 + )b + 1−ρ (a − b) , first recall that A e∗j (c) = P A eij I(ci =
4
i=1
2) is the sum of independent random variables and Aij I(ci = 2) − EAij I(ci = 2) ≤ 1.
e e

Therefore we can also apply the Bernstein inequality. We have

e∗j (c) = m · b = b,
h i
E A (S46)
m
  X n n
X h i
v = Var A∗j (c) =
e Var(Aij )I(cj = 2) ≤
e E Ae2 I(cj = 2) = b.
ij
j=1 j=1

Thus, by applying the Bernstein inequality to A


e∗j (c), we have

t2 /2 t2 /2
h h i i
P A∗j (c) ≥ E A∗j (c) + t ≤ e− v+t/3 ≤ e− b+t/3 ,
e e ∀t ≥ 0. (S47)

1−ρ
Take t = b + 4
(a − b). With (S47) and (S46), we have
1 1−ρ 2 (a−b)2 2

1 − ρ


2( 4 )

( 1−ρ
4 ) (a−b)
2

e∗j (c) ≥ (1 + )b +


P A (a − b) ≤ e b+2a/3 ≤e 2(a+b) . (S48)
4
By symmetry, we also have
1 1−ρ 2 (a−b)2 2

1−ρ


2( 4 )

( 1−ρ
4 ) (a−b)
2

P A∗j (c) ≥ (1 + )b +


e (a − b) ≤ e b+2a/3 ≤e 2(a+b) . (S49)
4

13
Therefore, with (S41), (S45), (S48) and (S49), it follows that
1−ρ 2 2 2
( 4 ) (a−b)2 ( 1−ρ
4 ) (a−b)2 ( 1−ρ
4 ) (a−b)2

P ηj σ(c) ≤ 2 ≤ e− − −
  
4(a+b) + 2e 2(a+b) ≤ 3e 4(a+b) . (S50)
With (S27), (S37) and (S50), we can get that, for j = 1, 2, . . . , m,
2 ( 2 )
h 0 0
i

( 1−ρ
4 ) (a−b)
2

( 1−
2 ) (2γ−1) (a−b)
2 2

2 /2
a
P B1j − B2j ≤  ≤ 3e 4(a+b) +n e 4(a+b) + 2e 1+/3 γ . (S51)

Similarly, with the same arguments, we can get that for j = m + 1, m + 2, . . . , n,


2 ( 2 )
h 0 0
i

( 1−ρ
4 ) (a−b)
2

( 1−
2 ) (2γ−1) (a−b)
2 2

2 /2
a
P B2j − B1j ≤  ≤ 3e 4(a+b) +n e 4(a+b) + 2e 1+/3 γ . (S52)

Finally, with (S39), (S51) and (S52), it follows that when n is large enough, we have
P ĉ{e(0) } =
 
6 c
 
[
ĉj {e(0) } =

= P 6 cj 
j∈{1,2,...,n}
    
 n o[ n 0 o[
0 0 0
[ [
≤ P B1j − B2j ≤  B2j − B1j ≤  {|n01 − n02 | ≥ }
   
j∈{1,2,...,m} j∈{m+1,m+2,...,n}
m h i n h 0 i
0 0 0
X X
≤ P B1j − B2j ≤  + P B2j − B1j ≤  + P [|n01 − n02 | ≥ ]
j=1 j=m+1
1−ρ 2 2
( )
( 4 ) (a−b)2 ( 1−
2 ) (2γ−1)2 (a−b)2
2 /2
= 3ne− 4(a+b) + n(n + 2) e − 4(a+b) + 2e − 1+/3 aγ
.

Therefore, we have
P ĉ{e(0) } = c = 1 − P ĉ{e(0) } =
   
6 c
" 2 ( 2 )#

( 1−ρ
4 ) (a−b)
2

( 1−
2 ) (2γ−1) (a−b)
2 2

2 /2
a
≥ 1 − 3ne 4(a+b) + n(n + 2) e 4(a+b) + 2e 1+/3 γ .

Thus we complete the proof of Theorem 3.

A4 Distributions of ĉ(s) and ĉ(w)


(w) (w)
We first show that ĉ(w) is weakly consistent to c. Let Xi , 1(ĉi 6= ci ) − P(ĉi 6= ci ), where
(w) 1
P(ĉi 6= ci ) = (1 + π1 )pn with pn = log n
. Then, it can be seen that
|Xi | ≤ 1, ∀i = 1, 2, . . . , n,
EXi = 0, ∀i = 1, 2, . . . , n,
Xn
EXi2 = n [(1 + π1 )pn {1 − (1 + π1 )pn }] .
i=1

14
Thus, by applying Bernstein inequality for Σni=1 Xi , we can get that
( n )
t2 /2
X  
P Xi ≥ t ≤ exp − 1 , ∀t ≥ 0. (S53)
i=1
n [(1 + π 1 )p n {1 − (1 + π 1 )p n }] + 3
t
n
(w) (w) P (w)
Recall Xi , 1(ĉi 6= ci ) − P(ĉi 6= ci ). We plug t = n − P(ĉi 6= ci ) (which is
i=1
nonnegative when n is sufficient large) into (S53) and get that
( n )
1 X (w)
P 1(ĉi 6= ci ) ≥  (S54)
n i=1
( n n n
)
X (w) X (w)
X (w)
= P 1(ĉi 6= ci ) − P(ĉi 6= ci ) ≥ n − P(ĉi 6= ci )
i=1 i=1 i=1
!
2
{n − n(1 + π1 )pn } /2
≤ exp − . (S55)
n [(1 + π1 )pn {1 − (1 + π1 )pn }] + {n − n(1 + π1 )pn } /3

Thus, ĉ(w) is weakly consistent to c. Next, we show that ĉ(w) is not strongly consistent to c.
Specifically, we have
n n   ( − log n )− logn n
Y (w)
Y 1 1
P(ĉ(w) = c) = P(ĉi = ci ) ≤ 1− = 1− . (S56)
i=1 i=1
log n log n

Thus by (S56), we know that ĉ(w) is not strongly consistent to c.


By the classical central limit theorem for independent and identically distributed random
variables, we have
( n
)
√ 1X d
n 1(ci = 1) − π1 −→ N {0, π1 (1 − π1 )} . (S57)
n i=1

We also have that


n
( ) ( n )

1 X (s) √ 1X
n 1(ĉi = 1) − π1 − n 1(ci = 1) − π1
n i=1 n i=1
n
1 X n (s) o
= √ 1(ĉi = 1) − 1(ci = 1) = op (1),
n i=1

which is based on the fact that ∀ > 0,


" #
1 X n n o
(s)
P √ 1(ĉi = 1) − 1(ci = 1) ≥  ≤ P (c(s) 6= c) = o(1).

n
i=1

√ √
 n
  n

1
P (s) 1
P
Thus, n n
1(ĉi = 1) − π1 has the same limit distribution as n n
1(ci = 1) − π1 .
i=1 i=1

15
Finally, we show that
( n )


1 X (w) 1 − 3π1 d
n 1(ĉi = 1) − π1 + −→ N {0, π1 (1 − π1 )} .
n i=1 log n

(w) (w)
Let Xni , 1(ĉi = 1) − P(ĉi = 1). We have

EXni = 0,
n
2 1X 2
sn = EXni = (π1 − π12 ) − O(pn ) → s2 = π1 (1 − π1 ) 6= 0, as n → ∞.
n i=1

(w)
We show the following Lindeberg condition. Specifically, note that P(ĉi = 1) = π1 + 1−3π
log n
1
,
then for every  > 0, we have
n
1X  2 √ 
E Xni 1 |Xni | ≥  n
n i=1
h 
(w)

(w)
n 
(w)

(w) √ oi
= E |1 ĉi = 1 − P(ĉi = 1)|2 1 |1 ĉi = 1 − P(ĉi = 1)| ≥  n
n 
(w)

(w) √ o
≤ P |1 ĉi = 1 − P(ĉi = 1)| ≥  n

  
(w)
 1 − 3π1
= P |1 ĉi = 1 − (π1 + )| ≥  n . (S58)
log n

Also note that


√ √ 
  
(w)
 1 − 3π1
P |1 ĉi = 1 − (π1 + )| ≥  n ≤ P 1 ≥  n → 0, as n → ∞. (S59)
log n

Thus, putting (S58) and (S59) together yields


n
1X  2 √ 
E Xni 1 |Xni | ≥  n → 0, as n → ∞. (S60)
n i=1

By the Lindeberg-Feller central limit theorem, we can get that


n
!
√ 1 X d
n Xni −→ N (0, s2 ),
n i=1

which is also
( n )


1 X (w) 1 − 3π1 d
n 1(ĉi = 1) − π1 + −→ N {0, π1 (1 − π1 )} . (S61)
n i=1 log n

16
A5 Extension to the Bipartite SBM

The bipartite network is a ubiquitous class of networks, in which nodes are of two disjoint

types and edges are only formed between nodes from different types. Bipartite networks can

be used to characterize many real-world systems, such as authorship of papers and people

attending events (Zhang and Chen, 2018). Community detection in bipartite networks have

been studied in many scientific fields, such as text mining (Bisson and Hussain, 2008), physics

(Larremore et al., 2014), and genetic studies (Madeira et al., 2010). In this section, we extend

the proposed profile-pseudo likelihood method to the case of bipartite stochastic blockmodels

(BiSBM).

Let G(V1 , V2 , E) denote a bipartite network, where V1 = {1, . . . , m} and V2 = {1, . . . , n}

are node sets of the two different types of nodes, respectively, and E is the set of edges

between nodes in V1 and V2 . The network G(V1 , V2 , E) can be uniquely represented by the

corresponding m × n bi-adjacency matrix A = [Aij ], where Aij = 1 if there is an edge from

node i of type 1 to node j of type 2 and Aij = 0 otherwise. Under the BiSBM, nodes in V1

form K1 blocks and nodes in V2 form K2 blocks. Specifically, for nodes in V1 , the labels c1 =

(c11 , c12 , . . . , c1m ) are drawn independently from a multinomial distribution with parameters

π1 = (π11 , π12 , . . . , π1K1 ), and for nodes in V2 , the labels c2 = (c21 , c22 , . . . , c2n ) are drawn

independently from a multinomial distribution with parameters π2 = (π21 , π22 , . . . , π2K2 ).

Conditional on c1 and c2 , the edges Aij ’s are independent Bernoulli variables with

E[Aij |c1 , c2 ] = Pc1i c2j ,

where P = [Pkl ] is a K1 × K2 matrix. The goal of community detection is then to estimate

the node labels c1 and c2 from the bi-adjacent matrix A.

Define Ω = (π1 , P ) and e2 = (e21 , e22 , . . . , e2n ). To estimate the node labels c2 from the

bi-adjacent matrix A, we define the following log pseudo likelihood function


m
(K n
)
1
A
X X Y
`B
PL (Ω, e2 ; {ai }) = log π1k Pkeij2j (1 − Pke2j )1−Aij .
i=1 k=1 j=1

17
Algorithm 3 BiSBM Profile-Pseudo Likelihood Maximization Algorithm.
(0) (0)
Step 1: Initialize e1 and e2 by applying SCP to AA> and A> A, respectively.
(0)
Step 2: Calculate Ω(0) = (π1 , P (0) ). That is, for 1 ≤ k ≤ K1 and 1 ≤ l ≤ K2 ,
m P
n
P (0) (0)
m Aij I(e1i =k)I(e2j =l)
(0) 1
P (0) (0) i=1 j=1
π1k = m
I(e1i = k), Pkl = m P
P n
(0) (0)
.
i=1 I(e1i =k)I(e2j =l)
i=1 j=1

(0,0) (0)
Step 3: Initialize Ω(0,0) = (π1 , P (0,0) ) = (π1 , P (0) ).
repeat
repeat
(s,t+1)
Step 4: E-step: compute τik . That is, for 1 ≤ k ≤ K1 and 1 ≤ i ≤ m,
( )A ( )1−A
n ij ij
(s,t) Q (s,t) (s,t)
π1k P (s)
1−P (s)
(s,t+1) j=1 ke ke
2j 2j
τik = K n
( )A (
ij
)1−A
ij
.
P1 (s,t) Q (s,t) (s,t)
π1l P (s)
1−P (s)
l=1 j=1 le le
2j 2j

(s,t+1)
Step 5: M-step: compute π1 , P (s,t+1) . That is, for 1 ≤ k ≤ K1 and 1 ≤ l ≤ K2 ,
m P
n
P (s,t+1) (s)
m Aij τik I(e2j =l)
(s,t+1) 1
P (s,t+1) (s,t+1) i=1 j=1
π1k = n
τik , Pkl = m P
P n
(s,t+1) (s)
.
i=1 πik I(e2j =l)
i=1 j=1

until the EM algorithm converges.


Step 6: Set Ω(s+1) to be the final EM update.
(s+1)
Step 7: Given Ω(s+1) , update e2j , 1 ≤ j ≤ n, using

K1
m P n  o
(s+1) P (s+1) (s+1) (s+1)
e2j = arg maxl∈{1,2,...,K2 } τik Aij log Pkl + (1 − Aij ) log 1 − Pkl .
i=1 k=1

until the profile-pseudo likelihood converges.

A profile-pseudo likelihood algorithm that maximizes `B


PL (Ω, e2 ; {ai }) is described in Algo-

rithm 3. Note that c1 can be estimated similarly as that for c2 , and we omit the details.

We investigate the performance of the proposed profile-pseudo likelihood method for

BiSBM. We fix m = n = 1200, K1 = K2 = 2, π1 = (1/2, 1/2), π2 = (1/2, 1/2) and edge

probability between communities k and l Pkl = 0.1(1.2 + 0.4 × 1(k = l)) for all k, l = 1, 2.

We compare PPL with two other clustering methods, namely the SCP and SVD (Rohe et al.,

18
Normalized mutual information

Normalized mutual information

Normalized mutual information


0.90 0.85 0.85

0.85 0.80 0.80

0.80 0.75 0.75




0.75 0.70 0.70



● ●


0.70 ●
0.65 0.65 ●

SCP CPL DC−PPL SCP SVD PPL SCP SVD PPL

Figure 9: Left: comparison of PPL, SVD and SCP for estimating c1 in BiSBM; right: com-
parison of PPL, SVD and SCP for estimating c2 in BiSBM.

2012; Sarkar and Dong, 2011). As for SCP, to deal with bipartite networks, we apply it to

AAT to get an estimate of c1 , and apply it to AT A to get the estimate of c2 . The result

is summarized in Figure 9, based on 100 replications. It is seen that PPL outperforms both

SCP and SVD for community detection in bipartite networks.

A6 Additional Numerical Results


A6.1 Running time for SCP

We report the computing time for SCP in Setting 3 of Section 5.1. Specifically, we set K = 3,

π = (0.2, 0.3, 0.5), λ = 5, β = 0.05 and vary the network size n from 102.5 to 106 . The

results from 100 data replicates are reported in Figure 10. It is seen that it takes SCP

less than 100 seconds when the network has one million nodes. Specifically, this is due to

the eigs() function in Matlab, which performs iterative solutions for eigensystems of large

sparse matrices using ARPACK. We note that the computational efficiency of eigs() can

decrease when the network density and the number of communities K increase.

19
Algorithm SCP



1
log10 Running time (secs)





0 ●





−1 ●



n=10^2.5 n=10^3 n=10^3.5 n=10^4 n=10^4.5 n=10^5 n=10^5.5 n=10^6

Figure 10: Computing time from SCP for large-scale and sparse networks under the SBM
with K = 3, π = (0.2, 0.3, 0.5), λ = 5, β = 0.05 and varying n.

A6.2 Comparison with Gao et al. (2017)

In this simulation study, we compare the performance of SCP, PPL and the majority voting

method proposed in Gao et al. (2017) (referred to as MV) on networks simulated from the

SBM. Specifically, we consider the simulation Setting 3 in Section 5.1, where the parameter

β controls the “out-in-ratio” and λ controls the overall expected network degree. We set

K = 3 and π = (0.2, 0.3, 0.5), and we consider three scenarios, 1) varying β while λ = 5

and n = 1200, 2) varying λ while β = 0.05 and n = 1200, and 3) varying n while λ = 5

and β = 0.05. Figure 11 reports the NMI from the three methods and the computing time

from PPL and MV, based on 100 replications. The running time for PPL does not include

the initialization step, which takes no more than a few seconds. Both PPL and MV use

SCP as the initial clustering method. It is seen that PPL and MV have comparable clustering

accuracies and they both outperform SCP in terms of NMI. Moreover, PPL is computationally

more efficient than MV as it needs not to repeatedly perform the leave-one-node-out spectral

20
Algorithm SCP MV PPL Algorithm MV PPL

log10 Running time (secs) log10 Running time (secs)


3
0.8
2
NMI

0.6 1
0
0.4
-1
beta=0.02 beta=0.05 beta=0.1 beta=0.02 beta=0.05 beta=0.1

Algorithm SCP MV PPL Algorithm MV PPL


0.8 3
0.6 2
NMI

0.4 1

0.2 0
-1
lambda=3 lambda=4 lambda=5 lambda=3 lambda=4 lambda=5

Algorithm SCP MV PPL Algorithm MV PPL


log10 Running time (secs)

0.8 3
0.7 2
NMI

1
0.6
0
0.5
-1
n=1000 n=1200 n=1400 n=1000 n=1200 n=1400

Figure 11: Comparisons of the NMI and computing time from SCP, MV and PPL under different
settings.

clustering.

21

You might also like