Pseudo Liklhood

Fast Network Community Detection with
Profile-Pseudo Likelihood Methods

Jiangzhou Wang1,2 , Jingfei Zhang3 , Binghui Liu1 , Ji Zhu4 , and Jianhua Guo1
arXiv:2011.00647v3 [stat.ME] 29 Aug 2021
1
School of Mathematics and Statistics & KLAS, Northeast Normal University, Jilin, 130024, China.
2
Department of Statistics and Data Science, Southern University of Science and Technology, Shenzhen, 518055, China.
3
Department of Management Science, University of Miami, Coral Gables, FL, 33146, USA.
4
Department of Statistics, University of Michigan, Ann Arbor, MI, 48109, USA.
Abstract
The stochastic block model is one of the most studied network models for com-
munity detection, and fitting its likelihood function on large-scale networks is known
to be challenging. One prominent work that overcomes this computational challenge
is Amini et al. (2013), which proposed a fast pseudo-likelihood approach for fitting
stochastic block models to large sparse networks. However, this approach does not
have convergence guarantee, and may not be well suited for small and medium scale
networks. In this article, we propose a novel likelihood based approach that decouples
row and column labels in the likelihood function, enabling a fast alternating maxi-
mization. This new method is computationally efficient, performs well for both small
and large scale networks, and has provable convergence guarantee. We show that our
method provides strongly consistent estimates of communities in a stochastic block
model. We further consider extensions of our proposed method to handle networks
with degree heterogeneity and bipartite properties.
Keywords: network analysis, profile likelihood, pseudo likelihood, stochastic block model,
strong consistency.
The first three authors contributed equally to this work. For correspondence, please contact Jianhua
Guo and Ji Zhu.
1
1 Introduction
One of the fundamental problems in network data analysis is community detection which
aims to divide the nodes in a network into several communities such that nodes within the
same community are densely connected, and nodes from different communities are relatively
sparsely connected. Identifying such communities can provide important insights on the
organization of a network. For example, in social networks, communities may correspond to
groups of individuals with common interests (Moody and White, 2003); in protein interac-
tion networks, communities may correspond to proteins that are involved in the same cellular
functions (Spirin and Mirny, 2003). There is a vast literature on network community de-
tection contributed from different scientific communities, such as computer science, physics,
social science and statistics. We refer to Fortunato (2010); Fortunato and Hric (2016); Zhao
(2017) for comprehensive reviews on this topic.
In the statistics literature, the majority of community detection methods are model-
based, which postulate and fit a probabilistic model that characterizes networks with com-
munity structures (Holland et al., 1983; Airoldi et al., 2008; Karrer and Newman, 2011).
Within this family, the stochastic block model (SBM; Holland et al., 1983) is perhaps the
best studied and most commonly used. The SBM is a generative model, in which the nodes
are divided into blocks, or communities, and the probability of an edge between two nodes
only depends on which communities they belong to and is independent across edges once
given the community assignment. Several extensions of the SBM have been considered,
notably the mixed membership model (Airoldi et al., 2008), which allows each node to be
associated with multiple clusters, and the degree corrected stochastic block model (DCSBM;
Karrer and Newman, 2011), which accommodates degree heterogeneity by including addi-
tional degree parameters. Due to the rapidly increasing interests, the statistical literature on
community detection in SBMs is fast growing with great advances on algorithmic solutions
(Snijders and Nowicki, 1997; Nowicki and Snijders, 2001; Daudin et al., 2008; Karrer and
2
Newman, 2011; Decelle et al., 2011; Amini et al., 2013; Bickel et al., 2013, among others)
and theoretical understandings of consistency and detection thresholds (Bickel and Chen,
2009; Rohe et al., 2011; Zhao et al., 2012; Lei and Rinaldo, 2015; Abbe, 2017; Gao et al.,
2017; Gao et al., 2018; Su et al., 2019; Abbe et al., 2020, among others).
It is well known that fitting the block model (i.e., SBM and DCSBM) likelihood functions
is a nontrivial task, and in principle optimizing over all possible community assignments is
a NP-hard problem (Bickel and Chen, 2009). Many work have considered using spectral
clustering for community detection in SBMs, which is computationally efficient and ensures
weak consistency, that is, the proportion of misclassified nodes tends to zero as the network
size increases, under certain regularity conditions (Rohe et al., 2011; Lei and Rinaldo, 2015;
Joseph et al., 2016). As such, spectral clustering is often used to produce initializations for
methods that aim to achieve strong consistency (Gao et al., 2017), that is, probability of
the estimated label being equal to the true label converges to one as the network size grows,
and methods that aim to directly maximize the nonconvex SBM and DCSBM likelihood
functions (Amini et al., 2013; Bickel et al., 2013).
To overcome the computational challenge in fitting the SBM likelihood, Amini et al.
(2013) proposed a novel pseudo likelihood approach that approximates the row rums within
blocks using Poisson random variables, and simplifies the likelihood function by lifting the
symmetry constraint on the adjacency matrix. This leads to a fast approximation to the
block model likelihood, which subsequently enables efficient maximization that can easily
handle up to millions of nodes. Additionally, it is shown that the maximum pseudo-likelihood
estimator achieves (weak) community detection consistency, in the case of a sparse SBM with
two communities. This pioneer work makes the SBM an attractive approach for network
community detection, due to its computational scalability and theoretical properties such as
the community detection consistency. However, this method may have two drawbacks. First,
in the examples that were presented in Amini et al. (2013), the authors found that empirically
the pseudo-likelihood maximization algorithm converged fast. It is, however, not guaranteed
3
Amini et al. (2013) Profile-Pseudo Likelihood
log pseudo likelihood
log pseudo likelihood
-111800 -111400
113400
112800
0 10 20 30 40 50 60 0 10 20 30 40 50 60
iteration number iteration number
Figure 1: An illustrative example comparing the pseudo likelihood method by Amini et al.
(2013) and the proposed profile-pseudo likelihood method. Details of the simulation setting
are described in Section 5.1.
that the algorithm will converge in general (see example in Figure 1). Convergence is a
critical property as it guarantees that the final estimator exists, and is therefore important
both computationally and statistically. Second, the pseudo likelihood approach may not be
suitable for small and medium scale networks, as the Poisson approximation may have non
negligible approximation errors in such cases. In the case of the DCSBM, cleverly employing
the observation that the conditional distribution (on node degrees) of the Poisson variables
is multinomial, Amini et al. (2013) proposed a conditional pseudo likelihood approach that
permits a fast estimation and adapts to both small and large scale networks. However, the
algorithm still does not have convergence guarantees.
Motivated by the pseudo likelihood approach, in this work, we propose a new SBM like-
lihood fitting method that decouples the membership labels of the rows and columns in the
likelihood function, treating the row label as a vector of latent variables and the column
label as a vector of unknown parameters. Correspondingly, the likelihood can be maximized
in an alternating fashion over the block model parameters and over the column label, where
the maximization now involves a tractable sum over the distribution of latent row label.
Furthermore, we consider a profile-pseudo likelihood that adopts a hybrid framework of the
profile likelihood and the pseudo likelihood, where the symmetry constraint on the adjacency
4
matrix is also lifted. Our proposed method retains and improves on the computational effi-
ciency of the pseudo likelihood method, performs well for both small and large scale networks
and has provable convergence guarantee. We show that the community label (i.e., column
label) estimated from our proposed method enjoys strong consistency, as long as the initial
label has an overlap with the truth beyond that of random guessing. We further consider two
extensions of the proposed method, including to the DCSBM and to the bipartite stochastic
block model (BiSBM; Larremore et al., 2014).
Our work is closely related to a recent and growing literature on strong consistency (or
exact recovery) pursuit in community detection (see, for example, Abbe et al., 2015; Lei
and Zhu, 2017; Gao et al., 2017; Gao et al., 2018). The strong consistency property may be
more desirable than weak consistency, as it enables establishing the asymptotic normality of
the SBM plug-in estimators (Amini et al., 2013) and performing goodness of fit tests (Lei,
2016; Hu et al., 2020b). To achieve strong consistency, the above methods usually consider a
refinement step after obtaining the initial label, which is assumed to obey weak consistency.
For example, in Gao et al. (2017), a majority voting algorithm is applied to the clustering
label obtained from spectral clustering. Similarly, our proposed profile-pseudo likelihood
estimation can be viewed as a refinement on the initial label to achieve strong consistency.
Similar to other refinement algorithms, the scalability of our proposed method depends on
the initialization step. While spectral clustering is used to produce initial solutions in our
work, other initialization methods can be considered as well (see Section 7).
The rest of the paper is organized as follows. Section 2 introduces the profile-pseudo
likelihood function and an efficient algorithm for its maximization. Moreover, we discuss the
convergence guarantee of the algorithm. Section 3 shows the strong consistency property
of the community label estimated from the proposed algorithm. Section 4 considers two
important extensions of the proposed method. Section 5 demonstrates the efficacy of the
proposed method through comparative simulation studies. Section 6 presents analyses of
two real-world networks with communities. A discussion section concludes the paper.
5
2 Profile-Pseudo Likelihood
Let G(V, E) denote a network, where V = {1, 2, . . . , n} is the set of n nodes and E is the
set of edges between the nodes. The network G(V, E) can be uniquely represented by the
corresponding n × n adjacency matrix A, where Aij = 1 if there is an edge (i, j) ∈ E
from node i to node j and Aij = 0 otherwise. In our work, we focus on unweighted and
undirected networks, and thus A is a binary symmetric matrix. Under the stochastic block
model, there are K communities (or blocks) and each node belongs to only one of the
communities. Let c = (c1 , c2 , . . . , cn ) ∈ {1, 2, . . . , K}n denote the true community labels
of the nodes, and assume that ci ’s are i.i.d. categorical variables with parameter vector
P
π = (π1 , . . . , πK ), where k πk = 1. Conditional on the community labels, the edge variables
Aij ’s are independent Bernoulli variables with E(Aij |c) = Pci cj , where P ∈ [0, 1]K×K is the
symmetric edge-probability matrix with the kl-th entry Pkl characterizing the probability
of connection between nodes in communities k and l. Let Ω = (π, P ). Our objective is to
estimate the unknown community labels c given the observed adjacency matrix A.
Denote the rows of A as ai = (Ai1 , Ai2 , . . . , Ain ), 1 ≤ i ≤ n and let e = (e1 , e2 , . . . , en ) ∈
{1, 2, . . . , K}n denote the column labeling vector. Define the pseudo likelihood function as
n
(K n
)
A
Y X Y
LPL (Ω, e; {ai }) = πl Plejij (1 − Plej )1−Aij , (1)
i=1 l=1 j=1
with its logarithm as

n
(K n
)
A
X X Y
`PL (Ω, e; {ai }) = log πl Plejij (1 − Plej )1−Aij .
i=1 l=1 j=1
We make a few remarks on the objective function defined in (1). First, in (1), we treat the
row labels as a vector of latent variables and the column labels e as a vector of unknown
model parameters. That is, given ej , each Aij is considered a mixture of K Bernoulli random
variables with mean Plej , 1 ≤ l ≤ K. This formulation decouples the row and column labels,
and allows us to derive a tractable sum when optimizing for the column labels e and the
block model parameter Ω. Second, the objective function LPL (Ω, e; {ai }) is calculated while
6
lifting the symmetry constraint on the adjacency matrix A, or equivalently, ignoring the
dependence among the rows ai ’s. Hence, we refer to (1) as the pseudo likelihood function,
which can be considered as an approximation to the SBM likelihood function.
We consider an iterative algorithm that alternates between updating e and updating Ω.
In each iteration, the estimation is carried out by first profiling out the nuisance parameter Ω
using maxΩ LPL (Ω, e; {ai }) given the current estimate of e, and then maximizing the profile
likelihood with respect to e. This is referred to as the profile-pseudo likelihood method. We
show in Theorem 1 the convergence guarantee of this efficient algorithm, and establish in
Theorem 2 the strong consistency of the estimated column labels e.
The estimation procedure proceeds in detail as follows. First, given the current ê and
treating the row labels as a vector of latent variables, LPL (Ω, ê; {ai }) can be viewed as
the likelihood of a mixture model with i.i.d. observations {ai } and parameter Ω. Conse-
quently, LPL (Ω, ê; {ai }) can be maximized over Ω using an expectation-maximization (EM)
algorithm, where both the E-step and M-step updates have closed-form expressions. Next,
given the estimated Ω, b e; {ai }) as the objective function. In

b we update e, treating LPL (Ω,
b e; {ai }) with respect to e is a NP-hard problem

this step, finding the maximizer of LPL (Ω,
since, in principle, it requires searching over all possible label assignments. As an alter-
native, we propose a fast updating rule that leads to a non-decreasing objective function
b e; {ai }) (although not necessarily maximized), which ensures the desirable ascent
LPL (Ω,
property of the iterative algorithm. This algorithm is summarized in Algorithm 1.
In what follows, we discuss in details the profile-pseudo likelihood algorithm. We refer
to the iterations between updating e and Ω as the outer iterations, and the iterations in the
EM algorithm used to update Ω as the inner iterations. Specifically, in the (t + 1)-th step
of the EM (inner) iteration, given e(s) and the parameter estimate from the previous EM
7
update Ω(s,t) = (π (s,t) , P (s,t) ), we let
n
Aij 1−Aij
(s,t) Q (s,t) (s,t)
πk P (s) 1 − P (s)
j=1 kej kej
(s,t+1)
τik = Aij 1−Aij (2)
K n
P (s,t) Q (s,t) (s,t)
πl P (s) 1 − P (s)
j=1 lej lej
l=1
for each 1 ≤ i ≤ n and 1 ≤ k ≤ K, which calculates the conditional probability that the row
label of node i equals to k at the (t + 1)-th step of the EM iteration. Next, we define
Q(Ω|Ω(s,t) , e(s) ) = Ez|{ai };Θ(s,t) ,e(s) log f {ai }, z; Ω, e(s) ,

where z denotes the latent row labels and

n
( n
)
A
Y Y
f ({ai }, z; Ω, e(s) ) = πz i P (s) (1 − Pzi e(s) )1−Aij .
ij
zi ej j
i=1 j=1
In the M-step, Ω(s,t+1) is updated by
Ω(s,t+1) = arg max Q(Ω|Ω(s,t) , e(s) ),

Ω
which has closed form solutions as follows

n P
n
P (s,t+1) (s)
n Aij τik I(ej = l)
(s,t+1) 1 X (s,t+1) (s,t+1) i=1 j=1
πk = τik , Pkl = n P n , (3)
n P (s,t+1) (s)
i=1 τik I(ej = l)
i=1 j=1
n o
(s+1) (s+1)
for 1 ≤ k, l ≤ K. Once the EM algorithm has converged, we let Ω and τil take
the values from the last EM update, respectively. Next, given Ω(s+1) , we propose to update
e as follows:
n X
X K n o
(s+1) (s+1) (s+1) (s+1)
ej = arg max τil Aij log Plk + (1 − Aij ) log 1 − Plk . (4)
k∈{1,2,...,K}
i=1 l=1
The update for e(s+1) is obtained separately for each node, which can be carried out efficiently.
As we discussed earlier, this update is not guaranteed to maximize the pseudo likelihood
function LPL (Ω(s+1) , e; {ai }), which in fact is an intractable problem. Nevertheless, it can
be shown that the update in (4) leads to a non-negative increment in the pseudo likelihood.
This gives the desirable ascent property, which we will formally state in the following theorem.
8
Algorithm 1 Profile-Pseudo Likelihood Maximization Algorithm.
Step 1: Initialize e(0) using spectral clustering with permutations (SCP).
Step 2: Calculate Ω(0) = (π (0) , P (0) ). That is, for 1 ≤ l, k ≤ K,
n P
n
P (0) (0)
n Aij I(ei =k)I(ej =l)
(0) 1
P (0) (0) i=1 j=1
πk = n
I(ei = k), Pkl = Pn Pn
(0) (0)
.
i=1 I(ei =k)I(ej =l)
i=1 j=1
Step 3: Initialize Ω(0,0) = (π (0,0) , P (0,0) ) = (π (0) , P (0) ).

repeat
repeat
(s,t+1)
Step 4: E-step: compute τik using (2) for 1 ≤ k ≤ K and 1 ≤ i ≤ n.
(s,t+1) (s,t+1)
Step 5: M-step: compute πk and Pkl using (3) for 1 ≤ k, l ≤ K.
until the EM algorithm nconverges.
o
(s+1) (s+1)
Step 6: Set Ω and τik to be the final EM update.
n o
(s+1) (s+1)
Step 7: Given Ω(s+1) and τik , update ej , 1 ≤ j ≤ n, using (4).
until the profile-pseudo likelihood converges.
Theorem 1. For a given initial labeling vector e(0) , Algorithm 1 generates a sequence
{Ω(s) , e(s) } such that
LPL (Ω(s) , e(s) ; {ai }) ≤ LPL (Ω(s+1) , e(s+1) ; {ai }).
The proof of Theorem 1 is provided in the supplemental material. Theorem 1 guaran-
tees that the pseudo likelihood function is non-decreasing at each iteration in Algorithm
1. Assuming that the parameter space for Ω is compact, we arrive at the conclusion that
LPL (Ω(s) , e(s) ; {ai }) converges as the number of iterations s increases. This is a desirable
property that guarantees the stability of the proposed algorithm. Since the pseudo likelihood
function is not concave, Algorithm 1 is not guaranteed to converge to the global optimum.
Whether it converges to a global or local solution depends on the initial value. In practice,
we find that the initialization procedure in Algorithm 1 shows good performance, that is,
we are able to achieve high clustering accuracy in our simulation studies. To avoid local so-
lutions in real data applications, we recommend considering multiple random initializations
9
in addition to the initialization in Algorithm 1.
Finally, we summarize the differences between our proposal and the method in Amini
et al. (2013). Both our method and Amini et al. (2013) consider algorithms that iterate
through two parameter updating steps, namely, the step that updates the block model
parameter Ω using EM and the step that updates the membership label. However, the
likelihood function is treated very differently in these two methods. As the row and column
labels are enforced to be the same in Amini et al. (2013), a Poisson approximation is needed
in the pseudo likelihood calculation. The label e in Amini et al. (2013) is treated as an initial
in the EM estimation, and its value is assigned heuristically in each iteration. As such, the
resulting procedure is not guaranteed to converge, as seen in Figure 1. In comparison, our
method decouples the row and column labels (i.e., z and e), and does not require a Poisson
approximation in the pseudo likelihood calculation. When updating the column labels e,
b e; {ai }) as the objective function that guides our updating routine. The
we use LPL (Ω,
proposed node-wise update enjoys the ascent property, which subsequently guarantees the
convergence of the algorithm (see Theorem 1). We also remark that due to the differences
in our problem formulation, our theoretical analysis is nontrivial and new technical tools are
needed.
3 Consistency Results
In this section, we investigate the strong consistency of the estimator obtained from one
outer loop iteration (i.e., updating the column labels e) of Algorithm 1, denoted as ĉ{e(0) },
where e(0) is an initial of Algorithm 1. We first consider strong consistency in the case of
SBMs with two balanced communities, and then extend our strong consistency result to
SBMs with K communities.
We first present the consistency result for directed SBMs with two communities, fitted
to directed networks, and then modify the result to handle the more challenging case of
10
undirected SBMs, fitted to undirected networks. To separate the cases of directed and
undirected SBMs, we adopt different notations for the corresponding adjacency matrices
and edge-probability matrices. First, for a directed SBM, we denote the adjacency matrix
as A
e and assume that its entries A
eij ’s are mutually independent given c, that is,
eij |c ∼ Bernoulli(Pec c ), for 1 ≤ i, j ≤ n.

(directed) A (5)
i j
For an undirected SBM, we denote the adjacency matrix as A and assume its entries Aij ’s,
i ≤ j, are mutually independent given c, that is,
(undirected) Aij |c ∼ Bernoulli(Pci cj ) and Aij = Aji , for 1 ≤ i ≤ j ≤ n. (6)
Furthermore, we assume that the edge-probability matrix of the directed SBM has the form
!
1 a b
Pe = , (7)
m b a
while that of the undirected SBM has the form

! !
2 a b 1 a2 b 2
P = − 2 . (8)
m b a m b 2 a2
Such a coupling between the directed and undirected models makes it possible to extend the
consistency result of the directed SBM to the undirected case.
Given an initial labeling vector e(0) , estimates â, b̂ and (π̂1 , π̂2 ), the estimator ĉ{e(0) } can
be written as
n X
2 n
(0)
X o
ĉj {e } = arg max τ̂il Aij log(Pblk ) + (1 − Aij ) log(1 − Pblk , (9)
k∈{1,2}
i=1 l=1
where τ̂il is defined as in (2), Pb is defined as in (7) for directed SBMs and as in (8) for
undirected SBMs, with a and b replaced by â and b̂, respectively. Here the estimates â, b̂
and (π̂1 , π̂2 ) are outputs from the inner loop (i.e., EM) iterations, and are in effect initials for
the outer loop calculation. Consistency of the inner loop (i.e., EM) outputs â, b̂ and (π̂1 , π̂2 )
can be established using the result in Amini et al. (2013). In our theoretical analysis, we
11
focus our efforts on establishing strong consistency of the column labels e estimated in the
δ
outer loop, given that the outer loop initials satisfy (â, b̂) ∈ Pa,b in (10) and π̂1 = π̂2 = 1/2.
For SBMs with two balanced communities, we make the following assumption:
(A) Assume that each community contains m = n/2 nodes and π̂1 = π̂2 = 1/2.
The assumption that π̂1 = π̂2 = 1/2 is reasonable as the inner loop outputs (π̂1 , π̂2 ) are
consistent estimators of (π1 , π2 ) = (1/2, 1/2), as shown in Amini et al. (2013). Without loss
of generality, let ci = 1 for i ∈ {1, . . . , m}, and ci = 2 for i ∈ {m + 1, . . . , n}. Assume
that e(0) ∈ {1, 2}n assigns equal numbers of nodes to the two communities, i.e., the initial
labeling vector is balanced. Let e(0) match with the truth on γm labels in each of the two
communities for some γ ∈ (0, 1). We assume γm to be an integer. Next, let E γ denote the
set that collects all such initial labeling vectors, i.e.,

( m n
)
X (0)
X (0)
E γ = e(0) ∈ {1, 2}n : I(ei = 1) = γm, I(ei = 2) = γm .
i=1 i=m+1
Note that γ = 1/2 corresponds to “no correlation” between e(0) and c, whereas γ = 0 and
γ = 1 both correspond to perfect correlation. In our analysis, we do not require knowing the
value of γ, or knowing which labels are matched. In Theorem 2, we show that the amount of
overlap γ can be any value, as long as γ 6= 1/2. Our goal is to establish strong consistency
for ĉ{e(0) }. For a constant δ > 1, we define Pa,b

δ
as follows:
( )
δ â b̂
Pa,b = (â, b̂) : I(a > b) + I(a < b) ≥ δ . (10)
b̂ â
δ
The set Pa,b specifies that (â, b̂) has the same ordering as (a, b), and the relative difference
between the estimates â and b̂ is lower bounded. Our next theorem considers the collection
δ
of estimates (â, b̂) in Pa,b .
(a−b)2
Theorem 2. Assume (A) holds, δ > 1, γ ∈ (0, 1)\{ 12 } and (a+b)
≥ C log n for a sufficiently
large constant C > 0. For a directed SBM in (5) with the edge-probabilities given by (7) with
12
a 6= b, we have that for any > 0, there exists N > 0 such that for all n ≥ N , the following
holds
( )
(a−b)2 −4(a−b)+42 (2γ−1)2 (a−b)2
− −
\
(0)
P ĉ{e } = c ≥ 1 − ne 4(a+b) + n(n + 2)e 8(a+b) ,
δ
(â,b̂)∈Pa,b
for any e(0) ∈ E γ , where ĉ{e(0) } = c means that they belong to the same equivalent class of
label permutations.
The proof of Theorem 2 is provided in the supplemental material. It can be seen from
Theorem 2 that the one-step estimate ĉ{e(0) } for a directed SBM is a strongly consistent
estimate of c for any e(0) ∈ E γ . Note that weak consistency was established in Amini
(a−b)2
et al. (2013) under the assumption that (a+b)
→ ∞. In comparison, our result requires
(a−b)2
(a+b)
≥ C log n to establish strong consistency. In existing literature on strong consistency,
λn
the condition logn
→ ∞ is often commonly imposed (Bickel and Chen, 2009; Zhao et al.,
2012), where λn denotes the average network degree. Specifically, under the SBM setting
considered in Bickel and Chen (2009) and Zhao et al. (2012), we have that a − b λn and
a + b λn , where denotes that the two quantities on both sides are of the same order. In
λn (a−b)2
this case, logn
→ ∞ implies (a+b)
≥ C log n for any constant C > 0.
Theorem 2 guarantees strong consistency for any e(0) ∈ E γ . In comparison, the weak
consistency in Amini et al. (2013) holds uniformly for all e(0) ∈ E γ , even if it is derived
from the data. Indeed, e(0) is usually derived from data using initialization procedures such
as the spectral clustering. For the strong consistency result to apply, one may consider a
data splitting strategy following the method in Li et al. (2020). Specifically, we may sample
a proportion of the node pairs to produce an initial value e(0) and estimate ĉ(e(0) ) using
the rest of the node pairs. In this case, e(0) is independent of the data used for community
detection and the result in Theorem 2 can be used to ensure strong consistency of ĉ(e(0) ).
In our numerical studies, for simplicity we did not use data splitting, while the simulation
results show that the proposed method still performs well. We also note that Theorem 2 can
13
be adapted to hold uniformly for all e(0) ∈ E γ , if stronger conditions are placed on γ and
a, b. Specifically, if the misclassification ratio of e(0) is, for example, O(1/(a + b)) and the
√
condition on a, b is strengthen to (a − b) & n log n (i.e., average degree is at least of order
√
n log n), then strong consistency in Theorem 2 holds uniformly for all such e(0) , even if it
is derived from the data. This can be shown by ! combining the union bound argument and
n
a Stirling approximation that gives log ≤ n log(en/nγ ), where nγ is the number of
nγ
misclassified nodes. The misclassification ratio of O(1/(a + b)) imposed above is known to
hold with high probability for spectral clustering (see, for example, Corollary 3.2 in Lei and
Rinaldo (2015)).
Next, we consider the case of undirected SBMs. Let aγ = (1 − γ)a + γb I(γ > 12 ) +

γa + (1 − γ)b I(γ < 21 ). We have the following result on the strong consistency of ĉ{e(0) }.

(a−b)2
Theorem 3. Assume (A) holds, δ > 1, γ ∈ (0, 1)\{ 12 } and (a+b)
≥ C log n for a sufficiently
large constant C > 0. For an undirected SBM in (6) with the edge-probabilities given by (8)
with 2(1 + )aγ ≤ |(1 − 2γ)(a − b)| for some ∈ (0, 1), there exist ρ ∈ (0, 1) and N > 0,
such that for all n ≥ N , the following holds

( )
\
P ĉ{e(0) } = c
δ
(â,b̂)∈Pa,b
1−ρ 2 2
( 1− )2 (2γ−1)2 (a−b)2
( 4 ) (a−b) 2 /2
− − 2 − aγ
≥ 1 − 3ne 4(a+b) + n(n + 2) e 4(a+b) + 2e 1+/2 ,
label permutations.
The proof of Theorem 3 is provided in the supplemental material. It can be seen that the
one-step estimate ĉ{e(0) } for an undirected SBM is a strongly consistent estimate of c, for
any e(0) ∈ E γ . Given and γ, the condition 2(1 + )aγ ≤ |(1 − 2γ)(a − b)| places an
1 1
upper bound on b/a. For example, for = 3
and γ < 10
, the above condition is satisfied if
b/a ≤ (1 − 10γ)/(9 − 10γ).
14
Strong consistency can be more desirable than weak consistency, as it enables normal
distribution based inference and goodness of fit tests (see numerical studies in Section
5.2). For example, consider a SBM with K = 2, π = (π1 , π2 ) and true community la-
(w)
bels c = (c1 , c2 , . . . , cn ). Suppose we can construct a label vector ĉ(w) such that {ĉi }ni=1
(w) (w)
are independent with P(ĉi 6= ci ) = 2pn for ci = 1 and P(ĉi 6= ci ) = pn for ci = 2, where
pn = 1/ log n. Then it can be shown that ĉ(w) is weakly consistent, with a misclassification
n
(w)
ratio of Op (1/ log n), but not strongly consistent to c. Let π̂1w =
P
I(ĉi = 1)/n. It holds
i=1
√ n w 1−3π1
o
d
that n π̂1 − π1 + log n −→ N {0, π1 (1 − π1 )} (See the proof in the supplemental ma-
terial). Thus the bias term of π̂1w is O(1/ log n), which can be non negligible for inference.

(s) (s) (s)
On the other hand, for a strongly consistent estimator ĉ(s) = ĉ1 , ĉ2 , . . . , ĉn , letting
n
(s) √ d
π̂1s = I(ĉi = 1)/n, it holds that n {π̂1s − π1 } −→ N {0, π1 (1 − π1 )}.
P
i=1
Next, we consider the more general case of directed and undirected SBMs with K com-
munities. Similar to Assumption (A), we make the following assumption:
(B) Assume that each community contains m = n/K nodes and π̂k = 1/K.
Let the edge-probability matrix of the directed SBM be
a b
Pekl = 1(k = l) + 1(k 6= l), (11)
m m
and that of the undirected SBM be
a2 b2

2a 2b
Pkl = − 2 1(k = l) + − 2 1(k 6= l), (12)
m m m m
for k, l = 1, . . . , K. Without loss of generality, let ci = k for i ∈ {(k − 1)m + 1, . . . , km} for
k = 1, . . . , K. Let E γ denote the set that collects all initial labeling vectors such that
 
 km
X Xn 
γ (0) n (0) (0)
E = e ∈ {1, . . . , K} : I(ei = k) = γk m, I(ei = k) = m, k = 1, . . . , K ,
 
i=(k−1)m+1 i=1
where γ = (γ1 , . . . , γK ).Corollaries 1 and 2 establish the strong consistency of profile pseudo
likelihood estimators for directed and undirected SBMs, respectively.
15
(a−b)2
Corollary 1. Assume (B) holds, δ > 1, min {γ1 , γ2 , . . . , γK } ∈ ( 21 , 1) and (a+b)
≥ C log n
for a sufficiently large constant C > 0. For a directed SBM in (5) with the edge-probabilities
given by (11) with a 6= b, we have that for each > 0, there exists N > 0 such that for all
n ≥ N , the following holds

( )
\
P ĉ{e(0) } = c (13)
δ
(â,b̂)∈Pa,b
( K K
)
2 2 2
−
(a−b)2 −4(a−b)+42 (10K − 8)n X X − (γk +γ8(a+b)
l −1) (a−b)
≥ 1− (K − 1)ne 4(a+b) + e ,
K k=1 l=1
label permutations.
(a−b)2
Corollary 2. Assume (B) holds, δ > 1, min {γ1 , γ2 , . . . , γK } ∈ ( 12 , 1) and (a+b)
≥ C log n for
a sufficiently large constant C > 0. For an undirected SBM in (6) with the edge-probabilities
given by (12) with 2(1 + )aγk ≤ (γk + γl − 1)(a − b) for all 1 ≤ k, l ≤ K and some ∈ (0, 1),
where aγk = (1 − γk )a + γk b, there exist ρ ∈ (0, 1) and N > 0, such that for all n ≥ N , the
following holds
( )
\
P ĉ{e(0) } = c
δ
(â,b̂)∈Pa,b
" K K #
1−ρ 2 32 aγ
−
( 4 ) (a−b)
2
(10K − 8)n2 X X − ( 1− 2 2
2 ) (γk +γl −1) (a−b)
2
− k
≥ 1 − 3(K − 1)ne 2(a+b) + e 6(a+b) + 2e 8(4+) ,
K k=1 l=1
for any e(0) ∈ εγn , where ĉ{e(0) } = c means that they belong to the same equivalent class of
label permutations.
The proofs of Corollaries 1 and 2 follow very similar steps as in the proofs of Theorems 2
and 3, respectively. We omit presenting the details.
4 Extensions
In this section, we study two useful extensions of the proposed method. First, we consider
the case of fitting the degree corrected stochastic block model using the proposed profile-
16
pseudo likelihood method. Second, we consider the case of fitting the bipartite stochastic
block model using the proposed profile-pseudo likelihood method (see Section A5 in the
supplemental material).
It has often been observed that real-world networks exhibit high degree heterogeneity,
with a few nodes having a large number of connections and the majority of the rest having
a small number of connections. The stochastic block model, however, cannot accommodate
such degree heterogeneity. To incorporate the degree heterogeneity in community detection,
Karrer and Newman (2011) proposed the degree-corrected SBM. Specifically, conditional
on the label vector c, it is assumed that the edge variables Aij for all i ≤ j are mutually
independent Poisson variables with
E[Aij |c] = θi θj λci cj ,
where Λ = [λkl ] is a K × K symmetric matrix and θ = (θ1 , θ2 , . . . , θn ) is a degree parameter

n
P
vector, with the additional constraint θi /n = 1 that ensures identifiability (Zhao et al.,
i=1
2012).
Define Ω = (π, Λ, θ). To fit the DCSBM to an observed adjacency matrix A, we define
the following log pseudo likelihood function

n
(K n
)
X X Y
`DC
PL (Ω, e; {ai }) = log πl e−θi θj λlej (θi θj λlej )Aij .
i=1 l=1 j=1
Pn
Let di = j=1 Aij , 1 ≤ i ≤ n. A profile-pseudo likelihood algorithm that maximizes
`DC
PL (Ω, e; {ai }) is described in Algorithm 2. At step 4, we update the conditional probabilities
for the row labels by

(s,t) (s,t) (s,t) Aij
−θi θj λ (s)
n

Q (s,t) ke (s,t) (s,t) (s,t)
πk e j θi θj λ (s)
j=1 ke j
(s,t+1)
τik = (s) (s,t) (s,t) Aij . (14)
−θi θj λ (s)
K Q
n

P (s,t) le (s,t) (s,t) (s,t)
πl e j θi θj λ (s)
lej
l=1 j=1
At step 5, we update the parameters by sequentially solving the following optimization
17
Algorithm 2 DCSBM Profile-Pseudo Likelihood Maximization Algorithm.
Step 1: Initialize e(0) using spectral clustering with permutations (SCP).
Step 2: Calculate Ω(0) = (π (0) , Λ(0) , θ (0) ). That is, for 1 ≤ l, k ≤ K, 1 ≤ i ≤ n,
n P
n
P (0) (0)
n Aij I(ei =k)I(ej =l)
(0) 1
P (0) (0) (0) i=1 j=1
πk = n
I(ei = k), θi ∝ di , λkl = n P
P n
(0) (0) (0) (0)
.
i=1 I(ei =k)I(ej =l)θi θj
i=1 j=1
Step 3: Initialize Ω(0,0) = (π (0,0) , Λ(0,0) , θ (0,0) ) = (π (0) , Λ(0) , θ (0) ).

repeat
repeat
(s,t+1)
Step 4: E-step: compute τik using (14) for 1 ≤ k ≤ K and 1 ≤ i ≤ n.
(s,t+1)
Step 5: CM-step: compute π , Λ(s,t+1) , θ (s,t+1) . For 1 ≤ k, l ≤ K, set
n P
n
P (s,t+1) (s)
n τik I(ej = l)Aij
(s,t+1)
X (s,t+1) (s,t+1) i=1 j=1
πk = τik /n, λkl = P
n P
n ,
(s,t+1) (s) (s,t) (s,t)
i=1 τik I(ej = l)θi θj
i=1 j=1
K
(s,t+1) P (s,t+1) (s) (s,t+1)
Letting gij = τik I(ej = l)λkl , for 1 ≤ i ≤ n, set
k,l=1
q .
(s,t+1) (s,t+1) (s,t+1) 2 (s,t+1) (s,t+1)
θi = −hi + hi + 8di gii 4gii ,
i−1 n
(s,t+1) P (s,t+1) (s,t+1) P (s,t) (s,t+1)
where hi = θj gij + θj gij .
j=1 j=i+1
until the ECM algorithm converges.
Step 6: Set Ω(s+1) to be the final ECM update.
(s+1)
Step 7: Given Ω(s+1) , update ej , 1 ≤ j ≤ n, using
n P
K n o
(s+1) P (s+1) (s+1) (s+1) (s+1) (s+1)
ej = arg maxk∈{1,2,...,K} −θi θj λlk + Aij log(λlk ) τil .
i=1 l=1
problems:
(π (s,t+1) , Λ(s,t+1) ) = arg max Q(π, Λ, θ (s,t) |Ω(s,t) , e(s) ),

(π,Λ)
(s,t+1) (s,t+1) (s,t+1) (s,t)
θi = arg max Q(π (s,t+1) , Λ(s,t+1) , θ1 , . . . , θi−1 , θi , θi+1 , . . . , θn(s,t) |Ω(s,t) , e(s) ).
θi
18
Here, the objective function Q(Ω|Ω(s,t) , e(s) ) is defined as
Q(Ω|Ω(s,t) , e(s) ) = Ez|{ai };Θ(s,t) ,e(s) log f {ai }, z; Ω, e(s) ,

where z = (z1 , · · · , zn )> denotes the row label vector and

 n oAij 
Y  Y −θi θj λz e(s) θi θj λzi e(s)
n n
f ({ai }, z; Ω, e(s) ) = πzi e i j j
.

i=1 j=1
Aij !
The inner loop of Algorithm 2, i.e., steps 4 and 5, is different from that in Algorithm
1, as it considers a conditional EM (ECM) update. Specifically, the objective function
Q(Ω|Ω(s,t) , e(s) ) in the M-step, i.e., step 5, which solves for block parameters λkl ’s and
degree parameters θi ’s, is nonconvex and does not have closed form solutions. Hence, directly
optimizing it using numerical techniques can be computationally costly and is not ensured
to find the global optimum. The ECM algorithm replaces the challenging optimization
problem in the M-step with a sequence of alternating updates, each of which has a closed-
form solution. It is easy to implement and enjoys the desirable ascent property (Meng and
Rubin, 1993). Consequently, Algorithm 2 has convergence guarantees, which improves over
Amini et al. (2013).
We also note that in our profile-pseudo likelihood approach, while the conditional distri-
bution (on node degrees) of the Poisson variables is multinomial, the multinomial coefficient
di !
(i.e., the bi1 !bi2 !···biK !
factorial term) in the density function involves the column labels (in
bik ’s). As such, optimizing for the column labels in the outer loop becomes highly chal-
lenging. In Algorithm 2, we work with the pseudo likelihood without conditioning on node
degrees and it requires estimating the degree parameters in the M-step. This is different
from that in Amini et al. (2013).
5 Simulation Studies
In this section, we carry out simulation studies to investigate the finite sample performance
of our proposed profile-pseudo likelihood method (referred to as PPL), and to compare with
19
existing solutions including the spectral clustering with permutations (referred to as SCP)
and the pseudo likelihood method (referred to as PL) proposed in Amini et al. (2013). Both
SCP and PL are implemented using the code provided by Amini et al. (2013). We also
compare with the strongly consistent majority voting method proposed in Gao et al. (2017)
(see Section A6 in the supplemental material).
We consider two evaluation criteria. The first one is the normalized mutual information
(NMI), which measures the distance between the true labeling vector and an estimated
labeling vector. The NMI takes values between 0 and 1, and a larger value implies a higher
accuracy. The second one is the CPU running time, which measures the computational cost.
Note the reported running time does not include the initialization step (see Section A6 in
the supplemental material and discussions in Section 7). All methods are implemented in
Matlab and run on a single processor of an Intel(R) Core(TM) i7-4790 CPU 3.60 GHz PC.
5.1 SBM
In this section, we simulate networks from SBMs. Three different settings are considered.
In Setting 1, we evaluate the convergence of PPL and PL; in Setting 2, we compare the
performance of PPL, SCP and PL when the networks are small and dense; in Setting 3, we
compare the three methods when the networks are large and sparse.
Setting 1: In this simulation, we evaluate the convergence performance of PPL and PL with
varying initial labeling vectors. We simulate from SBMs with n = 500 nodes that are divided
into K equal sized communities, and the within/between community connecting probabilities
are Pkl = p1 + p2 × 1(k = l), k, l = 1, . . . , K. We consider (K, p1 , p2 ) = (2, 0.13, 0.07), and
(K, p1 , p2 ) = (5, 0.10, 0.13). Both the PPL and PL algorithms are considered to have converged
if the change of the latest update (relative to the previous one) is less than 10−6 or if the
number of outer iterations exceeds 60. We let the NMI of the initial labeling vector vary
from 0.1 to 0.5. All simulations are repeated 100 times. The proportion of convergence for
20
K=2 K=5
1.00 1.00
convergence proportion
convergence proportion
0.75
0.75
0.50
0.50
Algorithm Algorithm
PL PL
PPL 0.25 PPL
0.25
0.00
0.1 0.2 0.3 0.4 0.1 0.2 0.3 0.4
NMI of initial NMI of initial
Figure 2: Proportion of convergence of PPL and PL with initial labels of varying NMI.
Algorithm ● SCP PL PPL Algorithm PL PPL
1.00 ● ● ● ● ● ● 0.4
● ● ●
● ●
● ●
●
●
●
●
0.75 ●
running time(secs)
● 0.3
●
●
●
NMI
0.50
● 0.2
0.25
0.1
0.00
600 900 1200 1500 600 900 1200 1500
n n
Figure 3: NMI and computing time of PPL and PL with varying network size n.
PPL and PL are presented in Figure 2. It is seen that the PL does not have a satisfactory
convergence performance. One example (in the case of K = 2) of the convergence of PPL
and non-convergence of PL is shown in Figure 1, where it is observed that the PL algorithm
did not converge, and the final estimate has a smaller log pseudo likelihood when compared
to the initial value.
Setting 2: In this simulation, we compare the performance of SCP, PL, and PPL on small
scale and dense networks. The PL method is not expected to perform well in this setting due
to the relatively large Poisson approximation error. We acknowledge that many networks in
real applications are large and/or sparse, and we note that here we use simulated examples
to investigate a limitation of the PL method. We simulate from SBMs with n nodes that are
21
divided into K = 2 equal sized communities, and the within/between community connecting
probabilities are Pkl = p1 + p2 × 1(k = l), k, l = 1, . . . , K. We consider (p1 , p2 ) = (0.84, 0.06).
Both PPL and PL are initialized by SCP. Figure 3 reports the NMI from the three methods
based on 100 replications. It is seen that PPL outperforms the PL both in terms of community
detection accuracy (when n < 1000) and computational efficiency. The unsatisfactory per-
formance of the PL method when n < 1000 is due to the errors from approximating binomial
random variables with Poisson random variables. This approximation is not expected to
work well when p1 (or p2 ) is large and when n is small (Hodges and Le Cam, 1960). Also
note that the PL method may perform worse than the initial labels, as its iterations do not
enjoy the ascent property. It can also be seen that as n increases, the performance of PL
improves notably.
Setting 3: In this simulation, we compare the performance of SCP, PL, and PPL on large-
scale and sparse networks. We consider similar simulation settings as in Amini et al. (2013).
As in Decelle et al. (2011), the edge-probability matrix P is controlled by the following two
parameters: the “out-in-ratio” β, varying from 0 to 0.2, and the weight vector ω, determining
the relative degrees within communities. We set ω = (1, 1, 1). Once β = 0, P ∗ is set to be
a diagonal matrix diag(ω), while otherwise we set the diagonal elements of P ∗ to be β −1 ω
and set all the off-diagonal ones to 1. Then, the overall expected network degree is set to be
λ, which varies from 3 to 5. Finally, we re-scale P ∗ to obtain this expected degree, giving
the resulting P as follows:
λ
P = P ∗, (15)
(n − 1)(π T P ∗ π)
which generates sparse networks, since Pkl = O(1/n). In this simulation study, both PL and
PPL are initialized by SCP. We let K = 3 and π = (0.2, 0.3, 0.5). We consider three scenarios:
1) varying β while setting λ = 5 and n = 4000, 2) varying λ while setting β = 0.05, and
n = 4000, and 3) varying n while setting λ = 5 and β = 0.05. Figure 4 reports the NMI from
the three methods and the computing time from PPL and PL, based on 100 replications. We
22
Algorithm SCP PL PPL Algorithm PL PPL
running time (secs)

0.9 ● 0.5
0.8 ●
●
●
● 0.4 ●
●
●
●
0.7
NMI
●
●
● 0.3 ●
●
●
●
●
0.6 ●
● ●
● ●
●
● ●
●
0.2 ●
●
● ●
● ●
●
0.5 ●
●
●
●
●
●
●
0.4 ●
●
●
●
0.1 ●
●
●
●
●
beta=0.02 beta=0.05 beta=0.1 beta=0.02 beta=0.05 beta=0.1
0.8 ●
0.5 ●
running time (secs)

● ●
●
●
●
●
●
●
0.4 ● ●
●
0.6 ●
● ●
●
●
0.3
NMI
● ●
●
●
●
● ●
●
●
● ●
0.4 ●
●
0.2 ●
●
● ●
●
●
●
●
● ●
●
0.1 ●
●
●
0.2 ●
●
●
● ●
lambda=3 lambda=4 lambda=5 lambda=3 lambda=4 lambda=5
log10 running time (secs)

2 ●
0.75 ● ● ●
● ●
0.70 1 ●
●
●
●
NMI
●
●
0.65
●
0 ●
●
0.60
●
0.55 −1
n=10^4 n=10^5 n=10^6 n=10^4 n=10^5 n=10^6
Figure 4: Comparisons of the NMI and computing time from SCP, PL and PPL under different
settings. The three rows correspond to the following three scenarios respectively: 1) varying
β while setting λ = 5 and n = 4000, 2) varying λ while setting β = 0.05, and n = 4000, and
3) varying n while setting λ = 5 and β = 0.05.
note the reported running times for PPL and PL do not include the initialization step. For
comparison, when λ = 5, β = 0.05 and n = 106 , the SCP initialization step takes less than
100 seconds (see Section A6 in the supplemental material). It is seen that PPL outperforms
both SCP and PL in terms of community detection accuracy. Moreover, PPL consistently
outperforms PL in terms of computational efficiency.
5.2 Goodness of fit test and normality of plug-in estimators
To evaluate goodness of fit, we consider the maximum entry-wise deviation based testing
procedure in Hu et al. (2020b). The authors showed that the distribution of the test statistic,
23
n=600 n=1200
SCP
SCP PPL
PPL limit
0.20
limit
0.20
0.15
0.15
Density
Density
0.10
0.10
0.05
0.05
0.00
0.00
−5 0 5 1. 15 20 25 −5 0 5 1. 15 20 25
Figure 5: Null densities of the test statistic with n = 600 (left plot) and n = 1200 (right
plot). The blue dashed lines, red dash-dotted lines and black solid lines show the densities
under SCP, PPL and the theoretical limit, respectively.
denoted by Tn and calculated with a strongly consistent community label, converges to a
Gumbel distribution. In this simulation study, we consider a SBM with K = 3, π =
(0.2, 0.3, 0.5), and Pkl = 0.12 + 0.08 × I(k = l), and investigate the distribution of Tn
calculated using estimates from PPL and SCP respectively. The results over 1000 replications
are shown in Figure 5. It is seen that the sample null distribution of Tn calculated with
PPL is very close to the limiting distribution while that calculated with SCP deviates from
the limit considerably. This is due to that Tn in Hu et al. (2020b) is calculated based on
maximum entry-wise deviation and as such, the misclassified nodes in SCP, albeit not many,
may much inflate the test statistic. With the refinement of PPL, the test statistic is seen to
have a sample null distribution close to the theoretical limit, ensuring a well-controlled test
size.
To examine normality of plug-in estimators, we consider a SBM with K = 3, π =
(0.2, 0.3, 0.5), Pkl = 0.12+0.08×I(k = l) and n = 800. We consider the empirical distribution
of π̂1 , π̂2 and π̂3 calculated using labels produced by PPL and SCP, respectively. The results
over 1000 replications are shown in Figure 6. It is seen that the empirical distributions
calculated with PPL are very close to the limiting distributions while those calculated with
24
π1 π2 π3
SCP SCP
30
25
SCP
PPL PPL PPL
25
limit limit limit
25
20
20
20
Density
Density
Density
15
15
15
10
10
10
5
5
5
0
0
0.15 0.20 0.25 0.20 0.25 0.30 0.35 0.40 0.40 0.45 0.50 0.55 0.60
Figure 6: Empirical distributions of π̂1 , π̂2 and π̂3 . The blue dashed lines, red dash-dotted
lines and black solid lines show the densities under SCP, PPL and the theoretical limit, re-
spectively.
SCP deviate, especially for π̂1 and π̂3 , from the theoretical limits.
5.3 DCSBM
In this section, we evaluate the performance of the profile-pseudo likelihood method under
the DCSBM, referred to as DC-PPL. We fix K = 3, n = 1200, π = (0.2, 0.3, 0.5) and let
P = 10−2 × [JK,K + diag(2, 3, 4)], where JK,K is a K by K matrix where every element is
equal to one. The degree parameters {θi }ni=1 are generated from (Zhao et al., 2012), i.e.,
2
P (θi = mx) = P (θi = x) = 1/2 with x = ,
m+1
which ensures that E(θi ) = 1. We consider m = 2, 4, 6. Given c and θ, the edge variables
Aij ’s are independently generated from a Bernoulli distribution with parameters θi θj Pci cj ,
1 ≤ i ≤ j ≤ n.
We compare DC-PPL with SCP as well as CPL, an extension of PL proposed for networks
with degree heterogeneity in Amini et al. (2013). The results are summarized in Figure 7,
based on 100 replications. We can see both DC-PPL and CPL outperform SCP, and DC-PPL
performs better than CPL in terms community detection accuracy.
25
Figure 7: Comparison of SCP, CPL, DC-PPL under DCSBM with varying m.
6 Real-world Data Examples

6.1 Political blogs data
In this subsection, we apply our proposed method to the network of political blogs collected
by Adamic and Glance (2005). The nodes in this network are blogs on US politics and
the edges are hyper-links between these blogs with directions removed. This data set was
collected right after the 2004 presidential election and demonstrates strong divisions. In
Adamic and Glance (2005), all the blogs were manually labeled as liberal or conservative,
and we take these labels as the ground truth. As in Zhao et al. (2012), we focus on the
largest connected component of the original network, which contains 1,222 nodes, 16,714
edges and has the average degree of approximately 27.
To perform community detection, we consider five different methods, namely, PL, PPL,
SCP, CPL, and DC-PPL. We compute the NMI between the estimated community labels with
the so-called ground truth labels. Figure 8 shows the community detection results from the
five different methods. It is seen that PPL and PL divide the nodes into two communities, with
low degree and high degree nodes, respectively. Both the PPL and PL estimates have NMI
close to zero as neither of these two methods take into consideration the degree heterogeneity.
The partition obtained using SCP has NMI=0.653, while that from the CPL has NMI=0.722
and that from the DC-PPL has NMI=0.727. Both CPL and DC-PPL achieve good performance
in this application.
26
(a) True (b) PL (c) PPL
(d) SCP (e) CPL (f) DC-PPL
Figure 8: Community detection on the political blogs data using PL, PPL, SCP, CPL, and
DC-PPL respectively. The sizes of nodes are proportional to the their degree, and the color
corresponds to different community labels.
6.2 International trade data
In this subsection, we apply our proposed method to the network of international trades. The
data contain yearly international trades among n = 58 countries from 1981–2000 (Westveld
and Hoff, 2011). Each node in the network corresponds to a country and an edge (i, j)
measures the amount of exports from country i to country j for a given year; see Westveld
and Hoff (2011) for details. Following Saldana et al. (2017), we focus on the international
trade network in 1995 and transform the directed and weighted adjacency network to an
undirected binary network. Specifically, let Wij = Tradeij + Tradeji , and set Aij = 1 if
27
Group Countries
Algeria, Barbados, Bolivia, Costa Rica, Cyprus, Ecuador, El Salvador,
1 Guatemala, Honduras, Iceland, Jamaica, Mauritius, Nepal, Oman, Panama,
Paraguay, Peru, Trinidad and Tobago, Tunisia, Uruguay, Venezuela
Belgium, Brazil, Canada, France, Germany, Italy, Japan, South Korea,
2
Mexico, Netherlands, Spain, Switzerland, United Kingdom, United States
Argentina, Australia, Austria, Chile, Colombia, Denmark, Egypt, Finland,
3 Greece, India, Indonesia, Ireland, Israel, Malaysia, Morocco, New Zealand,
Norway, Philippines, Portugal, Singapore, Sweden, Thailand, Turkey
Table 1: Community detection result on the international trade data using PPL with K = 3.
Wij ≥ W0.5 , and Aij = 0 otherwise. Here Tradeij records the amount of exports from
country i to country j and W0.5 denotes the 50th percentile of {Wij }1≤i<j≤n . Using different
model selection procedures, both Saldana et al. (2017) and Hu et al. (2020a) selected the
number of SBM communities to be K = 3 for this data set. Saldana et al. (2017) suggested
that larger community numbers such as K = 7 are also reasonable and they tended to provide
finer solutions. We apply PPL to this network with K = 3 and the community detection
result is summarized in Table 1. It is seen that the three communities mostly correspond
to developing countries in South America with low GDPs, countries with high GDPs and
industrialized European and Asian countries with medium-level GDPs, respectively.
To evaluate goodness of fit, we consider the maximum entry-wise deviation based testing
procedure (Hu et al., 2020b) that we investigated in Section 5.2. The community labels
identified using SCP under K = 3 gives a test statistic value of 52.13 with a p-value less
than 10−10 , suggesting a lack of fit. On the other hand, the community labels identified by
PPL, initialized using SCP under K = 3, gives a test statistic of 4.59 with a p-value of 0.03.
Therefore, the goodness of fit test for PPL under K = 3 is not rejected at the significance
level of 0.01. It is also worth noting that when K = 4, PPL gives a test statistic of 2.38
with a p-value of 0.08 while SCP gives a p-value less than 10−3 . It is seen through this data
example that refinement of the initial clustering solution can be useful in inferential tasks
28
such as the goodness of fit test.
7 Discussion
In this paper, we propose a new profile-pseudo likelihood method for fitting SBMs to large
networks. Specifically, we consider a novel approach that decouples the membership labels
of the rows and columns in the likelihood function, and treat the row labels as a vector
of latent variables. Correspondingly, the likelihood can be maximized in an alternating
fashion over the block model parameters and over the column community labels. Our pro-
posed method retains and improves on the computational efficiency of the pseudo likelihood
method, performs well for both small and large scale networks, and has provable conver-
gence guarantee. We show that the community labels (i.e., column labels) estimated from
our proposed method enjoy strong consistency, as long as the initial labels have an overlap
with the truth beyond that of random guessing.
In our approach, we consider spectral clustering as the initialization method, which re-
quires computing K leading eigenvectors. In real world applications, many implementations
of eigen-decomposition are scalable, such as the PageRank algorithm adopted in Google
search (Page et al., 1999). We also note that our method needs not to limit the initialization
algorithm to spectral clustering. For large-scale networks, one may consider the FastGreedy
method by Clauset et al. (2004), which has a complexity of O(n log2 n) or the Louvain al-
gorithm by Blondel et al. (2008), which has a complexity of O(n log n) (Yang et al., 2016).
These fast algorithms, to our best knowledge, may not have theoretical guarantees on their
performances. However, they have been validated empirically by many across various fields
(Yang et al., 2016) and can be considered as an initialization method when spectral clustering
is not feasible.
Although we focus on SBMs and DCSBMs in this work, we envision the idea of simplifying
the block model likelihoods by decoupling the membership labels of rows and columns can be
29
applied to other network block model problems, such as mixed membership SBMs (Airoldi
et al., 2008), block models with additional node features (Zhang et al., 2016) and SBMs with
dependent edges (Yuan and Qu, 2018). We plan to investigate these directions in our future
work.
The code is publicly available on Github (https://github.com/WangJiangzhou/Fast-Network-
Community-Detection-with-Profile-Pseudo-Likelihood-Methods).
Acknowledgment
Wang, Liu and Guo’s research are supported by NSFC grants 11690012 and 11571068, the
Fundamental Research Funds for the Central Universities grant 2412017BJ002, the Key
Laboratory of Applied Statistics of MOE (KLAS) grants 130026507 and 130028612, the
Special Fund for Key Laboratories of Jilin Province, China grant 20190201285JC. Zhang’s
research is supported by NSF DMS-2015190 and Zhu’s research is supported by NSF DMS-
1821243.
References
Abbe, E. (2017), “Community detection and stochastic block models: recent developments,”
The Journal of Machine Learning Research, 18, 6446–6531.
Abbe, E., Bandeira, A. S. and Hall, G. (2015), “Exact recovery in the stochastic block
model,” IEEE Transactions on Information Theory, 62, 471–487.
Abbe, E., Fan, J., Wang, K., Zhong, Y., et al. (2020), “Entrywise eigenvector analysis of
random matrices with low expected rank,” Annals of Statistics, 48, 1452–1474.
Adamic, L. A. and Glance, N. (2005), “The political blogosphere and the 2004 U.S. elec-
tion:divided they blog,” in International Workshop on Link Discovery, pp. 36–43.
Airoldi, E. M., Blei, D. M., Fienberg, S. E., and Xing, E. P. (2008), “Mixed membership
stochastic block models,” Journal of Machine Learning Research, 9, 1981–2014.
Amini, A. A., Chen, A., Bickel, P. J., and Levina, E. (2013), “Pseudo-likelihood methods for
community detection in large sparse networks,” The Annals of Statistics, 41, 2097–2122.
30
Bickel, P., Choi, D., Chang, X., and Zhang, H. (2013), “Asymptotic normality of maximum
likelihood and its variational approximation for stochastic block models,” The Annals of
Statistics, 1922–1943.
Bickel, P. J. and Chen, A. (2009), “A nonparametric view of network models and Newman–
Girvan and other modularities,” Proceedings of the National Academy of Sciences, 106,
21068–21073.
Bisson, G. and Hussain, F. (2008), “Chi-sim: A new similarity measure for the co-clustering
task,” in Machine Learning and Applications, 2008. ICMLA’08. Seventh International
Conference on, IEEE, pp. 211–217.
Blondel, V. D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E. (2008), “Fast unfolding of
communities in large networks,” Journal of statistical mechanics: theory and experiment,
2008, P10008.
Clauset, A., Newman, M. E. and Moore, C. (2004), “Finding community structure in very
large networks,” Physical review E, 70, 066111.
Daudin, J.-J., Picard, F. and Robin, S. (2008), “A mixture model for random graphs,”
Statistics and Computing, 18, 173–183.
Decelle, A., Krzakala, F., Moore, C., and Zdeborová, L. (2011), “Asymptotic analysis of the
stochastic block model for modular networks and its algorithmic applications,” Physical
Review E, 84, 066106.
Fortunato, S. (2010), “Community detection in graphs,” Physics Reports, 486, 75–174.
Fortunato, S. and Hric, D. (2016), “Community detection in networks: A user guide,” Physics
Reports, 659, 1–44.
Gao, C., Ma, Z., Zhang, A. Y., and Zhou, H. H. (2017), “Achieving Optimal Misclassification
Proportion in Stochastic Block Models,” Journal of Machine Learning Research, 18, 1–45.
Gao, C., Ma, Z., Zhang, A. Y., and Zhou, H. H. (2018), “Community detection in degree-
corrected block models,” Annals of Statistics, 46, 2153–2185.
Hodges, J. L. and Le Cam, L. (1960), “The Poisson approximation to the Poisson binomial
distribution,” The Annals of Mathematical Statistics, 31, 737–740.
Holland, P. W., Laskey, K. B. and Leinhardt, S. (1983), “Stochastic block models: First
steps,” Social Networks, 5, 109–137.
Hu, J., Qin, H., Yan, T., and Zhao, Y. (2020a), “Corrected Bayesian information criterion for
stochastic block models,” Journal of the American Statistical Association, 115, 1771–1783.
31
Hu, J., Zhang, J., Qin, H., Yan, T., and Zhu, J. (2020b), “Using Maximum Entry-Wise De-
viation to Test the Goodness of Fit for Stochastic Block Models,” Journal of the American
Statistical Association, 1–10.
Joseph, A., Yu, B. et al. (2016), “Impact of regularization on spectral clustering,” Annals of
Statistics, 44, 1765–1791.
Karrer, B. and Newman, M. E. (2011), “Stochastic block models and community structure
in networks,” Physical Review E, 83, 016107.
Larremore, D. B., Clauset, A. and Jacobs, A. Z. (2014), “Efficiently inferring community

structure in bipartite networks,” Physical Review E, 90, 012805.
Lei, J. (2016), “A goodness-of-fit test for stochastic block models,” The Annals of Statistics,
44, 401–424.
Lei, J. and Rinaldo, A. (2015), “Consistency of spectral clustering in stochastic block mod-
els,” The Annals of Statistics, 43, 215–237.
Lei, J. and Zhu, L. (2017), “Generic Sample Splitting for Refined Community Recovery in
Degree Corrected Stochastic Block Models,” Statistica Sinica, 1639–1659.
Li, T., Levina, E. and Zhu, J. (2020), “Network cross-validation by edge sampling,”
Biometrika, 107, 257–276.
Madeira, S. C., Teixeira, M. C., Sa-Correia, I., and Oliveira, A. L. (2010), “Identification
of regulatory modules in time series gene expression data using a linear time bicluster-
ing algorithm,” IEEE/ACM Transactions on Computational Biology and Bioinformatics
(TCBB), 7, 153–165.
Meng, X.-L. and Rubin, D. B. (1993), “Maximum likelihood estimation via the ECM algo-
rithm: A general framework,” Biometrika, 80, 267–278.
Moody, J. and White, D. R. (2003), “Structural cohesion and embeddedness: A hierarchical

concept of social groups,” American Sociological Review, 103–127.
Nowicki, K. and Snijders, T. A. B. (2001), “Estimation and prediction for stochastic block-
structures,” Journal of the American Statistical Association, 96, 1077–1087.
Page, L., Brin, S., Motwani, R., and Winograd, T. (1999), “The PageRank citation ranking:
Bringing order to the web.” Tech. rep., Stanford InfoLab.
Rohe, K., Chatterjee, S. and Yu, B. (2011), “Spectral clustering and the high-dimensional
stochastic blockmodel,” The Annals of Statistics, 39, 1878–1915.
32
Rohe, K., Qin, T. and Yu, B. (2012), “Co-clustering for directed graphs: the Stochastic
co-Blockmodel and spectral algorithm Di-Sim,” arXiv preprint arXiv:1204.2296.
Saldana, D. F., Yu, Y. and Feng, Y. (2017), “How many communities are there?” Journal
of Computational and Graphical Statistics, 26, 171–181.
Sarkar, S. and Dong, A. (2011), “Community detection in graphs using singular value decom-
position,” Physical Review E Statistical Nonlinear and Soft Matter Physics, 83, 046114.
Snijders, T. A. and Nowicki, K. (1997), “Estimation and prediction for stochastic block
models for graphs with latent block structure,” Journal of Classification, 14, 75–100.
Spirin, V. and Mirny, L. A. (2003), “Protein complexes and functional modules in molecular
networks,” Proceedings of the National Academy of Sciences, 100, 12123–12128.
Su, L., Wang, W. and Zhang, Y. (2019), “Strong consistency of spectral clustering for
stochastic block models,” IEEE Transactions on Information Theory, 66, 324–338.
Westveld, A. H. and Hoff, P. D. (2011), “A mixed effects model for longitudinal relational
and network data, with applications to international trade and conflict,” The Annals of
Applied Statistics, 5, 843–872.
Wu, C. J. (1983), “On the convergence properties of the EM algorithm,” The Annals of
statistics, 11, 95–103.
Yang, Z., Algesheimer, R. and Tessone, C. J. (2016), “A comparative analysis of community

detection algorithms on artificial networks,” Scientific reports, 6, 1–18.
Yuan, Y. and Qu, A. (2018), “Community Detection with Dependent Connectivity,” arXiv
preprint arXiv:1812.06406.
Zhang, J. and Chen, Y. (2018), “Modularity based community detection in heterogeneous

networks,” arXiv preprint arXiv:1803.07961.
Zhang, Y., Levina, E. and Zhu, J. (2016), “Community detection in networks with node
features,” Electronic Journal of Statistics, 10, 3153–3178.
Zhao, Y. (2017), “A survey on theoretical advances of community detection in networks,”

Wiley Interdisciplinary Reviews: Computational Statistics, 9, e1403.
Zhao, Y., Levina, E. and Zhu, J. (2012), “Consistency of community detection in networks
under degree-corrected stochastic block models,” The Annals of Statistics, 40, 2266–2292.
33
Supplementary Materials
Fast Network Community Detection with Profile-Pseudo
Likelihood Methods
Jiangzhou Wang, Jingfei Zhang, Binghui Liu, Ji Zhu, and Jianhua Guo
A1 Proof of Theorem 1
To prove Theorem 1, it suffices to show
LPL (Ω(s) , e(s) ; {ai }) ≤ LPL (Ω(s+1) , e(s) ; {ai }), (S1)
LPL (Ω(s+1) , e(s) ; {ai }) ≤ LPL (Ω(s+1) , e(s+1) ; {ai }). (S2)
Consider (S1). The updating procedure from {Ω(s) , e(s) } to {Ω(s+1) , e(s) } can be seen
as a procedure of fitting some mixture model, thus the inequality (S1) holds by the ascent
property of the EM algorithm (Wu, 1983).
Consider (S2). It is equivalent to
`PL (Ω(s+1) , e(s) ; {ai }) ≤ `PL (Ω(s+1) , e(s+1) ; {ai }). (S3)
We have
`PL (Ω(s+1) , e(s+1) ; {ai }) − `PL (Ω(s+1) , e(s) ; {ai })
 )Aij ( )1−Aij   )Aij ( )1−Aij 
n K n n K n
( (
(s+1) (s+1) (s+1) (s+1) (s+1) (s+1)
X X Y X X Y
= log  πl P (s+1) 1 − P (s+1) − log  πl P (s) 1 − P (s) 
lej lej lej lej
i=1 l=1 j=1 i=1 l=1 j=1
 ( )Aij ( )1−Aij ( )Aij ( )1−Aij 
n n
 K πl(s+1) (s+1) (s+1) (s+1) (s+1) (s+1)
Q Q
n
P (s+1) 1−P (s+1) πl P (s) 1−P (s)

X X j=1 lej lej j=1 lej lej 
= log 
 
( )Aij ( )1−Aij ( )Aij ( )1−Aij 
i=1
 n K n 
 l=1 (s+1) Q (s+1) (s+1) P (s+1) Q (s+1) (s+1)
πl P 1−P πl P 1−P

(s) (s) (s) (s)
j=1 lej lej l=1 j=1 lej lej
 ( )Aij ( )1−Aij  ( )Aij ( )1−Aij
n n
(s+1) (s+1)
 πl(s+1) (s+1) (s+1)
Q Q
n X
K
 P (s+1) 1 − P (s+1) P (s) 1 − P (s)
X  j=1 lej lej  j=1 lej lej
≥ log 
 
( )Aij ( )1−Aij  ( )Aij ( )1−Aij
 n  K (s+1) Q n
 Q  P
i=1 l=1 (s+1) (s+1) (s+1) (s+1)
P (s) 1 − P (s) πl P (s) 1 − P (s)
j=1 lej lej l=1 j=1 lej lej
( )Aij ( )1−Aij  ( )Aij ( )1−Aij 
n X
n X
K n X
n X
K
(s+1) (s+1) (s+1) (s+1) (s+1) (s+1)
X X
= τil log  P (s+1) 1−P (s+1)
− τil log  P (s) 1−P (s)

lej lej lej lej
j=1 i=1 l=1 j=1 i=1 l=1
 ( )Aij ( )1−Aij  ( )Aij ( )1−Aij 
n n X
K n X
K
(s+1) (s+1) (s+1) (s+1) (s+1) (s+1)
X X X
=  τil log  P (s+1) 1−P (s+1)
− τil log  P (s) 1−P (s)

lej lej lej lej
j=1 i=1 l=1 i=1 l=1
≥ 0,
1
where the first inequality is due to Jensen’s inequality, and the second inequality is due to
the update strategy for e(s) in Algorithm 1. The proof is completed.
We focus on the case of γ ∈ ( 12 , 1) and a > b. For the remaining three cases of (i) γ ∈ ( 21 , 1),
a < b, (ii) γ ∈ (0, 12 ), a > b, and (iii) γ ∈ (0, 12 ), a < b, the proofs are similar.
δ
For any (â, b̂) ∈ Pa,b , we have â > b̂. The PPL estimate can be written as follows:
n X
X 2 h n oi
(0)
ĉj {e } = arg max log (Pblk )Aij (1 − Pblk )1−Aij τ̂il {e(0) }.
e e
k∈{1,2}
i=1 l=1
Consider j ∈ {1, 2, . . . , m}. Then ĉj {e(0) } = 1 if

n X
X 2 h n oi n X
X 2 h n oi
log (Pbl1 )Aij (1 − Pbl1 )(1−Aij ) (0)
τ̂il {e } > log (Pbl2 )Aij (1 − Pbl2 )(1−Aij ) τ̂il {e(0) },
e e e e
i=1 l=1 i=1 l=1
which is equivalent to
2
( n n
)
X X X
eij τ̂il {e(0) }logPbl1 +
A eij )τ̂il {e(0) }log(1 − Pbl1 ) >
(1 − A
l=1 i=1 i=1
2
( n n
) (S4)
X X X
eij τ̂il {e(0) }logPbl2 +
A eij )τ̂il {e(0) }log(1 − Pbl2 ) .
(1 − A
l=1 i=1 i=1
n n
e0 ,
We let B
P eij τ̂il {e(0) } and n0 ,
A
P
τ̂il {e(0) } for all j = 1, 2, . . . , n and l = 1, 2, and
lj l
i=1 i=1
recall
! !
Pb11 Pb12 1 â b̂
Pb = = . (S5)
Pb21 Pb22 m b̂ â
By simplifying (S4), we can restate that ĉj {e(0) } = 1 if

!
0 0
Pb11 n e 0 0 0 0
o 1 − Pb12
B
e1j −B
e2j log + B1j − B2j − (n1 − n2 ) log
e > 0. (S6)
Pb12 1 − Pb11
Since â > b̂, we have Pb11 > Pb12 . Thus by (S6), we have
hn 0 o[n 0 oi
P ĉj {e(0) } = e0 ≤ 0 e 0 − (n0 − n0 ) ≤ 0

6 1 ≤ P B e −B
1j 2j Be −B
1j 2j 1 2
hn 0 o i
e0 ≤
[
≤ P B e −B
1j 2j {|n01 − n02 | ≥ }
h 0 i
≤ P B
e −B e 0 ≤ + P [|n0 − n0 | ≥ ] . (S7)
1j 2j 1 2
2
h i
0 0
Next we upper bound the two terms P Be1j −B
e2j ≤ and P [|n01 − n02 | ≥ ] separately.
Firstly, we have
h 0 0
i
P B1j − B2j ≤
e e
" n n
#
X X
= P eij τ̂i1 {e(0) } −
A eij τ̂i2 {e(0) } ≤
A
i=1 i=1
n
hX Xn
= P eij I(ci = 1) −
A A
eij I(ci = 2)
i=1 i=1
n
X n i
(0)
X
eij τ̂i2 {e(0) } − I(ci = 2) ≤

+ Aij τ̂i1 {e } − I(ci = 1) −
e A
"i=1n n
i=1
m
#
X X X
τ̂i1 {e(0) } − I(ci = 1) ≤

≤ P eij I(ci = 1) −
A A
eij I(ci = 2) + 2 A
eij
i=1 i=1 i=1
"( n n
) ( m
)#
X X [ X
τ̂i1 {e(0) } − I(ci = 1) ≤ −

≤ P eij I(ci = 1) −
A eij I(ci = 2) ≤ 2
A 2 A
eij
i=1 i=1 i=1
" n n
# " m
#
X X X
τ̂i1 {e(0) } − I(ci = 1) ≤ −

≤ P eij I(ci = 1) −
A eij I(ci = 2) ≤ 2 + P 2
A A
eij
" i=1
n n
i=1
m
# i=1
X X X h i
≤ P eij I(ci = 1) −
A Aij I(ci = 2) ≤ 2 +
e P τ̂i1 {e(0) } − I(ci = 1) ≥ . (S8)
i=1 i=1 i=1
n
Next, we have
P [|n01 − n02 | ≥ ]
h n [ n i
≤ P {n01 ≤ − } {n02 ≤ − }
2 2i 2 2i
h n h n
≤ P n01 ≤ − + P n02 ≤ −
 2 2 2 2   
[ n 
o [ n 
o
≤ P |τ̂i1 {e(0) } − I(ci = 1)| ≥ + P |τ̂i2 {e(0) } − I(ci = 2)| ≥
n n
i∈{1,2,...,m} i∈{m+1,m+2,...,n}
m n
X h
(0) i X hi (0)
≤ P |τ̂i1 {e } − I(ci = 1)| ≥ +P |τ̂i2 {e } − I(ci = 2)| ≥ . (S9)
i=1
n i=m+1
n
hP n
Similar to Lemma 1 in Amini et al. (2013), we can upper bound the term P A
eij I(ci =
i=1
Pn i
1) − Aij I(ci = 2) ≤ 2 as follows. Let
e
i=1
n n n
X X X
ηej σ(c) = eij I(ci = 1) −
A A
eij I(ci = 2) , A
eij σi (c),
i=1 i=1 i=1
3

 1, ci = 1
where σi (c) = , and σ(c) = (σ1 (c), σ2 (c), . . . , σn (c)). Let α
eij = E[A
eij ]. Since
−1, ci = 2

h i
Aij σj (c) − E Aij σj (c) ≤ max{e
αij , 1 − α
eij } ≤ 1, we have, for j = 1, 2, . . . , m,
e e
a b
ηj (σ (c))] = m ×
E [e −m× = (a − b),
b m
n
X n
X h i X n h i
2
υ = Var (−e
ηj (σ (c))) = V ar(Aij ) ≤
e E Aij =
e E A
eij = (a + b).
i=1 i=1 i=1

Then by applying the Bernstein inequality to −e
ηj σ(c) , we have
t2
−
P ηej σ(c) ≤ E ηej σ(c) − t = P −e ηj σ(c) ] + t ≤ e 2(v+t/3) , ∀t ≥ 0. (S10)
ηj σ(c) ≥ −E[e
Note that for t ∈ [0, 3(a + b)], we have 2(υ + t/3) ≤ 4(a + b). It follows from (S10) that
t2
P ηej σ(c) ≤ (a − b) − t ≤ e− 4(a+b) , ∀t ∈ [0, 3(a + b)].

(S11)
h i
In order to bound P ηej σ(c) ≤ 2 , we take t = (a − b) − 2. Then t ∈ [0, 3(a + b)] when n
(a−b)2
is large enough as (a+b)
≥ C log n for a sufficiently large C. Thus we have
{(a−b)−2}2 (a−b)2 −4(a−b)+42

ηj (σ(c)) ≤ 2] ≤ e−
P [e 4(a+b) = e− 4(a+b) . (S12)
h
To obtain upper bounds of (S8) and (S9), we need to upper bound P |τ̂i1 {e(0) } − I(ci =
i h i
1)| ≥ n , ∀i ∈ {1, 2, . . . , m} and P |τ̂i2 {e(0) }−I(ci = 2)| ≥ n , ∀i ∈ {m+1, m+2, . . . , n}.
δ
Firstly, we consider the case of i ∈ {1, 2, . . . , m}. With (â, b̂) ∈ Pa,b and (S5), we have
Pb11 â â
1−Pb11 m−â m−b̂ â
= ≥ = ≥ δ.
Pb12 b̂ b̂ b̂
1−Pb12 m−b̂ m−b̂
4
n n
eik = P A
Let B eij I(ej = k) and nk = P I(ej = k), we then have
j=1 i=1
h
(0) i
P |τ̂i1 {e } − I(ci = 1)| ≥
n
τ̂i1 {e(0) }

1 − /n
= P ≤
τ̂i2 {e(0) } /n
" #
(Pb11 )Bi1 (Pb12 )Bi2 (1 − Pb11 )n1 −Bi1 (1 − Pb12 )n2 −Bi2 1 − /n
e e e e
= P ≤
(Pb21 )Bei1 (Pb22 )Bei2 (1 − Pb21 )n1 −Bei1 (1 − Pb22 )n2 −Bei2 /n
 
e e Bi1 −Bi2
Pb11
1−Pb11 1 − /n 
= P  ≤
 
/n
Pb12

1−Pb12
 Bei1 −Bei2  
 Pb11 
 1 − /n  \ nn e o[n oo
= P   1−bP11  ≤ Bi1 − Bei2 ≥ 0 Bei1 − B
ei2 < 0 
 b
/n 

 1−P12

Pb12


ei1 −B
B 1 − /n h i
≤ P δ ≤ +P B ei1 − B
ei2 < 0
ei2
/n
1 − /n
ei1 − Bei2 ≤ 1 h
ei1 − B
i
= P B log +P B ei2 < 0
logδ /n

1 1 − /n
≤ 2P B ei1 − B ei2 ≤ log . (S13)
logδ /n
Let n
X
(0) eij σj {e(0) },

ξi σ{e } = Bi1 − Bi2 ,
e e e A
j=1
(
1, ej = 1
where σj {e(0) } = , and σ{e(0) } = σ1 {e(0) }, σ2 {e(0) }, . . . , σn {e(0) } . Note that
−1, ej = 2
h i
Aij σj {e(0) } − E Aeij σj {e(0) } ≤ max {e
αij , 1 − α
eij } ≤ 1. For i ∈ {1, 2, . . . , m}, we have
e

h
(0)
i a b a b
E ξi σ{e } = γm ·
e + (1 − γ)m · − (1 − γ)m · + γm · = (2γ − 1)(a − b),
m m m m
n n n
(0)
X X h i X
2
h i
υ = Var −ξi σ{e } =
e V ar Aij ≤
e E Aij =
e E Aeij = (a + b).
j=1 j=1 i=1

Then by applying the Bernstein inequality to −ξei σ{e(0) } , we have
t2
h h i i h h i i
−
P ξei σ{e(0) } ≤ E ξei σ{e(0) } − t = P −ξei σ{e(0) } ≥ −E ξei σ{e(0) } + t ≤ e 2(v+t/3) , ∀t ≥ 0.
(S14)
Note that for t ∈ [0, 3(a + b)], we have 2(υ + t/3) ≤ 4(a + b). It follows from (S14) that
t2
h i
P ξei σ{e(0) } ≤ (2γ − 1)(a − b) − t ≤ e− 4(a+b) , ∀t ∈ [0, 3(a + b)].

(S15)
5
h i
1 1−/n 1 1−/n
In order to bound P ξei σ{e(0) } ≤ logδ
log /n
, we take t = (2γ−1)(a−b)− logδ log /n
.
Then t ∈ [0, 3(a + b)] when n is large enough. Thus, we have

(0)
1 1 − /n
P ξei σ{e } ≤ log
logδ /n
( )2
1−/n
(2γ−1)(a−b)− 1 log
logδ /n
≤ e− 4(a+b)
 
1−/n
2 2 2(2γ−1)(a−b) 1 log 2
(2γ−1) (a−b) logδ /n 1−/n
− 4(a+b)
− 4(a+b)
+{ 1
logδ }
log( /n ) 
≤ e
(2γ−1)2 (a−b)2
−
≤ e 8(a+b) (when n is large enough).
It follows from (S13) that (when n is large enough)

i (2γ−1)2 (a−b)2
h
≤ 2e− 8(a+b)

P τ̂i1 {e(0) } − I(ci = 1) ≥ ∀i ∈ {1, 2, . . . , m}. (S16)
n
Similar results for P |τ̂i2 {e(0) } − I(ci = 2)| ≥ n can be obtained by using similar arguments.

Specifically, we have
i (2γ−1)2 (a−b)2
h
−
(0)

P τ̂i2 {e } − I(ci = 2) ≥
≤ 2e 8(a+b) ∀i ∈ {m + 1, m + 2, . . . , n}. (S17)
n
Thus by (S8), (S12), (S16) and (S17), for j = 1, 2, . . . , m, we have
(a−b)2 −4(a−b)+42 (2γ−1)2 (a−b)2
h 0 0
i
e1j − B
P B e2j ≤ ≤ e− 4(a+b) + ne− 8(a+b) . (S18)
h i
For j = m + 1, m + 2, . . . , n, the term P B e0 − B e 0 ≤ can be bounded as follows,
2j 1j
(a−b)2 −4(a−b)+42 (2γ−1)2 (a−b)2

h 0 0
i
e2j − B
P B e1j ≤ ≤ e− 4(a+b) + ne− 8(a+b) . (S19)
According to (S9), (S16), and (S17), we have

(2γ−1)2 (a−b)2
P [|n01 − n02 | ≥ ] ≤ 2ne− 8(a+b) . (S20)
Finally, by (S18), (S19), and (S20), we have
P ĉ{e(0) } =

6 c
 
[
ĉj {e(0) } =

= P 6 cj 
j∈{1,2,...,n}
    
 n o[ n 0 o[
0 0 0
[ [
≤ P  B1j − B2j ≤
e e e2j − B
B e1j ≤ {|n01 − n02 | ≥ }
   
j∈{1,2,...,m} j∈{m+1,m+2,...,n}
m h 0 i n h 0 i
0 0
X X
≤ e1j − B
P B e2j ≤ + e2j − B
B e1j ≤ + P [|n01 − n02 | ≥ ]
j=1 j=m+1
(a−b)2 −4(a−b)+42 (2γ−1)2 (a−b)2
− −
= ne 4(a+b) + n(n + 2)e 8(a+b) .
6
Therefore, we have that
P ĉ{e(0) } = c = 1 − P ĉ{e(0) } =

6 c

(a−b)2 −4(a−b)+42 (2γ−1)2 (a−b)2
− −
≥ 1 − ne 4(a+b) + n(n + 2)e 8(a+b) .
Recall that A and A
e are the adjacency matrices of undirected and directed networks, respec-
tively. Similar to the technique in Amini et al. (2013), we introduce a deterministic coupling
between A and A, e which allows us to carry over the results from the directed SBM. Let
(
h i 0, A
eij = A
eji = 0
A=T A
e , T A
e = . (S21)
1, otherwise
That is, the graph of A is obtained from that of A

e by removing directions. Note that

Pkl = P (Aij = 1) = 1 − P Aeij = 0 P Aeji = 0 = 2Pekl − (Pekl )2 ,
which matches the relationship between (7) and (8). From (S21), it is not difficult to see
that
Aij ≥ A
eij ∀ i, j ∈ {1, 2, . . . , n} .
We focus on the case of γ ∈ ( 21 , 1) and a > b. For the remaining three cases of (i) γ ∈ ( 21 , 1),
a < b, (ii) γ ∈ (0, 12 ), a > b, and (iii) γ ∈ (0, 21 ), a < b, the proofs are similar. For any
δ
(â, b̂) ∈ Pa,b , we have â > b̂. The PPL estimate can be written as
n X
2 (1−Aij )
Aij
X
(0)
ĉj {e } = arg max log Plk 1 − Plk
b b τ̂il {e(0) }. (S22)
k∈{1,2}
i=1 l=1
We first consider j ∈ {1, 2, . . . , m}. Then ĉj {e(0) } = 1 if

n P 2
(1−Aij )
A
τ̂il {e(0) } >
P ij
log Pbl1 1 − Pbl1
i=1 l=1
n P 2 (1−Aij )
Aij
τ̂il {e(0) },
P
log Pl2 1 − Pl2
b b
i=1 l=1
which is equivalent to
2
( n n
)
X X X
Aij τ̂il {e(0) }logPbl1 + (1 − Aij ) τ̂il {e(0) }log 1 − Pbl1 >
l=1 i=1 i=1
2
( n n
) (S23)
X X X
Aij τ̂il {e(0) }logPbl2 + (1 − Aij ) τ̂il {e(0) }log 1 − Pbl2 .
l=1 i=1 i=1
7
n n
0
Aij τ̂il {e(0) } and n0l = τ̂il {e(0) } for all j ∈ {1, 2, . . . , n} and l ∈ {1, 2}. We
P P
Let Blj =
i=1 i=1
have
! ! !
Pb11 Pb12 2 â b̂ 1 â2 b̂2
Pb = = − 2 . (S24)
Pb21 Pb22 m b̂ â m b̂2 â2
By simplifying (S23), we can restate that ĉj {e(0) } = 1 if

!
0 0
Pb11 n 0 0
o 1 − Pb12
B1j − B2j log + B1j − B2j − (n01 − n02 ) log > 0. (S25)
Pb12 1 − Pb11
Since â > b̂, we have Pb11 > Pb12 . Thus by (S25), we have
h 0 0
[ 0 0
i
6 1 ≤ P {B1j − B2j ≤ 0} {B1j − B2j − (n01 − n02 ) ≤ 0}
P ĉj {e(0) } =

h 0 0
[ i
≤ P {B1j − B2j ≤ } {|n01 − n02 | ≥ }
h 0 0
i
≤ P B1j − B2j ≤ + P [|n01 − n02 | ≥ ] . (S26)
0 0
Now we bound P B1j − B2j ≤ and P [|n01 − n02 | ≥ ] separately. Firstly,

h 0
i 0
P B1j − B2j ≤
" n n
#
X X
= P Aij τ̂i1 {e(0) } − Aij τ̂i2 {e(0) } ≤
i=1 i=1
" n n n
X X X
Aij τ̂i1 {e(0) } − I (ci = 1) −

= P Aij I (ci = 1) − Aij I (ci = 2) +
i=1 i=1 i=1
n
#
X
Aij τ̂i2 {e(0) } − I (ci = 2) ≤

i=1
" n n m
#
X X X
Aij τ̂i1 {e(0) } − I (ci = 1) ≤

≤ P Aij I (ci = 1) − Aij I (ci = 2) + 2
i=1 i=1 i=1
"( n n
) ( m
)#
X X [ X
τ̂i1 {e(0) } − I (ci = 1) ≤ −

≤ P Aij I (ci = 1) − Aij I (ci = 2) ≤ 2 2 Aij
i=1 i=1 i=1
" n n
# " m
#
X X X
τ̂i1 {e(0) } − I (ci = 1) ≤ −

≤ P Aij I (ci = 1) − Aij I (ci = 2) ≤ 2 + P 2 Aij
" i=1
n
i=1
n m
# i=1
X X X h i
≤ P Aij I (ci = 1) − Aij I (ci = 2) ≤ 2 + P τ̂i1 {e(0) } − I (ci = 1) ≥ . (S27)
i=1 i=1 i=1
n
8
Secondly,
P [|n01 − n02 | ≥ ]
hn n o[n 0 n oi
≤ P n01 ≤ − n2 ≤ −
2 i2 2 2i
h n h n
≤ P n01 ≤ − + P n02 ≤ −
 2 2 2 2   
τ̂i1 {e(0) } − I (ci = 1) ≥  + P  τ̂i2 {e(0) } − I (ci = 2) ≥ 
[ n o [ n o
≤ P
n n
i∈{1,2,...,m} i∈{m+1,m+2,...,n}
m n
X h i X h i
≤ P τ̂i1 {e(0) } − I (ci = 1) ≥ + P τ̂i2 {e(0) } − I (ci = 2) ≥ . (S28)
i=1
n i=m+1
n
Similar to Lemma 1 in Amini et al. (2013), we upper bound P |τ̂i1 {e(0) } − I (ci = 1) | ≥ n , ∀i ∈

{1, 2, . . . , m} and P τ̂i2 {e(0) } − I (ci =2) ≥n

, ∀i ∈ {m + 1, m + 2, . . . , n} as follows.
Pb11 Pb12
With (S24), it can be deduced that 1−Pb / 1−Pb ≥ δ 2 / (1 − δ) > 1. Let Bik =
11 12
n n
I(ej = k) and δ̃ = δ 2 / (1 − δ), we have
P P
Aij I(ej = k), nk =
j=1 i=1
h i
P τ̂i1 {e(0) } − I (ci = 1) ≥
n
(0)

τ̂i1 {e } 1 − /n
= P (0)
≤
τ̂i2 {e } /n
 n1 −Bi1 n2 −Bi2 
Bi1 b Bi2
 P11 P12 1 − P11 1 − P12 1 − /n 
b b b
= P n1 −Bi1 n2 −Bi2 ≤
/n

Bi1 b Bi2
P21 P22 1 − P21
b b 1 − P22
b
 b Bi1 −Bi2  
P11
 1 − /n  \ n [ o
= P   1−bP11  ≤ {Bi1 − Bi2 ≥ 0} {Bi1 − Bi2 < 0} 
b
 P12 /n 
1−Pb12

Bi1 −Bi2 1 − /n
≤ P δ̃ ≤ + P [Bi1 − Bi2 < 0]
/n

1 1 − /n
= P Bi1 − Bi2 ≤ log + P [Bi1 − Bi2 < 0]
logδ̃ /n

1 1 − /n
≤ 2P Bi1 − Bi2 ≤ log . (S29)
logδ̃ /n
9

Let ξi σ{e(0) } = Bi1 − Bi2 , and recall that ξei σ{e(0) } = B
ei1 − B
ei2 . Then we have
ξi σ{e(0) } − ξei σ{e(0) } = (Bi1 − B

ei1 ) − (Bi2 − B
ei2 )
n
X n
X
= (Aij − Aij )I(ej = 1) −
e (Aij − A
eij )I(ej = 2)
j=1 j=1
n
X
≥ − (Aij − A
eij )I(ej = 2)
j=1
Xn
≥ − (A
eij + A
eji )I(ej = 2) (by Aij − A
eij ≤ A
eij + A
eji ).
j=1
Thus, we have shown that

n n
X X
ξi σ{e(0) } ≥ ξei σ{e(0) } −

eij I(ej = 2) −
A A
eji I(ej = 2).
j=1 j=1
Consequently, we have

(0)
1 1 − /n
P ξi σ{e } ≤ log
logδ̃ /n
" n n
#
X X 1 1 − /n
(0)
≤ P ξi σ{e } −
e Aij I(ej = 2) −
e eji I(ej = 2) ≤
A log
j=1 j=1
logδ̃ /n
" n
#
1 1 − /n X
≤ P ξei σ{e(0) } ≤ 2(1 + )aγ +

log +P Aeij I (ej = 2) ≥ (1 + ) aγ
logδ̃ /n j=1
" n #
X
+P Aeji I (ej = 2) ≥ (1 + ) aγ . (S30)
j=1
h i
1 1−/n
Now we consider the term P ξei σ{e(0) } ≤ 2(1 + )aγ + logδ̃
log /n
. Recall that
n
(
X 1, ej = 1
ξei σ{e(0) } = B eij σj {e(0) }, where σj {e(0) } =

ei1 − B
ei2 = A .
j=1 −1, ej = 2
We have shown in (S15) that

t2
h i
P ξei σ{e(0) } ≤ (2γ − 1)(a − b) − t ≤ e− 4(a+b) ,

∀t ∈ [0, 3(a + b)] . (S31)
n o
1 1−/n
Take t = (2γ − 1)(a − b) − 2(1 + )aγ + logδ̃
log /n
. Recall aγ = (1 − γ)a + γb, and
(a − b) → ∞, n → ∞. Then when n is large enough, we have
1− 1 1 − /n
(2γ − 1)(a − b) > log . (S32)
2 logδ̃ /n
10
With the assumption that (2γ − 1)(a − b) ≥ 2(1 + )aγ and (S32), we have
1 1 − /n 1 +
2(1 + )aγ + log ≤ (2γ − 1)(a − b).
logδ̃ /n 2
Thus, we have
1−
0< (2γ − 1)(a − b) ≤ t ≤ (2γ − 1)(a − b) ≤ 3(a + b). (S33)
2
n o
By plugging t = (2γ − 1)(a − b) − 2(1 + )aγ + log1 δ̃ log 1−/n
/n
in (S31), it follows that
2

1 1 − /n
− t2
−
( 1−
2 ) (2γ−1) (a−b)
2 2
(0)

P ξi σ{e } ≤ 2(1 + )aγ +
e log ≤e 4(a+b) ≤e 4(a+b) . (S34)
logδ̃ /n
" # "
n
P n
Next, we consider the terms P eij I(ej = 2) ≥ (1 + )aγ and P P A
A eji I(ej = 2) ≥
j=1 j=1
#
n n
ei∗ {e(0) } = P A
(1 + )aγ . Let A e∗i {e(0) } = P A
eij I(ej = 2) and A eji I(ej = 2). By symmetry,
j=1 j=1
we have that
h i h i
P Aei∗ {e(0) } ≥ (1 + )aγ = P Ae∗i {e(0) } ≥ (1 + )aγ .
Note that since both A ei∗ {e(0) } and A

e∗i {e(0) } are sums of independent bounded random
variables, we can apply the Bernstein inequality to obtain upper bounds. For A ei∗ {e(0) }, we
have

Aij I(ej = 2) − EA
eij I(ej = 2) ≤ 1, and
e
h
(0)
i a b
E Ai∗ {e } = (1 − γ)m ·
f + γm · = (1 − γ)a + γb = aγ ,
m m
Xn Xn h i
(0) e2 I(ej = 2) = aγ .
υ = Var Ai∗ {e } =
e Var(Aij )I(ej = 2) ≤
e E A ij
j=1 j=1
ei∗ {e(0) }, we have

Then by applying the Bernstein inequality to A
t2
h h i i
P Ai∗ {e } ≥ E Ai∗ {e } + t ≤ e− 2(v+t/3) , ∀t ≥ 0.
e (0) e (0)
(S35)
Let t = aγ ≥ 0 in (S35) and by noting that v ≤ aγ , we have

2 /2
h i h i
P A e∗i {e(0) } ≥ (1 + )aγ = P Aei∗ {e(0) } ≥ (1 + )aγ ≤ e− 1+/3 aγ . (S36)
Thus, by (S29), (S30), (S34) and (S36), it follows that for i = 1, 2, . . . , m,

( 1− )2 (2γ−1)2 (a−b)2
i 2 /2
h
− 2 − 1+/3 aγ
(0)

P τ̂i1 {e } − I(ci = 1) ≥
≤2 e 4(a+b) + 2e . (S37)
n
11
Similarly, we can also obtain, for i = m + 1, m + 2, . . . , n,
( 1− )2 (2γ−1)2 (a−b)2
i 2 /2
h
− 2 − 1+/3 aγ
(0)

P τ̂i2 {e } − I(ci = 2) ≥
≤2 e 4(a+b) + 2e . (S38)
n
By (S28), (S37) and (S38), we have
( 1− )2 (2γ−1)2 (a−b)2
2 /2
− 2 − a
P [|n01 − n02 | ≥ ] ≤ 2n e 4(a+b) + 2e 1+/3 γ
. (S39)
"
Pn
According to (S27), we still need to obtain the upper bound of P Aij I(ci = 1) −
i=1
#
n
P Pn n
P
Aij I(ci = 2) ≤ 2 . Let ηj σ(c) = Aij I(ci = 1) − Aij I(ci = 2). We then have
i=1 i=1 i=1
n
X n
X

ηj σ(c) − ηej σ(c) = (Aij − Aij )I(ci = 1) −
e (Aij − A
eij )I(ci = 2)
i=1 i=1
n
X
≥ − (Aij − A
eij )I(ci = 2)
i=1
Xn
≥ − (A
eij + A
eji )I(ci = 2) (by Aij − A
eij ≤ A
eij + A
eji ).
i=1
n n
e∗j (c) = P A
Let A ej∗ (c) = P A
eij I(ci = 2) and A eji I(ci = 2). We have
i=1 i=1

ηj σ(c) ≥ ηej σ(c) − A
e∗j (c) − A
ej∗ (c).
From the assumption that (2γ − 1)(a − b) ≥ 2(1 + )aγ , γ ∈ ( 21 , 1) and a > b, we can get
that
2(1 + )γ + (2γ − 1)
a≥ b > b.
(2γ − 1) − 2(1 + )(1 − γ)
It is not difficult to check that there exists ρ ∈ (0, 1) such that
ρ(a − b) − 2(1 + )b > 0. (S40)
Then, we have

P ηj σ(c) ≤ 2
h i
≤ P ηej (σ(c)) − A∗j (c) − Aj∗ (c) ≤ 2
e e

1−ρ 1−ρ
≤ P ηej σ(c) ≤ (a − b) + 2(1 + )b + 2 + P A∗j (c) ≥ (1 + )b +
e (a − b)
2 4

1−ρ
+P Aj∗ (c) ≥ (1 + )b +
e (a − b) . (S41)
4
12
1−ρ

Consider the term P ηej σ(c) ≤ 2
(a − b) + 2(1 + )b + 2 . Recall that in (S11), we have
shown
t2
P ηej σ(c) ≤ (a − b) − t ≤ e− 4(a+b) ,

∀t ∈ [0, 3(a + b)]. (S42)
Then we can take

1−ρ
t = (a − b) − (a − b) + 2(1 + )b + 2
2

1−ρ
= (a − b) − 2 + {ρ(a − b) − 2(1 + )b} . (S43)
2
With (S40), (S43) and (a − b) → ∞ as n → ∞, it follows that when n is large enough we
have
1−ρ
0< (a − b) ≤ t ≤ 3(a + b). (S44)
4
With (S42), (S43), (S44), we get (when n is large enough)
2

1−ρ

−
( 1−ρ
4 ) (a−b)
2
P ηej σ(c) ≤ (a − b) + 2(1 + )b + 2 ≤ e 4(a+b) . (S45)

2
h i n
To bound the term P A e∗j (c) ≥ (1 + )b + 1−ρ (a − b) , first recall that A e∗j (c) = P A eij I(ci =
4
i=1
2) is the sum of independent random variables and Aij I(ci = 2) − EAij I(ci = 2) ≤ 1.
e e
Therefore we can also apply the Bernstein inequality. We have
e∗j (c) = m · b = b,
h i
E A (S46)
m
X n n
X h i
v = Var A∗j (c) =
e Var(Aij )I(cj = 2) ≤
e E Ae2 I(cj = 2) = b.
ij
j=1 j=1
Thus, by applying the Bernstein inequality to A

e∗j (c), we have
t2 /2 t2 /2
h h i i
P A∗j (c) ≥ E A∗j (c) + t ≤ e− v+t/3 ≤ e− b+t/3 ,
e e ∀t ≥ 0. (S47)
1−ρ
Take t = b + 4
(a − b). With (S47) and (S46), we have
1 1−ρ 2 (a−b)2 2

1 − ρ

−
2( 4 )
−
( 1−ρ
4 ) (a−b)
2
e∗j (c) ≥ (1 + )b +

P A (a − b) ≤ e b+2a/3 ≤e 2(a+b) . (S48)
4
By symmetry, we also have
1 1−ρ 2 (a−b)2 2

1−ρ

−
2( 4 )
−
( 1−ρ
4 ) (a−b)
2
P A∗j (c) ≥ (1 + )b +

e (a − b) ≤ e b+2a/3 ≤e 2(a+b) . (S49)
4
13
Therefore, with (S41), (S45), (S48) and (S49), it follows that
1−ρ 2 2 2
( 4 ) (a−b)2 ( 1−ρ
4 ) (a−b)2 ( 1−ρ
4 ) (a−b)2
P ηj σ(c) ≤ 2 ≤ e− − −

4(a+b) + 2e 2(a+b) ≤ 3e 4(a+b) . (S50)
With (S27), (S37) and (S50), we can get that, for j = 1, 2, . . . , m,
2 ( 2 )
h 0 0
i
−
( 1−ρ
4 ) (a−b)
2
−
( 1−
2 ) (2γ−1) (a−b)
2 2
−
2 /2
a
P B1j − B2j ≤ ≤ 3e 4(a+b) +n e 4(a+b) + 2e 1+/3 γ . (S51)
Similarly, with the same arguments, we can get that for j = m + 1, m + 2, . . . , n,

2 ( 2 )
h 0 0
i
−
( 1−ρ
4 ) (a−b)
2
−
( 1−
2 ) (2γ−1) (a−b)
2 2
−
2 /2
a
P B2j − B1j ≤ ≤ 3e 4(a+b) +n e 4(a+b) + 2e 1+/3 γ . (S52)
Finally, with (S39), (S51) and (S52), it follows that when n is large enough, we have
P ĉ{e(0) } =

6 c
 
[
ĉj {e(0) } =

= P 6 cj 
j∈{1,2,...,n}
    
 n o[ n 0 o[
0 0 0
[ [
≤ P B1j − B2j ≤ B2j − B1j ≤ {|n01 − n02 | ≥ }
   
j∈{1,2,...,m} j∈{m+1,m+2,...,n}
m h i n h 0 i
0 0 0
X X
≤ P B1j − B2j ≤ + P B2j − B1j ≤ + P [|n01 − n02 | ≥ ]
j=1 j=m+1
1−ρ 2 2
( )
( 4 ) (a−b)2 ( 1−
2 ) (2γ−1)2 (a−b)2
2 /2
= 3ne− 4(a+b) + n(n + 2) e − 4(a+b) + 2e − 1+/3 aγ
.
Therefore, we have
P ĉ{e(0) } = c = 1 − P ĉ{e(0) } =

6 c
" 2 ( 2 )#
−
( 1−ρ
4 ) (a−b)
2
−
( 1−
2 ) (2γ−1) (a−b)
2 2
−
2 /2
a
≥ 1 − 3ne 4(a+b) + n(n + 2) e 4(a+b) + 2e 1+/3 γ .
Thus we complete the proof of Theorem 3.
A4 Distributions of ĉ(s) and ĉ(w)

(w) (w)
We first show that ĉ(w) is weakly consistent to c. Let Xi , 1(ĉi 6= ci ) − P(ĉi 6= ci ), where
(w) 1
P(ĉi 6= ci ) = (1 + π1 )pn with pn = log n
. Then, it can be seen that
|Xi | ≤ 1, ∀i = 1, 2, . . . , n,
EXi = 0, ∀i = 1, 2, . . . , n,
Xn
EXi2 = n [(1 + π1 )pn {1 − (1 + π1 )pn }] .
i=1
14
Thus, by applying Bernstein inequality for Σni=1 Xi , we can get that
( n )
t2 /2
X
P Xi ≥ t ≤ exp − 1 , ∀t ≥ 0. (S53)
i=1
n [(1 + π 1 )p n {1 − (1 + π 1 )p n }] + 3
t
n
(w) (w) P (w)
Recall Xi , 1(ĉi 6= ci ) − P(ĉi 6= ci ). We plug t = n − P(ĉi 6= ci ) (which is
i=1
nonnegative when n is sufficient large) into (S53) and get that
( n )
1 X (w)
P 1(ĉi 6= ci ) ≥ (S54)
n i=1
( n n n
)
X (w) X (w)
X (w)
= P 1(ĉi 6= ci ) − P(ĉi 6= ci ) ≥ n − P(ĉi 6= ci )
i=1 i=1 i=1
!
2
{n − n(1 + π1 )pn } /2
≤ exp − . (S55)
n [(1 + π1 )pn {1 − (1 + π1 )pn }] + {n − n(1 + π1 )pn } /3
Thus, ĉ(w) is weakly consistent to c. Next, we show that ĉ(w) is not strongly consistent to c.
Specifically, we have
n n ( − log n )− logn n
Y (w)
Y 1 1
P(ĉ(w) = c) = P(ĉi = ci ) ≤ 1− = 1− . (S56)
i=1 i=1
log n log n
Thus by (S56), we know that ĉ(w) is not strongly consistent to c.

By the classical central limit theorem for independent and identically distributed random
variables, we have
( n
)
√ 1X d
n 1(ci = 1) − π1 −→ N {0, π1 (1 − π1 )} . (S57)
n i=1
We also have that

n
( ) ( n )
√
1 X (s) √ 1X
n 1(ĉi = 1) − π1 − n 1(ci = 1) − π1
n i=1 n i=1
n
1 X n (s) o
= √ 1(ĉi = 1) − 1(ci = 1) = op (1),
n i=1
which is based on the fact that ∀ > 0,

" #
1 X n n o
(s)
P √ 1(ĉi = 1) − 1(ci = 1) ≥ ≤ P (c(s) 6= c) = o(1).

n
i=1
√ √
n
n

1
P (s) 1
P
Thus, n n
1(ĉi = 1) − π1 has the same limit distribution as n n
1(ci = 1) − π1 .
i=1 i=1
15
Finally, we show that
( n )
√

1 X (w) 1 − 3π1 d
n 1(ĉi = 1) − π1 + −→ N {0, π1 (1 − π1 )} .
n i=1 log n
(w) (w)
Let Xni , 1(ĉi = 1) − P(ĉi = 1). We have
EXni = 0,
n
2 1X 2
sn = EXni = (π1 − π12 ) − O(pn ) → s2 = π1 (1 − π1 ) 6= 0, as n → ∞.
n i=1
(w)
We show the following Lindeberg condition. Specifically, note that P(ĉi = 1) = π1 + 1−3π
log n
1
,
then for every > 0, we have
n
1X 2 √
E Xni 1 |Xni | ≥ n
n i=1
h
(w)

(w)
n
(w)

(w) √ oi
= E |1 ĉi = 1 − P(ĉi = 1)|2 1 |1 ĉi = 1 − P(ĉi = 1)| ≥ n
n
(w)

(w) √ o
≤ P |1 ĉi = 1 − P(ĉi = 1)| ≥ n
√

(w)
1 − 3π1
= P |1 ĉi = 1 − (π1 + )| ≥ n . (S58)
log n
Also note that

√ √

(w)
1 − 3π1
P |1 ĉi = 1 − (π1 + )| ≥ n ≤ P 1 ≥ n → 0, as n → ∞. (S59)
log n
Thus, putting (S58) and (S59) together yields

n
1X 2 √
E Xni 1 |Xni | ≥ n → 0, as n → ∞. (S60)
n i=1
By the Lindeberg-Feller central limit theorem, we can get that

n
!
√ 1 X d
n Xni −→ N (0, s2 ),
n i=1
which is also
( n )
√

1 X (w) 1 − 3π1 d
n 1(ĉi = 1) − π1 + −→ N {0, π1 (1 − π1 )} . (S61)
n i=1 log n
16
A5 Extension to the Bipartite SBM
The bipartite network is a ubiquitous class of networks, in which nodes are of two disjoint
types and edges are only formed between nodes from different types. Bipartite networks can
be used to characterize many real-world systems, such as authorship of papers and people
attending events (Zhang and Chen, 2018). Community detection in bipartite networks have
been studied in many scientific fields, such as text mining (Bisson and Hussain, 2008), physics
(Larremore et al., 2014), and genetic studies (Madeira et al., 2010). In this section, we extend
the proposed profile-pseudo likelihood method to the case of bipartite stochastic blockmodels
(BiSBM).
Let G(V1 , V2 , E) denote a bipartite network, where V1 = {1, . . . , m} and V2 = {1, . . . , n}
are node sets of the two different types of nodes, respectively, and E is the set of edges
between nodes in V1 and V2 . The network G(V1 , V2 , E) can be uniquely represented by the
corresponding m × n bi-adjacency matrix A = [Aij ], where Aij = 1 if there is an edge from
node i of type 1 to node j of type 2 and Aij = 0 otherwise. Under the BiSBM, nodes in V1
form K1 blocks and nodes in V2 form K2 blocks. Specifically, for nodes in V1 , the labels c1 =
(c11 , c12 , . . . , c1m ) are drawn independently from a multinomial distribution with parameters
π1 = (π11 , π12 , . . . , π1K1 ), and for nodes in V2 , the labels c2 = (c21 , c22 , . . . , c2n ) are drawn
independently from a multinomial distribution with parameters π2 = (π21 , π22 , . . . , π2K2 ).
Conditional on c1 and c2 , the edges Aij ’s are independent Bernoulli variables with
E[Aij |c1 , c2 ] = Pc1i c2j ,
where P = [Pkl ] is a K1 × K2 matrix. The goal of community detection is then to estimate
the node labels c1 and c2 from the bi-adjacent matrix A.
Define Ω = (π1 , P ) and e2 = (e21 , e22 , . . . , e2n ). To estimate the node labels c2 from the
bi-adjacent matrix A, we define the following log pseudo likelihood function

m
(K n
)
1
A
X X Y
`B
PL (Ω, e2 ; {ai }) = log π1k Pkeij2j (1 − Pke2j )1−Aij .
i=1 k=1 j=1
17
Algorithm 3 BiSBM Profile-Pseudo Likelihood Maximization Algorithm.
(0) (0)
Step 1: Initialize e1 and e2 by applying SCP to AA> and A> A, respectively.
(0)
Step 2: Calculate Ω(0) = (π1 , P (0) ). That is, for 1 ≤ k ≤ K1 and 1 ≤ l ≤ K2 ,
m P
n
P (0) (0)
m Aij I(e1i =k)I(e2j =l)
(0) 1
P (0) (0) i=1 j=1
π1k = m
I(e1i = k), Pkl = m P
P n
(0) (0)
.
i=1 I(e1i =k)I(e2j =l)
i=1 j=1
(0,0) (0)
Step 3: Initialize Ω(0,0) = (π1 , P (0,0) ) = (π1 , P (0) ).
repeat
repeat
(s,t+1)
Step 4: E-step: compute τik . That is, for 1 ≤ k ≤ K1 and 1 ≤ i ≤ m,
( )A ( )1−A
n ij ij
(s,t) Q (s,t) (s,t)
π1k P (s)
1−P (s)
(s,t+1) j=1 ke ke
2j 2j
τik = K n
( )A (
ij
)1−A
ij
.
P1 (s,t) Q (s,t) (s,t)
π1l P (s)
1−P (s)
l=1 j=1 le le
2j 2j
(s,t+1)
Step 5: M-step: compute π1 , P (s,t+1) . That is, for 1 ≤ k ≤ K1 and 1 ≤ l ≤ K2 ,
m P
n
P (s,t+1) (s)
m Aij τik I(e2j =l)
(s,t+1) 1
P (s,t+1) (s,t+1) i=1 j=1
π1k = n
τik , Pkl = m P
P n
(s,t+1) (s)
.
i=1 πik I(e2j =l)
i=1 j=1
until the EM algorithm converges.

Step 6: Set Ω(s+1) to be the final EM update.
(s+1)
Step 7: Given Ω(s+1) , update e2j , 1 ≤ j ≤ n, using
K1
m P n o
(s+1) P (s+1) (s+1) (s+1)
e2j = arg maxl∈{1,2,...,K2 } τik Aij log Pkl + (1 − Aij ) log 1 − Pkl .
i=1 k=1
A profile-pseudo likelihood algorithm that maximizes `B

PL (Ω, e2 ; {ai }) is described in Algo-
rithm 3. Note that c1 can be estimated similarly as that for c2 , and we omit the details.
We investigate the performance of the proposed profile-pseudo likelihood method for
BiSBM. We fix m = n = 1200, K1 = K2 = 2, π1 = (1/2, 1/2), π2 = (1/2, 1/2) and edge
probability between communities k and l Pkl = 0.1(1.2 + 0.4 × 1(k = l)) for all k, l = 1, 2.
We compare PPL with two other clustering methods, namely the SCP and SVD (Rohe et al.,
18
Normalized mutual information

0.90 0.85 0.85
0.85 0.80 0.80
0.80 0.75 0.75

●
●
●
0.75 0.70 0.70

●
● ●
●
●
0.70 ●
0.65 0.65 ●
SCP CPL DC−PPL SCP SVD PPL SCP SVD PPL
Figure 9: Left: comparison of PPL, SVD and SCP for estimating c1 in BiSBM; right: com-
parison of PPL, SVD and SCP for estimating c2 in BiSBM.
2012; Sarkar and Dong, 2011). As for SCP, to deal with bipartite networks, we apply it to
AAT to get an estimate of c1 , and apply it to AT A to get the estimate of c2 . The result
is summarized in Figure 9, based on 100 replications. It is seen that PPL outperforms both
SCP and SVD for community detection in bipartite networks.
A6 Additional Numerical Results

A6.1 Running time for SCP
We report the computing time for SCP in Setting 3 of Section 5.1. Specifically, we set K = 3,
π = (0.2, 0.3, 0.5), λ = 5, β = 0.05 and vary the network size n from 102.5 to 106 . The
results from 100 data replicates are reported in Figure 10. It is seen that it takes SCP
less than 100 seconds when the network has one million nodes. Specifically, this is due to
the eigs() function in Matlab, which performs iterative solutions for eigensystems of large
sparse matrices using ARPACK. We note that the computational efficiency of eigs() can
decrease when the network density and the number of communities K increase.
19
Algorithm SCP
●
●
●
1
log10 Running time (secs)
●
●
●
●
●
0 ●
●
●
●
●
●
●
−1 ●
●
●
●
●
n=10^2.5 n=10^3 n=10^3.5 n=10^4 n=10^4.5 n=10^5 n=10^5.5 n=10^6
Figure 10: Computing time from SCP for large-scale and sparse networks under the SBM
with K = 3, π = (0.2, 0.3, 0.5), λ = 5, β = 0.05 and varying n.
A6.2 Comparison with Gao et al. (2017)
In this simulation study, we compare the performance of SCP, PPL and the majority voting
method proposed in Gao et al. (2017) (referred to as MV) on networks simulated from the
SBM. Specifically, we consider the simulation Setting 3 in Section 5.1, where the parameter
β controls the “out-in-ratio” and λ controls the overall expected network degree. We set
K = 3 and π = (0.2, 0.3, 0.5), and we consider three scenarios, 1) varying β while λ = 5
and n = 1200, 2) varying λ while β = 0.05 and n = 1200, and 3) varying n while λ = 5
and β = 0.05. Figure 11 reports the NMI from the three methods and the computing time
from PPL and MV, based on 100 replications. The running time for PPL does not include
the initialization step, which takes no more than a few seconds. Both PPL and MV use
SCP as the initial clustering method. It is seen that PPL and MV have comparable clustering
accuracies and they both outperform SCP in terms of NMI. Moreover, PPL is computationally
more efficient than MV as it needs not to repeatedly perform the leave-one-node-out spectral
20
Algorithm SCP MV PPL Algorithm MV PPL
log10 Running time (secs) log10 Running time (secs)

3
0.8
2
NMI
0.6 1
0
0.4
-1
beta=0.02 beta=0.05 beta=0.1 beta=0.02 beta=0.05 beta=0.1

0.8 3
0.6 2
NMI
0.4 1
0.2 0
-1
lambda=3 lambda=4 lambda=5 lambda=3 lambda=4 lambda=5

log10 Running time (secs)
0.8 3
0.7 2
NMI
1
0.6
0
0.5
-1
n=1000 n=1200 n=1400 n=1000 n=1200 n=1400
Figure 11: Comparisons of the NMI and computing time from SCP, MV and PPL under different
settings.
clustering.
21

Pseudo Liklhood

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pseudo Liklhood

Uploaded by

Copyright:

Available Formats

Fast Network Community Detection with

Profile-Pseudo Likelihood Methods

organization of a network. For example, in social networks, communities may correspond to

(2017) for comprehensive reviews on this topic.

functions (Amini et al., 2013; Bickel et al., 2013).

handle up to millions of nodes. Additionally, it is shown that the maximum pseudo-likelihood

log pseudo likelihood

log pseudo likelihood

iteration number iteration number

algorithm still does not have convergence guarantees.

label as a vector of unknown parameters. Correspondingly, the likelihood can be maximized

Furthermore, we consider a profile-pseudo likelihood that adopts a hybrid framework of the

block model (BiSBM; Larremore et al., 2014).

proposed method through comparative simulation studies. Section 6 presents analyses of

corresponding n × n adjacency matrix A, where Aij = 1 if there is an edge (i, j) ∈ E

of connection between nodes in communities k and l. Let Ω = (π, P ). Our objective is to

Denote the rows of A as ai = (Ai1 , Ai2 , . . . , Ain ), 1 ≤ i ≤ n and let e = (e1 , e2 , . . . , en ) ∈

with its logarithm as

which can be considered as an approximation to the SBM likelihood function.

We consider an iterative algorithm that alternates between updating e and updating Ω.

likelihood with respect to e. This is referred to as the profile-pseudo likelihood method. We

Theorem 2 the strong consistency of the estimated column labels e.

given the estimated Ω, b e; {ai }) as the objective function. In

b e; {ai }) with respect to e is a NP-hard problem

property of the iterative algorithm. This algorithm is summarized in Algorithm 1.

In what follows, we discuss in details the profile-pseudo likelihood algorithm. We refer

Q(Ω|Ω(s,t) , e(s) ) = Ez|{ai };Θ(s,t) ,e(s) log f {ai }, z; Ω, e(s) ,

where z denotes the latent row labels and

In the M-step, Ω(s,t+1) is updated by

Ω(s,t+1) = arg max Q(Ω|Ω(s,t) , e(s) ),

which has closed form solutions as follows

Step 3: Initialize Ω(0,0) = (π (0,0) , P (0,0) ) = (π (0) , P (0) ).

{Ω(s) , e(s) } such that

LPL (Ω(s) , e(s) ; {ai }) ≤ LPL (Ω(s+1) , e(s+1) ; {ai }).

The proof of Theorem 1 is provided in the supplemental material. Theorem 1 guaran-

lutions in real data applications, we recommend considering multiple random initializations

resulting procedure is not guaranteed to converge, as seen in Figure 1. In comparison, our

SBMs with K communities.

eij |c ∼ Bernoulli(Pec c ), for 1 ≤ i, j ≤ n.

i ≤ j, are mutually independent given c, that is,

(undirected) Aij |c ∼ Bernoulli(Pci cj ) and Aij = Aji , for 1 ≤ i ≤ j ≤ n. (6)

while that of the undirected SBM has the form

consistency result of the directed SBM to the undirected case.

of generality, let ci = 1 for i ∈ {1, . . . , m}, and ci = 2 for i ∈ {m + 1, . . . , n}. Assume

set that collects all such initial labeling vectors, i.e.,

for ĉ{e(0) }. For a constant δ > 1, we define Pa,b

such that for all n ≥ N , the following holds

b/a ≤ (1 − 10γ)/(9 − 10γ).

munities. Similar to Assumption (A), we make the following assumption:

Let the edge-probability matrix of the directed SBM be

and that of the undirected SBM be

likelihood estimators for directed and undirected SBMs, respectively.

n ≥ N , the following holds

and 3, respectively. We omit presenting the details.

such degree heterogeneity. To incorporate the degree heterogeneity in community detection,

independent Poisson variables with

E[Aij |c] = θi θj λci cj ,

where Λ = [λkl ] is a K × K symmetric matrix and θ = (θ1 , θ2 , . . . , θn ) is a degree parameter

the following log pseudo likelihood function

for the row labels by

At step 5, we update the parameters by sequentially solving the following optimization

Step 3: Initialize Ω(0,0) = (π (0,0) , Λ(0,0) , θ (0,0) ) = (π (0) , Λ(0) , θ (0) ).

until the profile-pseudo likelihood converges.

(π (s,t+1) , Λ(s,t+1) ) = arg max Q(π, Λ, θ (s,t) |Ω(s,t) , e(s) ),

{(a−b)−2}2 (a−b)2 −4(a−b)+42

(a−b)2 −4(a−b)+42 (2γ−1)2 (a−b)2

Let t = aγ ≥ 0 in (S35) and by noting that v ≤ aγ , we have

ρ(a − b) − 2(1 + )b > 0. (S40)

P ηej σ(c) ≤ (a − b) + 2(1 + )b + 2 ≤ e 4(a+b) . (S45)

e∗j (c) ≥ (1 + )b +

P A∗j (c) ≥ (1 + )b +