Professional Documents
Culture Documents
Completion: Matrix
Completion: Matrix
COMPLETION
A New Theory for Matrix Completion 3
Noisy Tensor Completion via the Sum-of-Squares Hierarchy 13
The Power of Convex Relaxation Near-Optimal Matrix Completion 37
Matrix Completion with Noise 89
High-Rank Matrix Completion and Subspace Clustering with Missing Data 101
A-new-theory-for-matrix-completion-Paper 117
Low-Rank Matrix Completion Survey 127
A New Theory for Matrix Completion
B-DAT, School of Information & Control, Nanjing Univ Informat Sci & Technol
NO 219 Ningliu Road, Nanjing, Jiangsu, China, 210044
{gcliu,qsliu,xtyuan}@nuist.edu.cn
Abstract
Prevalent matrix completion theories reply on an assumption that the locations of
the missing data are distributed uniformly and randomly (i.e., uniform sampling).
Nevertheless, the reason for observations being missing often depends on the unseen
observations themselves, and thus the missing data in practice usually occurs in a
nonuniform and deterministic fashion rather than randomly. To break through the
limits of random sampling, this paper introduces a new hypothesis called isomeric
condition, which is provably weaker than the assumption of uniform sampling and
arguably holds even when the missing data is placed irregularly. Equipped with
this new tool, we prove a series of theorems for missing data recovery and matrix
completion. In particular, we prove that the exact solutions that identify the target
matrix are included as critical points by the commonly used nonconvex programs.
Unlike the existing theories for nonconvex matrix completion, which are built
upon the same condition as convex programs, our theory shows that nonconvex
programs have the potential to work with a much weaker condition. Comparing to
the existing studies on nonuniform sampling, our setup is more general.
1 Introduction
Missing data is a common occurrence in modern applications such as computer vision and image
processing, reducing significantly the representativeness of data samples and therefore distorting
seriously the inferences about data. Given this pressing situation, it is crucial to study the problem
of recovering the unseen data from a sampling of observations. Since the data in reality is often
organized in matrix form, it is of considerable practical significance to study the well-known problem
of matrix completion [1] which is to fill in the missing entries of a partially observed matrix.
Problem 1.1 (Matrix Completion). Denote the (i, j)th entry of a matrix as [·]ij . Let L0 ∈ Rm×n be
an unknown matrix of interest. In particular, the rank of L0 is unknown either. Given a sampling of
the entries in L0 and a 2D index set Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n} consisting of the locations
of the observed entries, i.e., given
{[L0 ]ij |(i, j) ∈ Ω} and Ω,
can we restore the missing entries whose indices are not included in Ω, in an exact and scalable
fashion? If so, under which conditions?
Due to its unique role in a broad range of applications, e.g., structure from motion and magnetic
resonance imaging, matrix completion has received extensive attentions in the literatures, e.g., [2–13].
∗
The work of Guangcan Liu is supported in part by national Natural Science Foundation of China (NSFC)
under Grant 61622305 and Grant 61502238, in part by Natural Science Foundation of Jiangsu Province of China
(NSFJPC) under Grant BK20160040.
†
The work of Qingshan Liu is supported by NSFC under Grant 61532009.
‡
The work of Xiao-Tong Yuan is supported in part by NSFC under Grant 61402232 and Grant 61522308, in
part by NSFJPC under Grant BK20141003.
31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
In general, given no presumption about the nature of matrix entries, it is virtually impossible to
restore L0 as the missing entries can be of arbitrary values. That is, some assumptions are necessary
for solving Problem 1.1. Based on the high-dimensional and massive essence of today’s data-driven
community, it is arguable that the target matrix L0 we wish to recover is often low rank [23]. Hence,
one may perform matrix completion by seeking a matrix with the lowest rank that also satisfies the
constraints given by the observed entries:
min rank (L) , s.t. [L]ij = [L0 ]ij , ∀(i, j) ∈ Ω. (1)
L
Unfortunately, this idea is of little practical because the problem above is NP-hard and cannot be
solved in polynomial time [15]. To achieve practical matrix completion, Candès and Recht [4]
suggested to consider an alternative that minimizes instead the nuclear norm which is a convex
envelope of the rank function [12]. Namely,
min kLk∗ , s.t. [L]ij = [L0 ]ij , ∀(i, j) ∈ Ω, (2)
L
where k · k∗ denotes the nuclear norm, i.e., the sum of the singular values of a matrix. Rather
surprisingly, it is proved in [4] that the missing entries, with high probability, can be exactly restored
by the convex program (2), as long as the target matrix L0 is low rank and incoherent and the set Ω of
locations corresponding to the observed entries is a set sampled uniformly at random. This pioneering
work provides people several useful tools to investigate matrix completion and many other related
problems. Those assumptions, including low-rankness, incoherence and uniform sampling, are now
standard and widely used in the literatures, e.g., [14, 17, 22, 24, 28, 33, 34, 36]. In particular, the
analyses in [17, 33, 36] show that, in terms of theoretical completeness, many nonconvex optimization
based methods are as powerful as the convex program (2). Unfortunately, these theories still depend
on the assumption of uniform sampling, and thus they cannot explain why there are many nonconvex
methods which often do better than the convex program (2) in practice.
The missing data in practice, however, often occurs in a nonuniform and deterministic fashion instead
of randomly. This is because the reason for an observation being missing usually depends on the
unseen observations themselves. For example, in structure from motion and magnetic resonance
imaging, typically the locations of the observed entries are concentrated around the main diagonal of
a matrix4 , as shown in Figure 1. Moreover, as pointed out by [19, 21, 23], the incoherence condition
is indeed not so consistent with the mixture structure of multiple subspaces, which is also a ubiquitous
phenomenon in practice. There has been sparse research in the direction of nonuniform sampling,
e.g., [18, 25–27, 31]. In particular, Negahban and Wainwright [26] studied the case of weighted
entrywise sampling, which is more general than the setup of uniform sampling but still a special
form of random sampling. Király et al. [18] considered deterministic sampling and is most related to
this work. However, they had only established conditions to decide whether a particular entry of the
matrix can be restored. In other words, the setup of [18] may not handle well the dependence among
the missing entries. In summary, matrix completion still starves for practical theories and methods,
although has attained considerable improvements in these years.
To break through the limits of the setup of random sampling, in this paper we introduce a new
hypothesis called isomeric condition, which is a mixed concept that combines together the rank and
coherence of L0 with the locations and amount of the observed entries. In general, isomerism (noun
4
This statement means that the observed entries are concentrated around the main diagonal after a permutation
of the sampling pattern Ω.
We invent a new hypothesis called isomeric condition, which provably holds given the
standard assumptions of uniform sampling, low-rankness and incoherence. In addition,
we also exemplify that the isomeric condition can hold even if the target matrix L0 is not
incoherent and the missing entries are placed irregularly. Comparing to the existing studies
about nonuniform sampling, our setup is more general.
Equipped with the isomeric condition, we prove that the exact solutions that identify L0
are included as critical points by the commonly used bilinear programs. Comparing to the
existing theories for nonconvex matrix completion, our theory is built upon a much weaker
assumption and can therefore partially reveal the superiorities of nonconvex programs over
the convex methods based on (2).
We prove that the isomeric condition is sufficient and necessary for the column and row
projectors of L0 to be invertible given the sampling pattern Ω. This result implies that
the isomeric condition is necessary for ensuring that the minimal rank solution to (1) can
identify the target L0 .
The rest of this paper is organized as follows. Section 2 summarizes the mathematical notations used
in the paper. Section 3 introduces the proposed isomeric condition, along with some theorems for
matrix completion. Section 4 shows some empirical results and Section 5 concludes this paper. The
detailed proofs to all the proposed theorems are presented in the Supplementary Materials.
2 Notations
Capital and lowercase letters are used to represent matrices and vectors, respectively, except that the
lowercase letters, i, j, k, m, n, l, p, q, r, s and t, are used to denote some integers, e.g., the location of
an observation, the rank of a matrix, etc. For a matrix M , [M ]ij is its (i, j)th entry, [M ]i,: is its ith row
and [M ]:,j is its jth column. Let ω1 and ω2 be two 1D index sets; namely, ω1 = {i1 , i2 , · · · , ik } and
ω2 = {j1 , j2 , · · · , js }. Then [M ]ω1 ,: denotes the submatrix of M obtained by selecting the rows with
indices i1 , i2 , · · · , ik , [M ]:,ω2 is the submatrix constructed by choosing the columns j1 , j2 , · · · , js ,
and similarly for [M ]ω1 ,ω2 . For a 2D index set Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n}, we imagine it
as a sparse matrix and, accordingly, define its “rows”, “columns” and “transpose” as follows: The
ith row Ωi = {j1 |(i1 , j1 ) ∈ Ω, i1 = i}, the jth column Ωj = {i1 |(i1 , j1 ) ∈ Ω, j1 = j} and the
transpose ΩT = {(j1 , i1 )|(i1 , j1 ) ∈ Ω}.
The special symbol (·)+ is reserved to denote the Moore-Penrose pseudo-inverse of a matrix. More
T
precisely, for a matrix M with Singular Value Decomposition (SVD)5 M = UM ΣM VM , its pseudo-
+ −1 T
inverse is given by M = VM ΣM UM . For convenience, we adopt the conventions of using
span{M } to denote the linear space spanned by the columns of a matrix M , using y ∈ span{M } to
denote that a vector y belongs to the space span{M }, and using Y ∈ span{M } to denote that all the
column vectors of a matrix Y belong to span{M }.
Capital letters U , V , Ω and their variants (complements, subscripts, etc.) are reserved for left singular
vectors, right singular vectors and index set, respectively. For convenience, we shall abuse the
notation U (resp. V ) to denote the linear space spanned by the columns of U (resp. V ), i.e., the
column space (resp. row space). The orthogonal projection onto the column space U , is denoted by
PU and given by PU (M ) = U U T M , and similarly for the row space PV (M ) = M V V T . The same
In this paper, SVD always refers to skinny SVD. For a rank-r matrix M ∈ Rm×n , its SVD is of the form
5
T
UM ΣM VM , where UM ∈ Rm×r , ΣM ∈ Rr×r and VM ∈ Rn×r .
3.1.1 Definitions
For the ease of understanding, we shall begin with a concept called k-isomerism (or k-isomeric in
adjective form), which could be regarded as an extension of low-rankness.
Definition 3.1 (k-isomeric). A matrix M ∈ Rm×l is called k-isomeric if and only if any k rows of
M can linearly represent all rows in M . That is,
rank ([M ]ω,: ) = rank (M ) , ∀ω ⊆ {1, 2, · · · , m}, |ω| = k,
where | · | is the cardinality of an index set.
In general, k-isomerism is somewhat similar to Spark [37] which defines the smallest linearly
dependent subset of the rows of a matrix. For a matrix M to be k-isomeric, it is necessary that
rank (M ) ≤ k, not sufficient. In fact, k-isomerism is also somehow related to the concept of
coherence [4, 21]. When the coherence of a matrix M ∈ Rm×l is not too high, the rows of M will
sufficiently spread, and thus M could be k-isomeric with a small k, e.g., k = rank (M ). Whenever
the coherence of M is very high, one may need a large k to satisfy the k-isomeric property. For
example, consider an extreme case where M is a rank-1 matrix with one row being 1 and everywhere
else being 0. In this case, we need k = m to ensure that M is k-isomeric.
While Definition 3.1 involves all 1D index sets of cardinality k, we often need the isomeric property
to be associated with a certain 2D index set Ω. To this end, we define below a concept called
Ω-isomerism (or Ω-isomeric in adjective form).
Definition 3.2 (Ω-isomeric). Let M ∈ Rm×l and Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n}. Suppose
that Ωj 6= ∅ (empty set), ∀1 ≤ j ≤ n. Then the matrix M is called Ω-isomeric if and only if
rank [M ]Ωj ,: = rank (M ) , ∀j = 1, 2, · · · , n.
Note here that only the number of rows in M is required to coincide with the row indices included in
Ω, and thereby l 6= n is allowable.
Generally, Ω-isomerism is less strict than k-isomerism. Provided that |Ωj | ≥ k, ∀1 ≤ j ≤ n, a matrix
M is k-isomeric ensures that M is Ω-isomeric as well, but not vice versa. For the extreme example
where M is nonzero at only one row, interestingly, M can be Ω-isomeric as long as the locations of
the nonzero elements are included in Ω.
With the notation of ΩT = {(j1 , i1 )|(i1 , j1 ) ∈ Ω}, the isomeric property could be also defined on
the column vectors of a matrix, as shown in the following definition.
To solve Problem 1.1 without the imperfect assumption of missing at random, as will be shown later,
we need to assume that L0 is Ω/ΩT -isomeric. This condition has excluded the unidentifiable cases
where any rows or columns of L0 are wholly missing. In fact, whenever L0 is Ω/ΩT -isomeric, the
number of observed entries in each row and column of L0 has to be greater than or equal to the rank
of L0 ; this is consistent with the results in [20]. Moreover, Ω/ΩT -isomerism has actually well treated
the cases where L0 is of high coherence. For example, consider an extreme case where L0 is 1 at only
one element and 0 everywhere else. In this case, L0 cannot be Ω/ΩT -isomeric unless the nonzero
element is observed. So, generally, it is possible to restore the missing entries of a highly coherent
matrix, as long as the Ω/ΩT -isomeric condition is obeyed.
It is easy to see that the above lemma is still valid even when the condition of Ω-isomerism is replaced
by k-isomerism. Thus, hereafter, we may say that a space is isomeric (k-isomeric, Ω-isomeric or
ΩT -isomeric) as long as its basis matrix is isomeric. In addition, the isomeric property is subspace
successive, as shown in the next lemma.
Lemma 3.2. Let Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n} and U0 ∈ Rm×r be the basis matrix of a
Euclidean subspace embedded in Rm . Suppose that U is a subspace of U0 , i.e., U = U0 U0T U . If U0
is Ω-isomeric then U is Ω-isomeric as well.
The above lemma states that, in one word, the subspace of an isomeric space is isomeric.
3.2 Results
In this subsection, we shall show how the isomeric condition can take effect in the context of
nonuniform sampling, establishing some theorems pertaining to missing data recovery [35] as well
as matrix completion.
Unlike the theory in [35], the condition of which is unverifiable, our k-isomeric condition could be
verified in finite time. Notice, that the problem of missing data recovery is closely related to matrix
completion, which is actually to restore the missing entries in multiple data vectors simultaneously.
Hence, Theorem 3.3 can be naturally generalized to the case of matrix completion, as will be shown
in the next subsection.
Theorem 3.4 tells us that, in general, even when the locations of the missing entries are interrelated
and nonuniformly distributed, the target matrix L0 can be restored as long as we have found a proper
dictionary A. This motivates us to consider the commonly used bilinear program that seeks both A
and X simultaneously:
1 2 1 2
min kAkF + kXkF , s.t. PΩ (AX − L0 ) = 0, (8)
A,X 2 2
where A ∈ Rm×p and X ∈ Rp×n . The problem above is bilinear and therefore nonconvex. So, it
would be hard to obtain a strong performance guarantee as done in the convex programs, e.g., [4, 21].
Interestingly, under a very mild condition, the problem in (8) is proved to include the exact solutions
that identify the target matrix L0 as the critical points.
Theorem 3.5. Let L0 ∈ Rm×n and Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n}. Denote the rank and SVD
of L0 as r0 and U0 Σ0 V0T , respectively. If L0 is Ω/ΩT -isomeric then the exact solution, denoted by
(A0 , X0 ) and given by
1 1
A0 = U0 Σ02 QT , X0 = QΣ02 V0T , ∀Q ∈ Rp×r0 , QT Q = I,
is a critical point to the problem in (8).
To exhibit the power of program (8), however, the parameter p, which indicates the number of
columns in the dictionary matrix A, must be close to the true rank of the target matrix L0 . This is
15 15 15 15
1 1 1 1
1 15 35 55 75 95 1 15 35 55 75 95 1 15 35 55 75 95 1 15 35 55 75 95
rank(L0) rank(L0) rank(L0) rank(L0)
Figure 2: Comparing the bilinear program (9) (p = m) with the convex method (2). The numbers
plotted on the above figures are the success rates within 20 random trials. The white and black points
mean “succeed” and “fail”, respectively. Here the success is in a sense that PSNR ≥ 40dB, where
PSNR standing for peak signal-to-noise ratio.
impractical in the cases where the rank of L0 is unknown. Notice, that the Ω-isomeric condition
imposed on A requires
rank (A) ≤ |Ωj |, ∀j = 1, 2, · · · , n.
This, together with the condition of L0 ∈ span{A}, essentially need us to solve a low rank matrix
recovery problem [14]. Hence, we suggest to combine the formulation (7) with the popular idea of
nuclear norm minimization, resulting in a bilinear program that jointly estimates both the dictionary
matrix A and the representation matrix X by
1 2
min kAk∗ + kXkF , s.t. PΩ (AX − L0 ) = 0, (9)
A,X 2
which, by coincidence, has been mentioned in a paper about optimization [32]. Similar to (8), the
program in (9) has the following theorem to guarantee its performance.
Theorem 3.6. Let L0 ∈ Rm×n and Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n}. Denote the rank and SVD
of L0 as r0 and U0 Σ0 V0T , respectively. If L0 is Ω/ΩT -isomeric then the exact solution, denoted by
(A0 , X0 ) and given by
2 1
A0 = U0 Σ03 QT , X0 = QΣ03 V0T , ∀Q ∈ Rp×r0 , QT Q = I,
is a critical point to the problem in (9).
Unlike (8), which possesses superior performance only if p is close to rank (L0 ) and the initial
solution is chosen carefully, the bilinear program in (9) can work well by simply choosing p = m
and using A = I as the initial solution. To see why, one essentially needs to figure out the conditions
under which a specific optimization procedure can produce an optimal solution that meets an exact
solution. This requires extensive justifications and we leave it as future work.
4 Simulations
To verify the superiorities of the nonconvex matrix completion methods over the convex program (2),
we would like to experiment with randomly generated matrices. We generate a collection of m × n
(m = n = 100) target matrices according to the model of L0 = BC, where B ∈ Rm×r0 and
C ∈ Rr0 ×n are N (0, 1) matrices. The rank of L0 , i.e., r0 , is configured as r0 = 1, 5, 10, · · · , 90, 95.
Regarding the index set Ω consisting of the locations of the observed entries, we consider t-
wo settings: One is to create Ω by using a Bernoulli model to randomly sample a subset from
{1, · · · , m} × {1, · · · , n} (referred to as “uniform”), the other is as in Figure 1 that makes the
locations of the observed entries be concentrated around the main diagonal of a matrix (referred to as
“nonuniform”). The observation fraction is set to be |Ω|/(mn) = 0.01, 0.05, · · · , 0.9, 0.95. For each
pair of (r0 , |Ω|/(mn)), we run 20 trials, resulting in 8000 simulations in total.
When p = m and the identity matrix is used to initialize the dictionary A, we have empirically found
that program (8) has the same performance as (2). This is not strange, because it has been proven
in [16] that kLk∗ = minA,X 21 (kAk2F + kXk2F ), s.t. L = AX. Figure 2 compares the bilinear
Acknowledgment
We would like to thanks the anonymous reviewers and meta-reviewers for providing us many valuable
comments to refine this paper.
References
[1] Emmanuel Candès and Terence Tao. The power of convex relaxation: Near-optimal matrix completion.
IEEE Transactions on Information Theory, 56(5):2053–2080, 2009.
[2] Emmanuel Candès and Yaniv Plan. Matrix completion with noise. In IEEE Proceeding, volume 98, pages
925–936, 2010.
[3] William E. Bishop and Byron M. Yu. Deterministic symmetric positive semidefinite matrix completion.
In Neural Information Processing Systems, pages 2762–2770, 2014.
[4] Emmanuel Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations
of Computational Mathematics, 9(6):717–772, 2009.
[5] Eyal Heiman, Gideon Schechtman, and Adi Shraibman. Deterministic algorithms for matrix completion.
Random Structures and Algorithms, 45(2):306–317, 2014.
[6] Raghunandan H. Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries.
IEEE Transactions on Information Theory, 56(6):2980–2998, 2010.
[7] Raghunandan H. Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from noisy entries.
Journal of Machine Learning Research, 11:2057–2078, 2010.
[8] Akshay Krishnamurthy and Aarti Singh. Low-rank matrix and tensor completion via adaptive sampling.
In Neural Information Processing Systems, pages 836–844, 2013.
[9] Troy Lee and Adi Shraibman. Matrix completion from any given set of observations. In Neural Information
Processing Systems, pages 1781–1787, 2013.
[10] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning
large incomplete matrices. Journal of Machine Learning Research, 11:2287–2322, 2010.
[11] Karthik Mohan and Maryam Fazel. New restricted isometry results for noisy low-rank recovery. In IEEE
International Symposium on Information Theory, pages 1573–1577, 2010.
10
Abstract
arXiv:1501.06521v3 [cs.LG] 18 Feb 2016
In the noisy tensor completion problem we observe m entries (whose location is chosen uniformly at
random) from an unknown n1 × n2 × n3 tensor T . We assume that T is entry-wise close to being rank
r. Our goal is to fill in its missing entries using as few observations as possible. Let n = max(n1 , n2 , n3 ).
We show that if m = n3/2 r then there is a polynomial time algorithm based on the sixth level of the
sum-of-squares hierarchy for completing it. Our estimate agrees with almost all of T ’s entries almost
exactly and works even when our observations are corrupted by noise. This is also the first algorithm for
tensor completion that works in the overcomplete case when r > n, and in fact it works all the way up
to r = n3/2− .
Our proofs are short and simple and are based on establishing a new connection between noisy tensor
completion (through the language of Rademacher complexity) and the task of refuting random constant
satisfaction problems. This connection seems to have gone unnoticed even in the context of matrix
completion. Furthermore, we use this connection to show matching lower bounds. Our main technical
result is in characterizing the Rademacher complexity of the sequence of norms that arise in the sum-of-
squares relaxations to the tensor nuclear norm. These results point to an interesting new direction: Can
we explore computational vs. sample complexity tradeoffs through the sum-of-squares hierarchy?
∗ Harvard John A. Paulson School of Engineering and Applied Sciences. Email: b@boazbarak.org
† MassachusettsInstitute of Technology. Department of Mathematics and the Computer Science and Artificial Intelligence
Lab. Email: moitra@mit.edu. This work is supported in part by a grant from the MIT NEC Corporation and a Google Research
Award.
There are extensions to non-uniform sampling models [55, 24], as well as various efficiency improvements
[47, 40]. What is particularly remarkable about these guarantees is that the number of observations needed
is within a logarithmic factor of the number of parameters — (n1 + n2 )r — that define the model.
In fact, there are benefits to working with even higher-order structure but so far there has been little
progress on natural extensions to the tensor setting. To motivate this problem, consider the Groupon
Problem (which we introduce here to illustrate this point) where the goal is to predict user-activity ratings.
The challenge is that which activities we should recommend (and how much a user liked a given activity)
depends on time as well — weekday/weekend, day/night, summer/fall/winter/spring, etc. or even some
combination of these. As above, we can cast this problem as a large, partially observed tensor where the
first index represents a user, the second index represents an activity and the third index represents the time
period. It is again natural to model it as being close to low rank, under the assumption that a much smaller
number of (latent) factors about the interests of the user, the type of activity and the time period should
contribute to the rating. How many entries of the tensor do we need to observe in order to fill in its missing
entries? This problem is emblematic of a larger issue: Can we always solve linear inverse problems when
the number of observations is comparable to the number of parameters in the mode, or is computational
intractability an obstacle?
In fact, one of the advantages of working with tensors is that their decompositions are unique in important
ways that matrix decompositions are not. There has been a groundswell of recent work that uses tensor
decompositions for exactly this reason for parameter learning in phylogenetic trees [60], HMMs [60], mixture
models [46], topic models [2] and to solve community detection [3]. In these applications, one assumes access
to the entire tensor (up to some sampling noise). But given that the underlying tensors are low-rank, can
we observe fewer of their entries and still utilize tensor methods?
A wide range of approaches to solving tensor completion have been proposed [56, 35, 70, 73, 61, 52, 48,
14, 74]. However, in terms of provable guarantees none1 of them improve upon the following näive algorithm.
If the unknown tensor T is n1 × n2 × n3 we can treat it as a collection of n1 matrices each of size n2 × n3 . It
1 Most of the existing approaches rely on computing the tensor nuclear norm, which is hard to compute [39, 41]. The only
other algorithms we are aware of [48, 14] require that the factors be orthogonal. This is a rather strong assumption. First,
orthogonality requires the rank to be at most n. Second, even when r ≤ n, most tensors need to be “whitened” to be put in this
form and then a random sample from the “whitened” tensor would correspond to a (dense) linear combination of the entries of
the original tensor, which would be quite a different sampling model.
observations. Moreover, our algorithm works even when the observations are corrupted by noise. When
n = n1 = n2 = n3 , this amounts to about n1/2 r observations per slice which is much smaller than what
we would need to apply matrix completion on each slice separately. Our algorithm needs to leverage the
structure between the various slices.
where σ` is a scalar and a` , b` and c` are vectors of length n1 , n2 and n3 respectively. Here ∆ is a tensor
that represents noise. Its entries can be thought of as representing model misspecification because T is not
exactly low rank or noise in our observations or both. We will only make assumptions about the average
and maximum absolute value of entries in ∆. The vectors a` , b` and c` are called factors, and we will assume
√
that their norms are roughly ni for reasons that will become clear later. Moreover we will assume that the
2
magnitude of each of their entries is bounded by √ C in which case
√ we call the vectors C-incoherent . (Note
that a random vector of dimension n and norm n will be O( log ni )-incoherent with high probability.)
The advantage of these conventions are that a typical entry in T does not become vanishingly small as we
increase the dimensions of the tensor. This will make it easier to state and interpret the error bounds of our
algorithm.
Let Ω represent the locations of the entries that we observe, which (as is standard) are chosen uniformly
at random and without replacement. Set |Ω| = m. Our goal is to output a hypothesis X that has small
entry-wise error, defined as:
1 X
err(X) = Xi,j,k − Ti,j,k
n1 n2 n3
i,j,k
This measures the error on both the observed and unobserved entries of T . Our goal is to give algorithms
that achieve vanishing error, as the size of the problem increases. Moreover we will want algorithms that
need as few observations as possible. Here and throughout let n1 ≤ n2 ≤ n3 and n = max{n1 , n2 , n3 }. Our
main result is:
Theorem 1.1 (Main theorem). Suppose we are given m observations whose locations are chosen uniformly
at random (and without replacement) Pfrom a tensor T of the form Pr (1) where each of the factors a` , b` and
c` are C-incoherent. Let δ = n1 n12 n3 i,j,k |∆i,j,k |. And let r∗ = `=1 |σ` |. Then there is a polynomial time
algorithm that outputs a hypothesis X that with probability 1 − satisfies
s
(n1 )1/2 (n2 + n3 ) log4 n + log 2/
err(X) ≤ 4C 3 r∗ + 2δ
m
2 Incoherence is often defined based on the span of the factors, but we will allow the number of factors to be larger than any
of the dimensions of the tensor so we will need an alternative way to ensure that the non-zero entries of the factors are spread
out
Since the error bound above is quite involved, let us dissect the terms in it. In fact, having an additive
δ in the error bound is unavoidable. We have not assumed anything about ∆ in (1) except a bound on
the average and maximum magnitude of its entries. If ∆ were a random tensor whose entries are +δ and
−δ then no matter how many entries of T we observe, we cannot hope to obtain error less than δ on the
unobserved entries3 . The crucial point is that the remaining term in the error bound becomes o(1) when
e ∗ )2 n3/2 ) which for polylogarithmic r∗ improves over the näive algorithm for tensor completion
m = Ω((r
by a polynomial factor in terms of the number of observations. Moreover our algorithm works without any
constraints that factors a` , b` and c` be orthogonal or even have low inner-product.
In non-degenerate cases we can even remove another factor of r∗ from the number of observations we
need. Suppose that T is a tensor as in (1), but let σ` be Gaussian random variables with mean zero and
variance one. The factors a` , b` and c` are still fixed, but because of the randomness in the coefficients σ` ,
the entries of T are now random variables.
Corollary 1.2. Suppose we are given m observations whose locations are chosen uniformly at random (and
without replacement) from a tensor T of the form (1), where each coefficient σ` is a Gaussian random
variable with mean zero and variance one, and each of the factors a` , b` and c` are C-incoherent.
Further, suppose that for a 1 − o(1) fraction of the entries of T , we have var(Ti,j,k ) ≥ r/ polylog(n) = V
and that ∆ is a tensor where each entry is a Gaussian with mean zero and variance o(V ). Then there is a
polynomial time algorithm that outputs a hypothesis X that satisfies
Xi,j,k = 1 ± o(1) Ti,j,k
for a 1 − o(1) fraction of the entries. The algorithm succeeds with probability at least 1 − o(1) over the
randomness of the locations of the observations, and the realizations of the random variables σ` and the
entries of ∆. Moreover the algorithm uses m = C 6 n3/2 r polylog(n) observations.
In the setting above, it is enough that the coefficients σ` are random and that the non-zero entries in the
factors are spread out√ to ensure that the typical entry in T has variance about r. Consequently, the typical
entry in T is about r. This fact combined with the error bounds in Theorem 1.1 immediately yield the
above corollary . Remarkably, the guarantee is interesting even when r = n3/2− (the so-called overcomplete
case). In this setting, if we observe a subpolynomial fraction of the entries of T we are able to recover almost
all of the remaining entries almost entirely, even though there are no known algorithms for decomposing
an overcomplete, third-order tensor even if we are given all of its entries, at least without imposing much
stronger conditions that the factors be nearly orthogonal [36].
We believe that this work is a natural first step in designing practically efficient algorithms for tensor
completion. Our algorithms manage to leverage the structure across the slices through the tensor, instead
of treating each slice as an independent matrix completion problem. Now that we know this is possible,
a natural follow-up question is to get more efficient algorithms. Our algorithms are based on the sixth
level of the sum-of-squares hierarchy and run in polynomial time, but are quite far from being practically
efficient as stated. Recent work of Hopkins et al. [44] shows how to speed up sum-of-squares and obtain
nearly linear time algorithms for a number of problems where the only previously known algorithms ran
in a prohibitively large degree polynomial running time. Another approach would be to obtain similar
guarantees for alternating minimization. Currently, the only known approaches [48] require that the factors
are orthonormal and only work in the undercomplete case. Finally, it would be interesting to get algorithms
that recover a low rank tensor exactly when there is no noise.
3 The factor of 2 is not important, and comes from needing a bound on the empirical error of how well the low rank part of
T itself agrees with our observations so far. We could replace it with any other constant factor that is larger than 1.
Organization
In Section 2 we introduce Rademacher complexity, the tensor nuclear norm and strong refutation. We
connect these concepts by showing that any norm that can be computed in polynomial time and has good
Rademacher complexity yields an algorithm for strongly refuting random 3-SAT. In Section 3 we show
how a particular algorithm for strong refutation can be embedded into the sum-of-squares hierarchy and
directly leads to a norm that can be computed in polynomial time and has good Rademacher complexity.
Recall that err(X) is the average entry-wise error between X and T , over all (observed and unobserved)
entries. Also recall that among the candidate X’s that have low empirical error, the convex program finds
the one that minimizes kXkK for some polynomial time computable norm. The way we will choose the norm
k · kK and our bound on the maximum magnitude of an entry of ∆ will guarantee that the low rank part
of T will with high probability be a feasible solution. This ensures that kXkK for the X we find is not too
large either. One way to bound err(X) is to show that no hypothesis in the unit norm ball can have too
large a gap between its error and its empirical error (and then dilate the unit norm ball so that it contains
X). With this in mind, we define:
Definition 2.2. For a norm k · kK and a set Ω of observations, the generalization error is
sup err(X) − emp-err(X)
kXkK ≤1
It turns out that one can bound the generalization error via the Rademacher complexity.
Definition 2.3. Let Ω = {(i1 , j1 , k1 ), (i2 , j2 , k2 ), ..., (im , jm , km )} be a set of m locations chosen uniformly
at random (and without replacement) from [n1 ] × [n2 ] × [n3 ]. And let σ1 , σ2 , ..., σ` be random ±1 variables.
The Rademacher complexity of (the unit ball of) the norm k · kK is defined as
h X m i
Rm (k · kK ) = E sup σ` Xi` ,j` ,k`
Ω,σ kXkK ≤1 `=1
It follows from a standard symmetrization argument from empirical process theory [51, 11] that the
Rademacher complexity does indeed bound the generalization error.
Theorem 2.4. Let ∈ (0, 1) and suppose each X with kXkK ≤ 1 has bounded loss — i.e. |Xi,j,k −Ti,j,k | ≤ a
and that locations (i, j, k) are chosen uniformly at random and without replacement. Then with probability
at least 1 − , for every X with kXkK ≤ 1, we have
r
m ln(1/)
err(X) ≤ emp-err(X) + 2R (k · kK ) + 2a
m
where the last line follows by the concavity of sup(·). Now we can use the Rademacher (random ±1) variables
{σ` }` and rewrite the right hand side of the above expression as follows:
h m
1 X i
(∗) ≤ E0 sup σ` |Xi` ,j` ,k` − Ti` ,j` ,k` | − |Xi0` ,j`0 ,k`0 − Ti0` ,j`0 ,k`0 |
Ω,Ω ,σ kXkK ≤1 m `=1
h 1 Xm 1 X m i
≤ E sup σ` |Xi` ,j` ,k` − Ti` ,j` ,k` | + σ` |Xi0` ,j`0 ,k`0 − Ti0` ,j`0 ,k`0 |
Ω,Ω0 ,σ kXkK ≤1 m m
`=1 `=1
h m
1 X i
≤ 2 E sup σ` |Xi` ,j` ,k` − Ti` ,j` ,k` |
Ω,σ kXkK ≤1 m `=1
h m
1 X i
≤ 2 E sup σ` |Xi` ,j` ,k` | + |Ti` ,j` ,k` |
Ω,σ kXkK ≤1 m
`=1
m
h 1 X i h m
1 X i
≤ 2 E σ` |Ti` ,j` ,k` | + 2 E sup σ` |Xi` ,j` ,k` |
Ω,σ m Ω,σ kXkK ≤1 m
`=1 `=1
m
h 1 X i h m
1 X i
= 2 E σ` Ti` ,j` ,k` + 2 E sup σ` Xi` ,j` ,k`
Ω,σ m Ω,σ kXkK ≤1 m
`=1 `=1
where the second, fourth and fifth inequalities use the triangle inequality. The equality uses the fact that the
σ` ’s are random signs and hence can absorb the absolute value around the terms that they multiply. The
second term above in the last expression is exactly the Rademacher complexity that we defined earlier. This
argument only shows that the Rademacher complexity bounds the expected generalization error. However
it turns out that we can also use the Rademacher complexity to bound the generalization error with high
probability by applying McDiarmid’s inequality. See for example [5]. We also remark that generalization
bounds are often stated in the setting where samples are drawn i.i.d., but here the locations of our observations
are sampled without replacement. Nevertheless for the settings of m we are interested in, the fraction of
our observations that are repeats is o(1) — in fact it is subpolynomial — and we can move back and forth
between both sampling models at negligible loss in our bounds.
In much of what follows it will be convenient to think of Ω = {(i1 , j1 , k1 ), (i2 , j2 , k2 ), ..., (im , jm , km )} and
{σ` }` as being represented by a sparse tensor Z, defined below.
Definition 2.5. Let Z be an n1 × n2 × n3 tensor such that
(
0, if (i, j, k) ∈/Ω
Zi,j,k = P
` s.t. (i,j,k)=(i` ,j` ,k` ) σ`
The tensor nuclear norm of X which is denoted by kXkA is the infimum over α such that X/α ∈ A.
In particular kT − ∆kA ≤ r∗ . Finally we give an elementary bound on the Rademacher complexity of the
tensor nuclear norm. Recall that n = max(n1 , n2 , n3 ).
pn
Lemma 2.8. Rm (k · kA ) = O(C 3 m )
Proof. Recall the definition of Z given in Definition 2.5. With this we can write
h Xm i h i
E sup σ` Xi` ,j` ,k` = E sup |hZ, a ⊗ b ⊗ ci|
Ω,σ kXkA ≤1 Ω,σ C-incoherent a,b,c
`=1
We can now adapt the discretization approach in [33], although our task is considerably simpler because
we are constrained to C-incoherent a’s. In particular, let
n n o
S = aa is C-incoherent and a ∈ Z
By standard bounds on the size of an -net [58], we get that |S| ≤ O(C/)n . Suppose that P |hZ, a⊗b⊗ci| ≤ M
for all a, b, c ∈ S. Then for an arbitrary, but C-incoherent a we can expand it as a = i i ai where each
ai ∈ S and similarly for b and c. And now
XXX
|hZ, a ⊗ b ⊗ ci| ≤ i j k |hZ, ai ⊗ bi ⊗ ci i| ≤ (1 − )−3 M
i j k
Moreover since each entry in a ⊗ b ⊗ c has magnitude at most C 3 we can apply a Chernoff bound to conclude
that for any particular a, b, c ∈ S we have
p
|hZ, a ⊗ b ⊗ ci| ≤ O C 3 m log 1/γ
4 The usual definition of the tensor nuclear norm has no constraints that the vectors a, b and c be C-incoherent. However,
adding this additional requirement only serves to further restrict the unit norm ball, while ensuring that the low rank part of T
(when scaled down) is still in it, since the factors of T are anyways assumed to be C-incoherent. This makes it easier to prove
recovery guarantees because we do not need to worry about sparse vectors behaving very differently than incoherent ones, and
since we are not going to compute this norm anyways this modification will make our analysis easier.
m (1 − )−3 rn
R (A) ≤ max |hZ, a ⊗ b ⊗ ci| = O C 3
m a,b,c∈S m
and this completes the proof.
The important point is that the Rademacher complexity of the tensor nuclear norm is o(1) whenever
m = ω(n). In the next subsection we will connect this to refutation in a way that allows us to strengthen
known hardness results for computing the tensor nuclear norm [39, 41] and show that it is even hard to
compute in an average-case sense based on some standard conjectures about the difficulty of refuting random
3-SAT.
The right hand side is exactly alg(φ) and is 1/2 + o(1) with high probability, which implies that both
conditions in the definition for strong refutation hold and this completes the proof.
We can now combine Theorem 2.11 with the bound on the Rademacher complexity of the tensor nuclear
norm given in Lemma 2.8 to conclude that if we could compute the tensor nuclear norm we would also obtain
an algorithm for strongly refuting random 3-XOR with only m = Ω(n log n) clauses. It is not obvious but
it turns out that any algorithm for strongly refuting random 3-XOR implies one for 3-SAT. Let us define
strong refutation for 3-SAT. We will refer to any variable vi or its negation v̄i as a literal. We will use the
term random 3-SAT formula to refer to a formula where each clause is generated by choosing an ordered
triple of literals (yi , yj , yk ) uniformly at random (and without replacement) and setting yi ∨ yj ∨ yk = 1.
Definition 2.12. An algorithm for strongly refuting random 3-SAT takes as input a 3-SAT formula φ and
outputs a quantity alg(φ) that satisfies
Corollary 2.13. Suppose that k · kK is computable in polynomial time and satisfies kXkK ≤ 1 whenever
X = a ⊗ a ⊗ a and a is a vector with ±1 entries. Suppose further that for any X with kXkK ≤ 1 its entries
are bounded by C 3 in absolute value and that Rm (k · kK ) = o(1). Then there is a polynomial time algorithm
for strongly refuting a random 3-SAT formula with O(C 6 m log n) clauses.
Now we can get a better understanding of the obstacles to noisy tensor completion by connecting it to the
literature on refuting random 3-SAT. Despite a long line of work on refuting random 3-SAT [37, 32, 31, 30, 25],
there is no known polynomial time algorithm that works with m = n3/2− clauses for any > 0. Feige [29]
conjectured that for any constant C, there is no polynomial time algorithm for refuting random 3-SAT with
m = Cn clauses5 . Daniely et al. [26] conjectured that there is no polynomial time algorithm for m = n3/2−
for any > 0. What we have shown above is that any norm that is a relaxation to the tensor nuclear
norm and can be computed in polynomial time but has Rademacher complexity is Rm (k · kK ) = o(1) for
m = n3/2− would disprove the conjecture of Daniely et al. [26] and would yield much better algorithms for
refuting random 3-SAT than we currently know, despite fifteen years of work on the subject.
5 In
Feige’s paper [29] there was no need to make the conjecture any stronger because it was already strong enough for all of
the applications in inapproximability.
10
(1) E[1]
e = 1 (normalization)
The SOSk norm of X ∈ Rn1 ×n2 ×n3 which is denoted by kXkKk is the infimum over α such that X/α ∈ Kk .
The constraints in Definition 3.1 can be expressed as an O(nk )-sized semidefinite program. This implies
that given any set of polynomial constraints of the form {p = 0}, {p ≥ 0}, one can efficiently find a degree
k pseudo-distribution satisfying those constraints if one exists. This is often called the degree k Sum-of-
Squares algorithm [69, 62, 53, 63]. Hence we can compute the norm kXkKk of any tensor X to within
arbitrary accuracy in polynomial time. And because it is a relaxation to the tensor nuclear norm which is
defined analogously but over a distribution on C-incoherent vectors instead of a pseudo-distribution over
them, we have that kXkKk ≤ kXkA for every tensor X. Throughout most of this paper, we will be interested
in the case k = 6.
11
which we will use repeatedly. If d is even then any degree d pseudo-expectation operator satisfies the
2 e 2 ] for every polynomial p of degree at most d/2 (e.g., see Lemma A.4 in [6]). Hence
constraint (E[p])
e ≤ E[p
the right hand side of (4) can be bounded as:
X 2 X h (1) 2 i
n1 e (1) Qi,Z (Y (2) , Y (3) )]
E[Y ≤ n1 e Y Qi,Z (Y (2) , Y (3) )
E (5)
i i
i i
It turns out that bounding the right-hand side of (5) boils down to bounding the spectral norm of the
following matrix.
Definition 3.3. Let A be the n2 n3 × n2 n3 matrix whose rows and columns are indexed over ordered pairs
(j, k 0 ) and (j 0 , k) respectively, defined as
X
Aj,k0 ,j 0 ,k = Zi,j,k Zi,j 0 ,k0
i
We can now make the connection to resolution more explicit: We can think of a pair of observations
Zi,j,k , Zi,j 0 ,k0 as a pair of 3-XOR constraints, as usual. Resolving them (i.e. multiplying them) we obtain a
4-XOR constraint
xj · xk · xj 0 · xk0 = Zi,j,k Zi,j 0 ,k0
A captures the effect of resolving certain pairs of 3-XOR constraints into 4-XOR constraints. The challenge
is that the entries in A are not independent, so bounding its maximum singular value will require some care.
It is important that the rows of A are indexed by (j, k 0 ) and the columns are indexed by (j 0 , k), so that j
and j 0 come from different 3-XOR clauses, as do k and k 0 , and otherwise the spectral bounds that we will
want to prove about A would simply not be true! This is perhaps the key insight in [25].
It will be more convenient to decompose A and reason about its two types of contributions separately.
To that end, we let R be the n2 n3 × n2 n3 matrix whose non-zero entries are of the form
X
Rj,k,j,k = Zi,j,k Zi,j,k
i
and all of its other entries are set to zero. Then let B be the n2 n3 × n2 n3 matrix whose entries are of the
form (
0, if j = j 0 and k = k 0
Bj,k0 ,j 0 ,k = P
i Zi,j,k Zi,j ,k else
0 0
12
(1) 2
Proof. The pseudo-expectation operator satisfies {(Yi ) ≤ C 2 } for all i, and hence we have
X h 2 i X h 2 i h i
e Zi,j,k Zi,j 0 ,k0 Y (2) Y (3) Y (2) (3)
X X
e Yi Qi,Z (Y (2) , Y (3) )
E ≤ C2 e Qi,Z (Y (2) , Y (3) )
E = C2 E 0 Y 0
j k j k
i i i j,k,j 0 ,k0
(2)
Now let Y (2) ∈ Rn2 be a vector of variables where the ith entry is Yi and similarly for Y (3) . Then we can
re-write the right hand side as a matrix inner-product:
Pn2 (2)
where the last equality follows because the pseudo-expectation operator satisfies the constraints { i=1 (Yi )2 =
Pn3 (3)
n2 } and { i=1 (Yi )2 = n3 }.
Hence we can bound the contribution of the first term as C 2 hB, E[(Y
e (2)
⊗ Y (3) )(Y (2) ⊗ Y (3) )T ]]i ≤
2
C n2 n3 kBk. Now we proceed to bound the contribution of the second term:
(2) 2 (3) 2 4
Claim 3.6. E[(Yj ) (Yk ) ] ≤ C
e
Proof. It is easy to verify by direct computation that the following equality holds:
(2) (3) (2) (3) (3) (2) (2) (3)
C 4 − (Yj )2 (Yk )2 = C 2 − (Yj )2 C 2 − (Yk )2 + C 2 − (Yk )2 (Yj )2 + C 2 − (Yj )2 (Yk )2
Moreover the pseudo-expectation of each of the three terms above is nonnegative, by construction. This
implies the claim.
Moreover each entry in Z is in the set {−1, 0, +1} and there are precisely m non-zeros. Thus the sum of
the absolute values of all entries in R is at most m. Now we have:
(2) 2 (3) 2
X
C 2 hR, E[(Y
e (2)
⊗ Y (3) )(Y (2) ⊗ Y (3) )T ]i ≤ C 2 Rj,k,j,k E[(Y
e 6
j ) (Yk ) ] ≤ C m
j,k
4 Spectral Bounds
Recall the definition of B given in the previous section. In fact, for our spectral bounds it will be more
convenient to relabel the variables (but keeping the definition intact):
(
0, if j = j 0 and k = k 0
Bj,k,j 0 ,k0 = P
i Zi,j,k Zi,j ,k else
0 0
13
Zi,j 0 ,k if (i, j , k) ∈ Tr and zero otherwise. Also let Ei,j,j 0 ,k,k0 ,r be the event that there is no r0 < r where
0
0 Vi,j 0 ,k 1E
X
Brj,k,j 0 ,k0 = r
Ui,j,k r
where 1E is short-hand for the indicator function of the event Ei,j,j 0 ,k,k0 ,r . The idea behind this construction
is that each pair of triples (i, j, k 0 ) and (i, j 0 , k) that contributes to B will be contribute to some Br with high
probability. Moreover it will not contribute to any later matrix in the ensemble. Hence with high probability
O(log n)
X
B= Br
r=1
Throughout the rest of this section, we will suppress the superscript r and work with a particular matrix
in the ensemble, B. Now let ` be even and consider
T T
Tr(BB
| BB{z ...BBT})
` times
As is standard, we are interested in bounding E[Tr(BBT BBT ...BBT )] in order to bound kBk. But note that
B is not symmetric. Also note that the random variables U and V are not independent, however whether or
not they are non-zero is non-positively correlated and their signs are mutually independent. Expanding the
trace above we have
X X X
Tr(BBT BBT ...BBT ) = ... Bj1 ,k1 ,j2 ,k2 Bj3 ,k3 ,j2 ,k2 ...Bj1 ,k1 ,j` ,k`
j1 ,k1 j2 ,k2 j`−1 ,k`−1
Ui1 ,j1 ,k2 Vi1 ,j2 ,k1 1E1 Ui2 ,j3 ,k2 Vi2 ,j2 ,k3 1E2 ...Ui` ,j1 ,k` Vi` ,j` ,k1 1E`
XXXX XX
= ...
j1 ,k1 i1 j2 ,k2 i2 j` ,k` i`
where 1E1 is the indicator for the event that the entry Bj1 ,k1 ,j2 ,k2 is not covered by an earlier matrix in the
ensemble, and similarly for 1E2 , ..., 1E` .
Notice that there are 2` random variables in the above sum (ignoring the indicator variables). Moreover
if any U or V random variable appears an odd number of times, then the contribution of the term to
E[Tr(BBT BBT ...BBT )] is zero. We will give an encoding for each term that has a non-zero contribution, and
we will prove that it is injective.
Fix a particular term in the above sum where each random variable appears an even number of times.
Let s be the number of distinct values for i. Moreover let i1 , i2 , ..., is be the order that these indices first
appear. Now let r1j denote the number of distinct values for j that appear with i1 in U terms — i.e. r1j is the
number of distinct j’s that appear as Ui1 ,j,∗ . Let r1k denote the number of distinct values for k that appear
with i1 in U terms — i.e. r1k is the number of distinct k’s that appear as or Ui1 ,∗,k . Similarly let q1j denote
the number of distinct values for j that appear with i1 in V terms — i.e. q1j is the number of distinct j’s
that appear as Vi1 ,j,∗ . And finally let q1k denote the number of distinct values for k that appear with i1 in
V terms — i.e. q1k is the number of distinct k’s that appear as Vi1 ,∗,k .
We give our encoding below. It is more convenient to think of the encoding as any way to answer the
following questions about the term.
(a) What is the order i1 , i2 , ..., is of the first appearance of each distinct value of i?
(b) For each i that appears, what is the order of each of the distinct values of j’s and k’s that appear along
with it in U ? Similarly, what is the order of each of the distinct values of j’s and k’s that appear along
with it in V ?
14
Let rj = r1j +r2j +...+rsj and rk = r1k +r2k +...+rsk . Similarly let qj = q1j +q2j +...qsj and qk = q1k +q2k +...qsk .
r q
Then the number of possible answers to (a) and (b) is at most ns1 and n2j nr3k n2j nq3k respectively. It is also easy
to see that the number of answers to (c) that arise over the sequence of ` steps is at most 8` (s(rj +rk )(qj +qk ))` .
We remark that much of the work on bounding the maximum eigenvalue of a random matrix is in removing
any `` type terms, and so one needs to encode re-visiting indices more compactly. However such terms will
only cost us polylogarithmic factors in our bound on kBk.
It is easy to see that this encoding is injective, since given the answers to the above questions one can
simulate each step and recover the sequence of random variables. Next we establish some easy facts that
allow us to bound E[Tr(BBT BBT ...BBT )].
Claim 4.1. For any term that has a non-zero contribution to E[Tr(BBT BBT ...BBT )], we must have s ≤ `/2
and rj + qj + rk + qk ≤ `
Proof. Recall that there are 2` random variables in the product and precisely ` of them correspond to U
variables and ` of them to V variables. Suppose that s > `/2. Then there must be at least one U variable
and at least one V variable that occur exactly once, which implies that its expectation is zero because the
signs of the non-zero entries are mutually independent. Similarly suppose rj + qj + rk + qk > `. Then there
must be at least one U or V variable that occurs exactly once, which also implies that its expectation is
zero.
Claim 4.2. For any valid encoding, s ≤ rj + qj and s ≤ rk + qk .
Proof. This holds because in each step where the i variable is new and has not been visited before, by
definition the j variable is new too (for the current i) and similarly for the k variable.
Finally, if s, rj , qj , rk and qk are defined as above then for any contributing term
Ui1 ,j1 ,k2 Vi1 ,j2 ,k1 Ui2 ,j3 ,k2 Vi2 ,j2 ,k3 ...Ui` ,j1 ,l` Vi` ,j` ,k1
its expectation is at most prj +rk pqj +qk where p = m/n1 n2 n3 because there are exactly rj + rk distinct U
variables and qj + qk distinct V variables whose values are in the set {−1, 0, +1} and whether or not a
variable is non-zero is non-positively correlated and the signs are mutually independent.
This now implies the main lemma:
`/2
Lemma 4.3. E[Tr(BBT BBT ...BBT )] ≤ n1 (max(n2 , n3 ))` p` (`)3`+3
Proof. Note that the indicator variables only have the effect of zeroing out some terms that could otherwise
contribute to E[Tr(BBT BBT ...BBT )]. Returning to the task at hand, we have
r q
X
E[Tr(BBT BBT ...BBT )] ≤ ns1 n2j nr3k n2j nq3k prj +rk pqj +qk 8` (s(rj + rk )(qj + qk ))`
s,rj ,rk ,qj ,qk
where the sum is over all valid triples s, rj , rk , qj , qk and hence s, r, q ≤ `/2 and s ≤ rj + rk and s ≤ qj + qk
using Claim 4.1 and Claim 4.2. We can upper bound the above as
X
E[Tr(BBT BBT ...BBT )] ≤ ns1 (pn2 )rj +qj (pn3 )rk +qk (`)3`+3
s,rj ,rk ,qj ,qk
X
≤ ns1 (p max(n2 , n3 ))rj +qj +rk +qk (`)3`+3
s,rj ,rk ,qj ,qk
Now if p max(n2 , n3 ) ≤ 1 then using Claim 4.2 followed by the first half of Claim 4.1 we have:
`/2
E[Tr(BBT BBT ...BBT )] ≤ ns1 (p max(n2 , n3 ))2s (`)3`+3 ≤ n1 (p max(n2 , n3 ))` (`)3`+3
15
1/2
h
1/2
` i E[Tr(BBT BBT ...BBT )] `3
Pr[kBk ≥ n1 max(n2 , n3 )p(2`)3 ] = Pr kBk` ≥ n1 max(n2 , n3 )p(2`)3 ≤ `/2 ≤ 3`
n1 max(n2 , n3 )` p` (2`)3` 2
1/2
and hence setting ` = Θ(log n) we conclude that kBk ≤ 8n1 max(n2 , n3 )p log3 n holds with high probability.
PO(log n) r
Moreover B = r=1 B also holds with high probability. If this equality holds and each Br satisfies
1/2
kB k ≤ 8n1 max(n2 , n3 )p log3 n, we have
r
m log4 n
kBk ≤ max O(kBr k log n) = O 1/2
r n1 min(n2 , n3 )
where we have used the fact that p = m/n1 n2 n3 . This completes the proof of the theorem.
Proof. Consider any X with kXkK6 ≤ 1. Then using Lemma 3.4 and Theorem 4.4 we have
2 XX 2
1/2
hZ, Xi ≤ n1 Zi,j,k Xi,j,k ≤ C 2 n1 n2 n3 kBk + C 6 mn1 = O mn1 max(n2 , n3 ) log4 n + mn1
i j,k
Recall that Z was defined in Definition 2.5. The Rademacher complexity can now be bounded as
s
1 (n1 )1/2 (n2 + n3 ) log4 n
(hZ, Xi) ≤ O
m m
16
We can now invoke Theorem 1.1 which guarantees that the hypothesis X that results from solving (2)
e 3/2 r). This bound on the error
satisfies err(X) = o(1/ log n) with probability 1 − o(1) provided that m = Ω(n
immediately implies that |R0 | = o(n1 n2 n3 ) and so |R \ R0 | = (1 − o(1))n1 n2 n3 . This completes the proof of
the corollary.
uS ≡ vg − vf
17
X X
X
2
e 2] =
E[p cS cT hu∅ , uS∆T i = cS cT huS , uT i =
cS uS
≥ 0
S,T S,T S
Theorem 5.5. [38, 68] Let φ be a random 3-XOR formula on n variables with m = n3/2− clauses. Then
for any > 0 and any c < 2, the k = Ω(nc ) round Lasserre hierarchy given in Definition 5.1 permits a
feasible solution, with probability 1 − o(1).
Note that the constant in the Ω(·) depends on and c. Then using the above reductions, we have the
following as an immediate corollary:
Corollary 5.6. For any > 0 and any c < 2 and k = Ω(nc ), if m = n3/2− the Rademacher complexity
Rm (k · kKk ) = 1 − o(1).
Thus there is a sharp phase transition (as a function of the number of observations) in the Rademacher
complexity of the norms derived from the sum-of-squares hierarchy. At level six, Rm (k·kK6 ) = o(1) whenever
m = ω(n3/2 log4 n). In contrast, Rm (k · kKk ) = 1 − o(1) when m = n3/2− even for very strong relaxations
2
derived from n2 rounds of the sum-of-squares hierarchy. These norms require time 2n to compute but still
achieve essentially no better bounds on their Rademacher complexity.
18
References
[1] S. Allen, R. O’Donnell and D. Witmer. How to refute a random CSP. FOCS 2015, to appear.
[2] A. Anandkumar, D. Foster, D. Hsu, S. Kakade, Y. Liu. A spectral algorithm for latent dirichlet
allocation. NIPS, pages 926–934, 2012.
[3] A. Anandkumar, R. Ge, D. Hsu and S. Kakade. A tensor spectral approach to learning mixed member-
ship community models. COLT, pages 867–881, 2013.
[4] A. Anandkumar, D. Hsu and S. Kakade. A method of moments for mixture models and hidden markov
models. COLT, pages 1–33, 2012.
[5] N. Balcan. Machine Learning Theory Notes. http://www.cc.gatech.edu/~ninamf/ML11/lect1115.
pdf
[6] B. Barak, F. Brandao, A. Harrow, J. Kelner, D. Steurer and Y. Zhou. Hypercontractivity, sum-of-
squares proofs, and their applications. STOC, pages 307–326, 2012.
[7] B. Barak, J. Kelner and D. Steurer. Rounding sum-of-squares relaxations. STOC, pages 31–40, 2014.
[8] B. Barak, J. Kelner and D. Steurer. Dictionary learning and tensor decomposition via the sum-of-squares
method. STOC, pages 143–151, 2015.
[9] B. Barak, G. Kindler and D. Steurer. On the optimality of semidefinite relaxations for average-case and
generalized constraint satisfaction. ITCS, pages 197–214, 2013.
[10] B. Barak and D. Steurer. Sum-of-squares proofs and the quest toward optimal algorithms/ Proceedings
of the ICM, 2014.
[11] P. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: risk bounds and structural
results. Journal of Machine Learning Research, 3:463–482, 2003.
[12] Q. Berthet and P. Rigollet. Computational lower bounds for sparse principal component detection.
COLT, pages 1046–1066, 2013.
[13] A. Bhaskara, M. Charikar, A. Moitra and A. Vijayaraghavan. Smoothed analysis of tensor decomposi-
tions. STOC, pages 594–603, 2014.
[14] S. Bhojanapalli and S. Sanghavi. A new sampling technique for tensors. arXiv:1502.05023
[15] E. Candes, Y. Eldar, T. Strohmer and V. Voroninski. Phase retrieval via matrix completion. SIAM
Journal on Imaging Sciences, 6(1):199–225, 2013.
[16] E. Candes and C. Fernandez-Granda. Towards a mathematical theory of super-resolution. Communi-
cations on Pure and Applied Mathematics, 67(6):906–956, 2014.
[17] E. Candes, X. Li, Y. Ma and J. Wright. Robust principal component analysis? Journal of the ACM,
58(3):1–37, 2011.
[18] E. Candes and Y. Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925–936, 2010.
[19] E. Candes and B. Recht. Exact matrix completion via convex optimization. Foundations of Computa-
tional Math., 9(6):717–772, 2008.
[20] E. Candes and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE
Transactions on Information Theory, 56(5):2053-2080, 2010
19
[22] V. Chandrasekaran and M. Jordan. Computational and statistical tradeoffs via convex relaxation.
Proceedings of the National Academy of Sciences, 110(13)E1181–E1190, 2013.
[23] V. Chandrasekaran, B. Recht, P. Parrilo and A. Willsky. The convex geometry of linear inverse problems.
Foundations of Computational Math., 12(6)805–849, 2012.
[24] Y. Chen, S. Bhojanapalli, S. Sanghavi and R. Ward. Coherent matrix completion. ICML, pages 674–682,
2014.
[25] A. Coja-Oghlan, A. Goerdt and A. Lanka. Strong refutation heuristics for random k-SAT. Combina-
torics, Probability and Computing, 16(1):5–28, 2007.
[26] A. Daniely, N. Linial and S. Shalev-Shwartz. More data speeds up training time in learning half spaces
over sparse vectors. NIPS, pages 145–153, 2013.
[27] A. Daniely, N. Linial and S. Shalev-Shwartz. From average case complexity to improper learning
complexity. STOC, pages 441–448, 2014.
[28] M. Fazel. Matrix Rank Minimization with Applications. PhD thesis, Stanford University, 2002.
[29] U. Feige. Relations between average case complexity and approximation complexity. STOC, pages
534–543, 2002.
[30] U. Feige, J.H. Kim and E. Ofek. Witnesses for non-satisfiability of dense random 3CNF formulas In
Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages
497–508, 2006.
[31] U. Feige and E. Ofek. Easily refutable subformulas of large random 3-CNF formulas. Theory of Com-
puting 3:25–43, 2007.
[32] J. Friedman, A. Goerdt and M. Krivelevich. Recognizing more unsatisfiable random k-SAT instances
efficiently. SIAM Journal on Computing 35(2):408–430, 2005.
[33] J. Friedman, J. Kahn and E. Szemerédi. On the second eigenvalue of random regular graphs. STOC,
pages 534–543, 1989.
[34] Z. Füredi and J. Komlós. The eigenvalues of random symmetric matrices. Combinatorica, 1:233–241,
1981.
[35] S. Gandy, B. Recht and I. Yamada. Tensor completion and low-n-rank tensor recovery via convex
optimization. Inverse Problems, 27(2):1–19, 2011.
[36] R. Ge and T. Ma. Decomposing overcomplete 3rd order tensors using sum-of-squares algorithms.
RANDOM, pages 829–849, 2015.
[37] A. Goerdt and M. Krivelevich. Efficient recognition of random unsatisfiable k-SAT instances by spectral
methods. In Annual Symposium on Theoretical Aspects of Computer Science, pages 294–304, 2001.
[38] D. Grigoriev. Linear lower bound on degrees of Positivstellensatz calculus proofs for the parity. Theo-
retical Computer Science 259(1-2):613–622, 2001.
[39] L. Gurvits. Classical deterministic complexity of Edmonds’ problem and quantum entanglement. STOC,
pages 10–19, 2003.
[40] M. Hardt. Understanding alternating minimization for matrix completion. FOCS, pages 651–660, 2014.
[41] A. Harrow and A. Montanaro. Testing product states, quantum merlin-arther games and tensor opti-
mization. Journal of the ACM, 60(1):1–43, 2013.
20
21
[66] B. Recht. A simpler approach to matrix completion. Journal of Machine Learning Research, 12:3413–
3430, 2011.
[67] B. Recht, M. Fazel and P. Parrilo. Guaranteed minimum rank solutions of matrix equations via nuclear
norm minimization. SIAM Review, 52(3):471–501, 2010.
[68] G. Schoenebeck. Linear level Lasserre lower bounds for certain k-CSPs. FOCS, pages 593–602, 2008.
[71] N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. COLT, pages 545–560, 2005.
[72] G. Tang, B. Bhaskar and B. Recht. Compressed sensing off the grid. IEEE Transactions on Information
Theory, 59(11):7465–7490, 2013.
[73] R. Tomioko, K. Hayashi and H. Kashima. Estimation of low-rank tensors via convex optimization.
arXiv:1010.0789, 2011.
[74] M. Yuan and C.H. Zhang. On tensor completion via nuclear norm minimization. Foundations of
Computational Mathematics, to appear.
0 MT
S= .
M 0
We have not precisely defined the notion of incoherence that is used in the matrix completion literature, but
it turns out to be easy to see that S is low rank and incoherent as well.
The important point is that given m samples generated uniformly at random from M , we can generate
random samples from S too. It will be more convenient to think of these random samples as being generated
without replacement, but this reduction works just as well without replacement too. Let M ∈ Rn1 ×n2 . Now
n2 +n2
for each sample from S, with probability p = (n11+n22)2 we reveal a uniformly random entry in the either
block of zeros. And with probability 1 − p we reveal a uniformly random entry from M . Each entry in M
appears exactly twice in S, and we choose to reveal this entry of M with probability 1/2 from the top-right
block, and otherwise from the bottom-left block. Thus given m samples from M , we can generate from S
(in fact we can generate even more, because some of the revealed entries will be zeros). It is easy to see that
this approach works for the case of sampling without replacement to, in that m samples without replacement
from M can be used to generate at least m samples without replacement from S.
Now let us proceed to the tensor case. Let us introduce the following definition, for ease of notation:
Definition A.1. Let m(n, r, , f , C) be such that, there is an algorithm that on a rank r, order d, size
n × n × ... × n symmetric tensor where each factor has norm at most C, the algorithm returns an estimate
X with err(X) = f with probability 1 − when it is given m(n, r, , f ) samples chosen uniformly at random
(and without replacement).
22
where each factor is unit norm. There is an algorithm that with probability at least 1 − returns an estimate
Y with Pd
( j=1 nj )d
err(Y ) ≤ Qd f
d!2d−1 j=1 nj
Proof. Our goal is to symmetrize an asymmetric tensor, and in such a way that each entry in the symmetrized
tensor is either zero or else corresponds to an entry in the original tensor. Our reduction will work for any
odd order d tensor. In particular let
Xr
T = a1i ⊗ a2i ⊗ ... ⊗ adi
i=1
Pd
be an order d tensor where the dimension of aj is nj . Also let n = j=1 nj . Then we will construct a
symmetric, order d tensor as follows. Let σ1 , σ2 , ...σd be a collection
Qd of d random ± variables that are chosen
uniformly at random from the 2d−1 configurations where j=1 σj = 1. Then we consider the following
random vector
ai (σ1 , σ2 , ...σd ) = [σ1 a1i , σ2 a2i , ..., σd adi ]
Here ai (σ1 , σ2 , ...σd ) is an n-dimensional vector that results from concatenating the vectors a1i , a2i , ..., adi but
after flipping some of their signs according to σ1 , σ2 , ...σd . Then we set
r
X ⊗d
S= E [ ai (σ1 , σ2 , ...σd ) ]
σ1 ,σ2 ,...σd
i=1
It is immediate that S is symmetric and has rank at most 2d−1 r by expanding out the expectation into a
sum over the valid sign configurations. Moreover each rank one term in the decomposition is of the form
a⊗d where kak22 = d because it is the concatenation of d unit vectors.
If σ1 , σ2 , ...σd is fixed, then each entry in S is itself a degree d polynomial in the σj variables. By our
construction of the σj variables, and because d is odd so there are no terms where every variable appears to
an Qeven power, it follows that all the terms vanish in expectation except for the terms which have a factor
d
of j=1 σj , and these are exactly terms that correspond to some permutation π : [d] → [d], and a term of
the form
d
π(1) π(2) π(d)
X
ai ⊗ ai ⊗, ..., ⊗ai
i=1
Hence all of the entries in S are either zero or are 2d−1 times an entry in T . As before, we can generate m
uniformly random samples from S given m uniformly random samples from T , by simply choosing to sample
an entry from one of the blocks of zeros with the appropriate probability, or else revealing an entry of T and
choosing where in S to reveal this entry uniformly at random. Hence:
1 X 1 X
Pd |Yi1 ,i2 ,...,id − Si1 ,i2 ,...,id | ≤ Pd |Yi1 ,i2 ,...,id − Si1 ,i2 ,...,id |
( j=1 nj )d (i1 ,i2 ,...,id )∈Γ ( j=1 nj )d i1 ,i2 ,...,id
where Γ represents the locations in S where an entry of T appears. The right hand side above is at most
f with probability 1 − . Moreover each entry in T appears in exactly d! locations in S. And when it does
appear, it is scaled by 2d−1 . And hence if we multiply the left hand side by
Pd
( j=1 nj )d
Qd
d!2d−1 j=1 nj
we obtain err(Y ). This completes the reduction.
Note that in the case where n1 = n2 = n3 ... = nd , the error and the rank in this reduction increase only by
at most an ed and 2d factor respectively.
23
March 9, 2009
Abstract
This paper is concerned with the problem of recovering an unknown matrix from a small
fraction of its entries. This is known as the matrix completion problem, and comes up in a
great number of applications, including the famous Netflix Prize and other similar questions in
collaborative filtering. In general, accurate recovery of a matrix from a small number of entries
is impossible; but the knowledge that the unknown matrix has low rank radically changes this
premise, making the search for solutions meaningful.
This paper presents optimality results quantifying the minimum number of entries needed to
recover a matrix of rank r exactly by any method whatsoever (the information theoretic limit).
More importantly, the paper shows that, under certain incoherence assumptions on the singular
vectors of the matrix, recovery is possible by solving a convenient convex program as soon as the
number of entries is on the order of the information theoretic limit (up to logarithmic factors).
This convex program simply finds, among all matrices consistent with the observed entries, that
with minimum nuclear norm. As an example, we show that on the order of nr log(n) samples
are needed to recover a random n × n matrix of rank r by any method, and to be sure, nuclear
norm minimization succeeds as soon as the number of entries is of the form nrpolylog(n).
1 Introduction
1.1 Motivation
Imagine we have an n1 × n2 array of real1 numbers and that we are interested in knowing the
value of each of the n1 n2 entries in this array. Suppose, however, that we only get to see a
small number of the entries so that most of the elements about which we wish information are
simply missing. Is it possible from the available entries to guess the many entries that we have
not seen? This problem is now known as the matrix completion problem [7], and comes up in a
great number of applications, including the famous Netflix Prize and other similar questions in
1
Much of the discussion below, as well as our main results, apply also to the case of complex matrix completion,
with some minor adjustments in the absolute constants; but for simplicity we restrict attention to the real case.
where σ1 , . . . , σr ≥ 0 are the singular values, and the singular vectors u1 , . . . , ur ∈ Rn1 = Rn and
v1 , . . . , vr ∈ Rn2 = Rn are two sets of orthonormal vectors, is useful to reveal these degrees of
freedom. Informally, the singular values σ1 ≥ . . . ≥ σr depend on r degrees of freedom, the left
singular vectors uk on (n − 1) + (n − 2) + . . . + (n − r) = nr − r(r + 1)/2 degrees of freedom, and
similarly for the right singular vectors vk . If m < 2nr − r2 , no matter which entries are available,
minimize rank(X)
(1.2)
subject to PΩ (X) = PΩ (M ).
Knowing when this happens is a delicate question which shall be addressed later. For the moment,
note that attempting recovery via (1.2) is not practical as rank minimization is in general an NP-
hard problem for which there are no known algorithms capable of solving problems in practical
time once, say, n ≥ 10.
In [7], it was proved 1) that matrix completion is not as ill-posed as previously thought and
2) that exact matrix completion is possible by convex programming. The authors of [7] proposed
recovering the unknown matrix by solving the nuclear norm minimization problem
minimize kXk∗
(1.3)
subject to PΩ (X) = PΩ (M ),
where the nuclear norm kXk∗ of a matrix X is defined as the sum of its singular values,
X
kXk∗ := σi (X). (1.4)
i
(The problem (1.3) is a semidefinite program [11].) They proved that if Ω is sampled uniformly at
random among all subset of cardinality m and M obeys a low coherence condition which we will
review later, then with large probability, the unique solution to (1.3) is exactly M , provided that
the number of samples obeys
m ≥ C n6/5 r log n (1.5)
(to be completely exact, there is a restriction on the range of values that r can take on).
In (1.5), the number of samples per degree of freedom is not logarithmic or polylogarithmic in
the dimension, and one would like to know whether better results approaching the nr log n limit are
possible. This paper provides a positive answer. In details, this work develops many useful matrix
models for which nuclear norm minimization is guaranteed to succeed as soon as the number of
entries is of the form nrpolylog(n).
We observe that E interacts well with PU and PV , in particular obeying the identities
PU E = E = EPV ; E ∗ E = PV ; EE ∗ = PU .
One can view E as a sort of matrix-valued “sign pattern” for M (compare (1.7) with (1.1)), and is
also closely related to the subgradient ∂kM k∗ of the nuclear norm at M (see (3.2)).
It is clear that some assumptions on the singular vectors ui , vi (or on the spaces U, V ) is needed
in order to have a hope of efficient matrix completion. For instance, if u1 and v1 are Kronecker
delta functions at positions i, j respectively, then the singular value σ1 can only be recovered if one
actually samples the (i, j) coordinate, which is only likely if one is sampling a significant fraction
of the entire matrix. Thus we need the vectors ui , vi to be “spread out” or “incoherent” in some
sense. In our arguments, it will be convenient to phrase this incoherence assumptions using the
projection matrices PU , PV and the sign pattern matrix E. More precisely, our assumptions are as
follows.
A1 There exists µ1 > 0 such that for all pairs (a, a0 ) ∈ [n1 ] × [n1 ] and (b, b0 ) ∈ [n2 ] × [n2 ],
√
r r
hea , PU ea0 i − 1a=a0 ≤ µ1 , (1.8a)
n1 n1
√
r r
heb , PV eb0 i − 1b=b0 ≤ µ1 . (1.8b)
n2 n2
A2 There exists µ2 > 0 such that for all (a, b) ∈ [n1 ] × [n2 ],
√
r
|Eab | ≤ µ2 √ . (1.9)
n1 n2
We will say that the matrix M obey the strong incoherence property with parameter µ if one can
take µ1 and µ2 both less than equal to µ. (This property is related to, but slightly different from,
the incoherence property, which will be discussed in Section 1.6.1.)
Remark. Our assumptions only involve the singular vectors u1 , . . . , ur , v1 , . . . , vr of M ; the
singular values σ1 , . . . , σr are completely unconstrained. This lack of dependence on the singular
values is a consequence of the geometry of the nuclear norm (and in particular, the fact that the
subgradient ∂kXk∗ of this norm is independent of the singular values, see (3.2)).
Theorem 1.1 (Matrix completion I) Let M ∈ Rn1 ×n2 be a fixed matrix of rank r = O(1)
obeying the strong incoherence property with parameter µ. Write n := max(n1 , n2 ). Suppose we
observe m entries of M with locations sampled uniformly at random. Then there is a positive
numerical constant C such that if
m ≥ C µ4 n(log n)2 , (1.10)
then M is the unique solution to (1.3) with probability at least 1 − n−3 . In other words: with high
probability, nuclear-norm minimization recovers all the entries of M with no error.
This result is noteworthy for two reasons. The first is that the matrix model is deterministic
and only needs the strong incoherence assumption. The second is more substantial. Consider the
class of bounded rank matrices obeying µ = O(1). We shall see that no method whatsoever can
recover those matrices unless the number of entries obeys m ≥ c0 n log n for some positive numerical
constant c0 ; this is the information theoretic limit. Thus Theorem 1.1 asserts that exact recovery by
nuclear-norm minimization occurs nearly as soon as it is information theoretically possible. Indeed,
if the number of samples is slightly larger, by a logarithmic factor, than the information theoretic
limit, then (1.3) fills in the missing entries with no error.
We stated Theorem 1.1 for bounded ranks, but our proof gives a result for all values of r.
Indeed, the argument will establish that the recovery is exact with high probability provided that
When r = O(1), this is Theorem 1.1. We will prove a stronger and near-optimal result below
(Theorem 1.2) in which we replace the quadratic dependence on r with linear dependence. The
reason why we state Theorem 1.1 first is that its proof is somewhat simpler than that of Theorem
1.2, and we hope that it will provide the reader with a useful lead-in to the claims and proof of our
main result.
m ≥ C µ2 nr log6 n, (1.12)
1.4 A surprise
We find it unexpected that nuclear norm-minimization works so well, for reasons we now pause to
discuss. For simplicity, consider matrices with a strong incoherence parameter µ polylogarithmic in
the dimension. We know that for the rank minimization program (1.2) to succeed, or equivalently
for the problem to be well posed, the number of samples must exceed a constant times nr log n.
However, Theorem 1.2 proves that the convex relaxation is rigorously exact nearly as soon as our
problem has a unique low-rank solution. The surprise here is that admittedly, there is a priori no
good reason to suspect that convex relaxation might work so well. There is a priori no good reason
to suspect that the gap between what combinatorial and convex optimization can do is this small.
In this sense, we find these findings a little unexpected.
The reader will note an analogy with the recent literature on compressed sensing, which shows
that under some conditions, the sparsest solution to an underdetermined system of linear equations
is that with minimum `1 norm.
1. Select r left singular vectors uα(1) , . . . , uα(r) at random with replacement from the first family,
and r right singular vectors vβ(1) , . . . , vβ(r) from the second family, also at random. We do
not require that the β are chosen independently from the α; for instance one could have
β(k) = α(k) for all k ∈ [r].
We emphasize that the only assumptions about the families [u1 , . . . , un ] and [v1 , . . . , vn ] is that
they have small components. For example, they may be the same. Also note that this model allows
for any kind of dependence between the left and right singular selected vectors. For instance, we
may select the same columns as to obtain a symmetric matrix as in the case where the two families
are the same. Thus, one can think of our model as producing a generic matrix with uniformly
bounded singular vectors. √
We now show that PU , PV and E obey (1.8) and (1.9), with µ1 , µ2 = O(µB log n), with large
probability. For (1.9), observe that
X
∗
E= k uα(k) vβ(k) ,
k∈[r]
√
r 2
≤ 2e−λ /2 .
P hPU ea , PU ea0 i − 1{a=a0 } r/n ≥ λ µB
n
√
Taking λ proportional to log n and applying the√ union bound for a, a0 ∈ [n] proves (1.8) with
probability at least 1 − n−3 (say) with µ1 = O(µB log n).
Combining this computation with Theorems 1.1, 1.2, we have established the following corollary:
Corollary 1.4 (Matrix completion, uniformly bounded model) Let M be a matrix sampled
from a uniformly bounded model. Under the hypotheses of Theorem 1.1, if
m ≥ C µ2B nr log7 n,
M is the unique solution to (1.3) with probability at least 1 − n−3 . As we shall see below, when
r = O(1), it suffices to have
m ≥ C µ2B n log2 n.
hPU e1 , PU e2 i = r/n.
√
Obviously, this does not scale like r/n. Similarly, the sign flip (step 2) is also necessary as
otherwise, we could have E = PU as in the case where [u1 , . . . , un ] = [v1 , . . . , vn ] and the same
columns are selected. Here,
1X r
max Eaa = max kPU ea k2 ≥ kPU ea k2 = ,
a a n a n
√
which does not scale like r/n either.
Corollary 1.6 (Matrix completion, random orthogonal model) Let M be a matrix sampled
from the random orthogonal model. Under the hypotheses of Theorem 1.1, if
m ≥ C nr log8 n,
then M is the unique solution to (1.3) with probability at least 1 − n−3 . The exponent 8 can be
lowered to 7 when r ≥ log n and to 6 when r = O(1).
As mentioned earlier, we have a lower bound m ≥ 2nr − r2 for matrix completion, which can be
improved to m ≥ Cnr log n under reasonable hypotheses on the matrix M . Thus, the hypothesis
on m in Corollary 1.6 cannot be substantially improved. However, it is likely that by specializing
the proofs of our general results (Theorems 1.1 and 1.2) to this special case, one may be able to
improve the power of the logarithm here, though it seems that a substantial effort would be needed
to reach the optimal level of nr log n even in the bounded rank case.
Speaking of logarithmic improvements, we have shown that µ = O(log n), which is sharp since
for r = 1, one cannot hope
√ for better estimates. For r much larger than log n, however, one can
improve this to µ = O( log n). As far as µ1 is concerned, this is essentially a consequence of the
Johnson-Lindenstrauss lemma. For a 6= a0 , write
1
kPU ea + PU ea0 k2 − kPU ea − PU ea0 k2 .
hPU ea , PU ea0 i =
4
We claim that for each a 6= a0 ,
√
2 2r r log n
kPU (ea ± ea0 )k − ≤ C (1.17)
n n
with probability at least 1 − n−5 , say. This inequality is indeed well known. Observe that kPU xk
has the same distribution than the Euclidean norm of the first r components of a vector uniformly
distributed on the n − 1 dimensional sphere of radius kxk. Then we have [4]:
r r r
r 2 2
P (1 − ε)kxk ≤ kPU xk ≤ (1 − ε)−1 kxk ≤ 2e− r/4 + 2e− n/4 .
n n
q
Choosing x = ea ±ea0 , = C0 logr n , and applying the union bound proves the claim as long as long
as r is sufficiently larger than log n. Finally, since a bound on the diagonal term kP 2
√U ea k − r/n in
(1.8) follows from the same inequality by simply choosing x = ea , we have µ1 = O( log n). Similar
arguments for µ2 exist but we forgo the details.
10
Theorem 1.7 (Lower bound, Bernoulli model) Fix 1 ≤ m, r ≤ n and µ0 ≥ 1, let 0 < δ <
1/2, and suppose that we do not have the condition
m µ0 r n
− log 1 − 2 ≥ log . (1.20)
n n 2δ
Then there exist infinitely many pairs of distinct n × n matrices M 6= M 0 of rank at most r
and obeying the incoherence property (1.18) with parameter µ0 such that PΩ (M ) = PΩ (M 0 ) with
probability at least δ. Here, each entry is observed with probability p = m/n2 independently from
the others.
Clearly, even if one knows the rank and the coherence of a matrix ahead of time, then no
algorithm can be guaranteed to succeed based on the knowledge of PΩ (M ) only, since they are many
candidates which are consistent with these data. We prove this theorem in Section 2. Informally,
Theorem 1.7 asserts that (1.20) is a necessary condition for matrix completion to work with high
probability if all we know about the matrix M is that it has rank at most r and the incoherence
property with parameter µ0 . When the right-hand side of (1.20) is less than ε < 1, this implies
n
m ≥ (1 − ε/2)µ0 nr log . (1.21)
2δ
Recall that the number of degrees of freedom of a rank-r matrix is 2nr(1 − r/2n). Hence,
to recover an arbitrary rank-r matrix with the incoherence property with parameter µ0 with any
decent probability by any method whatsoever, the minimum number of samples must be about
the number of degrees of freedom times µ0 log n; in other words, the oversampling factor is directly
proportional to the coherence. Since µ0 ≥ 1, this justifies our earlier assertions that nr log n samples
are really needed.
In the Bernoulli model used in Theorem 1.7, the number of entries is a binomial random variable
sharply concentrating around its mean m. There is very little difference between this model and
the uniform model which assumes that Ω is sampled uniformly at random among all subsets of
cardinality m. Results holding for one hold for the other with only very minor adjustments. Because
we are concerned with essential difficulties, not technical ones, we will often prove our results using
the Bernoulli model, and indicate how the results may easily be adapted to the uniform model.
11
n1 = n2 = n.
The results for non-square matrices (with n = max(n1 , n2 )) are proven in exactly the same fashion,
but will add more subscripts to a notational system which is already quite complicated, and we
will leave the details to the interested reader. We will also assume that n ≥ C for some sufficiently
large absolute constant C, as our results are vacuous in the regime n = O(1).
Throughout, we will always assume that m is at least as large as 2nr, thus
A variety of norms on matrices X ∈ Rn×n will be discussed. The spectral norm (or operator
norm) of a matrix is denoted by
The Euclidean inner product between two matrices is defined by the formula
hX, Y i := trace(X ∗ Y ),
and the corresponding Euclidean norm, called the Frobenius norm or Hilbert-Schmidt norm, is
denoted
Xn
1/2
kXkF := hX, Xi = ( σj (X)2 )1/2 .
j=1
For vectors, we will only consider the usual Euclidean `2 norm which we simply write as kxk.
Further, we will also manipulate linear transformation which acts on the space Rn×n matrices
such as PΩ , and we will use calligraphic letters for these operators as in A(X). In particular, the
identity operator on this space will be denoted by I : Rn×n → Rn×n , and should not be confused
with the identity matrix I ∈ Rn×n . The only norm we will consider for these operators is their
spectral norm (the top singular value)
12
X k
Y X k
f (ai ) = f (a)
a1 ,...,ak ∈[n] i=1 a∈[n]
is valid both for positive integers k and for k = 0 (and both for non-zero f and for zero f , recalling
of course that 00 = 1). We will refer to sums over the empty tuple as trivial sums to distinguish
them from empty sums.
2 Lower bounds
This section proves Theorem 1.7, which asserts that no method can recover an arbitrary n × n
matrix of rank r and coherence at most µ0 unless the number of random samples obeys (1.20). As
stated in the theorem, we establish lower bounds for the Bernoulli model, which then apply to the
model where exactly m entries are selected uniformly at random, see the Appendix for details.
It may be best to consider a simple example first to understand the main idea behind the proof
of Theorem 1.7. Suppose that r = 1, µ0 > 1 in which case M = xy ∗ . For simplicity, suppose that
√
y is fixed, say y = (1, . . . , 1), and x is chosen arbitrarily from the cube [1, µ0 ]n of Rn . One easily
verifies that M obeys the coherence property with parameter µ0 (and in fact also obeys the strong
incoherence property with a comparable parameter). Then to recover M , we need to see at least
one entry per row. For instance, if the first row is unsampled, one has no information about the
√
first coordinate x1 of x other than that it lies in [1, µ0 ], and so the claim follows in this case by
√
varying x1 along the infinite set [1, µ0 ].
Now under the Bernoulli model, the number of observed entries in the first row—and in any
fixed row or column—is a binomial random variable with a number of trials equal to n and a
probability of success equal to p. Therefore, the probability π0 that any row is unsampled is equal
to π0 = (1 − p)n . By independence, the probability that all rows are sampled at least once is
(1 − π0 )n , and any method succeeding with probability greater 1 − δ would need
(1 − π0 )n ≥ 1 − δ.
or −nπ0 ≥ n log(1 − π0 ) ≥ log(1 − δ). When δ < 1/2, log(1 − δ) ≥ −2δ and thus, any method
would need
2δ
π0 ≤ .
n
This is the desired conclusion when µ0 > 1, r = 1.
13
where the σk are drawn arbitrarily from [0, 1] (say), and the singular vectors u1 , . . . , ur are defined
as follows: r
1 X
ui,k := ei , Bk = {(k − 1)` + 1, (k − 1)` + 2, . . . , k`};
`
i∈Bk
that is to say, uk vanishes everywhere except on a support of ` consecutive indices. Clearly, this
matrix is incoherent with parameter µ0 . Because the supports of the singular vectors are disjoint,
M is a block-diagonal matrix with diagonal blocks of size ` × `. We now argue as before. Recovery
with positive probability is impossible unless we have sampled at least one entry per row of each
diagonal block, since otherwise we would be forced to guess at least one of the σk based on no
information (other than that σk lies in [0, 1]), and the theorem will follow by varying this singular
value. Now the probability π0 that the first row of the first block—and any fixed row of any fixed
block—is unsampled is equal to (1−p)` . Therefore, any method succeeding with probability greater
1 − δ would need
(1 − π1 )n ≥ 1 − δ,
which implies π1 ≤ 2δ/n just as before. With π1 = (1 − p)` , this gives (1.20) under the Bernoulli
model. The second part of the theorem, namely, (1.21) follows from the equivalent characterization
µ0 r
m ≥ n2 1 − e− log(n/2δ)
n
3.1 Duality
We begin by recalling some calculations from [7, Section 3]. From standard duality theory, we know
that the correct matrix M ∈ Rn×n is a solution to (1.3) if and only if there exists a dual certificate
Y ∈ Rn1 ×n2 with the property that PΩ (Y ) is a subgradient of the nuclear norm at M , which we
write as
PΩ (Y ) ∈ ∂kM k∗ . (3.1)
14
∂kM k∗ = E + W : W ∈ Rn×n , PU W = 0, W PV = 0, kW k ≤ 1 .
(3.2)
There is a more compact way to write (3.2). Let T ⊂ Rn×n be the span of matrices of the form
uk y ∗ and xvk∗ and let T ⊥ be its orthogonal complement. Let PT : Rn×n → T be the orthogonal
projection onto T ; one easily verifies the explicit formula
In particular, PT ⊥ is a contraction:
kPT ⊥ k ≤ 1. (3.5)
Then Z ∈ ∂kXk∗ if and only if
(a) PΩ (Y ) = Y ,
(b) PT (Y ) = E, and
(c) kPT ⊥ (Y )k < 1.
Theorem 3.2 (Rudelson selection estimate) [7, Theorem 4.1] Suppose Ω is sampled accord-
ing to the Bernoulli model and put n := max(n1 , n2 ). Assume that M obeys (1.18). Then there is
a numerical constant CR such that for all β > 1, we have the bound
with probability at least 1 − 3n−β provided that a < 1, where a is the quantity
r
µ0 nr(β log n)
a := CR (3.7)
m
15
m ≥ C0 µ0 nr log n (3.8)
for a suitably large constant C0 . But this follows from the hypotheses in either Theorem 1.1 or
Theorem 1.2, for reasons that we now pause to explain. In either of these theorems we have
T → T
X 7→ PT PΩ PT (X)
is invertible, and we denote its inverse by (PT PΩ PT )−1 : T → T . Introduce the dual matrix
Y ∈ PΩ (Rn×n ) ⊂ Rn×n defined via
minimize kZkF
subject to PT PΩ (Z) = E.
16
where I : Rn×n → Rn×n is the identity operator on matrices (not the identity matrix I ∈ Rn×n !).
Note that with the Bernoulli model for selecting Ω, that QΩ has expectation zero.
From (3.12) we have PT PΩ PT = pPT (I + QΩ )PT , and owing to Theorem 3.2, one can write
(PT PΩ PT )−1 as the convergent Neumann series
X
p(PT PΩ PT )−1 = (−1)k (PT QΩ PT )k .
k≥0
where we have used PT2 = PT and PT (E) = E. By the triangle inequality and (3.5), it thus suffices
to show that X
k(QΩ PT )k QΩ (E)k < 1
k≥0
Second, this theorem also bounds kQΩ PT k (recall that this is the spectral norm) since
17
provided that a < 1/2. With p = m/n2 and a defined by (3.7) with β = 4, we have
k0 +1
√
X
k µ0 nr log n 2
k(QΩ PT ) QΩ (E)kF ≤ n×O
m
k≥k0
1 1
with probability at least 1 − n−4 . When k0 + 1 ≥ log n, n k0 +1 ≤ n log n = e and thus for each such
a k0 ,
k0 +1
X
k µ0 nr log n 2
k(QΩ PT ) QΩ (E)kF ≤ O (3.14)
m
k≥k0
3.4 Centering
We have already normalised PΩ to have “mean zero” in some sense by replacing it with QΩ . Now we
perform a similar operation for the projection PT : X 7→ PU X + XPV − PU XPV . The eigenvalues
of PT are centered around
as this follows from the fact that PT is a an orthogonal projection onto a space of dimension
2nr − r2 . Therefore, we simply split PT as
PT = QT + ρ0 I, (3.17)
so that the eigenvalues of QT are centered around zero. From now on, ρ and ρ0 will always be the
numbers defined above.
18
From (3.19) and the geometric series formula we obtain the corollary
0 −1
kX
√ 1
k(QΩ PT )k QΩ (E)k ≤ 5 σ √ . (3.20)
1−4 σ
k=0
Let σ0 be such that the right-hand side is less than 1/4, say. Applying this with σ = σ0 , we
conclude that to prove (3.15) with probability at least 1 − n−3 /4, it suffices by the union bound
to show that (3.18) for this value of σ. (Note that the hypothesis 8nr/m < σ 3/2 follows from the
hypotheses in either Theorem 1.1 or Theorem 1.2.)
Lemma 3.3, which is proven in the Appendix, is useful because the operator QT is easier to
work with than PT in the sense that it is more homogeneous, and obeys better estimates. If we
split the projections PU , PV as
PU = ρI + QU , PV = ρI + QV , (3.21)
then QT obeys
QT (X) = (1 − ρ)QU X + (1 − ρ)XQV − QU XQV .
Let Ua,a0 , Vb,b0 denote the matrix elements of QU , QV :
cab,a0 b0 := hea e∗b , QT (ea0 eb0 )i = (1 − ρ)1b=b0 Ua,a0 + (1 − ρ)1a=a0 Vb,b0 − Ua,a0 Vb,b0 . (3.23)
19
Theorem 3.4 (Moment bound I) Set A = (QΩ QT )k QΩ (E) for a fixed k ≥ 0. Under the as-
sumptions of Theorem 1.1, we have that for each j > 0,
2j(k+1) nrµ2 j(k+1)
E trace(A∗ A)j = O j(k + 1) rµ := µ2 r,
n , (3.25)
m
provided that m ≥ nrµ2 and n ≥ c0 j(k + 1) for some numerical constant c0 .
By Markov’s inequality, this result automatically estimates the norm of (QΩ QT )k QΩ (E) and im-
mediately gives the following corollary.
Corollary 3.5 (Existence of dual certificate I) Under the assumptions of Theorem 1.1, the
matrix Y (3.10) is a dual certificate, and obeys kPT ⊥ (Y )k ≤ 1/2 with probability at least 1 − n−3
provided that m obeys (1.10).
Proof Set A = (QΩ QT )k QΩ (E) with k ≤ log n, and set σ ≤ σ0 . By Markov’s inequality
k+1 E kAk2j
P(kAk ≥ σ 2 )≤ ,
σ j(k+1)
Now choose j > 0 to be the smallest integer such that j(k + 1) ≥ log n. Since
for some
(j(k + 1))2 nr2
µ
γ=O
am
1 1
where we have used the fact that n j(k+1) ≤ n log n = e. Hence, if
Therefore, [ k+1
{(QΩ QT )k QΩ (E)k ≥ a 2 }
0≤k<log n
20
Theorem 3.6 (Moment bound II) Set A = (QΩ QT )k QΩ (E) for a fixed k ≥ 0. Under the
assumptions of Theorem 1.2, we have that for each j > 0 (rµ is given in (3.25)),
(j(k + 1))6 nrµ j(k+1)
E trace(A∗ A)j ≤
(3.27)
m
provided that n ≥ c0 j(k + 1) for some numerical constant c0 .
3.6 Novelty
As explained earlier, this paper derives near-optimal sampling results which are stronger than
those in [7]. One of the reasons underlying this improvement is that we use completely differ-
ent techniques. In details, [7] constructs the dual certificate (3.10) and proceeds by showing that
kPT ⊥ (Y )k < 1 by bounding each term in the series k≥0 k(QΩ PT )k QΩ (E)k < 1. Further, to prove
P
that the early terms (small values of k) are appropriately small, the authors employ a sophisticated
array of tools from asymptotic geometric analysis, including noncommutative Khintchine inequali-
ties [16], decoupling techniques of Bourgain and Tzafiri and of de la Peña [10], and large deviations
inequalities [14]. They bound each term individually up to k = 4 and use the same argument as
that in Section 3.3 to bound the rest of the series. Since the tail starts at k0 = 5, this gives that a
sufficient condition is that the number of samples exceeds a constant times µ0 n6/5 nr log n. Bound-
ing each term k(QΩ PT )k QΩ (E)kk with the tools put forth in [7] for larger values of k becomes
increasingly delicate because of the coupling between the indicator variables defining the random
set Ω. In addition, the noncommutative Khintchine inequality seems less effective in higher dimen-
sions; that is, for large values of k. Informally speaking, the reason for this seems to be that the
types of random sums that appear in the moments (QΩ PT )k QΩ (E) for large k involve complicated
combinations of the coefficients of PT that are not simply components of some product matrix, and
which do not simplify substantially after a direct application of the Khintchine inequality.
In this paper, we use a very different strategy to estimate the spectral norm of (QΩ QT )k QΩ (E),
and employ moment methods, which have a long history in random matrix theory, dating back at
21
(the largest element dominates the sum). We then need to compute the expectation of the right-
hand side, and reduce matters to a purely combinatorial question involving the statistics of various
types of paths in a plane. It is rather remarkable that carrying out these combinatorial calculations
nearly give the quantitatively correct answer; the moment method seems to come close to giving
the ultimate limit of performance one can expect from nuclear-norm minimization.
As we shall shortly see, the expression trace(A∗ A)j expands as a sum over “paths” of products
of various coefficients of the operators QΩ , QT and the matrix E. These paths can be viewed as
complicated variants of Dyck paths. However, it does not seem that one can simply invoke standard
moment method calculations in the literature to compute this sum, as in order to obtain efficient
bounds, we will need to take full advantage of identities such as PT PT = PT (which capture certain
cancellation properties of the coefficients of PT or QT ) to simplify various components of this sum.
It is only after performing such simplifications that one can afford to estimate all the coefficients
by absolute values and count paths to conclude the argument.
4 Moments
Let j ≥ 0 be a fixed integer. The goal of this section is to develop a formula for
This will clearly be of use in the proofs of the moment bounds (Theorems 3.4, 3.6).
for some scalars Aab , where eab is the standard basis for the n × n matrices and Aab is the (a, b)th
entry of A. Then X Y
trace(A∗ A)j = Aai bi Aai+1 bi ,
a1 ,...,aj ∈[n] i∈[j]
b1 ,...,bj ∈[n]
where the sum is over all ai,µ , bi,µ ∈ [n] for i ∈ [j], µ ∈ {0, 1} obeying the compatibility conditions
22
or equivalently as
2 Y
XY 1
Aai,µ ,bi,µ
i=1 µ=0
where the sum is over all a1,0 , a1,1 , a2,0 , a2,1 , b1,0 , b1,1 , b2,0 , b2,1 ∈ [n] obeying the compatibility con-
ditions
a1,1 = a2,0 ; a2,1 = a1,0 ; b1,1 = b1,0 ; b2,1 = b2,0 .
Remark. The sum in (4.2) can be viewed as over all closed paths of length 2j in [n] × [n],
where the edges of the paths alternate between “horizontal rook moves” and “vertical rook moves”
respectively; see Figure 1.
Second, write QT and QΩ in coefficients as
X
QT (ea0 b0 ) = cab,a0 b0 eab
ab
23
X Y k
Y
Aa0 ,b0 := cal−1 bl−1 ,al bl ξal bl Eak bk (4.3)
a1 ,b1 ,...,ak ,bk ∈[n] l∈[k] l=0
for any a0 , b0 ∈ [n]. Note that this formula is even valid in the base case k = 0, where it simplifies
to just Aa0 b0 = ξa0 b0 Ea0 b0 due to our conventions on trivial sums and empty products.
Example. If k = 2, then
X
Aa0 ,b0 = ξa0 b0 ca0 b0 ,a1 ,b1 ξa1 b1 ca1 b1 ,a2 b2 ξa2 b2 Ea2 b2 .
a1 ,a2 ,b1 ,b2 ∈[n]
Remark. One can view the right-hand side of (4.3) as the sum over paths of length k + 1 in
[n] × [n] starting at the designated point (a0 , b0 ) and ending at some arbitrary point (ak , bk ). Each
edge (from (ai , bi ) to (ai+1 , bi+1 )) may be a horizontal or vertical “rook move” (in that at least
one of the a or b coordinates does not change2 ), or a “non-rook move” in which both the a and b
coordinates change. It will be important later on to keep track of which edges are rook moves and
which ones are not, basically because of the presence of the delta functions 1a=a0 , 1b=b0 in (3.23).
Each edge in this path is weighted by a c factor, and each vertex in the path is weighted by a ξ
factor, with the final vertex also weighted by an additional E factor. It is important to note that
the path is allowed to cross itself, in which case weights such as ξ 2 , ξ 3 , etc. may appear, see Figure
2.
Inserting (4.3) into (4.2), we see that X can thus be expanded as
1 h Y
XY Y k i
Y
E cai,µ,l−1 bi,µ,l−1 ,ai,µ,l bi,µ,l ξai,µ,l bi,µ,l Eai,µ,k bi,µ,k , (4.4)
∗ i∈[j] µ=0 l∈[k] l=0
P
where the sum ∗ is over all combinations of ai,µ,l , bi,µ,l ∈ [n] for i ∈ [j], µ ∈ {0, 1} and 0 ≤ l ≤ k
obeying the compatibility conditions
24
Note that despite the small values of j and k, this is already a rather complicated sum, ranging
over n2j(2k+1) = n20 summands, each of which is the product of 4j(k + 1) = 24 terms.
Remark. The expansion (4.4) is the sum over a sort of combinatorial “spider”, whose “body”
is a closed path of length 2j in [n] × [n] of alternating horizontal and vertical rook moves, and
whose 2j “legs” are paths of length k, emanating out of each vertex of the body. The various
“segments” of the legs (which can be either rook or non-rook moves) acquire a weight of c, and
the “joints” of the legs acquire a weight of ξ, with an additional weight of E at the tip of each leg.
To complicate things further, it is certainly possible for a vertex of one leg to overlap with another
vertex from either the same leg or a different leg, introducing weights such as ξ 2 , ξ 3 , etc.; see Figure
3. As one can see, the set of possible configurations that this “spider” can be in is rather large and
complicated.
25
26
X 1 h Y
XY Y
X= E cα(si,µ,l−1 )β(ti,µ,l−1 ),α(si,µ,l )β(ti,µ,l )
(s,t) α,β i∈[j] µ=0 l∈[k]
L
Y i
ξα(si,µ,l )β(ti,µ,l ) Eα(si,µ,k )β(ti,µ,k ) ,
l=0
where the outer sum is over all admissible pairs (s, t), and the inner sum is over all injections.
Remark. As with preceding identities, the above formula is also valid when k = 0 (with our
conventions on trivial sums and empty products), in which case it simplifies to
X 1
XY Y
X= E ξα(si,µ,0 )β(ti,µ,0 ) Eα(si,µ,0 )β(ti,µ,0 ) .
(s,t) α,β i∈[j] µ=0
27
and hence
1
| E( δ − 1)s | ≤ p1−s .
p
The value of the expectation of E Ξ does not depend on the choice of α or β, and the calculation
above shows that Ξ obeys
1
| E Ξ| ≤ 2j(k+1)−|Ω| ,
p
where
Ω := {(si,µ,l , ti,µ,l ) : (i, µ, l) ∈ [j] × {0, 1} × {0, . . . , k}} ⊂ J × K. (4.9)
Applying this estimate and the triangle inequality, we can thus bound X by
X
X≤ (1/p)2j(k+1)−|Ω|
(s,t) strongly admissible
1 h Y i
X Y Y
cα(si,µ,l−1 )β(ti,µ,l−1 ),α(si,µ,l )β(ti,µ,l ) Eα(si,µ,k )β(ti,µ,k ) , (4.10)
α,β i∈[j] µ=0 l∈[k]
where the sum is over those admissible (s, t) such that each element of Ω is visited at least twice
by the sequence (si,µ,l , ti,µ,l ); we shall call such (s, t) strongly admissible. We will use the bound
(4.10) as a starting point for proving the moment estimates (3.25) and (3.27).
Example. The pair (s, t) in the Example in Section 4.2 is admissible but not strongly admissible,
because not every element of the set Ω (which, in this example, is {(1, 1), (2, 2), (3, 1), (2, 3), (3, 4),
(1, 2), (1, 4)}) is visited twice by the (s, t).
Remark. Once again, the formula (4.10) is valid when k = 0, with the usual conventions on
empty products (in particular, the factor involving the c coefficients can be deleted in this case).
28
where we recall that rµ = µ2 r, and Q is the set of all (i, µ, l) ∈ [j] × {0, 1} × [k] such that
si,µ,l−1 6= si,µ,l and ti,µ,l−1 6= ti,µ,l . Thinking of the sequence {(si,µ,l , ti,µ,l )} as a path in J × K,
we have that (i, µ, l) ∈ Q if and only if the move from (si,µ,l−1 , ti,µ,l−1 ) to (si,µ,l , ti,µ,l ) is neither
horizontal nor vertical; per our earlier discussion, this is a “non-rook” move.
Example. The example in Section 4.2 is admissible, but not strongly admissible. Nevertheless,
the above definitions can still be applied, and we see that Q = {(0, 0, 1), (0, 1, 1), (1, 0, 1), (1, 1, 1)}
in this case, because all of the four associated moves are non-rook moves.
As the number of injections α, β is at most n|J| , n|K| respectively, we thus have
X √
X ≤ O(1)j(k+1) (1/p)2j(k+1)−|Ω| n|J|+|K| ( rµ /n)2jk+|Q|+2j ,
(s,t) str. admiss.
Since (s, t) is strongly admissible and every point in Ω needs to be visited at least twice, we see
that
|Ω| ≤ j(k + 1).
Also, since Q ⊂ [j] × {0, 1} × [k], we have the trivial bound
|Q| ≤ 2jk.
Remark. In the case where k = 0 in which Q = ∅, one can easily obtain a better estimate,
namely, (if np ≥ rµ )
r j
µ
X
X≤O n|J|+|K|−|Ω| .
np
(s,t) str. admiss.
29
(0, 1, 0), (0, 1, 1), (1, 0, 0), (1, 0, 1), (1, 1, 0), (1, 1, 1)
are all recycled (because they either reuse an existing value of s or t or both), while the triple
(1, 1, 1) is totally recycled (it visits the same location as the earlier triple (0, 0, 1)). Thus in this
case, we have Q0 = {(0, 1, 1), (1, 0, 1), (1, 1, 1)}.
We observe that if (i, µ, l) ∈ [j] × {0, 1} × [k] is not recycled, then it must have been reached
from (i, µ, l − 1) by a non-rook move, and thus (i, µ, l) lies in Q.
Lemma 5.1 (Exponent bound) For any admissible tuple, we have |J|+|K|−|Q|−|Ω| ≤ −|Q0 |+
1.
Proof We let (i, µ, l) increase from (1, 0, 0) to (j, 1, k) and see how each (i, µ, l) influences the
quantity |J| + |K| − |Q\Q0 | − |Ω|.
Firstly, we see that the triple (1, 0, 0) initialises |J|, |K|, |Ω| = 1 and |Q\Q0 | = 0, so |J| + |K| −
|Q\Q0 | − |Ω| = 1 at this initial stage. Now we see how each subsequent (i, µ, l) adjusts this quantity.
If (i, µ, l) is totally recycled, then J, K, Ω, Q\Q0 are unchanged by the addition of (i, µ, l), and
so |J| + |K| − |Q\Q0 | − |Ω| does not change.
If (i, µ, l) is recycled but not totally recycled, then one of J, K increases in size by at most one,
as does Ω, but the other set of J, K remains unchanged, as does Q\Q0 , and so |J|+|K|−|Q\Q0 |−|Ω|
does not increase.
If (i, µ, l) is not recycled at all, then (by (4.6)) we must have l > 0, and then (by definition of
Q, Q0 ) we have (i, µ, l) ∈ Q\Q0 , and so |Q\Q0 | and |Ω| both increase by one. Meanwhile, |J| and
|K| increase by 1, and so |J| + |K| − |Q\Q0 | − |Ω| does not change. Putting all this together we
obtain the claim.
This lemma gives
r2 j(k+1) 0
µ
X
X≤O n−|Q |+1 .
np
str. admiss.
To estimate the above sum, we need to count strongly admissible pairs. This is achieved by the
following lemma.
Lemma 5.2 (Pair counting) For fixed q ≥ 0, the number of strongly admissible pairs (s, t) with
|Q0 | = q is at most O(j(k + 1))2j(k+1)+q .
30
Under the assumption n ≥ c0 j(k + 1) for some numerical constant c0 , we can sum the series and
obtain Theorem 3.4.
Remark. When k = 0, we have the better bound
j
2j rµ
X ≤ O(j) n .
np
31
after one writes the projection identity PU2 = PU in terms of QU using (3.21), and similarly for the
second identity.
In a similar vein, we also have the identities
X X
Ua,a0 Ea0 ,b = (1 − ρ) Ea,b = Ea,b0 Vb0 ,b , (6.3)
a0 b0
which simply come from QU E = PU E −ρE = (1−ρ)E together with EQV = EPV −ρE = (1−ρ)E.
Finally, we observe the two equalities
X X
Ea,b Ea0 ,b = Ua,a0 + ρ1a=a0 , Ea,b Ea,b0 = Vb,b0 + ρ1b=b0 . (6.4)
b a
The first identity follows from the fact that b Ea,b Ea0 ,b is the (a, a0 )th element of EE ∗ = PU =
P
QU + ρI, and the second one similarly follows from the identity E ∗ E = PV = QV + ρI.
32
X ≤ (1 − ρ)2j(k+1)−|LU ∩LV |
X
(1/p)2j(k+1)−|Ω| |Xs,t,LU ,LV |,
(s,t,LU ,LV )
where the sum ranges over all strongly admissible quadruplets, and
1
Xh Y Y i
Xs,t,LU ,LV := Eα(si,µ,k )β(ti,µ,k )
α,β i∈[j] µ=0
h Y ih Y i
Uα(si,µ,l−1 ),α(si,µ,l ) Vβ(ti,µ,l−1 ),β(ti,µ,l ) .
(i,µ,l)∈LU (i,µ,l)∈LV
Remark. A strongly admissible quadruplet can be viewed as the configuration of a “spider” with
several additional constraints. Firstly, the spider must visit each of its vertices at least twice (strong
admissibility). When (i, µ, l) ∈ [j] × {0, 1} × [k] lies out of LU , then only horizontal rook moves are
allowed when reaching (i, µ, l) from (i, µ, l − 1); similarly, when (i, µ, l) lies out of LV , then only
vertical rook moves are allowed from (i, µ, l − 1) to (i, µ, l). In particular, non-rook moves are only
allowed inside LU ∩ LV ; in the notation of the previous section, we have Q ⊂ LU ∩ LV . Note though
that while one has the right to execute a non-rook move to LU ∩ LV , it is not mandatory; it could
still be that (si,µ,l−1 , ti,µ,l−1 ) shares a common row or column (or even both) with (si,µ,l , ti,µ,l ).
We claim the following fundamental bound on the summand |Xs,t,LU ,LV |:
Proposition 6.1 (Summand bound) Let (s, t, LU , LV ) be a strongly admissible quadruplet. Then
we have
|Xs,t,LU ,LV | ≤ O(j(k + 1))2j(k+1) (r/n)2j(k+1)−|Ω| n.
and since |Ω| ≤ j(k + 1) (by strong admissibility) and r ≤ np, and the number of (s, t, LU , LV ) can
be crudely bounded by O(j(k + 1))4j(k+1) ,
This gives (3.27) as desired. The bound on the number of quadruplets follows from the fact that
there are at most j(k + 1)4j(k+1) strongly admissible pairs and that the number of (LU , LV ) per
pair is at most O(1)j(k+1) .
Remark. It seems clear that the exponent 6 can be lowered by a finer analysis, for instance
by using counting bounds such as Lemma 5.2. However, substantial effort seems to be required in
order to obtain the optimal exponent of 1 here.
33
• An integer j ≥ 1, and a map k : [j] × {0, 1} → {0, 1, 2, . . .}, generating a set Γ := {(i, µ, l) :
i ∈ [j], µ ∈ {0, 1}, 0 ≤ l ≤ k(i, µ)};
LU ∪ LV := Γ+ := {(i, µ, l) ∈ Γ : l > 0}
and such that si,µ,l−1 = si,µ,l whenever (i, µ, l) ∈ Γ+ \LU , and ti,µ,l−1 = ti,µ,l whenever
(i, µ, l) ∈ Γ+ \LV .
Remark. Note we do not require configurations to be strongly admissible, although for our
application to Proposition 6.1 strong admissibility is required. Similarly, we no longer require that
the segments (4.7) be initial segments. This removal of hypotheses will give us a convenient amount
of flexibility in a certain induction argument that we shall perform shortly. One can think of a
configuration as describing a “generalized spider” whose legs are allowed to be of unequal length,
but for which certain of the segments (indicated by the sets LU , LV ) are required to be horizontal
or vertical. The freedom to extend or shorten the legs of the spider separately will be of importance
when we use the identities (6.1), (6.3), (6.4) to simplify the expression Xs,t,LU ,LV , see Figure 4.
34
1
Xh Y Y i
XC := Eα(s(i,µ,k(i,µ)))β(t(i,µ,k(i,µ)))
α,β i∈[j] µ=0
h Y ih Y i
Uα(s(i,µ,l−1)),α(s(i,µ,l)) Vβ(t(i,µ,l−1)),β(t(i,µ,l)) , (6.5)
(i,µ,l)∈LU (i,µ,l)∈LV
where α : J → [n], β : K → [n] range over all injections. To prove Proposition 6.1, it then suffices
to show that
|XC | ≤ (C0 (1 + |J| + |K|))|J|+|K| (rµ /n)|Γ|−|Ω| n (6.6)
for some absolute constant C0 > 0, where
since Proposition 6.1 then follows from the special case in which k(i, µ) = k is constant and (s, t)
is strongly admissible, in which case we have
35
• j 0 := j, J 0 := J, and K 0 := K.
• (s0i,µ,l , t0i,µ,l ) := (si,µ,l , ti,µ,l ) whenever (i, µ) 6= (i0 , µ0 ), or when (i, µ) = (i0 , µ0 ) and l < l0 .
• (s0i0 ,µ0 ,l0 , t0i0 ,µ0 ,l0 ) := (si0 ,µ0 ,l0 −1 , ti0 ,µ0 ,l0 ).
• We have
and
36
6.3.2 Second case: a low multiplicity row or column, no unguarded non-rook moves
Next, given any x ∈ J, define the row multiplicity τx to be
Remark. Informally, τx measures the number of times α(x) appears in (6.5), and similarly for
τy and β(y). Alternatively, one can think of τx as counting the number of times the spider has
the opportunity to “enter” and “exit” the row s = x, and similarly τ y measures the number of
opportunities to enter or exit the column t = y.
By surjectivity we know that τx , τ y are strictly positive for each x ∈ J, y ∈ K. We also observe
that τx , τ y must be even. To see this, write
X X
τx = 1s(i,µ,l)=x + 1s(i,µ,l−1)=x + 1s(i,µ,k(i,µ))=x .
(i,µ,l)∈LU (i,µ)∈[j]×{0,1}
Now observe that if (i, µ, l) ∈ Γ+ \LU , then 1s(i,µ,l)=x = 1s(i,µ,l−1)=x . Thus we have
X X
τx mod 2 = 1s(i,µ,l)=x + 1s(i,µ,l−1)=x + 1s(i,µ,k(i,µ))=x mod 2.
(i,µ,l)∈Γ+ i,µ∈[j]×{0,1}
37
Figure 6: In (a), a multiplicity 2 row is shown. After using the identity (6.1), the contribution
of this configuration is replaced with a number of terms one of which is shown in (b), in which
the x row is deleted and replaced with another existing row x̃.
and the right-hand side vanishes by (4.6), showing that τx is even, and similarly τ y is even.
In this subsection, we dispose of the case of a low-multiplicity row, or more precisely when
τx = 2 for some x ∈ J. By symmetry, the argument will also dispose of the case of a low-multiplicity
column, when τ y = 2 for some y ∈ K.
Suppose that τx = 2 for some x ∈ J. We first remark that this implies that there does not exist
(i, µ, l) ∈ LU with s(i, µ, l) = s(i, µ, l − 1) = x. We argue by contradiction and define l? to be the
first integer larger than l for which (i, µ, l? ) ∈ LU . First, suppose that l? does not exist (which, for
instance, happens when l = k(i, µ)). Then in this case it is not hard to see that s(i, µ, k(i, µ)) = x
since for (i, µ, l0 ) ∈
/ LU , we have s(i, µ, l0 ) = s(i, µ, l0 − 1). In this case, τx exceeds 2. Else, l? does
exist but then s(i, µ, l? − 1) = x since s(i, µ, l0 ) = s(i, µ, l0 − 1) for l < l0 < l? . Again, τx exceeds
2 and this is a contradiction. Thus, if (i, µ, l) ∈ LU and s(i, µ, l) = x, then s(i, µ, l − 1) 6= x, and
similarly if (i, µ, l) ∈ LU and s(i, µ, l − 1) = x, then s(i, µ, l) 6= x.
Now let us look at the terms in (6.5) which involve α(x). Since τx = 2, there are only two
such terms, and each of the terms are either of the form Uα(x),α(x0 ) or Eα(x),β(y) for some y ∈ K or
x0 ∈ J\{x}. We now have to divide into three subcases.
Subcase 1: (6.5) contains two terms Uα(x),α(x0 ) , Uα(x),α(x00 ) . Figure 6(a) for a typical
configuration in which this is the case.
The idea is to use the identity (6.1) to “delete” the row x, thus reducing |J| + |K| and allowing
us to use an induction hypothesis. Accordingly, let us define J˜ := J\{j}, and let α̃ : J˜ → [n] be
the restriction of α to J. ˜ We also write a := α(x) for the deleted row a.
We now isolate the two terms Uα(x),α(x0 ) , Uα(x),α(x00 ) from the rest of (6.5), expressing this sum
as X h X i
... Ua,α̃(x0 ) Ua,α̃(x00 )
α̃,β ˜
a∈[n]\α̃(J)
38
where the . . . denotes the product of all the terms in (6.5) other than Uα(x),α(x0 ) and Uα(x),α(x00 ) ,
but with α replaced by α̃, and α̃, β ranging over injections from J˜ and K to [n] respectively.
From (6.1) we have
X
Ua,α̃(x0 ) Ua,α̃(x00 ) = (1 − 2ρ) Uα̃(x0 ),α̃(x00 ) − ρ (1 − ρ) 1x0 =x00
a∈[n]
and thus
X
Ua,α̃(x0 ) Ua,α̃(x00 ) =
˜
a∈[n]\α̃(J)
X
(1 − 2ρ) Uα̃(x0 ),α̃(x00 ) − ρ (1 − ρ) 1x0 =x00 − Uα̃(x̃),α̃(x0 ) Uα̃(x̃),α̃(x00 ) . (6.10)
x̃∈J˜
Consider the contribution of one of the final terms Uα̃(x̃),α̃(x0 ) Uα̃(x̃),α̃(x00 ) of (6.10). This contribution
is equal to XC 0 , where C 0 is formed from C by replacing J with J, ˜ and replacing every occurrence
of x in the range of α with x̃, but leaving all other components of C unchanged (see Figure 6(b)).
Observe that |Γ0 | = |Γ|, |Ω0 | ≤ |Ω|, |J 0 | + |K 0 | < |J| + |K|, so the contribution of these terms is
acceptable by the (first) induction hypothesis (for C0 large enough).
Next, we consider the contribution of the term Uα̃(x0 ),α̃(x00 ) of (6.10). This contribution is equal
to XC 00 , where C 00 is formed from C by replacing J with J, ˜ replacing every occurrence of x in
0
the range of α with x , and also deleting the one element (i0 , µ0 , l0 ) in LU from Γ+ (relabeling the
remaining triples (i0 , µ0 , l) for l0 < l ≤ k(i0 , µ0 ) by decrementing l by 1) that gave rise to Uα(x),α(x0 ) ,
unless this element (i0 , µ0 , l0 ) also lies in LV , in which case one removes (i0 , µ0 , l0 ) from LU but
leaves it in LV (and does not relabel any further triples) (see Figure 7 for an example of the former
case, and 8 for the latter case). One observes that |Γ00 | ≥ |Γ| − 1, |Ω00 | ≤ |Ω| − 1 (here we use (6.8),
(6.9)), |J 00 | + |K 00 | < |J| + |K|, and so this term also is controlled by the (first) induction hypothesis
(for C0 large enough).
Finally, we consider the contribution of the term ρ1x0 =x00 of (6.10), which of course is only non-
trivial when x0 = x00 . This contribution is equal to ρXC 000 , where C 000 is formed from C by deleting x
39
from J, replacing every occurrence of x in the range of α with x0 = x00 , and also deleting the two
elements (i0 , µ0 , l0 ), (i1 , µ1 , l1 ) of LU from Γ+ that gave rise to the factors Uα(x),α(x0 ) , Uα(x),α(x00 )
in (6.5), unless these elements also lie in LV , in which case one deletes them just from LU but
leaves them in LV and Γ+ ; one also decrements the labels of any subsequent (i0 , µ0 , l), (i1 , µ1 , l)
accordingly (see Figure 9). One observes that |Γ000 | − |Ω000 | ≥ |Γ| − |Ω| − 1, |J 000 | + |K 000 | < |J| + |K|,
and |J 000 | + |K 000 | + |L000 000
U ∩ LV | < |J| + |K| + |LU ∩ LV |, and so this term also is controlled by the
induction hypothesis. (Note we need to use the additional ρ factor (which is less than rµ /n) in
order to make up for a possible decrease in |Γ| − |Ω| by 1.)
This deals with the case when there are two U terms involving α(x).
Subcase 2: (6.5) contains a term Uα(x),α(x0 ) and a term Eα(x),β(y) .
A typical case here is depicted in Figure 10.
The strategy here is similar to Subcase 1, except that one uses (6.3) instead of (6.1). Letting
˜ α̃, a be as before, we can express (6.5) as
J,
X h X i
... Ua,α̃(x0 ) Ea,β(y)
α̃,β ˜
a∈[n]\α̃(J)
where the . . . denotes the product of all the terms in (6.5) other than Uα(x),α(x0 ) and Eα(x),β(y) , but
with α replaced by α̃, and α̃, β ranging over injections from J˜ and K to [n] respectively.
From (6.3) we have X
Ua,α̃(x0 ) Ea,β(y) = (1 − ρ) Eα̃(x0 ),β(y)
a∈[n]
and hence X X
Ua,α̃(x0 ) Ea,β(y) = (1 − ρ) Eα̃(x0 ),β(y) − Uα̃(j̃),α̃(x0 ) Eα̃(j̃),β(y) (6.11)
˜
a∈[n]\α̃(J) x̃∈J˜
40
Figure 10: A configuration involving a U and E factor on the left. After applying (6.3), one
gets some terms associated to configuations such as those in the upper right, in which the x
row has been deleted and replaced with another existing row x̃, plus a term coming from a
configuration in the lower right, in which the U E terms have been collapsed to a single E term.
41
The contribution of the final terms in (6.11) are treated in exactly the same way as the final terms
in (6.10), and the main term Eα̃(x0 ),β(y) is treated in exactly the same way as the term Uα̃(x0 ),α̃(x00 )
in (6.10). This concludes the treatment of the case when there is one U term and one E term
involving α(x).
Subcase 3: (6.5) contains two terms Eα(x),β(y) , Eα(x),β(y0 ) .
A typical case here is depicted in 11. The strategy here is similar to that in the previous two
subcases, but now one uses (6.4) rather than (6.1). The combinatorics of the situation are, however,
slightly different.
By considering the path from Eα(x),β(y) to Eα(x),β(y0 ) along the spider, we see (from the hypoth-
esis τx = 2) that this path must be completely horizontal (with no elements of LU present), and
the two legs of the spider that give rise to Eα(x),β(y) , Eα(x),β(y0 ) at their tips must be adjacent, with
their bases connected by a horizontal line segment. In other words, up to interchange of y and y 0 ,
and cyclic permutation of the [j] indices, we may assume that
(x, y) = (s(1, 1, k(i, 1)), t(1, 1, k(i, 1))); (x, y 0 ) = (s(2, 0, k(2, 0)), t(2, 0, k(2, 0)))
with
s(1, 1, l) = s(2, 0, l0 ) = x
for all 0 ≤ l ≤ k(1, 1) and 0 ≤ l0 ≤ k(2, 0), where the index 2 is understood to be identified with 1
in the degenerate case j = 1. Also, LU cannot contain any triple of the form (1, 1, l) for l ∈ [k(1, 1)]
or (2, 0, l0 ) for l0 ∈ [k(2, 0)] (and so all these triples lie in LV instead).
For technical reasons we need to deal with the degenerate case j = 1 separately. In this case, s
is identically equal to x, and so (6.5) simplifies to
Xh X 1 k(1,µ)
iY Y
Ea,β(y) Ea,β(y0 ) Vβ(t(i,µ,l−1)),β(t(i,µ,l)) .
β a∈[n] µ=0 l=0
2 = r, which
P
In the extreme degenerate case when k(1, 0) = k(1, 1) = 0, the sum is just a,b∈[n] Eab
is acceptable, so we may assume that k(1, 0) + k(1, 1) > 0. We may assume that the column
multiplicity τ ỹ ≥ 4 for every ỹ ∈ K, since otherwise we could use (the reflected form of) one of the
previous two subcases to conclude (6.6) from the induction hypothesis. (Note when y = y 0 , it is
not possible for τ y to equal 2 since k(1, 0) + k(1, 1) > 0.)
42
The number of possible β is at most n|K| , so to establish (6.6) in this case it suffices to show that
√
n|K| (rµ /n)( rµ /n)k(1,0)+k(1,1) . (rµ /n)|Γ|−|Ω| n.
Observe that in this degenerate case j = 1, we have |Ω| = |K| and |Γ| = k(1, 0) + k(1, 1) + 2. One
then checks that the claim is true when rµ = 1, so it suffices to check that the other extreme case
rµ = n, i.e.
1
|K| − (k(1, 0) + k(1, 1)) ≤ 1.
2
But as τ y ≥ 4 for all k, every element in K must be visited at least twice, and the claim follows.
Now we deal with the non-degenerate case j > 1. Letting J, ˜ α̃, a be as in previous subcases, we
can express (6.5) as h X i
X
... Ea,β(y) Ea,β(y0 ) (6.12)
α̃,β ˜
a∈[n]\α̃(J)
where the . . . denotes the product of all the terms in (6.5) other than Eα(x),β(y) and Eα(x),β(y0 ) , but
with α replaced by α̃, and α̃, β ranging over injections from J˜ and K to [n] respectively.
From (6.4), we have X
Ea,β(y) Ea,β(y0 ) = Vβ(y),β(y0 ) + ρ1y=y0
a∈[n]
and hence
X X
Ea,β(y) Ea,β(y0 ) = Vβ(y),β(y0 ) + ρ1y=y0 − Eα̃(j̃),β(y) Eα̃(j̃),β(y0 ) . (6.13)
˜
a∈[n]\α̃(J) x̃∈J˜
The final terms are treated here in exactly the same way as the final terms in (6.10) or (6.11).
Now we consider the main term Vβ(y),β(y0 ) . The contribution of this term will be of the form
XC 0 , where the configuration C 0 is formed from C by “detaching” the two legs (i, µ) = (1, 1), (2, 0)
from the spider, “gluing them together” at the tips using the Vβ(y),β(y0 ) term, and then “inserting”
those two legs into the base of the (i, µ) = (1, 0) leg. To explain this procedure more formally,
observe that the . . . term in (6.12) can be expanded further (isolating out the terms coming from
(i, µ) = (1, 1), (2, 0)) as
hk(2,0)
Y 1
ih Y i
Vβ(t(2,0,l−1)),β(t(2,0,l)) Vβ(s(1,1,l−1)),β(s(1,1,l)) . . .
l=1 l=k(1,1)
where the . . . now denote all the terms that do not come from (i, µ) = (1, 1) or (i, µ) = (2, 0), and
we have reversed the order of the second product for reasons that will be clearer later. Recalling
43
that y = t(1, 1, k(1, 1)) and y 0 = t(2, 0, k(2, 0)), we see that the contribution of the first term of
(6.13) to (6.12) is now of the form
Xhk(2,0)
Y i 1
h Y i
Vβ(t(2,0,l−1)),β(t(2,0,l)) Vβ(t(2,0,k(2,0))),β(t(1,1,k(1,1))) Vβ(s(1,1,l−1)),β(s(1,1,l)) . . . .
α̃,β l=1 l=k(1,1)
But this expression is simply XC 0 , where the configuration of C 0 is formed from C in the following
fashion:
˜ and K 0 is equal to K.
• j 0 is equal to j − 1, J 0 is equal to J,
• k 0 (1, 0) := k(2, 0) + 1 + k(1, 1) + k(1, 0), and k 0 (i, µ) := k(i + 1, µ) for (i, µ) 6= (1, 0).
• The path {(s0 (1, 0, l), t0 (1, 0, l)) : l = 0, . . . , k 0 (1, 0)} is formed by concatenating the path
{(s(1, 0, 0), t(2, 0, l)) : l = 0, . . . , k(2, 0)}, with an edge from (s(1, 0, 0), t(2, 0, k(2, 0))) to
(s(1, 0, 0), t(1, 1, k(1, 1))), with the path {(s(1, 0, 0), t(1, 1, l)) : l = k(1, 1), . . . , 0}, with the
path {(s(1, 0, l), t(1, 0, l)) : l = 0, . . . , k(1, 0)}.
• For any (i, µ) 6= (i, 0), the path {(s0 (i, µ, l), t0 (i, µ, l)) : l = 0, . . . k 0 (i, µ)} is equal to the path
{(s(i, µ, l), t(i + 1, µ, l)) : l = 0, . . . , k(i + 1, µ)}.
• We have
and
44
We have now made the maximum use we can of the cancellation identities (6.1), (6.3), (6.4),
and have no further use for them. Instead, we shall now place absolute values everywhere and
estimate XC using (1.9), (1.8a), (1.8b), obtaining the bound
√
|XC | ≤ n|J|+|K| O( rµ /n)|Γ|+|LU ∩LV | .
Comparing this with (6.6), we see that it will suffice (by taking C0 large enough) to show that
√
n|J|+|K| ( rµ /n)|Γ|+|LU ∩LV | ≤ (rµ /n)|Γ|−|Ω| n.
Using the extreme cases rµ = 1 and rµ = n as test cases, we see that our task is to show that
and
1
|J| + |K| ≤ (|Γ| + |LU ∩ LV |) + 1. (6.17)
2
The first inequality (6.16) is proven by Lemma 5.1. The second is a consequence of the double
counting identity X X
4(|J| + |K|) ≤ τx + τ y = 2|Γ| + 2|LU ∩ LV |
x∈J y∈K
where the inequality follows from (6.14)–(6.15) (and we don’t even need the +1 in this case).
7 Discussion
Interestingly, there is an emerging literature on the development of efficient algorithms for solving
the nuclear-norm minimization problem (1.3) [6, 17]. For instance, in [6], the authors show that
the singular-value thresholding algorithm can solve certain problem instances in which the matrix
has close to a billion unknown entries in a matter of minutes on a personal computer. Hence, the
45
8 Appendix
8.1 Equivalence between the uniform and Bernoulli models
8.1.1 Lower bounds
For the sake of completeness, we explain how Theorem 1.7 implies nearly identical results for the
uniform model. We have established the lower bound by showing that there are two fixed matrices
M 6= M 0 for which PΩ (M ) = PΩ (M 0 ) with probability greater than δ unless m obeys the bound
(1.20). Suppose that Ω is sampled according to the Bernoulli model with p0 ≥ m/n2 and let F be
the event {PΩ (M ) = PΩ (M 0 )}. Then
n2
X
P(F ) = P(F | |Ω| = k) P(|Ω| = k)
k=0
m−1 n2
X X
≤ P(|Ω| = k) + P(F | |Ω| = k) P(|Ω| = k)
k=0 k=m
≤ P(|Ω| < m) + P(F | |Ω| = m),
where we have used the fact that for k ≥ m, P(F | |Ω| = m) ≥ P(F | |Ω| = k). The conditional
distribution of Ω given its cardinality is uniform and, therefore,
in which PUnif(m) and PBer(p0 ) are probabilities calculated under the uniform and Bernoulli models.
If we choose p0 = 2m/n2 , we have that PBer(p0 ) (|Ω| < m) ≤ δ/2 provided δ is not ridiculously small.
Thus if PBer(p0 ) (F ) ≥ δ, we have
PUnif(m) (F ) ≥ δ/2.
In short, we get a lower bound for the uniform model by applying the bound for the Bernoulli
model with a value of p = 2m2 /n and a probability of failure equal to 2δ.
46
(0)
where starting from α0 = 1, the sequences {α(k) }, {β (k) }, {γ (k) } and {δ (k) } are inductively defined
via
(k+1) (k) (k)ρ0 (1 − 2p) (k) (k) (k) (k)
αj = [αj−1 + (1 − ρ0 )γj−1 ] + [αj + (1 − ρ0 )γj ] + 1j=0 ρ0 [β0 + (1 − ρ0 )δ0 ]
p
(k+1) (k) (k) ρ0 (1 − 2p) (k) (k) 1 − p (k) (k)
βj = [βj−1 + (1 − ρ0 )δj−1 ] + [βj + (1 − ρ0 )δj ]1j>0 + 1j=0 ρ0 [α0 + (1 − ρ0 )γ0 ]
p p
and
(k+1) ρ0 (1 − p) (k) (k)
γj = [αj+1 + (1 − ρ0 )γj+1 ]
p
(k+1) ρ0 (1 − p) (k) (k)
δj = [βj+1 + (1 − ρ0 )δj+1 ].
p
(k)
In the above recurrence relations, we adopt the convention that αj = 0 whenever j is not in the
(k) (k) (k)
range specified by (8.2), and similarly for βj , γj and δj .
47
and (
j QΩ , j = 0,
QΩ (QΩ QT ) = 1−2p j (1−p) j−1 ,
p (QΩ QT ) + p QT (QΩ QT ) j > 0,
which both follow from (3.13), gives the desired recurrence relation. The calculation is rather
straightforward and omitted.
(k)
We note that the recurrence relations give αk = 1 for all k ≥ 0,
(k) ρ0 (1 − p) (k−1) ρ0 (1 − p)
γk−2 = αk−1 = ,
p p
(k) ρ0 (1 − p) (k−1) ρ0 (1 − p) 2
δk−3 = βk−2 = ,
p p
Lemma 8.2 Put λ = ρ0 /p and observe that by assumption (1.22), λ < 1. Then for all j, k ≥ 0, we
have k−j
(k) (k) (k) (k)
max |αj |, |βj |, |γj |, |δj | ≤ λd 2 e 4k . (8.3)
Proof We prove the lemma by induction on k. The claim is true for k = 0. Suppose it is true up
to k, we then use the recurrence relations given by Lemma 8.1 to establish the claim up to k + 1.
In details, since |1 − ρ0 | < 1, ρ0 < λ and |1 − 2p| < 1, the recurrence relation for α(k+1) gives
(k+1) (k) (k) (k) (k) (k) (k)
|αj | ≤ |αj−1 | + |γj−1 | + λ[|αj | + |γj |] + 1j=0 λ[|β0 | + |δ0 |]
k+1−j k−j k
≤ 2 λd 2
e
4k 1j>0 + 2λd 2
e+1
4k + 2λd 2 e+1 4k 1j=0
k+1−j k+1−j k+1
≤ 2 λd 2
e
4k 1j>0 + 2 λd 2
e
4k + 2 λd 2
e
4k 1j=0
k+1−j
≤ λd 2
e
4k+1 ,
48
QT = PT − ρ0 I = (I − PT ⊥ ) − ρ0 I = (1 − ρ0 )I − PT ⊥ ,
Now
k k−1
(k) (k)
X X
k(QΩ PT )k QΩ (E)k ≤ |αj |k(QΩ QT )j QΩ (E)k + |βj |k(QΩ QT )j (E)k
j=0 j=0
k−2 k−3
(k) (k)
X X
+ |γj |kQT (QΩ QT )j QΩ (E)k + |δj |kQT (QΩ QT )j (E)k,
j=0 j=0
since QT (E) = (1 − ρ0 )(E). By using the size estimates given by Lemma 8.2 on the coefficients, we
have
k−1 k−1
1 1 k+1 X k−j j+1 X k−j j
k(QΩ PT )k QΩ (E)k ≤ σ 2 + 4k λd 2 e σ 2 + 4 k λd 2 e σ 2
3 3
j=0 j=0
k−1 k−1
1 k+1 k+1 X k−j k−j k X k−j k−j
≤ σ 2 + 4k σ 2 λd 2 e σ − 2 + 4 k σ 2 λd 2 e σ − 2
3
j=0 j=0
k−1
1 k+1 k+1 k
X k−j k−j
≤ σ 2 + 4k σ 2 + σ 2 λd 2 e σ − 2 .
3
j=0
49
where the last inequality holds provided that 4λ ≤ σ 3/2 . The conclusion is
k+1
k(QΩ PT )k QΩ (E)k ≤ (1 + 4k+1 )σ 2 ,
Acknowledgements
E. C. is supported by ONR grants N00014-09-1-0469 and N00014-08-1-0749 and by the Waterman
Award from NSF. E. C. would like to thank Xiaodong Li and Chiara Sabatti for helpful conversa-
tions related to this project. T. T. is supported by a grant from the MacArthur Foundation, by
NSF grant DMS-0649473, and by the NSF Waterman award.
References
[1] J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes.
Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.
[2] Y. Amit, M. Fink, N. Srebro, and S. Ullman. Uncovering shared structures in multiclass classification.
Proceedings of the Twenty-fourth International Conference on Machine Learning, 2007.
[3] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. Neural Information Processing
Systems, 2007.
[4] A. Barvinok. A course in convexity, volume 54 of Graduate Studies in Mathematics. American Mathe-
matical Society, Providence, RI, 2002.
[5] P. Biswas, T-C. Lian, T-C. Wang, and Y. Ye. Semidefinite programming based algorithms for sensor
network localization. ACM Trans. Sen. Netw., 2(2):188–220, 2006.
[6] J-F. Cai, E. J. Candès, and Z. Shen. A singular value thresholding algorithm for matrix completion.
Technical report, 2008. Preprint available at http://arxiv.org/abs/0810.3286.
[7] E. J. Candès and B. Recht. Exact Matrix Completion via Convex Optimization. To appear in Found.
of Comput. Math., 2008.
[8] E. J. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from
highly incomplete frequency information. IEEE Trans. Inform. Theory, 52(2):489–509, 2006.
[9] P. Chen and D. Suter. Recovering the missing components in a large noisy low-rank matrix: application
to SFM source. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(8):1051–1063,
2004.
[10] V. H. de la Peña and S. J. Montgomery-Smith. Decoupling inequalities for the tail probabilities of
multivariate U -statistics. Ann. Probab., 23(2):806–816, 1995.
[11] M. Fazel, H. Hindi, and S. Boyd. Log-det heuristic for matrix rank minimization with applications to
Hankel and Euclidean distance matrices. Proc. Am. Control Conf, June 2003.
[12] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry. Using collaborative filtering to weave an information
tapestry. Communications of the ACM, 35:61–70, 1992.
50
51
Abstract—On the heels of compressed sensing, a re- of the rapidly developing field of compressed sensing,
markable new field has very recently emerged. This field and is already changing the way engineers think about
addresses a broad range of problems of significant practical data acquisition, hence this special issue and others, see
interest, namely, the recovery of a data matrix from what
[2] for example. Concretely, if a signal has a sparse
arXiv:0903.3131v1 [cs.IT] 18 Mar 2009
then M is the unique solution to (II.4) with probability These conditions do not assume anything about the
at least 1 − n−3 . In other words: with high probability, singular values. As we will see, incoherent matrices with
nuclear-norm minimization recovers all the entries of M a small value of the strong incoherence parameter µ can
with no error. be recovered from a minimal set of entries. Before we
As a side remark, one can obtain a probability of state this result, it is important to note that many model
success at least 1 − n−β for β by taking C in (II.6) matrices obey the strong incoherence property with a
of the form C 0 β for some universal constant C 0 . small value of µ.
An n1 × n2 matrix of rank r depends upon r(n1 + • Suppose the singular vectors obey (II.2) with µB =
n2 − r) degrees of freedom1 . When r is small, the O(1) (which informally says that the singular vec-
number of degrees of freedom is much less than n1 n2 tors are not spiky), then with the exception of a
and this is the reason why subsampling is possible. (In very few peculiar matrices, M obeys √ the strong
compressed sensing, the number of degrees of freedom incoherence property with µ = O( log n).
corresponds to the sparsity of the signal; i.e. the number • Assume that the column matrices [u1 , . . . , ur ] and
of nonzero entries.) What is remarkable here, is that [v1 , . . . , vr ] are independent random orthogonal ma-
exact recovery by nuclear norm minimization occurs as trices, then with high probability, M obeys √ the
soon as the sample size exceeds the number of degrees strong incoherence property with µ = O( log n),
of freedom by a couple of logarithmic factors. Further, at least when r ≥ log n as to avoid small samples
observe that if Ω completely misses one of the rows effects.
(e.g. one has no rating about one user) or one of the The sampling result below is general, nonasymptotic
columns (e.g. one has no rating about one movie), then and optimal up to a few logarithmic factors.
one cannot hope to recover even a matrix of rank 1 of Theorem 2: [12] With the same notations as in The-
the form M = xy ∗ . Thus one needs to sample every orem 1, there is a numerical constant C such that if
row (and also every column) of the matrix. When Ω is
sampled at random, it is well established that one needs m ≥ C µ2 nr log6 n, (II.8)
1 This can be seen by counting the degrees of freedom in the singular M is the unique solution to (II.4) with probability at
value decomposition. least 1 − n−3 .
rms error
III-B. To this end, Figure 2 plots three curves for varying 0.4
values of n, p, and r: 1) the RMS error introduced above,
2) the RMS error achievable when the oracle reveals 0.3
V. D ISCUSSION
Fig. 2: Comparison between the recovery
This paper reviewed and developed some new results error, the oracle error times 1.68, and the
about matrix completion. By and large, low-rank ma- estimated oracle error times 1.68. Each point
trix recovery is a field in complete infancy abounding on the plot corresponds to an average over 20
with interesting and open questions, and if the recent trials. Top: in this experiment, n = 600, r = 2
and p varies. The x-axis is the number of mea-
4 The number 2 is somewhat arbitrary here, although we picked it surements per degree of freedom (df). Middle:
because there is a large drop-off in the size of the singular values after n varies whereas r = 2, p = .2. Bottom:
the second. If, for example, M10 is the best rank-10 approximation, n = 600, r varies and p = .2.
then kM10 − M kF /kM kF = .081.
10
11
eriksson@cs.bu.edu
Robert Nowak
University of Wisconsin - Madison
nowak@ece.wisc.edu
December 2011
Abstract
This paper considers the problem of completing a matrix with many missing entries under the assumption that
the columns of the matrix belong to a union of multiple low-rank subspaces. This generalizes the standard low-rank
matrix completion problem to situations in which the matrix rank can be quite high or even full rank. Since the
columns belong to a union of subspaces, this problem may also be viewed as a missing-data version of the subspace
clustering problem. Let X be an n × N matrix whose (complete) columns lie in a union of at most k subspaces, each
of rank ≤ r < n, and assume N ≫ kn. The main result of the paper shows that under mild assumptions each column
of X can be perfectly recovered with high probability from an incomplete version so long as at least CrN log2 (n)
entries of X are observed uniformly at random, with C > 1 a constant depending on the usual incoherence conditions,
the geometrical arrangement of subspaces, and the distribution of columns over the subspaces. The result is illustrated
with numerical experiments and an application to Internet distance matrix completion and topology identification.
1 Introduction
Consider a real-valued n × N dimensional matrix X. Assume that the columns of X lie in the union of at most k
subspaces of Rn , each having dimension at most r < n and assume that N > kn. We are especially interested
in “high-rank” situations in which the total rank (the rank of the union of the subspaces) may be n. Our goal is to
complete X based on an observation of a small random subset of its entries. We propose a novel method for this
matrix completion problem. In the applications we have in mind N may be arbitrarily large, and so we will focus on
quantifying the probability that a given column is perfectly completed, rather than the probability that whole matrix is
perfectly completed (i.e., every column is perfectly completed). Of course it is possible to translate between these two
quantifications using a union bound, but that bound becomes meaningless if N is extremely large.
Suppose the entries of X are observed uniformly at random with probability p0 . Let Ω denote the set of indices
of observed entries and let XΩ denote the observations of X. Our main result shows that under a mild set of assump-
tions each column of X can be perfectly recovered from XΩ with high probability using a computationally efficient
procedure if
r
p0 ≥ C log2 (n) (1)
n
where C > 1 is a constant depending on the usual incoherence conditions as well as the geometrical arrangement of
subspaces and the distribution of the columns in the subspaces.
∗ The first two authors contributed equally to this paper.
High-Rank Matrix Completion and Subspace Clustering with Missing Data 101
1.1 Connections to Low-Rank Completion
Low-rank matrix completion theory [1] shows that an n × N matrix of rank r can be recovered from incomplete obser-
vations, as long as the number of entries observed (with locations sampled uniformly at random) exceeds rN log2 N
(within a constant factor and assuming n ≤ N ). It is also known that, in the same setting, completion is impossible if
the number of observed entries is less than a constant times rN log N [2]. These results imply that if the rank of X is
close to n, then all of the entries are needed in order to determine the matrix.
Here we consider a matrix whose columns lie in the union of at most k subspaces of Rn . Restricting the rank of
each subspace to at most r, then the rank of the full matrix our situation could be as large as kr, yielding the require-
ment krN log2 N using current matrix completion theory. In contrast, the bound in (1) implies that the completion of
each column is possible from a constant times rN log2 n entries sampled uniformly at random. Exact completion of
every column can be guaranteed by replacing log2 n with log2 N is this bound, but since we allow N to be very large
we prefer to state our result in terms of per-column completion. Our method, therefore, improves significantly upon
conventional low-rank matrix completion, especially when k is large. This does not contradict the lower bound in [2],
because the matrices we consider are not arbitrary high-rank matrices, rather the columns must belong to a union of
rank ≤ r subspaces.
High-Rank Matrix Completion and Subspace Clustering with Missing Data 102
Our work builds upon the results of [12], which quantifies the deviation of an incomplete vector norm with respect
to the incoherence of the sampling pattern. While this work also examines subspace detection using incomplete data,
it assumes complete knowledge of the subspaces.
While research that examines subspace learning has been presented in [13], the work in this paper differs by
the concentration on learning from incomplete observations (i.e., when there are missing elements in the matrix),
and by the methodological focus (i.e., nearest neighbor clustering versus a multiscale Singular Value Decomposition
approach).
Figure 1: Example of nearest-neighborhood selecting points on from a single subspace. For illustration, samples from three one-
dimensional subspaces are depicted as small dots. The large dot is the seed. The subset of samples with significant observed support
in common with that of the seed are depicted by ∗’s. If the density of points is high enough, then the nearest neighbors we identify
will belong to the same subspace as the seed. In this case we depict the ball containing the 3 nearest neighbors of the seed with
significant support overlap.
High-Rank Matrix Completion and Subspace Clustering with Missing Data 103
2 Key Assumptions and Main Result
The notion of incoherence plays a key role in matrix completion and subspace recovery from incomplete observations.
Definition 1. The coherence of an r-dimensional subspace S ⊆ Rn is
n
µ(S) := max kPS ej k22
r j
where PS is the projection operator onto S and {ej } are the canonical unit vectors for Rn .
nkxk2
Note that 1 ≤ µ(S) ≤ n/r. The coherence of single vector x ∈ Rn is µ(x) = kxk2∞ , which is precisely the
2
coherence of the one-dimensional subspace spanned by x. With this definition, we can state the main assumptions we
make about the matrix X.
A1. The columns of X lie in the union of at most k subspaces, with k = o(nd ) for some d > 0. The subspaces are
denoted by S1 , . . . , Sk and each has rank at most r < n. The ℓ2 -norm of each column is ≤ 1.
A2. The coherence of each subspace is bounded above by µ0 . The coherence of each column is bounded above by µ1
and for any pair of columns, x1 and x2 , the coherence of x1 − x2 is also bounded above by µ1 .
A3. The columns of X do not lie in the intersection(s) of the subspaces with probability 1, and if rank(Si ) = ri , then
any subset of ri columns from Si spans Si with probability 1. Let 0 < ǫ0 < 1 and Si,ǫ0 denote the subset of
points in Si at least ǫ0 distance away from any other subspace. There exists a constant 0 < ν0 ≤ 1, depending
on ǫ0 , such that
(i) The probability that a column selected uniformly at random belongs to Si,ǫ0 is at least ν0 /k.
(ii) If x ∈ Si,ǫ0 , then the probability that a column selected uniformly at random belongs to the ball of radius
ǫ0 centered at x is at least ν0 ǫr0 /k.
The conditions of A3 are met if, for example, the columns are drawn from a mixture of continuous distributions on
each of the subspaces. The value of ν0 depends on the geometrical arrangement of the subspaces and the distribution
of the columns within the subspaces. If the subspaces are not too close to each other, and the distributions within
the subspaces are fairly uniform, then typically ν0 will be not too close to 0. We define three key quantities, the
confidence parameter δ0 , the required number of “seed” columns s0 , and a quantity ℓ0 related to the neighborhood
formation process (see Algorithm 1 in Section 3):
1/2
δ0 := n2−2β log n , for some β > 1 , (2)
k(log k + log 1/δ0 )
s0 := ,
(1 − e−4 )ν0
& ( )'
2k 8k log(s0 /δ0 )
ℓ0 := max ǫ0 r , .
ν0 ( √3
) nν0 ( √ǫ03 )r
then each column of X can be perfectly recovered with probability at least 1 − (6 + 15s0 ) δ0 , using the methodology
sketched above (and detailed later in the paper).
High-Rank Matrix Completion and Subspace Clustering with Missing Data 104
The requirements on sampling are essentially the same as those for standard low-rank matrix completion, apart
from requirement that the total number of columns N is sufficiently large. This is needed to ensure that each of the
subspaces is sufficiently represented in the matrix. The requirement on N is polynomial in n for fixed p0 , which is
easy to see based on the definitions of δ0 , s0 , and ℓ0 (see further discussion at the end of Section 3).
Perfect recovery of each column is guaranteed with probability that decreases linearly in s0 , which itself is linear
in k (ignoring log factors). This is expected since this problem is more difficult than k individual low-rank matrix
completions. We state our results in terms of a per-column (rather than full matrix) recovery guarantee. A full matrix
recovery guarantee can be given by replacing log2 n with log2 N . This is evident from the final completion step
discussed in Lemma 8, below. However, since N may be quite large (perhaps arbitrarily large) in the applications we
envision, we chose to state our results in terms of a per-column guarantee.
The details of the methodology and lemmas leading to the theorem above are developed in the subsequent sections
following the four steps of the methodology outlined above. In certain cases it will be more convenient to consider
sampling the locations of observed entries uniformly at random with replacement rather than without replacement, as
assumed above. The following lemma will be useful for translating bounds derived assuming sampling with replace-
ment to our situation (the same sort of relation is noted in Proposition 3.1 in [1]).
Lemma 1. Draw m samples independently and uniformly from {1, . . . , n} and let Ω′ denote the resulting subset of
unique values. Let Ωm be a subset of size m selected uniformly at random from {1, . . . , n}. Let E denote an event
depending on a random subset of {1, . . . , n}. If P(E(Ωm )) is a non-increasing function of m, then P(E(Ω′ )) ≥
P(E(Ωm )).
Proof. For k = 1, . . . , m, let Ωk denote a subset of size k sampled uniformly at random from {1, . . . , n}, and let
m′ = |Ω′ |.
m
X
P(E(Ω′ )) = P (E(Ω′ ) | m′ = k) P(m′ = k)
k=0
Xm
= P(E(Ωk ))P(m′ = k)
k=0
m
X
≥ P(E(Ωm )) P(m′ = k) .
k=0
3 Local Neighborhoods
In this first step, s columns of XΩ are selected uniformly at random and a set of “nearby” columns are identified for
each, constituting a local neighborhood of size n. All bounds that hold are designed with probability at least 1 − δ0 ,
where δ0 is defined in (2) above. The s columns are called “seeds.” The required size of s is determined as follows.
Lemma 2. Assume A3 holds. If the number of chosen seeds,
k(log k + log 1/δ0 )
s ≥ ,
(1 − e−4 )ν0
then with probability greater than 1 − δ0 for each i = 1, . . . , k, at least one seed is in Si,ǫ0 and each seed column has
at least
64 β max{µ21 , µ0 }
η0 := r log2 (n) (3)
ν0
observed entries.
Proof. First note that from Theorem 2.1, the expected number of observed entries per column is at least
128 β max{µ21 , µ0 }
η= r log2 (n)
ν0
High-Rank Matrix Completion and Subspace Clustering with Missing Data 105
Therefore, the number of observed entries ηb in a column selected uniformly at random is probably not significantly
less. More precisely, by Chernoff’s bound we have
η ≤ η/2) ≤ exp(−η/8) < e−4 .
P(b
Combining this with A3, we have the probability that a randomly selected column belongs to Si,ǫ0 and has η/2 or
more observed entries is at least ν0′ /k, where ν0′ := (1 − e−4 )ν0 . Then, the probability that the set of s columns does
not contain a column from Si,ǫ0 with at least η/2 observed entries is less than (1 − ν0′ /k)s . The probability that the
set does not contain at least one column from Si,ǫ0 with η/2 or more observed entries, for i = 1, . . . , k is less than
δ0 = k(1 − ν0′ /k)s . Solving for s in terms of δ0 yields
log k + log 1/δ0
s =
k/ν0′
log k/ν ′ −1
0
The result follows by noting that log(x/(x − 1)) ≥ 1/x, for x > 1.
Next, for each seed we must find a set of n columns from the same subspace as the seed. This will be accomplished
by identifying columns that are ǫ0 -close to the seed, so that if the seed belongs to Si,ǫ0 , the columns must belong to
the same subspace. Clearly the total number of columns N must be sufficiently large so that n or more such columns
can be found. We will return to the requirement on N a bit later, after first dealing with the following challenge.
Since the columns are only partially observed, it may not be possible to determine how close each is to the seed.
We address this by showing that if a column and the seed are both observed on enough common indices, then the
incoherence assumption A2 allows us reliably estimate the distance.
Lemma 3. Assume A2 and let y = x1 − x2 , where x1 and x2 are two columns of X. Assume there is a common set of
indices of size q ≤ n where both x1 and x2 are observed. Let ω denote this common set of indices and let yω denote
the corresponding subset of y. Then for any δ0 > 0, if the number of commonly observed elements
q ≥ 8µ21 log(2/δ0 ) ,
then with probability at least 1 − δ0
1 n 3
kyk22 ≤ kyω k22 ≤ kyk22 .
2 q 2
Proof. Note that kyω k22 is the sum of q random variables drawn uniformly at random without replacement from the set
{y12 , y22 , . . . , yn2 }, and Ekyω k22 = nq kyk22 . We will prove the bound under the assumption that, instead, the q variables
are sampled with replacement, so that they are independent. By Lemma 1, this will provide the desired result. Note
that if one variable in the sum kyω k22 is replaced with another value, then the sum changes in value by at most 2kyk2∞.
Therefore, McDiramid’s Inequality shows that for t > 0
q −t2
P kyω k22 − kyk22 ≥ t ≤ 2 exp ,
n 2qkyk4∞
or equivalently
n −qt2
P kyω k22 − kyk22 ≥ t ≤ 2 exp .
q 2n2 kyk4 ∞
Suppose that x1 ∈ Si,ǫ0 (for some i) and that x2 6∈ Si , and that both x1 , x2 observe q ≥ 2µ20 log(2/δ0 ) common
indices. Let yω denote the difference between x1 and x2 on the common support set. If the partial distance nq kyω k22 ≤
ǫ20 /2, then the result above implies that with probability at least 1 − δ0
n
kx1 − x2 k22 ≤ 2 kyω k22 ≤ ǫ20 .
q
High-Rank Matrix Completion and Subspace Clustering with Missing Data 106
On the other hand if x2 ∈ Si and kx1 − x2 k22 ≤ ǫ20 /3, then with probability at least 1 − δ0
n 3
kyω k22 ≤ kx1 − x2 k22 ≤ ǫ20 /2 .
q 2
Using these results we will proceed as follows. For each seed we find all columns that have at least t0 > 2µ20 log(2/δ0 )
observations at indices in common with the seed (the precise value of t0 will be specified in a moment). Assuming
√ ℓ ≥ 1. In
that this set is sufficiently large, we will select ℓn these columns uniformly at random, for some integer
particular, ℓ will be chosen so that with high probability at least n of the columns will be within ǫ0 / 3√of the seed,
ensuring that with probability at least δ0 the corresponding partial distance of each will be within ǫ0 / 2. That is
enough to guarantee with the same probability that the columns are within ǫ0 of the seed. Of course, a union bound
will be needed so that the distance bounds above hold uniformly over the set of sℓn columns under consideration,
which means that we will need each to have at least t0 := 2µ20 log(2sℓn/δ0 ) observations at indices in common with
the corresponding seed. All this is predicated on N being large enough so that such columns exist in XΩ . We will
return to this issue later, after determining the requirement for ℓ. For now we will simply assume that N ≥ ℓn.
√
Lemma 4. Assume A3 and for each seed x let Tx,ǫ0 denote the number of columns of X in the ball of radius ǫ0 / 3
about x. If the number of columns selected for each seed, ℓn, such that,
( )
2k 8k log(s/δ0 )
ℓ ≥ max ǫ0 r , ,
ν0 ( √3
) nν0 ( √ǫ03 )r
ℓnν0 ( √ǫ03 )r
E[Tx,ǫ0 ] ≥ .
k
By Chernoff’s bound for any 0 < γ < 1
! !
ℓnν0 ( √ǫ03 )r γ 2 ℓnν0 ( √3 )
ǫ0 r
P Tx,ǫ0 ≤ (1 − γ) ≤ exp − .
k 2 k
ǫ
ǫ
ℓnν0 ( √03 )r ℓnν0 ( √03 )r
We would like to choose ℓ so that 2k ≥ n and so that exp − 8k ≤ δ0 /s (so that the desired result
2k
fails for one or more of the s seeds is less than δ0 ). The first condition leads to the requirement ℓ ≥ ǫ
ν0 ( √03 )r
. The
8k log(s/δ0 )
second condition produces the requirement ℓ ≥ ǫ
nν0 ( √03 )r
.
We can now formally state the procedure for finding local neighborhoods in Algorithm 1. Recall that the number
of observed entries in each seed is at least η0 , per Lemma 2.
Lemma 5. If N is sufficiently large and η0 > t0 , then the Local Neighborhood Procedure in Algorithm 1 produces
at least n columns within ǫ0 of each seed, and at least one seed will belong to each of Si,ǫ0 , for i = 1, . . . , k, with
probability at least 1 − 3δ0 .
Proof. Lemma 2 states that if we select s0 seeds, then with probability at least 1 − δ0 there is a seed in each Si,ǫ0 ,
i = 1, . . . , k, with at least η0 observed entries, where η0 is defined in (3). Lemma 4 implies that if ℓ0 n columns are
selected uniformly at random for each seed, then with probability at least 1 − δ0 for each seed at least n of the columns
High-Rank Matrix Completion and Subspace Clustering with Missing Data 107
Algorithm 1 - Local Neighborhood Procedure
Input: n, k, µ0 , ǫ0 , ν0 , η0 , δ0 > 0.
k(log k + log 1/δ0 )
s0 :=
(1 − e−4 )ν0
& ( )'
2k 8k log(s0 /δ0 )
ℓ0 := max ,
ν0 ( √ǫ03 )r nν0 ( √ǫ03 )r
t0 := ⌈2µ20 log(2s0 ℓ0 n/δ0 )⌉
Steps:
1. Select s0 “seed” columns uniformly at random and discard all with less than η0 observations
2. For each seed, find all columns with t0 observations at locations observed in the seed
3. Randomly select ℓ0 n columns from each such set
√
4. Form local neighborhood for each seed by randomly selecting n columns with partial distance less than ǫ0 / 2
from the seed
√
will be within a distance ǫ0 / 3 of the seed. Each seed has at least η0 observed entries and we need to find ℓ0 n other
columns with at least t0 observations at indices where the seed was observed. Provided that η0 ≥ t0 , this is certainly
possible if N is large enough. It follows from Lemma 3 that ℓ0 n columns have at least t0 observations at indices √
where the seed was also observed, then with probability at least 1 − δ0 the partial distances will be within ǫ0 / 2,
which implies the true distances are within ǫ0 . The result follows by the union bound.
Finally, we quantify just how large N needs to be. Lemma 4 also shows that we require at least
( )
2kn 8k log(s/δ0 )
N ≥ ℓn ≥ max ǫ0 r , .
ν0 ( √3
) ν0 ( √ǫ03 )r
However, we must also determine a lower bound on the probability that a column selected uniformly at random has at
least t0 observed indices in common with a seed. Let γ0 denote this probability, and let p0 denote the probability of
observing each entry in X. Note that our main result, Theorem 2.1, assumes that
This implies that the expected number of columns with t0 or more observed indices in common with a seed is at least
γ0 N . If ne is the actual number with this property, then by Chernoff’s bound, P(e n ≤ γ0 N/2) ≤ exp(−γ0 N/8). So
N ≥ 2ℓ0 γ0−1 n will suffice to guarantee that enough columns can be found for each seed with probability at least
1 − s0 exp(−ℓ0 n/4) ≥ 1 − δ0 since this will be far larger than 1 − δ0 , since δ0 is polynomial in n.
To take this a step further, a simple lower bound on γ0 is obtained as follows. Suppose we consider only a t0 -sized
subset of the indices where the seed is observed. The probability that another column selected at random is observed
2
at all t0 indices in this subset is pt00 . Clearly γ0 ≥ pt00 = exp(t0 log p0 ) ≥ (2s0 ℓ0 n)2µ0 log p0 . This yields the following
sufficient condition on the size of N :
2 −1
N ≥ ℓ0 n(2s0 ℓ0 n/δ0 )2µ0 log p0 .
High-Rank Matrix Completion and Subspace Clustering with Missing Data 108
From the definitions of s0 and ℓ0 , this implies that if 2µ20 log p0 is a fixed constant, then a sufficient number of
columns will exist if N = O(poly(kn/δ0 )). For example, if µ20 = 1 and p0 = 1/2, then N = O((kn)/δ0 )2.4 )
will suffice; i.e., N need only grow polynomially in n. On the other hand, in the extremely undersampled case p0
scales like log2 (n)/n (as n grows and r and k stay constant) and N will need to grow almost exponentially in n, like
nlog n−2 log log n .
We wish to apply these results to our local neighbor sets, but we have three issues we must address: First, the
sampling of the matrices formed by local neighborhood sets is not uniform since the set is selected based on the
observed indices of the seed. Second, given Lemma 2 we must complete not one, but s0 (see Algorithm 1) incomplete
matrices simultaneously with high probability. Third, some of the local neighbor sets may have columns from more
than one subspace. Let us consider each issue separately.
First consider the fact that our incomplete submatrices are not sampled uniformly. The non-uniformity can be
corrected with a simple thinning procedure. Recall that the columns in the seed’s local neighborhood are identified
first by finding columns with sufficient overlap with each seed’s observations. To refer to the seed’s observations, we
will say “the support of the seed.”
Due to this selection of columns, the resulting neighborhood columns are highly sampled on the support of the
seed. In fact, if we again use the notation q for the minimum overlap between two columns needed to calculate
distance, then these columns have at least q observations on the support of the seed. Off the support, these columns
are still sampled uniformly at random with the same probability as the entire matrix. Therefore we focus only on
correcting the sampling pattern on the support of the seed.
Let t be the cardinality of the support of a particular seed. Because all entries of the entire matrix are sampled
independently with probability p0 , then for a randomly selected column, the random variable which generates t is
binomial. For neighbors selected to have at least q overlap with a particular seed, we denote t′ as the number of
samples overlapping with the support of the seed. The probability density for t′ is positive only for j = q, . . . , t,
t j t−j
′ j p0 (1 − p0 )
P(t = j) =
ρ
Pt
where ρ = j=q jt pj0 (1 − p0 )t−j .
In order to thin the common support, we need two new random variables. The first is a bernoulli, call it Y , which
takes the value 1 with probability ρ and 0 with probability 1 − ρ. The second random variable, call it Z, takes values
j = 0, . . . , q − 1 with probability
t j t−j
j p0 (1 − p0 )
P(Z = j) =
1−ρ
High-Rank Matrix Completion and Subspace Clustering with Missing Data 109
which equal to the desired binomial distribution. Thus, the thinning is accomplished as follows. For each column draw
an independent sample of Y . If the sample is 1, then the column is not altered. If the sample is 0, then a realization of
Z is drawn, which we denote by z. Select a random subset of size z from the observed entries in the seed support and
discard the remainder. We note that the seed itself should not be used in completion, because there is a dependence
between the sample locations of the seed column and its selected neighbors which cannot be eliminated.
Now after thinning, we have the following matrix completion guarantee for each neighborhood matrix.
Lemma 7. Assume all s0 seed neighborhood matrices are thinned according to the discussion above, have rank ≤ r,
and the matrix entries are observed uniformly at random with probability,
Finally, let us consider the third issue, the possibility that one or more of the points in the neighborhood of a seed
lies in a subspace different than the seed subspace. When this occurs, the rank of the submatrix formed by the seed’s
neighbor columns will be larger than the dimension of the seed subspace. Without loss of generality assume that we
have only two subspaces represented in the neighbor set, and assume their dimensions are r′ and r′′ . First, in the case
that r′ + r′′ > r, when a rank ≥ r matrix is completed to a rank r matrix, with overwhelming probability there will
be errors with respect to the observations as long as the number of samples in each column is O(r log r), which is
assumed in our case; see [12]. Thus we can detect and discard these candidates. Secondly, in the case that r′ + r′′ ≤ r,
we still have enough samples to complete this matrix successfully with high probability. However, since we have
drawn enough seeds to guarantee that every subspace has a seed with a neighborhood entirely in that subspace, we
will find that this problem seed is redundant. This is determined in the Subspace Refinement step.
5 Subspace Refinement
Each of the matrix completion steps above yields a low-rank matrix with a corresponding column subspace, which
we will call the candidate subspaces. While the true number of subspaces will not be known in advance, since
s0 = O(k(log k + log(1/δ0 )), the candidate subspaces will contain the true subspaces with high probability (see
Lemma 4). We must now deal with the algorithmic issue of determining the true set of subspaces.
We first note that, from Assumption A3, with probability 1 a set of points of size ≥ r all drawn from a single
subspace S of dimension ≤ r will span S. In fact, any b < r points will span a b-dimensional subspace of the
r-dimensional subspace S.
10
High-Rank Matrix Completion and Subspace Clustering with Missing Data 110
Assume that r < n, since otherwise it is clearly necessary to observe all entries. Therefore, if a seed’s nearest
neighborhood set is confined to a single subspace, then the columns in span their subspace. And if the seed’s nearest
neighborhood contains columns from two or more subspaces, then the matrix will have rank larger than that of any
of the constituent subspaces. Thus, if a certain candidate subspace is spanned by the union of two or more smaller
candidate subspaces, then it follows that that subspace is not a true subspace (since we assume that none of the true
subspaces are contained within another).
This observation suggests the following subspace refinement procedure. The s0 matrix completions yield s ≤ s0
candidate column subspaces; s may be less than s0 since completions that fail are discarded as described above. First
sort the estimated subspaces in order of rank from smallest to largest (with arbitrary ordering of subspaces of the
same rank), which we write as S(1) , . . . , S(s) . We will denote the final set of estimated subspaces as Sb1 , . . . , Sbk . The
first subspace Sb1 := S(1) , a lowest-rank subspace in the candidate set. Next, Sb2 = S(2) if and only if S(2) is not
contained in Sb1 . Following this simple sequential strategy, suppose that when we reach the candidate S(j) we have so
far determined Sb1 , . . . , Sbi , i < j. If S(j) is not in the span of ∪iℓ=1 Sbℓ , then we set Sbi+1 = S(j) , otherwise we move
on to the next candidate. In this way, we can proceed sequentially through the rank-ordered list of candidates, and we
will identify all true subspaces.
and for j = 2, . . . , k
kxΩ − PΩ,Sj xΩ k22 > 0 . (8)
Proof. We wish to use results from [12, 14], which require a fixed number of measurements |Ω|. By Chernoff’s bound
np0 −np0
P |Ω| ≤ ≤ exp .
2 8
11
High-Rank Matrix Completion and Subspace Clustering with Missing Data 111
Note that np0 > 16rβ log2 n, therefore exp −np 8
0
< (n−2β )log n < δ0 ; in other words, we observe |Ω| > np0 /2
entries of x with probability 1 − δ0 . This set Ω is selected uniformly at random among all sets of size |Ω|, but using
Lemma 1 we can assume that the samples are drawn uniformly with replacement in order to apply results of [12, 14].
Now we show that |Ω| > np0 /2 samples selected uniformly with replacement implies that
8rµ0 2r rµ0 (1 + ξ)2
|Ω| > max log , (9)
3 δ0 (1 − α)(1 − γ)
r r r
2µ21 1 1 8rµ0 2r
where ξ, α > 0 and γ ∈ (0, 1) are defined as α = |Ω| log δ0 , ξ = 2µ 1 log δ0 , and γ = 3|Ω| log δ0 .
We start with the second term in the max of (9). Substituting δ0 and the bound for p0 , one can show that for
n ≥ 15 both α ≤ 1/2 and γ ≤ 1/2. This makes (1 + ξ)2 /(1 − α)(1 − γ) ≤ 4(1 − ξ)2 ≤ 8ξ 2 for ξ > 2.5, i.e. for
δ0 < 0.04.
We finish this argument by noting that 8ξ 2 = 16µ1 log(1/δ0 ) < np0/2; there is in fact an O(r log(n)) gap
between the two. Similarly for the first term in the max of (9), 83 rµ0 log 2r
δ0 < np0 /2; here the gap is O(log(n)).
Now we prove (7), which follows from [12]. With |Ω| > 83 rµ0 log 2r T
δ0 , we have that UΩ UΩ is invertible with
probability at least 1 − δ0 according to Lemma 3 of [12]. This implies that
−1 T
U T x = UΩT UΩ UΩ xΩ . (10)
Call a1 = U T x. Since x ∈ S, U a1 = x, and a1 is in fact the unique solution to U a = x. Now consider the equation
−1 T
UΩ a = xΩ . The assumption that UΩT UΩ is invertible implies that a2 = UΩT UΩ UΩ xΩ exists and is the unique
solution to UΩ a = xΩ . However, UΩ a1 = xΩ as well, meaning that a1 = a2 . Thus, we have
kxΩ − PΩ,S1 xΩ k22 = kxΩ − UΩ U T xk22 = 0
with probability at least 1 − δ0 .
Now we prove (8), paralleling Theorem 1 in [14]. We use Assumption A3 to ensure that x ∈
/ Sj , j = 2, . . . , k.
This along with (9) and Theorem 1 from [12] guarantees that
2
|Ω|(1 − α) − rµ0 (1+ξ)
1−γ
kxΩ − PΩ,Sj xΩ k22 ≥ kx − PSj xk22 > 0
n
for each j = 2, . . . , k with probability at least 1 − 3δ0 . With a union bound this holds simultaneously for all k − 1
alternative subspaces with probability at least 1 − 3(k − 1)δ0 . When we also include the events that (7) holds and that
|Ω| > np0 /2, we get that the entire theorem holds with probability at least 1 − (3(k − 1) + 2)δ0 .
Finally, denote the column to be completed by xΩ . To complete xΩ we first determine which subspace it belongs to
using the results above. For a given column we can use the incomplete data projection residual of (7). With probability
at least 1−(3(k−1)+2)δ0, the residual will be zero for the correct subspace and strictly positive for all other subspaces.
−1 T
Using the span of the chosen subspace, U , we can then complete the column by using x b = U UΩT UΩ UΩ xΩ .
We reiterate that Lemma 8 allows us to complete a single column x with probability 1 − (3(k − 1) + 2)δ0 . If we
wish to complete the entire matrix, we will need another union bound over all N columns, leading to a log N factor in
our requirement on p0 . Since N may be quite large in applications, we prefer to state our result in terms of per-column
completion bound.
The confidence level stated in Theorem 2.1 is the result of applying the union bound to all the steps required in
the Sections 3, 4, and 6. All hold simultaneously with probability at least
1 − (6 + 3(k − 1) + 12s0 ) δ0 < 1 − (6 + 15s0 )δ0 ,
which proves the theorem.
7 Experiments
The following experiments evaluate the performance of the proposed high-rank matrix completion procedure and
compare results with standard low-rank matrix completion based on nuclear norm minimization.
12
High-Rank Matrix Completion and Subspace Clustering with Missing Data 112
7.1 Numerical Simulations
We begin by examining a highly synthesized experiment where the data exactly matches the assumptions of our high-
rank matrix completion procedure. The key parameters were chosen as follows: n = 100, N = 5000, k = 10, and
r = 5. The k subspaces were r-dimensional, and each was generated by r vectors drawn from the N (0, In ) distribution
and taking their span. The resulting subspaces are highly incoherent with the canonical basis for Rn . For each
subspace, we generate 500 points drawn from a N (0, U U T ) distribution, where U is a n×r matrix whose orthonormal
columns span the subspace. Our procedure was implemented using ⌈3k log k⌉ seeds. The matrix completion software
called GROUSE (available here [15]) was used in our procedure and to implement the standard low-rank matrix
completions. We ran 50 independent trials of our procedure and compared it to standard low-rank matrix completion.
The results are summarized in the figures below. The key message is that our new procedure can provide accurate
completions from far fewer observations compared to standard low-rank completion, which is precisely what our main
result predicts.
Figure 2: The number of correctly completed columns (with tolerances shown above, 10e-5 or 0.01), versus the average number
of observations per column. As expected, our procedure (termed high rank MC in the plot) provides accurate completion with only
about 50 samples per column. Note that r log n ≈ 23 in this simulation, so this is quite close to our bound. On the other hand, since
the rank of the full matrix is rk = 50, the standard low-rank matrix completion bound requires m > 50 log n ≈ 230. Therefore, it
is not surprising that the standard method (termed low rank MC above) requires almost all samples in each column.
13
High-Rank Matrix Completion and Subspace Clustering with Missing Data 113
Subnet
Subnet
Border
Router Border
Subnet Router Subnet
Internet Core
Border Border
Router Router
Passive Monitors
Figure 3: Internet topology example of subnets sending traffic to passive monitors through the Internet core and common border
routers.
0.8
Cumulative Distribution
0.6
0.4
0
0 0.2 0.4 0.6 0.8 1
Approximation Error
Figure 4: Hop count imputation results, using a synthetic network with k = 12 subnets, n = 75 passive monitors, and N = 2700
IP addresses. The cumulative distribution of estimation error is shown with respect to observing 40% of the total elements.
Finally, using real-world Internet delay measurements (courtesy of [19]) from n = 100 monitors to N = 22550
IP addresses, we test imputation performance when the underlying subnet structure is not known. Using the estimate
k = 15, in Figure 5 we find a significant performance increase using the high-rank matrix completion technique.
References
[1] B. Recht, “A Simpler Approach to Matrix Completion,” in To appear in Journal of Machine Learning Research,
arXiv:0910.0651v2.
[2] E. J. Candès and T. Tao, “The Power of Convex Relaxation: Near-Optimal Matrix Completion.” in IEEE Trans-
actions on Information Theory, vol. 56, May 2010, pp. 2053–2080.
[3] R. Vidal, “A Tutorial on Subspace Clustering,” in Johns Hopkins Technical Report, 2010.
[4] K. Kanatani, “Motion Segmentation by Subspace Separation and Model Selection,” in Computer Vision, 2001.
ICCV 2001. Proceedings. Eighth IEEE International Conference on, vol. 2, 2001, pp. 586–591.
14
High-Rank Matrix Completion and Subspace Clustering with Missing Data 114
0.8
0.7
Cumulative Distribution
0.6
0.5
0.4
0.3
0.2
High Rank MC
0.1 Standard MC
0
0 20 40 60 80 100
Approximation Error (in ms)
Figure 5: Real-world delay imputation results, using a network n = 100 monitors, N = 22550 IP addresses, and an unknown
number of subnets. The cumulative distribution of estimation error is shown with respect to observing 40% of the total delay
elements.
[5] R. Vidal, Y. Ma, and S. Sastry, “Generalized Principal Component Analysis (GPCA),” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 27, December 2005.
[6] G. Lerman and T. Zhang, “Robust Recovery of Multiple Subspaces by Lp Minimization,” 2011, Preprint at
http://arxiv.org/abs/1104.3770.
[7] A. Gruber and Y. Weiss, “Multibody Factorization with Uncertainty and Missing Data using the EM Algorithm,”
in Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR), vol. 1, June 2004.
[8] R. Vidal, R. Tron, and R. Hartley, “Multiframe Motion Segmentation with Missing Data Using Power Factoriza-
tion and GPCA,” International Journal of Computer Vision, vol. 79, pp. 85–105, 2008.
[9] B. Eriksson, P. Barford, and R. Nowak, “Network Discovery from Passive Measurements,” in Proceedings of
ACM SIGCOMM Conference, Seattle, WA, August 2008.
[10] E. Candès and B. Recht, “Exact Matrix Completion Via Convex Optimization.” in Foundations of Computational
Mathematics, vol. 9, 2009, pp. 717–772.
[11] B. Eriksson, P. Barford, J. Sommers, and R. Nowak, “DomainImpute: Inferring Unseen Components in the
Internet,” in Proceedings of IEEE INFOCOM Mini-Conference, Shanghai, China, April 2011, pp. 171–175.
[12] L. Balzano, B. Recht, and R. Nowak, “High-Dimensional Matched Subspace Detection When Data are
Missing,” in Proceedings of the International Conference on Information Theory, June 2010, available at
http://arxiv.org/abs/1002.0852.
[13] G. Chen and M. Maggioni, “Multiscale Geometric and Spectral Analysis of Plane Arrangements,” in IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, June 2011.
[14] L. Balzano, R. Nowak, A. Szlam, and B. Recht, “k-Subspaces with missing data,” University of Wisconsin,
Madison, Tech. Rep. ECE-11-02, February 2011.
[15] L. Balzano and B. Recht, 2010, http://sunbeam.ece.wisc.edu/grouse/.
[16] N. Spring, R. Mahajan, and D. Wetherall, “Measuring ISP Topologies with Rocketfuel,” in Proceedings of ACM
SIGCOMM, Pittsburgh, PA, August 2002.
[17] B. Eriksson, P. Barford, R. Nowak, and M. Crovella, “Learning Network Structure from Passive Measurements,”
in Proceedings of ACM Internet Measurement Conference, San Diego, CA, October 2007.
[18] L. Li, D. Alderson, W. Willinger, and J. Doyle, “A First-Principles Approach to Understanding the Internet’s
Router-Level Topology,” in Proceedings of ACM SIGCOMM Conference, August 2004.
[19] J. Ledlie, P. Gardner, and M. Seltzer, “Network Coordinates in the Wild,” in Proceedings of NSDI Conference,
April 2007.
15
High-Rank Matrix Completion and Subspace Clustering with Missing Data 115
A New Theory for Matrix Completion
B-DAT, School of Information & Control, Nanjing Univ Informat Sci & Technol
NO 219 Ningliu Road, Nanjing, Jiangsu, China, 210044
{gcliu,qsliu,xtyuan}@nuist.edu.cn
Abstract
Prevalent matrix completion theories reply on an assumption that the locations of
the missing data are distributed uniformly and randomly (i.e., uniform sampling).
Nevertheless, the reason for observations being missing often depends on the unseen
observations themselves, and thus the missing data in practice usually occurs in a
nonuniform and deterministic fashion rather than randomly. To break through the
limits of random sampling, this paper introduces a new hypothesis called isomeric
condition, which is provably weaker than the assumption of uniform sampling and
arguably holds even when the missing data is placed irregularly. Equipped with
this new tool, we prove a series of theorems for missing data recovery and matrix
completion. In particular, we prove that the exact solutions that identify the target
matrix are included as critical points by the commonly used nonconvex programs.
Unlike the existing theories for nonconvex matrix completion, which are built
upon the same condition as convex programs, our theory shows that nonconvex
programs have the potential to work with a much weaker condition. Comparing to
the existing studies on nonuniform sampling, our setup is more general.
1 Introduction
Missing data is a common occurrence in modern applications such as computer vision and image
processing, reducing significantly the representativeness of data samples and therefore distorting
seriously the inferences about data. Given this pressing situation, it is crucial to study the problem
of recovering the unseen data from a sampling of observations. Since the data in reality is often
organized in matrix form, it is of considerable practical significance to study the well-known problem
of matrix completion [1] which is to fill in the missing entries of a partially observed matrix.
Problem 1.1 (Matrix Completion). Denote the (i, j)th entry of a matrix as [·]ij . Let L0 ∈ Rm×n be
an unknown matrix of interest. In particular, the rank of L0 is unknown either. Given a sampling of
the entries in L0 and a 2D index set Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n} consisting of the locations
of the observed entries, i.e., given
{[L0 ]ij |(i, j) ∈ Ω} and Ω,
can we restore the missing entries whose indices are not included in Ω, in an exact and scalable
fashion? If so, under which conditions?
Due to its unique role in a broad range of applications, e.g., structure from motion and magnetic
resonance imaging, matrix completion has received extensive attentions in the literatures, e.g., [2–13].
∗
The work of Guangcan Liu is supported in part by national Natural Science Foundation of China (NSFC)
under Grant 61622305 and Grant 61502238, in part by Natural Science Foundation of Jiangsu Province of China
(NSFJPC) under Grant BK20160040.
†
The work of Qingshan Liu is supported by NSFC under Grant 61532009.
‡
The work of Xiao-Tong Yuan is supported in part by NSFC under Grant 61402232 and Grant 61522308, in
part by NSFJPC under Grant BK20141003.
31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
A-new-theory-for-matrix-completion-Paper 117
Figure 1: Left and Middle: Typical configurations for the locations of the observed entries. Right: A
real example from the Oxford motion database. The black areas correspond to the missing entries.
In general, given no presumption about the nature of matrix entries, it is virtually impossible to
restore L0 as the missing entries can be of arbitrary values. That is, some assumptions are necessary
for solving Problem 1.1. Based on the high-dimensional and massive essence of today’s data-driven
community, it is arguable that the target matrix L0 we wish to recover is often low rank [23]. Hence,
one may perform matrix completion by seeking a matrix with the lowest rank that also satisfies the
constraints given by the observed entries:
min rank (L) , s.t. [L]ij = [L0 ]ij , ∀(i, j) ∈ Ω. (1)
L
Unfortunately, this idea is of little practical because the problem above is NP-hard and cannot be
solved in polynomial time [15]. To achieve practical matrix completion, Candès and Recht [4]
suggested to consider an alternative that minimizes instead the nuclear norm which is a convex
envelope of the rank function [12]. Namely,
min kLk∗ , s.t. [L]ij = [L0 ]ij , ∀(i, j) ∈ Ω, (2)
L
where k · k∗ denotes the nuclear norm, i.e., the sum of the singular values of a matrix. Rather
surprisingly, it is proved in [4] that the missing entries, with high probability, can be exactly restored
by the convex program (2), as long as the target matrix L0 is low rank and incoherent and the set Ω of
locations corresponding to the observed entries is a set sampled uniformly at random. This pioneering
work provides people several useful tools to investigate matrix completion and many other related
problems. Those assumptions, including low-rankness, incoherence and uniform sampling, are now
standard and widely used in the literatures, e.g., [14, 17, 22, 24, 28, 33, 34, 36]. In particular, the
analyses in [17, 33, 36] show that, in terms of theoretical completeness, many nonconvex optimization
based methods are as powerful as the convex program (2). Unfortunately, these theories still depend
on the assumption of uniform sampling, and thus they cannot explain why there are many nonconvex
methods which often do better than the convex program (2) in practice.
The missing data in practice, however, often occurs in a nonuniform and deterministic fashion instead
of randomly. This is because the reason for an observation being missing usually depends on the
unseen observations themselves. For example, in structure from motion and magnetic resonance
imaging, typically the locations of the observed entries are concentrated around the main diagonal of
a matrix4 , as shown in Figure 1. Moreover, as pointed out by [19, 21, 23], the incoherence condition
is indeed not so consistent with the mixture structure of multiple subspaces, which is also a ubiquitous
phenomenon in practice. There has been sparse research in the direction of nonuniform sampling,
e.g., [18, 25–27, 31]. In particular, Negahban and Wainwright [26] studied the case of weighted
entrywise sampling, which is more general than the setup of uniform sampling but still a special
form of random sampling. Király et al. [18] considered deterministic sampling and is most related to
this work. However, they had only established conditions to decide whether a particular entry of the
matrix can be restored. In other words, the setup of [18] may not handle well the dependence among
the missing entries. In summary, matrix completion still starves for practical theories and methods,
although has attained considerable improvements in these years.
To break through the limits of the setup of random sampling, in this paper we introduce a new
hypothesis called isomeric condition, which is a mixed concept that combines together the rank and
coherence of L0 with the locations and amount of the observed entries. In general, isomerism (noun
4
This statement means that the observed entries are concentrated around the main diagonal after a permutation
of the sampling pattern Ω.
A-new-theory-for-matrix-completion-Paper 118
of isomeric) is a very mild hypothesis and only a little bit more strict than the well-known oracle
assumption; that is, the number of observed entries in each row and column of L0 is not smaller than
the rank of L0 . It is arguable that the isomeric condition can hold even when the missing entries have
irregular locations. In particular, it is provable that the widely used assumption of uniform sampling
is sufficient to ensure isomerism, not necessary. Equipped with this new tool, isomerism, we prove a
set of theorems pertaining to missing data recovery [35] and matrix completion. For example, we
prove that, under the condition of isomerism, the exact solutions that identify the target matrix are
included as critical points by the commonly used bilinear programs. This result helps to explain the
widely observed phenomenon that there are many nonconvex methods performing better than the
convex program (2) on real-world matrix completion tasks. In summary, the contributions of this
paper mainly include:
We invent a new hypothesis called isomeric condition, which provably holds given the
standard assumptions of uniform sampling, low-rankness and incoherence. In addition,
we also exemplify that the isomeric condition can hold even if the target matrix L0 is not
incoherent and the missing entries are placed irregularly. Comparing to the existing studies
about nonuniform sampling, our setup is more general.
Equipped with the isomeric condition, we prove that the exact solutions that identify L0
are included as critical points by the commonly used bilinear programs. Comparing to the
existing theories for nonconvex matrix completion, our theory is built upon a much weaker
assumption and can therefore partially reveal the superiorities of nonconvex programs over
the convex methods based on (2).
We prove that the isomeric condition is sufficient and necessary for the column and row
projectors of L0 to be invertible given the sampling pattern Ω. This result implies that
the isomeric condition is necessary for ensuring that the minimal rank solution to (1) can
identify the target L0 .
The rest of this paper is organized as follows. Section 2 summarizes the mathematical notations used
in the paper. Section 3 introduces the proposed isomeric condition, along with some theorems for
matrix completion. Section 4 shows some empirical results and Section 5 concludes this paper. The
detailed proofs to all the proposed theorems are presented in the Supplementary Materials.
2 Notations
Capital and lowercase letters are used to represent matrices and vectors, respectively, except that the
lowercase letters, i, j, k, m, n, l, p, q, r, s and t, are used to denote some integers, e.g., the location of
an observation, the rank of a matrix, etc. For a matrix M , [M ]ij is its (i, j)th entry, [M ]i,: is its ith row
and [M ]:,j is its jth column. Let ω1 and ω2 be two 1D index sets; namely, ω1 = {i1 , i2 , · · · , ik } and
ω2 = {j1 , j2 , · · · , js }. Then [M ]ω1 ,: denotes the submatrix of M obtained by selecting the rows with
indices i1 , i2 , · · · , ik , [M ]:,ω2 is the submatrix constructed by choosing the columns j1 , j2 , · · · , js ,
and similarly for [M ]ω1 ,ω2 . For a 2D index set Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n}, we imagine it
as a sparse matrix and, accordingly, define its “rows”, “columns” and “transpose” as follows: The
ith row Ωi = {j1 |(i1 , j1 ) ∈ Ω, i1 = i}, the jth column Ωj = {i1 |(i1 , j1 ) ∈ Ω, j1 = j} and the
transpose ΩT = {(j1 , i1 )|(i1 , j1 ) ∈ Ω}.
The special symbol (·)+ is reserved to denote the Moore-Penrose pseudo-inverse of a matrix. More
T
precisely, for a matrix M with Singular Value Decomposition (SVD)5 M = UM ΣM VM , its pseudo-
+ −1 T
inverse is given by M = VM ΣM UM . For convenience, we adopt the conventions of using
span{M } to denote the linear space spanned by the columns of a matrix M , using y ∈ span{M } to
denote that a vector y belongs to the space span{M }, and using Y ∈ span{M } to denote that all the
column vectors of a matrix Y belong to span{M }.
Capital letters U , V , Ω and their variants (complements, subscripts, etc.) are reserved for left singular
vectors, right singular vectors and index set, respectively. For convenience, we shall abuse the
notation U (resp. V ) to denote the linear space spanned by the columns of U (resp. V ), i.e., the
column space (resp. row space). The orthogonal projection onto the column space U , is denoted by
PU and given by PU (M ) = U U T M , and similarly for the row space PV (M ) = M V V T . The same
In this paper, SVD always refers to skinny SVD. For a rank-r matrix M ∈ Rm×n , its SVD is of the form
5
T
UM ΣM VM , where UM ∈ Rm×r , ΣM ∈ Rr×r and VM ∈ Rn×r .
A-new-theory-for-matrix-completion-Paper 119
notation is also used to represent a subspace of matrices (i.e., the image of an operator), e.g., we say
that M ∈ PU for any matrix M which satisfies PU (M ) = M . We shall also abuse the notation Ω
to denote the linear space of matrices supported on Ω. Then the symbol PΩ denotes the orthogonal
projection onto Ω, namely,
[M ]ij , if (i, j) ∈ Ω,
[PΩ (M )]ij =
0, otherwise.
Similarly, the symbol PΩ⊥ denotes the orthogonal projection onto the complement space of Ω. That
is, PΩ + PΩ⊥ = I, where I is the identity operator.
Three types of matrix norms are used in this paper, and they are all functions of the singular values:
1) The operator norm or 2-norm (i.e., largest singular value) denoted by kM k, 2) the Frobenius norm
(i.e., square root of the sum of squared singular values) denoted by kM kF and 3) the nuclear norm
or trace norm (i.e., sum of singular values) denoted by kM k∗ . The only used vector norm is the `2
norm, which is denoted by k · k2 . The symbol | · | is reserved for the cardinality of an index set.
3.1.1 Definitions
For the ease of understanding, we shall begin with a concept called k-isomerism (or k-isomeric in
adjective form), which could be regarded as an extension of low-rankness.
Definition 3.1 (k-isomeric). A matrix M ∈ Rm×l is called k-isomeric if and only if any k rows of
M can linearly represent all rows in M . That is,
rank ([M ]ω,: ) = rank (M ) , ∀ω ⊆ {1, 2, · · · , m}, |ω| = k,
where | · | is the cardinality of an index set.
In general, k-isomerism is somewhat similar to Spark [37] which defines the smallest linearly
dependent subset of the rows of a matrix. For a matrix M to be k-isomeric, it is necessary that
rank (M ) ≤ k, not sufficient. In fact, k-isomerism is also somehow related to the concept of
coherence [4, 21]. When the coherence of a matrix M ∈ Rm×l is not too high, the rows of M will
sufficiently spread, and thus M could be k-isomeric with a small k, e.g., k = rank (M ). Whenever
the coherence of M is very high, one may need a large k to satisfy the k-isomeric property. For
example, consider an extreme case where M is a rank-1 matrix with one row being 1 and everywhere
else being 0. In this case, we need k = m to ensure that M is k-isomeric.
While Definition 3.1 involves all 1D index sets of cardinality k, we often need the isomeric property
to be associated with a certain 2D index set Ω. To this end, we define below a concept called
Ω-isomerism (or Ω-isomeric in adjective form).
Definition 3.2 (Ω-isomeric). Let M ∈ Rm×l and Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n}. Suppose
that Ωj 6= ∅ (empty set), ∀1 ≤ j ≤ n. Then the matrix M is called Ω-isomeric if and only if
rank [M ]Ωj ,: = rank (M ) , ∀j = 1, 2, · · · , n.
Note here that only the number of rows in M is required to coincide with the row indices included in
Ω, and thereby l 6= n is allowable.
Generally, Ω-isomerism is less strict than k-isomerism. Provided that |Ωj | ≥ k, ∀1 ≤ j ≤ n, a matrix
M is k-isomeric ensures that M is Ω-isomeric as well, but not vice versa. For the extreme example
where M is nonzero at only one row, interestingly, M can be Ω-isomeric as long as the locations of
the nonzero elements are included in Ω.
With the notation of ΩT = {(j1 , i1 )|(i1 , j1 ) ∈ Ω}, the isomeric property could be also defined on
the column vectors of a matrix, as shown in the following definition.
A-new-theory-for-matrix-completion-Paper 120
Definition 3.3 (Ω/ΩT -isomeric). Let M ∈ Rm×n and Ω ⊆ {1, 2, · · · , m}×{1, 2, · · · , n}. Suppose
Ωi 6= ∅ and Ωj 6= ∅, ∀i = 1, · · · , m, j = 1, · · · , n. Then the matrix M is called Ω/ΩT -isomeric if
and only if M is Ω-isomeric and M T is ΩT -isomeric as well.
To solve Problem 1.1 without the imperfect assumption of missing at random, as will be shown later,
we need to assume that L0 is Ω/ΩT -isomeric. This condition has excluded the unidentifiable cases
where any rows or columns of L0 are wholly missing. In fact, whenever L0 is Ω/ΩT -isomeric, the
number of observed entries in each row and column of L0 has to be greater than or equal to the rank
of L0 ; this is consistent with the results in [20]. Moreover, Ω/ΩT -isomerism has actually well treated
the cases where L0 is of high coherence. For example, consider an extreme case where L0 is 1 at only
one element and 0 everywhere else. In this case, L0 cannot be Ω/ΩT -isomeric unless the nonzero
element is observed. So, generally, it is possible to restore the missing entries of a highly coherent
matrix, as long as the Ω/ΩT -isomeric condition is obeyed.
It is easy to see that the above lemma is still valid even when the condition of Ω-isomerism is replaced
by k-isomerism. Thus, hereafter, we may say that a space is isomeric (k-isomeric, Ω-isomeric or
ΩT -isomeric) as long as its basis matrix is isomeric. In addition, the isomeric property is subspace
successive, as shown in the next lemma.
Lemma 3.2. Let Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n} and U0 ∈ Rm×r be the basis matrix of a
Euclidean subspace embedded in Rm . Suppose that U is a subspace of U0 , i.e., U = U0 U0T U . If U0
is Ω-isomeric then U is Ω-isomeric as well.
The above lemma states that, in one word, the subspace of an isomeric space is isomeric.
A-new-theory-for-matrix-completion-Paper 121
to ensure that the global minimum to (1) can identify L0 , it is essentially necessary to show that
U0 ∩ Ω⊥ = {0} (resp. V0 ∩ Ω⊥ = {0}), which is equivalent to the operator PU0 PΩ PU0 (resp.
PV0 PΩ PV0 ) is invertible (see Lemma 6.8 of the Supplementary Materials). Interestingly, the isomeric
condition is indeed a sufficient and necessary condition for the operators PU0 PΩ PU0 and PV0 PΩ PV0
to be invertible, as shown in the following theorem.
Theorem 3.1. Let L0 ∈ Rm×n and Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n}. Let the SVD of L0 be
U0 Σ0 V0T . Denote PU0 (·) = U0 U0T (·) and PV0 (·) = (·)V0 V0T . Then we have the following:
1. The linear operator PU0 PΩ PU0 is invertible if and only if U0 is Ω-isomeric.
2. The linear operator PV0 PΩ PV0 is invertible if and only if V0 is ΩT -isomeric.
The necessity stated above implies that the isomeric condition is actually a very mild hypothesis. In
general, there are numerous reasons for the target matrix L0 to be isomeric. Particularly, the widely
used assumptions of low-rankness, incoherence and uniform sampling are indeed sufficient (but not
necessary) to ensure isomerism, as shown in the following theorem.
Theorem 3.2. Let L0 ∈ Rm×n and Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n}. Denote n1 = max(m, n)
and n2 = min(m, n). Suppose that L0 is incoherent and Ω is a 2D index set sampled uniformly
at random, namely Pr((i, j) ∈ Ω) = ρ0 and Pr((i, j) ∈ / Ω) = 1 − ρ0 . For any δ > 0, if ρ0 > δ
is obeyed and rank (L0 ) < δn2 /(c log n1 ) holds for some numerical constant c then, with high
probability at least 1 − n−10
1 , L0 is Ω/ΩT -isomeric.
It is worth noting that the isomeric condition can be obeyed in numerous circumstances other than
the case of uniform sampling plus incoherence. For example,
1 0 0
" #
Ω = {(1, 1), (1, 2), (1, 3), (2, 1), (3, 1)} and L0 = 0 0 0 ,
0 0 0
where L0 is a 3×3 matrix with 1 at (1, 1) and 0 everywhere else. In this example, L0 is not incoherent
and the sampling is not uniform either, but it could be verified that L0 is Ω/ΩT -isomeric.
3.2 Results
In this subsection, we shall show how the isomeric condition can take effect in the context of
nonuniform sampling, establishing some theorems pertaining to missing data recovery [35] as well
as matrix completion.
A-new-theory-for-matrix-completion-Paper 122
As we now can see, the unseen data yu could be restored, as long as the representation x0 is retrieved
by only accessing the available observations in yb . In general cases, there are infinitely many
representations that satisfy y0 = Ax0 , e.g., x0 = A+ y0 , where (·)+ is the pseudo-inverse of a matrix.
Since A+ y0 is the representation of minimal `2 norm, we revisit the traditional `2 program:
1 2
min kxk2 , s.t. yb = Ab x, (6)
x 2
where k · k2 is the `2 norm of a vector. Under some verifiable conditions, the above `2 program
is indeed consistently successful in a sense as in the following: For any y0 ∈ S0 with an arbitrary
partition y0 = [yb ; yu ] (i.e., arbitrarily missing), the desired representation x0 = A+ y0 is the unique
minimizer to the problem in (6). That is, the unseen data yu is exactly recovered by firstly computing
the minimizer x∗ to problem (6) and then calculating yu = Au x∗ .
Theorem 3.3. Let y0 = [yb ; yu ] ∈ Rm be an authentic sample drawn from some low-dimensional
subspace S0 embedded in Rm , A ∈ Rm×p be a given dictionary and k be the number of available
observations in yb . Then the convex program (6) is consistently successful, provided that S0 ⊆
span{A} and the dictionary A is k-isomeric.
Unlike the theory in [35], the condition of which is unverifiable, our k-isomeric condition could be
verified in finite time. Notice, that the problem of missing data recovery is closely related to matrix
completion, which is actually to restore the missing entries in multiple data vectors simultaneously.
Hence, Theorem 3.3 can be naturally generalized to the case of matrix completion, as will be shown
in the next subsection.
Theorem 3.4 tells us that, in general, even when the locations of the missing entries are interrelated
and nonuniformly distributed, the target matrix L0 can be restored as long as we have found a proper
dictionary A. This motivates us to consider the commonly used bilinear program that seeks both A
and X simultaneously:
1 2 1 2
min kAkF + kXkF , s.t. PΩ (AX − L0 ) = 0, (8)
A,X 2 2
where A ∈ Rm×p and X ∈ Rp×n . The problem above is bilinear and therefore nonconvex. So, it
would be hard to obtain a strong performance guarantee as done in the convex programs, e.g., [4, 21].
Interestingly, under a very mild condition, the problem in (8) is proved to include the exact solutions
that identify the target matrix L0 as the critical points.
Theorem 3.5. Let L0 ∈ Rm×n and Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n}. Denote the rank and SVD
of L0 as r0 and U0 Σ0 V0T , respectively. If L0 is Ω/ΩT -isomeric then the exact solution, denoted by
(A0 , X0 ) and given by
1 1
A0 = U0 Σ02 QT , X0 = QΣ02 V0T , ∀Q ∈ Rp×r0 , QT Q = I,
is a critical point to the problem in (8).
To exhibit the power of program (8), however, the parameter p, which indicates the number of
columns in the dictionary matrix A, must be close to the true rank of the target matrix L0 . This is
A-new-theory-for-matrix-completion-Paper 123
convex (nonuniform) nonconvex (nonuniform) convex (uniform) nonconvex (uniform)
95 95 95 95
15 15 15 15
1 1 1 1
1 15 35 55 75 95 1 15 35 55 75 95 1 15 35 55 75 95 1 15 35 55 75 95
rank(L0) rank(L0) rank(L0) rank(L0)
Figure 2: Comparing the bilinear program (9) (p = m) with the convex method (2). The numbers
plotted on the above figures are the success rates within 20 random trials. The white and black points
mean “succeed” and “fail”, respectively. Here the success is in a sense that PSNR ≥ 40dB, where
PSNR standing for peak signal-to-noise ratio.
impractical in the cases where the rank of L0 is unknown. Notice, that the Ω-isomeric condition
imposed on A requires
rank (A) ≤ |Ωj |, ∀j = 1, 2, · · · , n.
This, together with the condition of L0 ∈ span{A}, essentially need us to solve a low rank matrix
recovery problem [14]. Hence, we suggest to combine the formulation (7) with the popular idea of
nuclear norm minimization, resulting in a bilinear program that jointly estimates both the dictionary
matrix A and the representation matrix X by
1 2
min kAk∗ + kXkF , s.t. PΩ (AX − L0 ) = 0, (9)
A,X 2
which, by coincidence, has been mentioned in a paper about optimization [32]. Similar to (8), the
program in (9) has the following theorem to guarantee its performance.
Theorem 3.6. Let L0 ∈ Rm×n and Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n}. Denote the rank and SVD
of L0 as r0 and U0 Σ0 V0T , respectively. If L0 is Ω/ΩT -isomeric then the exact solution, denoted by
(A0 , X0 ) and given by
2 1
A0 = U0 Σ03 QT , X0 = QΣ03 V0T , ∀Q ∈ Rp×r0 , QT Q = I,
is a critical point to the problem in (9).
Unlike (8), which possesses superior performance only if p is close to rank (L0 ) and the initial
solution is chosen carefully, the bilinear program in (9) can work well by simply choosing p = m
and using A = I as the initial solution. To see why, one essentially needs to figure out the conditions
under which a specific optimization procedure can produce an optimal solution that meets an exact
solution. This requires extensive justifications and we leave it as future work.
4 Simulations
To verify the superiorities of the nonconvex matrix completion methods over the convex program (2),
we would like to experiment with randomly generated matrices. We generate a collection of m × n
(m = n = 100) target matrices according to the model of L0 = BC, where B ∈ Rm×r0 and
C ∈ Rr0 ×n are N (0, 1) matrices. The rank of L0 , i.e., r0 , is configured as r0 = 1, 5, 10, · · · , 90, 95.
Regarding the index set Ω consisting of the locations of the observed entries, we consider t-
wo settings: One is to create Ω by using a Bernoulli model to randomly sample a subset from
{1, · · · , m} × {1, · · · , n} (referred to as “uniform”), the other is as in Figure 1 that makes the
locations of the observed entries be concentrated around the main diagonal of a matrix (referred to as
“nonuniform”). The observation fraction is set to be |Ω|/(mn) = 0.01, 0.05, · · · , 0.9, 0.95. For each
pair of (r0 , |Ω|/(mn)), we run 20 trials, resulting in 8000 simulations in total.
When p = m and the identity matrix is used to initialize the dictionary A, we have empirically found
that program (8) has the same performance as (2). This is not strange, because it has been proven
in [16] that kLk∗ = minA,X 21 (kAk2F + kXk2F ), s.t. L = AX. Figure 2 compares the bilinear
A-new-theory-for-matrix-completion-Paper 124
program (9) to the convex method (2). It can be seen that (9) works distinctly better than (2). Namely,
while handling the nonuniformly missing data, the number of matrices successfully restored by the
bilinear program (9) is 102% more than the convex program (2). Even for dealing with the missing
entries chosen uniformly at random, in terms of the number of successfully restored matrices, the
bilinear program (9) can still outperform the convex method (2) by 44%. These results illustrate that,
even in the cases where the rank of L0 is unknown, the bilinear program (9) can do much better than
the convex optimization based method (2).
Acknowledgment
We would like to thanks the anonymous reviewers and meta-reviewers for providing us many valuable
comments to refine this paper.
References
[1] Emmanuel Candès and Terence Tao. The power of convex relaxation: Near-optimal matrix completion.
IEEE Transactions on Information Theory, 56(5):2053–2080, 2009.
[2] Emmanuel Candès and Yaniv Plan. Matrix completion with noise. In IEEE Proceeding, volume 98, pages
925–936, 2010.
[3] William E. Bishop and Byron M. Yu. Deterministic symmetric positive semidefinite matrix completion.
In Neural Information Processing Systems, pages 2762–2770, 2014.
[4] Emmanuel Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations
of Computational Mathematics, 9(6):717–772, 2009.
[5] Eyal Heiman, Gideon Schechtman, and Adi Shraibman. Deterministic algorithms for matrix completion.
Random Structures and Algorithms, 45(2):306–317, 2014.
[6] Raghunandan H. Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries.
IEEE Transactions on Information Theory, 56(6):2980–2998, 2010.
[7] Raghunandan H. Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from noisy entries.
Journal of Machine Learning Research, 11:2057–2078, 2010.
[8] Akshay Krishnamurthy and Aarti Singh. Low-rank matrix and tensor completion via adaptive sampling.
In Neural Information Processing Systems, pages 836–844, 2013.
[9] Troy Lee and Adi Shraibman. Matrix completion from any given set of observations. In Neural Information
Processing Systems, pages 1781–1787, 2013.
[10] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning
large incomplete matrices. Journal of Machine Learning Research, 11:2287–2322, 2010.
[11] Karthik Mohan and Maryam Fazel. New restricted isometry results for noisy low-rank recovery. In IEEE
International Symposium on Information Theory, pages 1573–1577, 2010.
A-new-theory-for-matrix-completion-Paper 125
[12] B. Recht, W. Xu, and B. Hassibi. Necessary and sufficient conditions for success of the nuclear norm
heuristic for rank minimization. Technical report, CalTech, 2008.
[13] Markus Weimer, Alexandros Karatzoglou, Quoc V. Le, and Alex J. Smola. Cofi rank - maximum margin
matrix factorization for collaborative ranking. In Neural Information Processing Systems, 2007.
[14] Emmanuel J. Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis?
Journal of the ACM, 58(3):1–37, 2011.
[15] Alexander L. Chistov and Dima Grigoriev. Complexity of quantifier elimination in the theory of alge-
braically closed fields. In Proceedings of the Mathematical Foundations of Computer Science, pages
17–31, 1984.
[16] Maryam Fazel, Haitham Hindi, and Stephen P. Boyd. A rank minimization heuristic with application to
minimum order system approximation. In American Control Conference, pages 4734–4739, 2001.
[17] Rong Ge, Jason D. Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Neural
Information Processing Systems, pages 2973–2981, 2016.
[18] Franz J. Király, Louis Theran, and Ryota Tomioka. The algebraic combinatorial approach for low-rank
matrix completion. J. Mach. Learn. Res., 16(1):1391–1436, January 2015.
[19] Guangcan Liu and Ping Li. Recovery of coherent data via low-rank dictionary pursuit. In Neural
Information Processing Systems, pages 1206–1214, 2014.
[20] Daniel L. Pimentel-Alarcón and Robert D. Nowak. The Information-theoretic requirements of subspace
clustering with missing data. In International Conference on Machine Learning, 48:802–810, 2016.
[21] Guangcan Liu and Ping Li. Low-rank matrix completion in the presence of high coherence. IEEE
Transactions on Signal Processing, 64(21):5623–5633, 2016.
[22] Guangcan Liu, Zhouchen Lin, Shuicheng Yan, Ju Sun, Yong Yu, and Yi Ma. Robust recovery of subspace
structures by low-rank representation. IEEE Transactions on Pattern Recognition and Machine Intelligence,
35(1):171–184, 2013.
[23] Guangcan Liu, Qingshan Liu, and Ping Li. Blessing of dimensionality: Recovering mixture data via
dictionary pursuit. IEEE Transactions on Pattern Recognition and Machine Intelligence, 39(1):47–60,
2017.
[24] Guangcan Liu, Huan Xu, Jinhui Tang, Qingshan Liu, and Shuicheng Yan. A deterministic analysis for
LRR. IEEE Transactions on Pattern Recognition and Machine Intelligence, 38(3):417–430, 2016.
[25] Raghu Meka, Prateek Jain, and Inderjit S. Dhillon. Matrix completion from power-law distributed samples.
In Neural Information Processing Systems, pages 1258–1266, 2009.
[26] Sahand Negahban and Martin J. Wainwright. Restricted strong convexity and weighted matrix completion:
Optimal bounds with noise. Journal of Machine Learning Research, 13:1665–1697, 2012.
[27] Yudong Chen, Srinadh Bhojanapalli, Sujay Sanghavi, and Rachel Ward. Completing any low-rank matrix,
provably. Journal of Machine Learning Research, 16: 2999-3034, 2015.
[28] Praneeth Netrapalli, U. N. Niranjan, Sujay Sanghavi, Animashree Anandkumar, and Prateek Jain. Non-
convex robust PCA. In Neural Information Processing Systems, pages 1107–1115, 2014.
[29] Yuzhao Ni, Ju Sun, Xiaotong Yuan, Shuicheng Yan, and Loong-Fah Cheong. Robust low-rank subspace
segmentation with semidefinite guarantees. In International Conference on Data Mining Workshops, pages
1179–1188, 2013.
[30] R. Rockafellar. Convex Analysis. Princeton University Press, Princeton, NJ, USA, 1970.
[31] Ruslan Salakhutdinov and Nathan Srebro. Collaborative filtering in a non-uniform world: Learning with
the weighted trace norm. In Neural Information Processing Systems, pages 2056–2064, 2010.
[32] Fanhua Shang, Yuanyuan Liu, and James Cheng. Scalable algorithms for tractable schatten quasi-norm
minimization. In AAAI Conference on Artificial Intelligence, pages 2016–2022, 2016.
[33] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via non-convex factorization. IEEE
Transactions on Information Theory, 62(11):6535–6579, 2016.
[34] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust PCA via outlier pursuit. IEEE Transactions
on Information Theory, 58(5):3047–3064, 2012.
[35] Yin Zhang. When is missing data recoverable? CAAM Technical Report TR06-15, 2006.
[36] Tuo Zhao, Zhaoran Wang, and Han Liu. A nonconvex optimization framework for low rank matrix
estimation. In Neural Information Processing Systems, pages 559–567, 2015.
[37] David L. Donoho and Michael Elad. Optimally sparse representation in general (nonorthogonal) dic-
tionaries via `1 minimization. Proceedings of the National Academy of Sciences, 100(5): 2197-2202,
2003.
10
A-new-theory-for-matrix-completion-Paper 126
1
Notice: This work has been submitted to the IEEE for possible
publication. Copyright may be transferred without notice, after which
this version may no longer be accessible.
arXiv:1907.11705v1 [cs.DS] 27 Jul 2019
Abstract
As a paradigm to recover unknown entries of a matrix from partial observations, low-rank matrix
completion (LRMC) has generated a great deal of interest. Over the years, there have been lots of works
on this topic but it might not be easy to grasp the essential knowledge from these studies. This is mainly
because many of these works are highly theoretical or a proposal of new LRMC technique. In this paper,
we give a contemporary survey on LRMC. In order to provide better view, insight, and understanding
of potentials and limitations of LRMC, we present early scattered results in a structured and accessible
way. Specifically, we classify the state-of-the-art LRMC techniques into two main categories and then
explain each category in detail. We next discuss issues to be considered when one considers using
LRMC techniques. These include intrinsic properties required for the matrix recovery and how to
exploit a special structure in LRMC design. We also discuss the convolutional neural network (CNN)
based LRMC algorithms exploiting the graph structure of a low-rank matrix. Further, we present the
recovery performance and the computational complexity of the state-of-the-art LRMC techniques. Our
hope is that this survey article will serve as a useful guide for practitioners and non-experts to catch
the gist of LRMC.
I. I NTRODUCTION
In the era of big data, the low-rank matrix has become a useful and popular tool to express two-
dimensional information. One well-known example is the rating matrix in the recommendation
systems representing users’ tastes on products [1]. Since users expressing similar ratings on
multiple products tend to have the same interest for the new product, columns associated with
users sharing the same interest are highly likely to be the same, resulting in the low rank structure
Dragonheart, 1996
Parenthood, 1989
The Game, 1997
7 Seconds, 2005
Silkwood, 1983
Impostor, 2000
Spartan, 2004
Congo, 1995
Fame, 1980
Customer ID: 6
7
79
5 134
188
199
481
561
684
769
906
1310
1333
4 1409
1427
1442
1457
1500
1527
1626
1830
1871
3 1897
1918
2000
2128
2213
2225
2307
2455
2469
2 2678
2693
2757
2787
2794
2807
2878
2892
2905
2976
1 3039
3186
3292
3321
3363
3458
3595
3604
3694
0
(a) (b)
Netflix rating matrix with each entry an integer Submatrix M of size 50 × 50
from 1 to 5 and zero for unknown
(c) (d)
Observed matrix Mo (70% of known entries of M) c via LRMC using Mo
Reconstructed matrix M
of the rating matrix (see Fig. I). Another example is the Euclidean distance matrix formed by
the pairwise distances of a large number of sensor nodes. Since the rank of a Euclidean distance
matrix in the k-dimensional Euclidean space is at most k + 2 (if k = 2, then the rank is 4), this
matrix can be readily modeled as a low-rank matrix [2], [3], [4].
One major benefit of the low-rank matrix is that the essential information, expressed in terms
of degree of freedom, in a matrix is much smaller than the total number of entries. Therefore,
even though the number of observed entries is small, we still have a good chance to recover
the whole matrix. There are a variety of scenarios where the number of observed entries of a
matrix is tiny. In the recommendation systems, for example, users are recommended to submit
the feedback in a form of rating number, e.g., 1 to 5 for the purchased product. However,
users often do not want to leave a feedback and thus the rating matrix will have many missing
entries. Also, in the internet of things (IoT) network, sensor nodes have a limitation on the radio
communication range or under the power outage so that only small portion of entries in the
Euclidean distance matrix is available.
When there is no restriction on the rank of a matrix, the problem to recover unknown entries
of a matrix from partial observed entries is ill-posed. This is because any value can be assigned
to unknown entries, which in turn means that there are infinite number of matrices that agree
with the observed entries. As a simple example, consider the following 2 × 2 matrix with one
unknown entry marked ?
1 5
M= . (1)
2 ?
If M is a full rank, i.e., the rank of M is two, then any value except 10 can be assigned to ?.
Whereas, if M is a low-rank matrix (the rank is one in this trivial example), two columns differ by
only a constant and hence unknown element ? can be easily determined using a linear relationship
between two columns (? = 10). This example is obviously simple, but the fundamental principle
to recover a large dimensional matrix is not much different from this and the low-rank constraint
plays a central role in recovering unknown entries of the matrix.
Before we proceed, we discuss a few notable applications where the underlying matrix is
modeled as a low-rank matrix.
1) Recommendation system: In 2006, the online DVD rental company Netflix announced
a contest to improve the quality of the company’s movie recommendation system. The
company released a training set of half million customers. Training set contains ratings
on more than ten thousands movies, each movie being rated on a scale from 1 to 5
[1]. The training data can be represented in a large dimensional matrix in which each
column represents the rating of a customer for the movies. The primary goal of the
recommendation system is to estimate the users’ interests on products using the sparsely
(a)
Partially observed distances of sensor nodes due to limitation of radio communication range r
(b) (c)
RSSI-based observation error of 1000 Reconstruction error
sensor nodes in an 100m× 100m area
Fig. 2. Localization via LRMC [4]. The Euclidean distance matrix can be recovered with 92% of distance error below 0.5m
using 30% of observed distances.
sampled1 rating matrix.2 Often users sharing the same interests in key factors such as the
type, the price, and the appearance of the product tend to provide the same rating on the
movies. The ratings of those users might form a low-rank column space, resulting in the
low-rank model of the rating matrix (see Fig. I).
2) Phase retrieval: The problem to recover a signal not necessarily sparse from the magni-
tude of its observation is referred to as the phase retrieval. Phase retrieval is an important
problem in X-ray crystallography and quantum mechanics since only the magnitude of
1
Netflix dataset consists of ratings of more than 17,000 movies by more than 2.5 million users. The number of known entries
is only about 1% [1].
2
Customers might not necessarily rate all of the movies.
Fig. 3. Image reconstruction via LRMC. Recovered images achieve peak SNR ≥ 32dB.
the Fourier transform is measured in these applications [5]. Suppose the unknown time-
domain signal m = [m0 · · · mn−1 ] is acquired in a form of the measured magnitude of
the Fourier transform. That is,
1 Xn−1
−j2πωt/n
|zω | = √ mt e , ω ∈ Ω, (2)
n t=0
where Ω is the set of sampled frequencies. Further, let
1
fω = √ [1 e−j2πω/n · · · e−j2πω(n−1)/n ]H , (3)
n
M = mmH where mH is the conjugate transpose of m. Then, (2) can be rewritten as
= hM, Fω i, (7)
where Fw = fw fwH is the rank-1 matrix of the waveform fω . Using this simple transform,
we can express the quadratic magnitude |zω |2 as linear measurement of M. In essence,
the phase retrieval problem can be converted to the problem to reconstruct the rank-1
matrix M in the positive semi-definite (PSD) cone3 [5]:
min rank(X)
X
X 0.
3) Localization in IoT networks: In recent years, internet of things (IoT) has received
much attention for its plethora of applications such as healthcare, automatic metering,
environmental monitoring (temperature, pressure, moisture), and surveillance [6], [7], [2].
Since the action in IoT networks, such as fire alarm, energy transfer, emergency request, is
made primarily on the data center, data center should figure out the location information
of whole devices in the networks. In this scheme, called network localization (a.k.a.
cooperative localization), each sensor node measures the distance information of adjacent
nodes and then sends it to the data center. Then the data center constructs a map of sensor
nodes using the collected distance information [8]. Due to various reasons, such as the
power outage of a sensor node or the limitation of radio communication range (see Fig.
1), only small number of distance information is available at the data center. Also, in the
vehicular networks, it is not easy to measure the distance of all adjacent vehicles when
a vehicle is located at the dead zone. An example of the observed Euclidean distance
matrix is
0 d212 d213 ? ?
d2 0 ? ? ?
21
Mo = 2
d31 ? 0 d234 2 ,
d35
? ? d243 0 d245
? ? d253 d254 0
where dij is the pairwise distance between two sensor nodes i and j. Since the rank
of Euclidean distance matrix M is at most k+2 in the k-dimensional Euclidean space
(k = 2 or k = 3) [3], [4], the problem to reconstruct M can be well-modeled as the
LRMC problem.
3
If M is recovered, then the time-domain vector m can be computed by the eigenvalue decomposition of M.
Nuclear norm minimization (NNM) II.A.2 Singular value thresholding (SVT) [33]
each column of Φ is the pilot signal from one antenna at BS [10], [11]. Since the number
of resolvable paths P is limited in most cases, one can readily assume that rank(H) ≤ P
[12]. In the massive MIMO systems, P is often much smaller than the dimension of H
due to the limited number of clusters around BS. Thus, the problem to recover H at BS
can be solved via the rank minimization problem subject to the linear constraint Y = HΦ
[11].
Other than these, there are a bewildering variety of applications of LRMC in wireless com-
munication, such as millimeter wave (mmWave) channel estimation [13], [14], topological inter-
ference management (TIM) [15], [16], [17], [18] and mobile edge caching in fog radio access
networks (Fog-RAN) [19], [20].
The paradigm of LRMC has received much attention ever since the works of Fazel [21],
Candes and Recht [22], and Candes and Tao [23]. Over the years, there have been lots of works
on this topic [5], [57], [48], [49], but it might not be easy to grasp the essentials of LRMC from
these studies. One reason is because many of these works are highly theoretical and based on
random matrix theory, graph theory, manifold analysis, and convex optimization so that it is not
easy to grasp the essential knowledge from these studies. Another reason is that most of these
works are proposals of new LRMC technique so that it is difficult to catch a general idea and
big picture of LRMC from these.
The primary goal of this paper is to provide a contemporary survey on LRMC, a new paradigm
to recover unknown entries of a low-rank matrix from partial observations. To provide better
view, insight, and understanding of the potentials and limitations of LRMC to researchers and
practitioners in a friendly way, we present early scattered results in a structured and accessible
way. Firstly, we classify the state-of-the-art LRMC techniques into two main categories and
then explain each category in detail. Secondly, we present issues to be considered when using
LRMC techniques. Specifically, we discuss the intrinsic properties required for low-rank matrix
recovery and explain how to exploit a special structure, such as positive semidefinite-based
structure, Euclidean distance-based structure, and graph structure, in LRMC design. Thirdly, we
compare the recovery performance and the computational complexity of LRMC techniques via
numerical simulations. We conclude the paper by commenting on the choice of LRMC techniques
and providing future research directions.
Recently, there have been a few overview papers on LRMC. An overview of LRMC algorithms
and their performance guarantees can be found in [73]. A survey with an emphasis on first-order
LRMC techniques together with their computational efficiency is presented in [74]. Our work
is distinct from the previous studies in several aspects. Firstly, we categorize the state-of-the-
art LRMC techniques into two classes and then explain the details of each class, which can
help researchers to easily determine which technique can be used for the given problem setup.
Secondly, we provide a comprehensive survey of LRMC techniques and also provide extensive
simulation results on the recovery quality and the running time complexity from which one can
easily see the pros and cons of each LRMC technique and also gain a better insight into the
choice of LRMC algorithms. Finally, we discuss how to exploit a special structure of a low-
rank matrix in the LRMC algorithm design. In particular, we introduce the CNN-based LRMC
algorithm that exploits the graph structure of a low-rank matrix.
We briefly summarize notations used in this paper.
• For a vector a ∈ Rn , diag(a) ∈ Rn×n is the diagonal matrix formed by a.
• For a matrix A ∈ Rn1 ×n2 , ai ∈ Rn1 is the i-th column of A.
• rank(A) is the rank of A.
• AT ∈ Rn2 ×n1 is the transpose of A.
• For A, B ∈ Rn1 ×n2 , hA, Bi = tr(AT B) and A⊙B are the inner product and the Hadamard
product (or element-wise multiplication) of two matrices A and B, respectively, where
tr(·) denotes the trace operator.
• kAk, kAk∗ , and kAkF stand for the spectral norm (i.e., the largest singular value), the
nuclear norm (i.e., the sum of singular values), and the Frobenius norm of A, respectively.
• σi (A) is the i-th largest singular value of A.
• 0d1 ×d2 and 1d1 ×d2 are (d1 × d2 )-dimensional matrices with entries being zero and one,
respectively.
• Id is the d-dimensional identity matrix.
• If A is a square matrix (i.e., n1 = n2 = n), diag(A) ∈ Rn is the vector formed by the
diagonal entries of A.
• vec(X) is the vectorization of X.
In this section, we discuss the principle to recover a low-rank matrix from partial observations.
Basically, the desired low-rank matrix M can be recovered by solving the rank minimization
problem
min rank(X)
X
(9)
subject to xij = mij , (i, j) ∈ Ω,
where Ω is the index set of observed entries (e.g., Ω = {(1, 1), (1, 2), (2, 1)} in the example
in (1)). One can alternatively express the problem using the sampling operator PΩ . The sampling
operation PΩ (A) of a matrix A is defined as
aij if (i, j) ∈ Ω
[PΩ (A)]ij =
0 otherwise.
Using this operator, the problem (9) can be equivalently formulated as
min rank(X)
X
(10)
subject to PΩ (X) = PΩ (M).
A naive way to solve the rank minimization problem (10) is the combinatorial search. Specifically,
we first assume that rank(M) = 1. Then, any two columns of M are linearly dependent and thus
we have the system of expressions mi = αi,j mj for some αi,j ∈ R. If the system has no solution
for the rank-one assumption, then we move to the next assumption of rank(M) = 2. In this case,
we solve the new system of expressions mi = αi,j mj + αi,k mk . This procedure is repeated until
the solution is found. Clearly, the combinatorial search strategy would not be feasible for most
practical scenarios since it has an exponential complexity in the problem size [76]. For example,
when M is an n × n matrix, it can be shown that the number of the system expressions to be
solved is O(n2n ).
As a cost-effective alternative, various low-rank matrix completion (LRMC) algorithms have
been proposed over the years. Roughly speaking, depending on the way of using the rank
information, the LRMC algorithms can be classified into two main categories: 1) those without
the rank information and 2) those exploiting the rank information. In this section, we provide
an in depth discussion of two categories (see the outline of LRMC algorithms in Fig. 3).
In this subsection, we explain the LRMC algorithms that do not require the rank information
of the original low-rank matrix.
1) Nuclear Norm Minimization (NNM): Since the rank minimization problem (10) is NP-
hard [21], it is computationally intractable when the dimension of a matrix is large. One common
trick to avoid computational issue is to replace the non-convex objective function with its convex
surrogate, converting the combinatorial search problem into a convex optimization problem.
There are two clear advantages in solving the convex optimization problem: 1) a local optimum
solution is globally optimal and 2) there are many efficient polynomial-time convex optimization
solvers (e.g., interior point method [77] and semi-definite programming (SDP) solver).
In the LRMC problem, the nuclear norm kXk∗ , the sum of the singular values of X, has been
widely used as a convex surrogate of rank(X) [22]:
min kXk∗
X
(11)
subject to PΩ (X) = PΩ (M)
Indeed, it has been shown that the nuclear norm is the convex envelope (the “best” convex
approximation) of the rank function on the set {X ∈ Rn1 ×n2 : kXk ≤ 1} [21].4 Note that the
relaxation from the rank function to the nuclear norm is conceptually analogous to the relaxation
from ℓ0 -norm to ℓ1 -norm in compressed sensing (CS) [39], [40], [41].
Now, a natural question one might ask is whether the NNM problem in (11) would offer a
solution comparable to the solution of the rank minimization problem in (10). In [22], it has
been shown that if the observed entries of a rank r matrix M(∈ Rn×n ) are suitably random and
the number of observed entries satisfies
where µ0 is the largest coherence of M (see the definition in Subsection III-A2), then M is the
unique solution of the NNM problem (11) with overwhelming probability (see Appendix B).
4
For any function f : C → R, where C is a convex set, the convex envelope of f is the largest convex function g such that
f (x) ≥ g(x) for all x ∈ C. Note that the convex envelope of rank(X) on the set {X ∈ Rn1 ×n2 : kXk ≤ 1} is the nuclear
norm kXk∗ [21].
It is worth mentioning that the NNM problem in (11) can also be recast as a semidefinite
program (SDP) as (see Appendix A)
min tr(Y)
Y
Y 0,
W1 X
where Y = ∈ R(n1 +n2 )×(n1 +n2 ) , {Ak }|Ω|
k=1 is the sequence of linear sampling
XT W2
|Ω|
matrices, and {bk }k=1 are the observed entries. The problem (13) can be solved by the off-the-
shelf SDP solvers such as SDPT3 [24] and SeDuMi [25] using interior-point methods [26], [27],
[28], [31], [30], [29]. It has been shown that the computational complexity of SDP techniques
is O(n3 ) where n = max(n1 , n2 ) [30]. Also, it has been shown that under suitable conditions,
c of SDP satisfies kM
the output M c − MkF ≤ ǫ in at most O(nω log( 1 )) iterations where
ǫ
ω is a positive constant [29]. Alternatively, one can reconstruct M by solving the equivalent
nonconvex quadratic optimization form of the NNM problem [32]. Note that this approach has
computational benefit since the number of primal variables of NNM is reduced from n1 n2 to
r(n1 + n2 ) (r ≤ min(n1 , n2 )). Interested readers may refer to [32] for more details.
2) Singular Value Thresholding (SVT): While the solution of the NNM problem in (11) can
be obtained by solving (13), this procedure is computationally burdensome when the size of the
matrix is large.
As an effort to mitigate the computational burden, the singular value thresholding (SVT)
algorithm has been proposed [33]. The key idea of this approach is to put the regularization
term into the objective function of the NNM problem:
1
min τ kXk∗ + kXk2F
X 2 (14)
subject to PΩ (X) = PΩ (M),
where τ is the regularization parameter. In [33, Theorem 3.1], it has been shown that the solution
to the problem (14) converges to the solution of the NNM problem as τ → ∞.5
5
In practice, a large value of τ has been suggested (e.g., τ = 5n for an n × n low rank matrix) for the fast convergence of
SVT. For example, when τ = 5000, it requires 177 iterations to reconstruct a 1000×1000 matrix of rank 10 [33].
b Y)
max min L(X, Y) = L(X, b = min max L(X, Y). (16)
Y X X Y
b and Y
The SVT algorithm finds X b in an iterative fashion. Specifically, starting with Y0 = 0n1 ×n2 ,
SVT updates Xk and Yk as
Theorem 1 ([33, Theorem 2.1]). Let Z be a matrix whose singular value decomposition (SVD)
is Z = UΣVT . Define t+ = max{t, 0} for t ∈ R. Then,
1
Dτ (Z) = arg min τ kXk∗ + kX − Zk2F , (19)
X 2
where Dτ is the singular value thresholding operator defined as
While T = false do
k =k+1
[Uk−1 , Σk−1 , Vk−1 ] = svd(Yk−1 )
T
Xk = Uk−1 diag({(σi (Σk−1 ) − τ )+ }i })Vk−1 using (20)
Yk = Yk−1 + δk (PΩ (M) − PΩ (Xk ))
End
Output Xk
By Theorem 1, the right-hand side of (18) is Dτ (Yk−1). To conclude, the update equations
for Xk and Yk are given by
Xk = Dτ (Yk−1 ), (21a)
One can notice from (21a) and (21b) that the SVT algorithm is computationally efficient since
we only need the truncated SVD and elementary matrix operations in each iteration. Indeed,
let rk be the number of singular values of Yk−1 being greater than the threshold τ . Also,
we suppose {rk } converges to the rank of the original matrix, i.e., limk→∞ rk = r. Then the
computational complexity of SVT is O(rn1 n2 ). Note also that the iteration number to achieve
the ǫ-approximation6 is O( √1ǫ ) [33]. In Table I, we summarize the SVT algorithm. For the details
of the stopping criterion of SVT, see [33, Section 5].
Over the years, various SVT-based techniques have been proposed [35], [78], [79]. In [78], an
iterative matrix completion algorithm using the SVT-based operator called proximal operator has
been proposed. Similar algorithms inspired by the iterative hard thresholding (IHT) algorithm in
CS have also been proposed [35], [79].
6 c − M∗ kF ≤ ǫ where M
By ǫ-approximation, we mean kM c is the reconstructed matrix and M∗ is the optimal solution of
SVT.
3) Iteratively Reweighted Least Squares (IRLS) Minimization: Yet another simple and com-
putationally efficient way to solve the NNM problem is the IRLS minimization technique [36],
[37]. In essence, the NNM problem can be recast using the least squares minimization as
1
min kW 2 Xk2F
X,W
(22)
subject to PΩ (X) = PΩ (M),
1
where W = (XXT )− 2 . It can be shown that (22) is equivalent to the NNM problem (11) since
we have [36]
1 1
kXk∗ = tr((XXT ) 2 ) = kW 2 Xk2F . (23)
The key idea of the IRLS technique is to find X and W in an iterative fashion. The update
expressions are
1
Xk = arg min kWk−1
2
Xk2F , (24a)
PΩ (X)=PΩ (M)
1
Wk = (Xk XTk )− 2 . (24b)
Note that the weighted least squares subproblem (24a) can be easily solved by updating each
and every column of Xk [36]. In order to compute Wk , we need a matrix inversion (24b). To
avoid the ill-behavior (i.e., some of the singular values of Xk approach to zero), an approach
to use the perturbation of singular values has been proposed [36], [37]. Similar to SVT, the
computational complexity per iteration of the IRLS-based technique is O(rn1 n2 ). Also, IRLS
requires O(log( 1ǫ )) iterations to achieve the ǫ-approximation solution. We summarize the IRLS
minimization technique in Table II.
In many applications such as localization in IoT networks, recommendation system, and image
restoration, we encounter the situation where the rank of a desired matrix is known in advance.
As mentioned, the rank of a Euclidean distance matrix in a localization problem is at most
k + 2 (k is the dimension of the Euclidean space). In this situation, the LRMC problem can be
formulated as a Frobenius norm minimization (FNM) problem:
1
min kPΩ (M) − PΩ (X)k2F
X 2 (25)
subject to rank(X) ≤ r.
Input a constant q ≥ r,
a scaling parameter γ > 0,
and a stopping criterion T
Initialize iteration counter k = 0,
a regularizing sequence ǫ0 = 1,
and W0 = I
While T = false do
k =k+1
1
2
Xk = arg minPΩ (X)=PΩ (M) kWk−1 Xk2F
ǫk = min(ǫk−1 , γσq+1 (Xk ))
e k of Xk [36]
Compute a SVD perturbation version X
e e T −1
Wk = (Xk X ) 2 k
End
Output Xk
Due to the inequality of the rank constraint, an approach to use the approximate rank information
(e.g., upper bound of the rank) has been proposed [43]. The FNM problem has two main
advantages: 1) the problem is well-posed in the noisy scenario and 2) the cost function is
differentiable so that various gradient-based optimization techniques (e.g., gradient descent,
conjugate gradient, Newton methods, and manifold optimization) can be used to solve the
problem.
Over the years, various techniques to solve the FNM problem in (25) have been proposed [43],
[44], [45], [46], [47], [48], [49], [50], [51], [57]. The performance guarantee of the FNM-
based techniques has also been provided [59], [60], [61]. It has been shown that under suitable
conditions of the sampling ratio p = |Ω|/(n1 n2 ) and the largest coherence µ0 of M (see the
definition in Subsection III-A2), the gradient-based algorithms globally converges to M with
high probability [60]. Well-known FNM-based LRMC techniques include greedy techniques
[43], alternating projection techniques [45], and optimization over Riemannian manifold [50].
In this subsection, we explain these techniques in detail.
1) Greedy Techniques: In recent years, greedy algorithms have been popularly used for LRMC
due to the computational simplicity. In a nutshell, they solve the LRMC problem by making a
heuristic decision at each iteration with a hope to find the right solution in the end.
Let r be the rank of a desired low-rank matrix M ∈ Rn×n and M = UΣVT be the singular
M can be expressed as a linear combination of r rank-one matrices. The main task of greedy
techniques is to investigate the atom set AM = {ϕi = ui viT }ri=1 of rank-one matrices representing
M. Once the atom set AM is found, the singular values σi (M) = σi can be computed easily by
solving the following problem
r
X
(σ1 , · · · , σr ) = arg min kPΩ (M) − PΩ ( αi ϕi )kF . (27)
αi
i=1
To be specific, let A = [vec(PΩ (ϕ1 )) · · · vec(PΩ (ϕr ))], α = [α1 · · · αr ]T and b = vec(PΩ (M)).
Then, we have (σ1 , · · · , σr ) = arg min kb − Aαk2 = A† b.
α
One popular greedy technique is atomic decomposition for minimum rank approximation
(ADMiRA) [43], which can be viewed as an extension of the compressive sampling matching
pursuit (CoSaMP) algorithm in CS [38], [39], [40], [41]. ADMiRA employs a strategy of adding
as well as pruning to identify the atom set AM . In the adding stage, ADMiRA identifies 2r rank-
one matrices representing a residual best and then adds the matrices to the pre-chosen atom set.
Specifically, if Xi−1 is the output matrix generated in the (i−1)-th iteration and Ai−1 is its atom
set, then ADMiRA computes the residual Ri = PΩ (M) − PΩ (Xi−1 ) and then adds 2r leading
principal components of Ri to Ai−1 . In other words, the enlarged atom set Ψi is given by
T
Ψi = Ai−1 ∪ {uRi ,j vR i ,j
: 1 ≤ j ≤ 2r}, (28)
where uRi ,j and vRi ,j are the j-th principal left and right singular vectors of Ri , respectively.
Note that Ψi contains at most 3r elements. In the pruning stage, ADMiRA refines Ψi into a set
e i is the best rank-3r approximation of M, i.e.,7
of r atoms. To be specific, if X
e i = arg
X min kPΩ (M) − PΩ (X)kF , (29)
X∈span(Ψi )
7
Note that the solution to (29) can be computed in a similar way as in (27).
While T = false do
Rk = PΩ (M) − PΩ (Xk )
[URk , ΣRk , VRk ] = svds(Rk , 2r)
T
(Augment) Ψk+1 = Ak ∪ {uRk ,j vR : 1 ≤ j ≤ 2r}
k ,j
Xe k+1 = arg min kPΩ (M) − PΩ (X)kF using (27)
X∈span(Ψk+1 )
[UX
e , ΣX
e , VX
e
e k+1 , r)
] = svds(X
k+1 k+1 k+1
using (27)
k =k+1
End
Output Ak , Xk
where uX
e i ,j and vX
e
e i ,j are the j-th principal left and right singular vectors of Xi , respectively.
The computational complexity of ADMiRA is mainly due to two operations: the least squares
operation in (27) and the SVD-based operation to find out the leading atoms of the required matrix
(e.g., Rk and Xe k+1 ). First, since (27) involves the pseudo-inverse of A (size of |Ω| × O(r)), its
computational cost is O(r|Ω|). Second, the computational cost of performing a truncated SVD
of O(r) atoms is O(rn1 n2 ). Since |Ω| < n1 n2 , the computational complexity of ADMiRA per
iteration is O(rn1 n2 ). Also, the iteration number of ADMiRA to achieve the ǫ-approximation
is O(log( 1ǫ )) [43]. In Table III, we summarize the ADMiRA algorithm.
Yet another well-known greedy method is the rank-one matrix pursuit algorithm [44], an
extension of the orthogonal matching pursuit algorithm in CS [42]. In this approach, instead of
choosing multiple atoms of a matrix, an atom corresponding to the largest singular value of the
residual matrix Rk is chosen.
2) Alternating Minimization Techniques: Many of LRMC algorithms [33], [43] require the
computation of (partial) SVD to obtain the singular values and vectors (expressed as O(rn2 )). As
an effort to further reduce the computational burden of SVD, alternating minimization techniques
have been proposed [45], [46], [47]. The basic premise behind the alternating minimization
techniques is that a low-rank matrix M ∈ Rn1 ×n2 of rank r can be factorized into tall and fat
matrices, i.e., M = XY where X ∈ Rn1 ×r and Y ∈ Rr×n2 (r ≪ n1 , n2 ). The key idea of this
approach is to find out X and Y minimizing the residual defined as the difference between the
original matrix and the estimate of it on the sampling space. In other words, they recover X and
Y by solving
1
min kPΩ (M) − PΩ (XY)k2F . (31)
X,Y 2
Power factorization, a simple alternating minimization algorithm, finds out the solution to (31)
by updating X and Y alternately as [45]
Alternating steepest descent (ASD) is another alternating method to find out the solution [46].
The key idea of ASD is to update X and Y by applying the steepest gradient descent method
to the objective function f (X, Y) = 21 kPΩ (M) − PΩ (XY)k2F in (31). Specifically, ASD first
computes the gradient of f (X, Y) with respect to X and then updates X along the steepest
gradient descent direction:
where the gradient descent direction ▽fYi (Xi ) and stepsize txi are given by
where
The low-rank matrix fitting (LMaFit) algorithm finds out the solution in a different way by
solving [47]
arg min {kXY − Zk2F : PΩ (Z) = PΩ (M)}. (37)
X,Y,Z
With the arbitrary input of X0 ∈ Rn1 ×r and Y0 ∈ Rr×n2 and Z0 = PΩ (M), the variables X, Y,
and Z are updated in the i-th iteration as
3) Optimization over Smooth Riemannian Manifold: In many applications where the rank of
a matrix is known in a priori (i.e., rank(M) = r), one can strengthen the constraint of (25) by
defining the feasible set, denoted by F , as
Note that F is not a vector space8 and thus conventional optimization techniques cannot be
used to solve the problem defined over F . While this is bad news, a remedy for this is
8
This is because if rank(X) = r and rank(Y) = r, then rank(X + Y) = r is not necessarily true (and thus X + Y does not
need to belong F).
that F is a smooth Riemannian manifold [53], [48]. Roughly speaking, smooth manifold is
a generalization of Rn1 ×n2 on which a notion of differentiability exists. For more rigorous
definition, see, e.g., [55], [56]. A smooth manifold equipped with an inner product, often called
a Riemannian metric, forms a smooth Riemannian manifold. Since the smooth Riemannian
manifold is a differentiable structure equipped with an inner product, one can use all necessary
ingredients to solve the optimization problem with quadratic cost function, such as Riemannian
gradient, Hessian matrix, exponential map, and parallel translation [55]. Therefore, optimization
techniques in Rn1 ×n2 (e.g., steepest descent, Newton method, conjugate gradient method) can
be used to solve (25) in the smooth Riemannian manifold F .
In recent years, many efforts have been made to solve the matrix completion over smooth Rie-
mannian manifolds. These works are classified by their specific choice of Riemannian manifold
structure. One well-known approach is to solve (25) over the Grassmann manifold of orthogonal
matrices9 [49]. In this approach, a feasible set can be expressed as F = {QRT : QT Q = I, Q ∈
Rn1 ×r , R ∈ Rn2 ×r } and thus solving (25) is to find an n1 × r orthonormal matrix Q satisfying
f (Q) = min
n ×r
kPΩ (M) − PΩ (QRT )k2F = 0. (39)
R∈R 2
In [49], an approach to solve (39) over the Grassmann manifold has been proposed.
Recently, it has been shown that the original matrix can be reconstructed by the unconstrained
optimization over the smooth Riemannian manifold F [50]. Often, F is expressed using the
singular value decomposition as
The FNM problem (25) can then be reformulated as an unconstrained optimization over F :
1
min kPΩ (M) − PΩ (X)k2F . (41)
X∈F 2
One can easily obtain the closed-form expression of the ingredients such as tangent spaces, Rie-
mannian metric, Riemannian gradient, and Hessian matrix in the unconstrained optimization [53],
[55], [56]. In fact, major benefits of the Riemannian optimization-based LRMC techniques are
the simplicity in implementation and the fast convergence. Similar to ASD, the computational
9
The Grassmann manifold is defined as the set of the linear subspaces in a vector space [55].
complexity per iteration of these techniques is O(r|Ω|+r 2n1 +r 2n2 ), and they require O(log( 1ǫ ))
iterations to achieve the ǫ-approximation solution [50].
4) Truncated NNM: Truncated NNM is a variation of the NNM-based technique requiring
the rank information r.10 While the NNM technique takes into account all the singular values
of a desired matrix, truncated NNM considers only the n − r smallest singular values [57].
Specifically, truncated NNM finds a solution to
min kXkr
X
(42)
subject to PΩ (X) = PΩ (M),
P
n
where kXkr = σi (X). We recall that σi (X) is the i-th largest singular value of X.
i=r+1
Using [57]
r
X
σi = max tr(UT XV), (43)
UT U=VT V=Ir
i=1
we have
10
Although truncated NNM is a variant of NNM, we put it into the second category since it exploits the rank information of
a low-rank matrix.
Output Xk
[82], and accelerated proximal gradient line search method (APGL)[83] can be employed. Note
also that the dominant operation is the truncated SVD operation and its complexity is O(rn1 n2 ),
which is much smaller than that of the NNM technique (see Table V) as long as r ≪ min(n1 , n2 ).
Similar to SVT, the iteration complexity of the truncated NNM to achieve the ǫ-approximation
is O( √1ǫ ) [57]. Alternatively, the difference of two convex functions (DC) based algorithm can
be used to solve (45) [58]. In Table IV, we summarize the truncated NNM algorithm.
In this section, we study the main principles that make the recovery of a low-rank matrix
possible and discuss how to exploit a special structure of a low-rank matrix in algorithm design.
A. Intrinsic Properties
There are two key properties characterizing the LRMC problem: 1) sparsity of the observed
entries and 2) incoherence of the matrix. Sparsity indicates that an accurate recovery of the
undersampled matrix is possible even when the number of observed entries is very small.
Incoherence indicates that nonzero entries of the matrix should be spread out widely for the
efficient recovery of a low-rank matrix. In this subsection, we go over these issues in detail.
1) Sparsity of Observed Entries: Sparsity expresses an idea that when a matrix has a low
rank property, then it can be recovered using only a small number of observed entries. Natural
question arising from this is how many elements do we need to observe for the accurate recovery
of the matrix. In order to answer this question, we need to know a notion of a degree of freedom
(DOF). The DOF of a matrix is the number of freely chosen variables in the matrix. One can
easily see that the DOF of the rank one matrix in (1) is 3 since one entry can be determined
after observing three. As an another example, consider the following rank one matrix
1 3 5 7
2 6 10 14
M= . (47)
3 9 15 21
4 12 20 28
One can easily see that if we observe all entries of one column and one row, then the rest can
be determined by a simple linear relationship between these since M is the rank-one matrix.
Specifically, if we observe the first row and the first column, then the first and the second columns
differ by the factor of three so that as long as we know one entry in the second column, rest
will be recovered. Thus, the DOF of M is 4 + 4 − 1 = 7. Following lemma generalizes our
observations.
Lemma 2. The DOF of a square n × n matrix with rank r is 2nr − r 2 . Also, the DOF of
n1 × n2 -matrix is (n1 + n2 )r − r 2 .
Proof: Since the rank of a matrix is r, we can freely choose values for all entries of the r
columns, resulting in nr degrees of freedom for the first r column. Once r independent columns,
say m1 , · · · mr , are constructed, then each of the rest n − r columns is expressed as a linear
combinations of the first r columns (e.g., mr+1 = α1 m1 +· · ·+αr mr ) so that r linear coefficients
(α1 , · · · αr ) can be freely chosen in these columns. By adding nr and (n − r)r, we obtain the
desired result. Generalization to n1 × n2 matrix is straightforward.
This lemma says that if n is large and r is small enough (e.g., r = O(1)), essential information
in a matrix is just in the order of n, DOF= O(n), which is clearly much smaller than the total
number of entries of the matrix. Interestingly, the DOF is the minimum number of observed
entries required for the recovery of a matrix. If this condition is violated, that is, if the number of
observed entries is less than the DOF (i.e., m < 2nr − r 2 ), no algorithm whatsoever can recover
the matrix. In Fig. III-A1, we illustrate how to recover the matrix when the number of observed
r r
r r
(a) (b)
Fig. 5. LRMC with colored entries being observed. The dotted boxes are used to compute: (a) linear coefficients and (b)
unknown entries.
entries equals the DOF. In this figure, we assume that blue colored entries are observed.11 In
a nutshell, unknown entries of the matrix are found in two-step process. First, we identify the
linear relationship between the first r columns and the rest. For example, the (r + 1)-th column
can be expressed as a linear combination of the first r columns. That is,
mr+1 = α1 m1 + · · · + αr mr . (48)
Since the first r entries of m1 , · · · mr+1 are observed (see Fig. III-A1(a)), we have r unknowns
(α1 , · · · , αr ) and r equations so that we can identify the linear coefficients α1 , · · · αr with the
computational cost O(r 3 ) of an r × r matrix inversion. Once these coefficients are identified,
we can recover the unknown entries mr+1,r+1 · · · mr+1,n of mr+1 using the linear relationship
in (48) (see Fig. III-A1(b)). By repeating this step for the rest of columns, we can identify all
unknown entries with O(rn2 ) computational complexity12.
11
Since we observe the first r rows and columns, we have 2nr − r 2 observations in total.
12
For each unknown entry, it needs r multiplication and r − 1 addition operations. Since the number of unknown entries is
(n − r)2 , the computational cost is (2r − 1)(n − r)2 . Recall that O(r 3 ) is the cost of computing (α1 , · · · , αr ) in (48). Thus,
the total cost is O(r 3 + (2r − 1)(n − r)2 ) = O(rn2 ).
(r, l ) r
Now, an astute reader might notice that this strategy will not work if one entry of the column
(or row) is unobserved. As illustrated in Fig. III-A1, if only one entry in the r-th row, say (r, l)-th
entry, is unobserved, then one cannot recover the l-th column simply because the matrix in Fig.
III-A1 cannot be converted to the matrix form in Fig. III-A1(b). It is clear from this discussion
that the measurement size being equal to the DOF is not enough for the most cases and in fact
it is just a necessary condition for the accurate recovery of the rank-r matrix. This seems like a
depressing news. However, DOF is in any case important since it is a fundamental limit (lower
bound) of the number of observed entries to ensure the exact recovery of the matrix. Recent
results show that the DOF is not much different from the number of measurements ensuring the
recovery of the matrix [22], [75].13
2) Coherence: If nonzero elements of a matrix are concentrated in a certain region, we
generally need a large number of observations to recover the matrix. On the other hand, if
the matrix is spread out widely, then the matrix can be recovered with a relatively small number
13
In [75], it has been shown that the required number of entries to recover the matrix using the nuclear-norm minimization
is in the order of n1.2 when the rank is O(1).
P U e3 = 0
e3
P U e2 = 0
e2
e1
P U e 1 = e1
(a) (b)
Fig. 7. Coherence of matrices in (52) and (53): (a) maximum and (b) minimum.
of entries. For example, consider the following two rank-one matrices in Rn×n
1 1 0 ··· 0
1 1 0 ··· 0
M1 = 0 0 0 · · · 0 ,
. . . .
.. .. .. . . ...
0 0 0 ··· 0
1 1 1 ··· 1
1 1 1 ··· 1
M2 =
1 1 1 ··· 1 ..
.. .. .. . . ..
. . . . .
1 1 1 ··· 1
The matrix M1 has only four nonzero entries at the top-left corner. Suppose n is large, say
n = 1000, and all entries but the four elements in the top-left corner are observed (99.99% of
entries are known). In this case, even though the rank of a matrix is just one, there is no way to
recover this matrix since the information bearing entries is missing. This tells us that although
the rank of a matrix is very small, one might not recover it if nonzero entries of the matrix are
concentrated in a certain area.
In contrast to the matrix M1 , one can accurately recover the matrix M2 with only 2n − 1
(= DOF) known entries. In other words, one row and one column are enough to recover M2 ).
One can deduce from this example that the spread of observed entries is important for the
identification of unknown entries.
In order to quantify this, we need to measure the concentration of a matrix. Since the matrix
has two-dimensional structure, we need to check the concentration in both row and column
directions. This can be done by checking the concentration in the left and right singular vectors.
Recall that the SVD of a matrix is
r
X
T
M = UΣV = σi ui viT (49)
i=1
where U = [u1 · · · ur ] and V = [v1 · · · vr ] are the matrices constructed by the left and
right singular vectors, respectively, and Σ is the diagonal matrix whose diagonal entries are σi .
From (49), we see that the concentration on the vertical direction (concentration in the row) is
determined by ui and that on the horizontal direction (concentration in the column) is determined
by vi . For example, if one of the standard basis vector ei , say e1 = [1 0 · · · 0]T , lies on the space
spanned by u1 , · · · ur while others (e2 , e3 , · · · ) are orthogonal to this space, then it is clear that
nonzero entries of the matrix are only on the first row. In this case, clearly one cannot infer the
entries of the first row from the sampling of the other row. That is, it is not possible to recover
the matrix without observing the entire entries of the first row.
The coherence, a measure of concentration in a matrix, is formally defined as [75]
n
µ(U) = max kPU ei k2 (50)
r 1≤i≤n
where ei is standard basis and PU is the projection onto the range space of U. Since the columns
of U = [u1 · · · ur ] are orthonormal, we have
Note that both µ(U) and µ(V) should be computed to check the concentration on the vertical
and horizontal directions.
Proof: The upper bound is established by noting that ℓ2 -norm of the projection is not greater
than the original vector (kPU ei k22 ≤ kei k22 ). The lower bound is because
n
1X
max kPU ei k22 ≥ kPU ei k22
i n i=1
n
1X T
= e PU ei
n i=1 i
n
1X T
= ei UUT ei
n i=1
n r
1 XX
= |uij |2
n i=1 j=1
r
=
n
where the first equality is due to the idempotency of PU (i.e., PUT PU = PU ) and the last equality
P
is because ni=1 |uij |2 = 1.
Coherence is maximized when the nonzero entries of a matrix are concentrated in a row (or
column). For example, consider the matrix whose nonzero entries are concentrated on the first
row
3 2 1
M = 0 0 0 . (52)
0 0 0
Note that the SVD of M is
1
M = σ1 u1 v1T = 3.8417 0 [0.8018 0.5345 0.2673].
0
Then, U = [1 0 0]T , and thus kPU e1 k2 = 1 and kPU e2 k2 = kPU e3 k2 = 0. As shown in Fig.
III-A2(a), the standard basis e1 lies on the space spanned by U while others are orthogonal to
this space so that the maximum coherence is achieved (maxi kPU ei k22 = 1 and µ(U) = 3).
In contrast, coherence is minimized when the nonzero entries of a matrix are spread out
In many practical situations where the matrix has a certain structure, we want to make the most
of the given structure to maximize profits in terms of performance and computational complexity.
We go over several cases including LRMC of the PSD matrix [54], Euclidean distance matrix [4],
and recommendation matrix [67] and discuss how the special structure can be exploited in the
algorithm design.
1) Low-Rank PSD Matrix Completion: In some applications, a desired matrix M ∈ Rn×n not
only has a low-rank structure but also is positive semidefinite (i.e., M = MT and zT Mz ≥ 0
for any vector z). In this case, the problem to recover M can be formulated as
min rank(X)
X
X = XT , X 0.
Similar to the rank minimization problem (10), the problem (54) can be relaxed using the nuclear
norm, and the relaxed problem can be solved via SDP solvers.
The problem (54) can be simplified if the rank of a desired matrix is known in advance. Let
rank(M) = k. Then, since M is positive semidefinite, there exists a matrix Z ∈ Rn×k such that
M = ZZT . Using this, the problem (54) can be concisely expressed as
1
min kPΩ (M) − PΩ (ZZT )k2F . (55)
Z∈Rn×k 2
Since (55) is an unconstrained optimization problem with a differentiable cost function, many
gradient-based techniques such as steepest descent, conjugate gradient, and Newton methods can
be applied. It has been shown that under suitable conditions of the coherence property of M and
the number of the observed entries |Ω|, the global convergence of gradient-based algorithms is
guaranteed [59].
subject to rank(D) ≤ k + 2,
(57)
D = DT ,
1 T 1
hh )D(In − hhT ) 0.
− (In −
n n
T T n×k
Let Y = ZZ where Z = [z1 · · · zn ] ∈ R is the matrix of sensor locations. Then, one can
easily check that
Thus, by letting g(Y) = diag(Y)hT + hdiag(Y)T − 2Y, the problem in (57) can be equivalently
formulated as
min kPΩ (g(Y)) − PΩ (M)k2F
Y
Y = Y T , Y 0.
Since the feasible set associated with the problem in (59) is a smooth Riemannian manifold [53],
[54], an extension of the Euclidean space on which a notion of differentiation exists [55], [56],
various gradient-based optimization techniques such as steepest descent, Newton method, and
conjugate gradient algorithms can be applied to solve (59) [3], [4], [55].
The adjacency matrix Wr ∈ Rn1 ×n2 of the row graph Gr is defined in a similar way.
CNN-based LRMC: Let U ∈ Rn1 ×r and V ∈ Rn2 ×r be matrices such that M = UVT . The
primary task of the CNN-based approach is to find functions fr and fc mapping the vertex sets
of the row and column graphs Gr and Gc of M to U and V, respectively. Here, each vertex of
Gr (respective Gc ) is mapped to each row of U (respective V) by fr (respective fc ). Since it is
difficult to express fr and fc explicitly, we can learn these nonlinear mappings using CNN-based
models. In the CNN-based LRMC approach, U and V are initialized at random and updated in
each iteration. Specifically, U and V are updated to minimize the following loss function [67]:
X X
l(U, V) = kui − uj k22 + kvi − vj k22
r =1
(i,j):wij c =1
(i,j):wij
Xr
τ
+ kPΩ ( ui viT ) − PΩ (M)k2F , (61)
2 i=1
where τ is a regularization parameter. In other words, we find U and V such that the Euclidean
distance between the connected vertices is minimized (see kui − uj k2 (wijr = 1) and kvi − vj k2
(wijc = 1) in (61)). The update procedures of U and V are [67]:
1) Initialize U and V at random and assign each row of U and V to each vertex of the row
graph Gr and the column graph Gc , respectively.
2) Extract the feature matrices ∆U and ∆V by performing a graph-based convolution
operation on Gr and Gc , respectively.
3) Update U and V using the feature matrices ∆U and ∆V, respectively.
4) Compute the loss function in (61) using updated U and V and perform the back propa-
gation to update the filter parameters.
5) Repeat the above procedures until the value of the loss function is smaller than a pre-
chosen threshold.
One important issue in the CNN-based LRMC approach is to define a graph-based convolution
operation to extract the feature matrices ∆U and ∆V (see the second step). Note that the input
data Gr and Gc do not lie on regular lattices like images and thus classical CNN cannot be directly
applied to Gr and Gc . One possible option is to define the convolution operation in the Fourier
domain of the graph. In recent years, CNN models based on the Fourier transformation of graph-
structure data have been proposed [68], [69], [70], [71], [72]. In [68], an approach to use the
eigendecomposition of the Laplacian has been proposed. To further reduce the model complexity,
CNN models using the polynomial filters have been proposed [70], [69], [71]. In essence, the
Fourier transform of a graph can be computed using the (normalized) graph Laplacian. Let Rr
−1/2 −1/2
be the graph Laplacian of Gr (i.e., Rr = I − Dr W r Dr where Dr = diag(Wr 1n2 ×1 )) [63].
Then, the graph Fourier transform Fr (u) of a vertex assigned with the vector u is defined as
Let z be the filter used in the convolution, then the output ∆u of the graph-based convolution
on a vertex assigned with the vector u is defined as [63], [70]
= Qr diag(Fr (z))QTr u
= Qr GQTr u, (65)
where G = diag(Fr (z)) is the matrix of filter parameters defined in the graph Fourier domain.
We next update U and V using the feature matrices ∆U and ∆V. In [67], a cascade of
multi-graph CNN followed by long short-term memory (LSTM) recurrent neural network has
been proposed. The computational cost of this approach is O(r|Ω| + r 2n1 + r 2 n2 ) which is much
lower than the SVD-based LRMC techniques (i.e., O(rn1 n2 )) as long as r ≪ min(n1 , n2 ).
Finally, we compute the loss function l(Ui , Vi ) in (61) and then update the filter parameters
using the backpropagation. Suppose {Ui }i and {Vi }i converge to U b and V,
b respectively, then
c=U
the estimate of M obtained by the CNN-based LRMC is M bVbT.
14
One can easily check that Fr−1 (Fr (u)) = u and Fr (Fr−1 (u′ )) = u′ .
is the steering vector and bi ∈ Cn2 is the vector of normalized coefficients (i.e., kbi k2 = 1). We
denote the set of such atoms Hi as H. Using H, the atomic norm of X is defined as
X X
kXkH = inf{ αi : X = αi Hi , αi > 0, Hi ∈ H}. (67)
i i
Note that the atomic norm kXkH is a generalization of the ℓ1 -norm and also the nuclear norm
to the space of sinusoidal signals [86], [39].
Let Xo be the observation of X, then the problem to reconstruct X can be modeled as the
ANM problem:
1
min 2
kZ − Xo kF + τ kZkH , (68)
Z
In this section, we study the performance of the LRMC algorithms. In our experiments, we
focus on the algorithm listed in Table V. The original matrix is generated by the product of two
random matrices A ∈ Rn1 ×r and B ∈ Rn2 ×r , i.e., M = ABT . Entries of these two matrices,
aij and bpq are identically and independently distributed random variables sampled from the
normal distribution N (0, 1). Sampled elements are also chosen at random. The sampling ratio
p is defined as
|Ω|
p= ,
n1 n2
where |Ω| is the cardinality (number of elements) of Ω. In the noisy scenario, we use the additive
noise model where the observed matrix Mo is expressed as Mo = M+N where the noise matrix
N is formed by the i.i.d. random entries sampled from the Gaussian distribution N (0, σ 2). For
SNR
1
given SNR, σ 2 = n1 n2
kMk2F 10− 10 . Note that the parameters of the LRMC algorithm are chosen
from the reference paper. For each point of the algorithm, we run 1, 000 independent trials and
then plot the average value.
In the performance evaluation of the LRMC algorithms, we use the mean square error (MSE)
and the exact recovery ratio, which are defined, respectively, as
1 c
MSE = kM − Mk2F ,
n1 n2
number of successful trials
R= ,
total trials
c is the reconstructed low-rank matrix. We say the trial is successful if the MSE
where M
performance is less than the threshold ǫ. In our experiments, we set ǫ = 10−6. Here, R can
be used to represent the probability of successful recovery.
We first examine the exact recovery ratio of the LRMC algorithms in terms of the sampling
ratio and the rank of M. In our experiments, we set n1 = n2 = 100 and compute the phase
transition [90] of the LRMC algorithms. Note that the phase transition is a contour plot of the
success probability P (we set P = 0.5) where the sampling ratio (x-axis) and the rank (y-axis)
form a regular grid of the x-y plane. The contour plot separates the plane into two areas: the area
above the curve is one satisfying P < 0.5 and the area below the curve is a region achieving
P > 0.5 [90] (see Fig. IV). The higher the curve, therefore, the better the algorithm would be.
In general, the LRMC algorithms perform poor when the matrix has a small number of observed
40
NNM using SDPT3
SVT
35 ADMiRA
TNNR-APGL
TNNR-ADMM
30
25
Rank
20
15
10
0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Sampling Ratio
entries and the rank is large. Overall, NNM-based algorithms perform better than FNM-based
algorithms. In particular, the NNM technique using SDPT3 solver outperforms the rest because
the convex optimization technique always finds a global optimum while other techniques often
converge to local optimum.
In order to investigate the computational efficiency of LRMC algorithms, we measure the
running time of each algorithm as a function of rank (see Fig. IV). The running time is measured
in second, using a 64-bit PC with an Intel i5-4670 CPU running at 3.4 GHz. We observe that
the convex algorithms have a relatively high running time complexity.
We next examine the efficiency of the LRMC algorithms for different problem size (see Table
VI). For iterative LRMC algorithms, we set the maximum number of iteration to 300. We see
that LRMC algorithms such as SVT, IRLS-M, ASD, ADMiRA, and LRGeomCG run fast. For
example, it takes less than a minute for these algorithms to reconstruct 1000×1000 matrix, while
the running time of SDPT3 solver is more than 5 minutes. Further reduction of the running time
can be achieved using the alternating projection-based algorithms such as LMaFit. For example,
it takes about one second to reconstruct an (1000 × 1000)-dimensional matrix with rank 5 using
LMaFit. Therefore, when the exact recovery of the original matrix is unnecessary, the FNM-based
technique would be a good choice.
16
NNM using SDPT3
SVT
14 ADMiRA
TNNR-APGL
TNNR-ADMM
12
10
Running Time
8
0
5 10 15 20 25 30 35
Rank
Fig. 9. Running times of LRMC algorithms in noiseless scenario (40% of entries are observed).
In the noisy scenario, we also observe that FNM-based algorithms perform well (see Fig. IV
and Fig. IV). In this experiment, we compute the MSE of LRMC algorithms against the rank
of the original low-rank matrix for different setting of SNR (i.e., SNR = 20 and 50 dB). We
observe that in the low and mid SNR regime (e.g., SNR = 20 dB), TNNR-ADMM performs
comparable to the NNM-based algorithms since the FNM-based cost function is robust to the
noise. In the high SNR regime (e.g., SNR = 50 dB), the convex algorithm (NNM with SDPT3)
exhibits the best performance in term of the MSE. The performance of TNNR-ADMM is notably
better than that of the rest of LRMC algorithms. For example, given rank(M) = 20, the MSE
of TNNR-ADMM is around 0.04, while the MSE of the rest is higher than 1.
Finally, we apply LRMC techniques to recover images corrupted by impulse noise. In this
experiment, we use 256×256 standard grayscale images (e.g., boat, cameraman, lena, and pepper
images) and the salt-and-pepper noise model with different noise density ρ = 0.3, 0.5, and 0.7.
For the FNM-based LRMC techniques, the rank is given by the number of the singular values σi
being greater than a relative threshold ǫ > 0, i.e., σi > ǫ max σi . From the simulation results, we
i
observe that peak SNR (pSNR), defined as the ratio of the maximum pixel value of the image
to noise variance, of all LRMC techniques is at least 52dB when ρ = 0.3 (see Table VII). In
particular, NNM using SDPT3, SVT, and IRLS-M outperform the rest, achieving pSNR≥ 57 dB
102
NNM using SDPT3
SVT
LMaFit
ADMiRA
TNNR-ADMM
101
MSE
100
10-1
10-2
5 10 15 20 25 30 35
Rank
Fig. 10. MSE performance of LRMC algorithms in noisy scenario with SNR = 20 dB (70% of entries are observed).
102
NNM using SDPT3
SVT
LMaFit
ADMiRA
101 TNNR-ADMM
100
MSE
10-1
10-2
10-3
5 10 15 20 25 30 35
Rank
Fig. 11. MSE performance of LRMC algorithms in noisy scenario with SNR = 50 dB (70% of entries are observed).
V. C ONCLUDING R EMARKS
• When the dimension of a low-rank matrix increases and thus computational complexity
increases significantly, we want an algorithm with good recovery guarantee yet its com-
plexity scales linearly with the problem size. Without doubt, in the real-time applications
such as IoT localization and massive MIMO, low-complexity and short running time are
of great importance. Development of implementation-friendly algorithm and architecture
would accelerate the dissemination of LRMC techniques.
• Most of the LRMC techniques assume that the original low-rank matrix is a random matrix
whose entries are randomly generated. In many practical situations, however, entries of
the matrix are not purely random but chosen from a finite set of integer numbers. In
the recommendation system, for example, each entry (rating for a product) is chosen
from integer value (e.g., 1 ∼ 5 scale). Unfortunately, there is no well-known practical
guideline and efficient algorithm when entries of a matrix are chosen from the discrete
set. It would be useful to come up with a simple and effective LRMC technique suited
for such applications.
• As mentioned, CNN-based LRMC technique is a useful tool to reconstruct a low-rank
matrix. In essence, unknown entries of a low-rank matrix are recovered based on the
graph model of the matrix. Since observed entries can be considered as labeled training
data, this approach can be classified as a supervised learning. In many practical scenarios,
however, it might not be easy to precisely express the graph model of the matrix since
there are various criteria to define the graph edge. In addressing this problem, new deep
learning technique such as the generative adversarial networks (GAN) [91] consisting of
the generator and discriminator would be useful.
A PPENDIX A
P ROOF OF THE SDP FORM OF NNM
min hC, Yi
Y
Y0
where C is a given matrix, and {Ak }lk=1 and {bk }lk=1 are given sequences of matrices and
constants, respectively. To convert the NNM problem in (11) into the standard SDP form in (71),
we need a few steps. First, we convert the NNM problem in (11) into the epigraph form:15
min t
X,t
PΩ (X) = PΩ (M).
Next, we transform the constraints in (72) to generate the standard form in (71). We first
consider the inequality constraint (kXk∗ ≤ t). Note that kXk∗ ≤ t if and only if there are
symmetric matrices W1 ∈ Rn1 ×n1 and W2 ∈ Rn2 ×n2 such that [21, Lemma 2]
W1 X
tr(W1 ) + tr(W2 ) ≤ 2t and 0. (73)
X T W2
W1 X 0 M
Then, by denoting Y = f = n1 ×n1
∈ R(n1 +n2 )×(n1 +n2 ) and M where
T T
X W2 M 0n2 ×n2
0s×t is the (s × t)-dimensional zero matrix, the problem in (72) can be reformulated as
min 2t
Y,t
f ei eT i =
where {e1 , · · · , en1 +n2 } be the standard ordered basis of Rn1 +n2 . Let Ak = ei eTj+n1 and hM, j+n1
subject to tr(Y) ≤ 2t
(77)
Y0
hY, Ak i = bk , k = 1, · · · , |Ω|.
1 2
For example, we consider the case where the desired matrix M is given by M = and
2 4
the index set of observed entries is Ω = {(2, 1), (2, 2)}. In this case,
One can express (77) in a concise form as (13), which is the desired result.
A PPENDIX B
P ERFORMANCE GUARANTEE OF NNM
Sketch of proof: Exact recovery of the desired low-rank matrix M can be guaranteed under
the uniqueness condition of the NNM problem [22], [23], [75]. To be specific, let M = UΣVT
L ⊥
be the SVD of M where U ∈ Rn1 ×r , Σ ∈ Rr×r , and V ∈ Rn2 ×r . Also, let Rn1 ×n2 = T T
be the orthogonal decomposition in which T ⊥ is defined as the subspace of matrices whose row
and column spaces are orthogonal to the row and column spaces of M, respectively. Here, T
is the orthogonal complement of T ⊥ . It has been shown that M is the unique solution of the
NNM problem if the following conditions hold true [22, Lemma 3.1]:
1) there exists a matrix Y = UVT + W such that PΩ (Y) = Y, W ∈ T ⊥ , and kWk < 1,
2) the restriction of the sampling operation PΩ to T is an injective (one-to-one) mapping.
The establishment of Y obeying 1) and 2) are in turn conditioned on the observation model of
M and its intrinsic coherence property.
Under a uniform sampling model of M, suppose the coherence property of M satisfies
where µ0 and µ1 are some constants, eij is the entry of E = UVT , and µ(U) and µ(V) are the
coherences of the column and row spaces of M, respectively.
Theorem 4 ([22, Theorem 1.3]). There exists constants α and β such that if the number of
observed entries m = |Ω| satisfies
1 1
m ≥ α max(µ21 , µ02 µ1 , µ0n 4 )γnr log n (80)
where γ > 2 is some constant and n1 = n2 = n, then M is the unique solution of the NNM
problem with probability at least 1 − βn−γ . Further, if r ≤ µ−1
0 n
1/5
, (80) can be improved to
m ≥ Cµ0 γn1.2 r log n with the same success probability.
One direct interpretation of this theorem is that the desired low-rank matrix can be recon-
structed exactly using NNM with overwhelming probability even when m is much less than
n1 n2 .
R EFERENCES
[1] Netflix Prize. http://www.netflixprize.com
[2] A. Pal, “Localization algorithms in wireless sensor networks: Current approaches and future challenges,” Netw. Protocols
Algorithms, vol. 2, no. 1, pp. 45–74, 2010.
[3] L. Nguyen, S. Kim, and B. Shim, “Localization in Internet of things network: Matrix completion approach,” in Proc.
Inform. Theory Appl. Workshop, San Diego, CA, USA, 2016, pp. 1–4.
[4] L. T. Nguyen, J. Kim, S. Kim, and B. Shim, “Localization of IoT Networks Via Low-Rank Matrix Completion,” to
appear in IEEE Trans. Commun., 2019.
[5] E. J. Candes, Y. C. Eldar, and T. Strohmer, “Phase retrieval via matrix completion,” SIAM Rev., vol. 52, no. 2, pp.
225–251, May 2015.
[6] M. Delamom, S. Felici-Castell, J. J. Perez-Solano, and A. Foster, “Designing an open source maintenance-free
environmental monitoring application for wireless sensor networks,” J. Syst. Softw., vol. 103, pp. 238–247, May 2015.
[7] G. Hackmann, W. Guo, G. Yan, Z. Sun, C. Lu, and S. Dyke, “Cyber-physical codesign of distributed structural health
monitoring with wireless sensor networks,” IEEE Trans. Parallel Distrib. Syst., vol. 25, no. 1, pp. 63–72, Jan. 2014.
[8] W. S. Torgerson, “Multidimensional scaling: I. Theory and method,” Psychometrika, vol. 17, no. 4, pp. 401–419, Dec.
1952.
[9] H. Ji, Y. Kim, J. Lee, E. Onggosanusi, Y. Nam, J. Zhang, B. Lee, and B. Shim, “Overview of full-dimension MIMO in
LTE-advanced pro,” IEEE Commun. Mag., vol. 55, no. 2, pp. 176–184, Feb. 2017.
[10] T. L. Marzetta and B. M. Hochwald, “Fast transfer of channel state information in wireless systems,” IEEE Trans. Signal
Process., vol. 54, no. 4, pp. 1268–1278, Apr. 2006.
[11] F. Rusek, D. Persson, B. K. Lau, E. G. Larsson, T. L. Marzetta, O. Edfors, and F. Tufvesson, “Scaling up MIMO:
Opportunities and challenges with very large arrays,” IEEE Signal Process. Mag., vol. 30, no. 1, pp. 40–60, Jan. 2013.
[12] W. Shen, L. Dai, B. Shim, S. Mumtaz, and Z. Wang, “Joint CSIT acquisition based on low-rank matrix completion for
FDD massive MIMO systems,” IEEE Commun. Lett., vol. 19, no. 12, pp. 2178–2181, Dec. 2015.
[13] T. S. Rappaport et al., “Millimeter wave mobile communications for 5G cellular: It will work!,” IEEE Access, vol. 1,
no. 1, pp. 335–349, May 2013.
[14] X. Li, J. Fang, H. Li, H. Li, and P. Wang, “Millimeter wave channel estimation via exploiting joint sparse and low-rank
structures,” IEEE Trans. Wireless Commun., vol. 17, no. 2, pp. 1123–1133, Feb. 2018.
[15] Y. Shi, J. Zhang, and K. B. Letaief, “Low-rank matrix completion for topological interference management by Riemannian
pursuit,” IEEE Trans. Wireless Commun., vol. 15, no. 7, pp. 4703-4717, Jul. 2016.
[16] Y. Shi, B. Mishra, and W. Chen, “Topological interference management with user admission control via Riemannian
optimization,” IEEE Trans. Wireless Commun., vol. 16, no. 11, pp. 7362-7375, Nov. 2017.
[17] Y. Shi, J. Zhang, W. Chen, and K. B. Letaief, “Generalized sparse and low-rank optimization for ultra-dense networks,”
IEEE Commun. Mag., vol. 56, no. 6, pp. 42-48, Jun., 2018.
[18] G. Sridharan and W. Yu, “Linear Beamforming Design for Interference Alignment via Rank Minimization,” IEEE Trans.
Signal Process., vol. 63, no. 22, pp. 5910-5923, Nov. 2015.
[19] M. Peng, S. Yan, K. Zhang, and C. Wang, “Fog-computing-based radio access networks: issues and challenges,” IEEE
Network, vol. 30, pp. 46-53, July 2016.
[20] K. Yang, Y. Shi, and Z. Ding, “Low-rank matrix completion for mobile edge caching in Fog-RAN via Riemannian
optimization,” in Proc. IEEE Global Communications Conf. (GLOBECOM), Washington, DC, Dec. 2016.
[21] M. Fazel, “Matrix rank minimization with applications,” Ph.D. dissertation, Elec. Eng. Dept., Standford Univ., Stanford,
CA, 2002.
[22] E. J. Candes and B. Recht, “Exact matrix completion via convex optimization,” Found. Comput. Math., vol. 9, no. 6, pp.
717–772, Dec. 2009.
[23] E. J. Candes and T. Tao, “The power of convex relaxation: Near-optimal matrix completion,” IEEE Trans. Inform. Theory,
vol. 56, no. 5, pp. 2053–2080, May 2010.
[24] K. C. Toh, M. J. Todd, and R. H. Tutuncu, “SDPT3 — a MATLAB software package for semidefinite programming,”
Optim. Methods Softw., vol. 11, pp. 545–581, 1999.
[25] J. F. Sturm, “Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones,” Optim. Methods Softw.,
vol. 11, pp. 625–653, 1999.
[26] L. Vandenberghe and S. Boyd, “Semidefinite programming,” SIAM Rev., vol. 38, no. 1, pp. 49–95, 1996.
[27] Y. Zhang, “On extending some primal-dual interior-point algorithms from linear programming to semidefinite program-
ming,” SIAM J. Optim., vol. 8, no. 2, pp. 365–386, 1998.
[28] Y. E. Nesterov and M. Todd, “Primal-dual interior-point methods for self-scaled cones,” SIAM J. Optim., vol. 8, no. 2,
pp. 324–364, 1998.
[29] F. A. Potra and S. J. Wright, “Interior-point methods,” J. Comput. Appl. Math., vol. 124, no. 1-2, pp.281–302, 2000.
[30] L. Vandenberghe, V. R. Balakrishnan, R. Wallin, A. Hansson, and T. Roh, “Interior-point algorithms for semidefinite
programming problems derived from the KYP lemma,” In Positive polynomials in control (pp. 195-238). Berlin,
Heidelberg: Springer, 2005.
[31] F. A. Potra and R. Sheng, “A superlinearly convergent primal-dual infeasible-interior-point algorithm for semidefinite
programming,” SIAM J. Optim., vol. 8, no. 4, pp.1007–1028, 1998.
[32] B. Recht, M. Fazel, and P. A. Parillo, “Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm
minimization,” SIAM Rev., vol. 52, no. 3, pp. 471–501, 2010.
[33] J. F. Cai, E. J. Candes, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” SIAM J. Optim.,
vol. 20, no. 4, pp. 1956–1982, Mar. 2010.
[34] T. Blumensath and M. E. Davies, “Iterative hard thresholding for compressed sensing,” Appl. Comput. Harmon. Anal.,
vol. 27, no. 3, pp. 265–274, Nov. 2009.
[35] J. Tanner and K. Wei, “Normalized iterative hard thresholding for matrix completion,” SIAM J. Sci. Comput., vol. 35,
no. 5, pp. S104–S125, Oct. 2013.
[36] M. Fornasier, H. Rauhut, and R. Ward, “Low-rank matrix recovery via iteratively reweighted least squares minimization,”
SIAM J. Optim., vol. 21, no. 4, pp. 1614–1640, Dec. 2011.
[37] K. Mohan, and M. Fazel, “Iterative reweighted algorithms for matrix rank minimization,” J. Mach. Learning Research,
no. 13, pp. 3441–3473, Nov. 2012.
[38] D. Needell and J. A. Tropp, “CoSaMP: Iterative signal recovery from incomplete and inaccurate samples,” Appl. Comput.
Harmon. Anal., vol. 26, no. 3, pp. 301–321, May 2009.
[39] J. W. Choi, B. Shim, Y. Ding, B. Rao, and D. I. Kim, “Compressed sensing for wireless communications: Useful tips
and tricks,” IEEE Commun. Surveys Tuts., vol. 19, no. 3, pp. 1527–1550, Feb. 2017.
[40] S. Kwon, J. Wang, and B. Shim, “Multipath matching pursuit,” IEEE Trans. Inform. Theory, vol. 60, no. 5, pp. 2986–3001,
Mar. 2014.
[41] J. Wang, S. Kwon, and B. Shim, “Generalized orthogonal matching pursuit,” IEEE Trans. Signal Process., vol. 60, no.
12, pp. 6202–6216, Sep. 2012.
[42] J. A. Tropp and A. C. Gilbert, “Signal recovery from random measurements via orthogonal matching pursuit,” IEEE
Trans. Inform. Theory, vol. 53, no. 12, pp. 4655–4666, Dec. 2007.
[43] K. Lee and Y. Bresler, “ADMiRA: Atomic decomposition for minimum rank approximation,” IEEE Trans. Inform. Theory,
vol. 56, no. 9, pp. 4402–4416, Sept. 2010.
[44] Z. Wang, M-J. Lai, Z. Lu, W. Fan, H. Davulcu, and J. Ye, “Rank-one matrix pursuit for matrix completion,” in Proc.
Int. Conf. Mach. Learn., Beijing, China, 2014, pp. 91–99.
[45] J. P. Haldar and D. Hernando, “Rank-constrained solutions to linear matrix equations using power factorization,” IEEE
Signal Process. Lett., vol. 16, no. 7, pp. 584–587, Jul. 2009.
[46] J. Tanner and K. Wei, “Low rank matrix completion by alternating steepest descent methods,” Appl. Comput. Harmon.
Anal., vol. 40, no. 2, pp. 417–429, Mar. 2016.
[47] Z. Wen, W. Yin, and Y. Zhang, “Solving a low-rank factorization model for matrix completion by a nonlinear successive
over-relaxation algorithm,” Math. Prog. Comput., vol. 4, no. 4, pp. 333–361, Dec. 2012.
[48] B. Mishra, G. Meyer, S. Bonnabel, and R. Sepulchre, “Fixed-rank matrix factorizations and Riemannian low-rank
[49] W. Dai and O. Milenkovic, “SET: An algorithm for consistent matrix completion,” in Proc. Int. Conf. Acoust., Speech,
Signal Process., Dallas, Texas, USA, 2010, pp. 3646–3649.
[50] B. Vandereycken, “Low-rank matrix completion by Riemannian optimization,” SIAM J. Optim., vol. 23, no. 2, pp.
1214–1236, Jun. 2013.
[51] T. Ngo and Y. Saad, “Scaled gradients on Grassmann manifolds for matrix completion,” in Proc. Adv. Neural Inform.
Process. Syst. Conf., Lake Tahoe, Nevada, USA, 2012, pp. 1412–1420.
[52] J. Dattorro, Convex optimization and Euclidean distance geometry. USA: Meboo, 2005.
[53] U. Helmke and J. B. Moore, Optimization and Dynamical Systems. New York, NY, USA: Springer, 1994.
[54] B. Vandereycken, P.-A. Absil, and S. Vandewalle, “Embedded geometry of the set of symmetric positive semidefinite
matrices of fixed rank,” in Proc. IEEE Workshop Stat. Signal Process., Cardiff, UK, 2009, pp. 389–392.
[55] P. A. Absil, R. Mahony, and R. Sepulchre, Optimization algorithms on matrix manifolds. Princeton, NJ, USA: Princeton
Univ., 2009.
[56] J. M. Lee, Smooth manifolds. New York, NY, USA: Springer, 2003.
[57] Y. Hu, D. Zhan, J. Ye, X. Li, and X. He, “Fast and accurate matrix completion via truncated nuclear norm regularization,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 9, pp. 2117–2130, Sept. 2013.
[58] J. Y. Gotoh, A. Takeda, and K. Tono, “DC formulations and algorithms for sparse optimization problems,” Math.
Programming, pp. 1-36, May 2018.
[59] R. Ge, J. D. Lee, and T. Ma, “Matrix completion has no spurious local minimum,” in Advances Neural Inform. Process.
Syst., pp. 2973-2981, 2016.
[60] R. Ge, C. Jin, and Y. Zheng, “No spurious local minima in nonconvex low rank problems: A unified geometric analysis,”
In Proc. 34th Int. Conf. on Machine Learning, JMLR. org., Aug. 2017, vol. 70, pp. 1233–1242.
[61] S. S. Du, C. Jin, J. D. Lee, M. I. Jordan, A. Singh, and B. Poczos, “Gradient descent can take exponential time to escape
saddle points,” in Advances Neural Inform. Process. Syst., pp. 1067-1077, 2017.
[62] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Trans.
Neural Netw., vol. 20, no. 1, pp. 61–80, Jan. 2009.
[63] D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graph theory,” Appl. Comput.
Harmon. Anal., vol. 30, no. 2, pp. 129–150, Mar. 2011.
[64] S. Sedhain, A. K. Menon, S. Sanner, and L. Xie, “Autorec: Autoencoders meet collaborative filtering,” in Proc. Int. Conf.
World Wide Web, Florence, Italy, 2015, pp. 111–112.
[65] Y. Zheng, B. Tang, W. Ding, and H. Zhou, “A neural autoregressive approach to collaborative filtering,” in Proc. Int.
Conf. Mach. Learn., New York, NY, USA, 2016, pp. 764–773.
[66] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. S. Chua, “Neural collaborative filtering,” in Proc. Int. Conf. World
Wide Web, Perth, Australia, 2017, pp. 173–182.
[67] F. Monti, M. Bronstein, and X. Bresson, “Geometric matrix completion with recurrent multi-graph neural networks,” in
Proc. Adv. Neural Inform. Process. Syst., Long Beach, CA, USA, 2017, pp. 3700–3710.
[68] J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun, “Spectral networks and locally connected networks on graphs,” in Proc.
Int. Conf. Learn. Representations, Banff, Canada, 2014, pp. 1–14.
[69] M. Henaff, J. Bruna, and Y. Lecun, “Deep convolutional networks on graph-structured data,” arXiv:1506.05163, 2015.
[70] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral
filtering,” in Proc. Adv. Neural Inform. Process. Syst., Barcelona, Spain, 2016, pp. 3844–3852.
[71] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint
arXiv:1609.02907, 2016.
[72] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein, “Geometric deep learning on graphs and
manifolds using mixture model cnns,” in Proc. IEEE Conf. Comput. Vision Pattern Recognition, pp. 5115–5124, 2017.
[73] M. A. Davenport and J. Romberg, “An overview of low-rank matrix recovery from incomplete observations,” IEEE J.
Sel. Topics Signal Process., vol. 10, no. 4, pp. 608–622, Jun. 2016.
[74] Y. Chen and Y. Chi, “Harnessing structures in big data via guaranteed low-rank matrix estimation,” IEEE Signal Process.
Mag., vol. 35, no. 4, pp. 14–31, Jul. 2018.
[75] B. Recht, “A simple approach to matrix completion,” J. Mach. Learn. Res., vol. 12, pp. 3413–3430, Dec. 2011.
[76] C. R. Berger, Double Exponential. IEEE Trans. Signal Process. 56(5), 1708–1721 (2010)
[77] S. Boyd and Van, Convex Optimization. Cambridge, England: Cambridge Univ., 2004.
[78] P. Combettes and J. C. Pesquet, Proximal splitting methods in signal processing. New York, NY, USA: Springer, 2011.
[79] P. Jain, R. Meka, and I. Dhillon, “Guaranteed rank minimization via singular value projection,” in Proc. Neural Inform.
Process. Syst. Conf., Vancouver, Canada, 2010, pp. 937–945.
[80] M. Tao and X. Yuan, “Recovering low-rank and sparse components of matrices from incomplete and noisy observations,”
SIAM J. Optim., vol. 21, no. 1, pp. 57–81, Jan. 2011.
[81] Z. Lin, R. Liu, and Z. Su, “Linearized alternating direction method with adaptive penalty for low-rank representation,”
in Proc. Adv. Neural Inform. Process. Syst., Montreal, Canada, 2011, pp. 612–620.
[82] B. S. He, H. Yang, and S. L. Wang, “Alternating direction method with self-adaptive penalty parameters for monotone
variational inequalities,” J. Optim. Theory Appl., vol. 106, no. 2, pp. 337–356, Aug. 2000.
[83] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J.
Imaging Sci., vol. 2, no. 1, pp. 183–202, Mar. 2009.
[84] R. Escalante and M. Raydan, Alternating projection methods. Philadelphia, PA, USA: SIAM, 2011.
[85] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, “The convex geometry of linear inverse problems,” Found.
Comput. Math., vol. 12, no. 6, pp. 805–849, Dec. 2012.
[86] B. N. Bhaskar, G. Tang, and B. Recht, “Atomic norm denoising with applications to line spectral estimation,” IEEE
Trans. Signal Process., vol. 61, no. 23, pp. 5987–5999, Dec. 2013.
[87] Y. Li and Y. Chi, “Off-the-grid line spectrum denoising and estimation with multiple measurement vectors,” IEEE Trans.
Signal Process., vol. 64, no. 5, pp. 1257–1269, Mar. 2016.
[88] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM Rev., vol. 43, no. 1,
pp. 129–159, Feb. 2001.
[89] N. Rao, P. Shah, and S. Wright, “Forwardbackward greedy algorithms for atomic norm regularization,” IEEE Trans.
Signal Process., vol. 63, no. 21, pp. 5798–5811, Nov. 2015.
[90] D. L. Donoho, A. Maleki, and A. Montanari, “Message passing algorithms for compressed sensing,” Proc. Nat. Acad.
Sci., vol. 106, no. 45, pp. 18914–18919, Nov. 2009.
[91] I. Goodfellow, “NIPS 2016 tutorial: Generative adversarial networks,” arXiv preprint arXiv:1701.00160, 2016.
Computational Iteration
Category Technique Algorithm Features
Complexity Complexity*
Convex SDPT3 A solver for conic programming prob- O(n3 ) O(nω log( 1ǫ ))**
Optimization (CVX) [24] lems
NNM
NNM via SVT [33] An extension of the iterative soft thresh- O(rn2 ) O( √1ǫ )
Singular olding technique in compressed sensing
Value for LRMC, based on a Lagrange mul-
Thresholding tiplier method
NIHT [35] An extension of the iterative hard O(rn2 ) O(log( 1ǫ ))
thresholding technique [34] in com-
pressed sensing for LRMC
IRLS IRLS-M An algorithm to solve the NNM prob- O(rn2 ) O(log( 1ǫ ))
Minimization Algo- lem by computing the solution of a
rithm [36] weighted least squares subproblem in
each iteration
Greedy ADMiRA [43] An extension of the greedy algorithm O(rn2 ) O(log( 1ǫ ))
Technique CoSaMP [38], [39] in compressed sens-
FNM
ing for LRMC, uses greedy projection
with
to identify a set of rank-one matrices
Rank
that best represents the original matrix
Constraint
Alternating LMaFit [47] A nonlinear successive over-relaxation O(r|Ω| + r 2 n) O(log( 1ǫ ))
Minimization LRMC algorithm based on nonlinear
Gauss-Seidel method
ASD [46] A steepest decent algorithm for the O(r|Ω| + r 2 n) O(log( 1ǫ ))
FNM-based LRMC problem (25)
Manifold SET [49] A gradient-based algorithm to solve the O(r|Ω| + r 2 n) O(log( 1ǫ ))
Optimization FNM problem on a Grassmann mani-
fold
LRGeomCG A conjugate gradient algorithm over a O(r|Ω| + r 2 n) O(log( 1ǫ ))
[50] Riemannian manifold of the fixed-rank
matrices
Truncated TNNR- This algorithm solves the truncated O(rn2 ) O( √1ǫ )
NNM APGL [57] NNM problem via accelerated proximal
gradient line search method [83]
TNNR- This algorithm solves the truncated O(rn2 ) O( √1ǫ )
ADMM [57] NNM problem via an alternating direc-
tion method of multipliers [80]
CNN-based CNN-based An gradient-based algorithm to express O(r|Ω| + r 2 n) O(log( 1ǫ ))
Technique LRMC Algo- a low-rank matrix as a graph structure
rithm [67] and then apply CNN to the constructed
graph to recover the desired matrix
* c satisfying kM
The number of iterations to achieve the reconstructed matrix M c − M∗ kF ≤ ǫ where M∗ is the optimal solution.
** ω is some positive constant controlling the iteration complexity.
TABLE VI. MSE RESULTS FOR DIFFERENT PROBLEM SIZES WHERE RANK(M) = 5, AND p = 2 × DOF
n1 = n2 = 50 n1 = n2 = 500 n1 = n2 = 1000
MSE Running Number of MSE Running Number of MSE Running Number of
Time (s) Iterations Time (s) Iterations Time (s) Iterations
NNM using SDPT3 0.0072 0.6 13 0.0017 74 16 0.0010 354 16
SVT 0.0154 0.4 300 0.4564 10 300 0.2110 32 300
NIHT 0.0008 0.2 253 0.0039 21 300 0.0019 93 300
IRLS-M 0.0009 0.2 60 0.0033 2 60 0.0025 8 60
ADMiRA 0.0075 0.3 300 0.0029 49 300 0.0016 52 300
ASD 0.0003 10−2 227 0.0006 2 300 0.0005 8 300
LMaFit 0.0002 10−2 241 0.0002 0.5 300 0.0500 1 300
SET 0.0678 11 9 0.0260 136 8 0.0108 270 8
LRGeomCG 0.0287 0.1 108 0.0333 12 300 0.0165 40 300
TNNR-ADMM 0.0221 0.3 300 0.0042 22 300 0.0021 94 300
TNNR-APGL 0.0055 0.3 300 0.0011 21 300 0.0009 95 300
TABLE VII. I MAGE RECOVERY VIA LRMC FOR DIFFERENT NOISE LEVELS ρ.