Completion: Matrix

MATRIX
COMPLETION
A New Theory for Matrix Completion 3
Noisy Tensor Completion via the Sum-of-Squares Hierarchy 13
The Power of Convex Relaxation Near-Optimal Matrix Completion 37
Matrix Completion with Noise 89
High-Rank Matrix Completion and Subspace Clustering with Missing Data 101
A-new-theory-for-matrix-completion-Paper 117
Low-Rank Matrix Completion Survey 127
A New Theory for Matrix Completion
Guangcan Liu∗ Qingshan Liu† Xiao-Tong Yuan‡
B-DAT, School of Information & Control, Nanjing Univ Informat Sci & Technol
NO 219 Ningliu Road, Nanjing, Jiangsu, China, 210044
{gcliu,qsliu,xtyuan}@nuist.edu.cn
Abstract
Prevalent matrix completion theories reply on an assumption that the locations of
the missing data are distributed uniformly and randomly (i.e., uniform sampling).
Nevertheless, the reason for observations being missing often depends on the unseen
observations themselves, and thus the missing data in practice usually occurs in a
nonuniform and deterministic fashion rather than randomly. To break through the
limits of random sampling, this paper introduces a new hypothesis called isomeric
condition, which is provably weaker than the assumption of uniform sampling and
arguably holds even when the missing data is placed irregularly. Equipped with
this new tool, we prove a series of theorems for missing data recovery and matrix
completion. In particular, we prove that the exact solutions that identify the target
matrix are included as critical points by the commonly used nonconvex programs.
Unlike the existing theories for nonconvex matrix completion, which are built
upon the same condition as convex programs, our theory shows that nonconvex
programs have the potential to work with a much weaker condition. Comparing to
the existing studies on nonuniform sampling, our setup is more general.
1 Introduction
Missing data is a common occurrence in modern applications such as computer vision and image
processing, reducing significantly the representativeness of data samples and therefore distorting
seriously the inferences about data. Given this pressing situation, it is crucial to study the problem
of recovering the unseen data from a sampling of observations. Since the data in reality is often
organized in matrix form, it is of considerable practical significance to study the well-known problem
of matrix completion [1] which is to fill in the missing entries of a partially observed matrix.
Problem 1.1 (Matrix Completion). Denote the (i, j)th entry of a matrix as [·]ij . Let L0 ∈ Rm×n be
an unknown matrix of interest. In particular, the rank of L0 is unknown either. Given a sampling of
the entries in L0 and a 2D index set Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n} consisting of the locations
of the observed entries, i.e., given
{[L0 ]ij |(i, j) ∈ Ω} and Ω,
can we restore the missing entries whose indices are not included in Ω, in an exact and scalable
fashion? If so, under which conditions?
Due to its unique role in a broad range of applications, e.g., structure from motion and magnetic
resonance imaging, matrix completion has received extensive attentions in the literatures, e.g., [2–13].
∗
The work of Guangcan Liu is supported in part by national Natural Science Foundation of China (NSFC)
under Grant 61622305 and Grant 61502238, in part by Natural Science Foundation of Jiangsu Province of China
(NSFJPC) under Grant BK20160040.
†
The work of Qingshan Liu is supported by NSFC under Grant 61532009.
‡
The work of Xiao-Tong Yuan is supported in part by NSFC under Grant 61402232 and Grant 61522308, in
part by NSFJPC under Grant BK20141003.
31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

Figure 1: Left and Middle: Typical configurations for the locations of the observed entries. Right: A
real example from the Oxford motion database. The black areas correspond to the missing entries.
In general, given no presumption about the nature of matrix entries, it is virtually impossible to
restore L0 as the missing entries can be of arbitrary values. That is, some assumptions are necessary
for solving Problem 1.1. Based on the high-dimensional and massive essence of today’s data-driven
community, it is arguable that the target matrix L0 we wish to recover is often low rank [23]. Hence,
one may perform matrix completion by seeking a matrix with the lowest rank that also satisfies the
constraints given by the observed entries:
min rank (L) , s.t. [L]ij = [L0 ]ij , ∀(i, j) ∈ Ω. (1)
L
Unfortunately, this idea is of little practical because the problem above is NP-hard and cannot be
solved in polynomial time [15]. To achieve practical matrix completion, Candès and Recht [4]
suggested to consider an alternative that minimizes instead the nuclear norm which is a convex
envelope of the rank function [12]. Namely,
min kLk∗ , s.t. [L]ij = [L0 ]ij , ∀(i, j) ∈ Ω, (2)
L
where k · k∗ denotes the nuclear norm, i.e., the sum of the singular values of a matrix. Rather
surprisingly, it is proved in [4] that the missing entries, with high probability, can be exactly restored
by the convex program (2), as long as the target matrix L0 is low rank and incoherent and the set Ω of
locations corresponding to the observed entries is a set sampled uniformly at random. This pioneering
work provides people several useful tools to investigate matrix completion and many other related
problems. Those assumptions, including low-rankness, incoherence and uniform sampling, are now
standard and widely used in the literatures, e.g., [14, 17, 22, 24, 28, 33, 34, 36]. In particular, the
analyses in [17, 33, 36] show that, in terms of theoretical completeness, many nonconvex optimization
based methods are as powerful as the convex program (2). Unfortunately, these theories still depend
on the assumption of uniform sampling, and thus they cannot explain why there are many nonconvex
methods which often do better than the convex program (2) in practice.
The missing data in practice, however, often occurs in a nonuniform and deterministic fashion instead
of randomly. This is because the reason for an observation being missing usually depends on the
unseen observations themselves. For example, in structure from motion and magnetic resonance
imaging, typically the locations of the observed entries are concentrated around the main diagonal of
a matrix4 , as shown in Figure 1. Moreover, as pointed out by [19, 21, 23], the incoherence condition
is indeed not so consistent with the mixture structure of multiple subspaces, which is also a ubiquitous
phenomenon in practice. There has been sparse research in the direction of nonuniform sampling,
e.g., [18, 25–27, 31]. In particular, Negahban and Wainwright [26] studied the case of weighted
entrywise sampling, which is more general than the setup of uniform sampling but still a special
form of random sampling. Király et al. [18] considered deterministic sampling and is most related to
this work. However, they had only established conditions to decide whether a particular entry of the
matrix can be restored. In other words, the setup of [18] may not handle well the dependence among
the missing entries. In summary, matrix completion still starves for practical theories and methods,
although has attained considerable improvements in these years.
To break through the limits of the setup of random sampling, in this paper we introduce a new
hypothesis called isomeric condition, which is a mixed concept that combines together the rank and
coherence of L0 with the locations and amount of the observed entries. In general, isomerism (noun
4
This statement means that the observed entries are concentrated around the main diagonal after a permutation
of the sampling pattern Ω.

of isomeric) is a very mild hypothesis and only a little bit more strict than the well-known oracle
assumption; that is, the number of observed entries in each row and column of L0 is not smaller than
the rank of L0 . It is arguable that the isomeric condition can hold even when the missing entries have
irregular locations. In particular, it is provable that the widely used assumption of uniform sampling
is sufficient to ensure isomerism, not necessary. Equipped with this new tool, isomerism, we prove a
set of theorems pertaining to missing data recovery [35] and matrix completion. For example, we
prove that, under the condition of isomerism, the exact solutions that identify the target matrix are
included as critical points by the commonly used bilinear programs. This result helps to explain the
widely observed phenomenon that there are many nonconvex methods performing better than the
convex program (2) on real-world matrix completion tasks. In summary, the contributions of this
paper mainly include:
We invent a new hypothesis called isomeric condition, which provably holds given the
standard assumptions of uniform sampling, low-rankness and incoherence. In addition,
we also exemplify that the isomeric condition can hold even if the target matrix L0 is not
incoherent and the missing entries are placed irregularly. Comparing to the existing studies
about nonuniform sampling, our setup is more general.
Equipped with the isomeric condition, we prove that the exact solutions that identify L0
are included as critical points by the commonly used bilinear programs. Comparing to the
existing theories for nonconvex matrix completion, our theory is built upon a much weaker
assumption and can therefore partially reveal the superiorities of nonconvex programs over
the convex methods based on (2).
We prove that the isomeric condition is sufficient and necessary for the column and row
projectors of L0 to be invertible given the sampling pattern Ω. This result implies that
the isomeric condition is necessary for ensuring that the minimal rank solution to (1) can
identify the target L0 .
The rest of this paper is organized as follows. Section 2 summarizes the mathematical notations used
in the paper. Section 3 introduces the proposed isomeric condition, along with some theorems for
matrix completion. Section 4 shows some empirical results and Section 5 concludes this paper. The
detailed proofs to all the proposed theorems are presented in the Supplementary Materials.
2 Notations
Capital and lowercase letters are used to represent matrices and vectors, respectively, except that the
lowercase letters, i, j, k, m, n, l, p, q, r, s and t, are used to denote some integers, e.g., the location of
an observation, the rank of a matrix, etc. For a matrix M , [M ]ij is its (i, j)th entry, [M ]i,: is its ith row
and [M ]:,j is its jth column. Let ω1 and ω2 be two 1D index sets; namely, ω1 = {i1 , i2 , · · · , ik } and
ω2 = {j1 , j2 , · · · , js }. Then [M ]ω1 ,: denotes the submatrix of M obtained by selecting the rows with
indices i1 , i2 , · · · , ik , [M ]:,ω2 is the submatrix constructed by choosing the columns j1 , j2 , · · · , js ,
and similarly for [M ]ω1 ,ω2 . For a 2D index set Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n}, we imagine it
as a sparse matrix and, accordingly, define its “rows”, “columns” and “transpose” as follows: The
ith row Ωi = {j1 |(i1 , j1 ) ∈ Ω, i1 = i}, the jth column Ωj = {i1 |(i1 , j1 ) ∈ Ω, j1 = j} and the
transpose ΩT = {(j1 , i1 )|(i1 , j1 ) ∈ Ω}.
The special symbol (·)+ is reserved to denote the Moore-Penrose pseudo-inverse of a matrix. More
T
precisely, for a matrix M with Singular Value Decomposition (SVD)5 M = UM ΣM VM , its pseudo-
+ −1 T
inverse is given by M = VM ΣM UM . For convenience, we adopt the conventions of using
span{M } to denote the linear space spanned by the columns of a matrix M , using y ∈ span{M } to
denote that a vector y belongs to the space span{M }, and using Y ∈ span{M } to denote that all the
column vectors of a matrix Y belong to span{M }.
Capital letters U , V , Ω and their variants (complements, subscripts, etc.) are reserved for left singular
vectors, right singular vectors and index set, respectively. For convenience, we shall abuse the
notation U (resp. V ) to denote the linear space spanned by the columns of U (resp. V ), i.e., the
column space (resp. row space). The orthogonal projection onto the column space U , is denoted by
PU and given by PU (M ) = U U T M , and similarly for the row space PV (M ) = M V V T . The same
In this paper, SVD always refers to skinny SVD. For a rank-r matrix M ∈ Rm×n , its SVD is of the form
5
T
UM ΣM VM , where UM ∈ Rm×r , ΣM ∈ Rr×r and VM ∈ Rn×r .

notation is also used to represent a subspace of matrices (i.e., the image of an operator), e.g., we say
that M ∈ PU for any matrix M which satisfies PU (M ) = M . We shall also abuse the notation Ω
to denote the linear space of matrices supported on Ω. Then the symbol PΩ denotes the orthogonal
projection onto Ω, namely,

[M ]ij , if (i, j) ∈ Ω,
[PΩ (M )]ij =
0, otherwise.
Similarly, the symbol PΩ⊥ denotes the orthogonal projection onto the complement space of Ω. That
is, PΩ + PΩ⊥ = I, where I is the identity operator.
Three types of matrix norms are used in this paper, and they are all functions of the singular values:
1) The operator norm or 2-norm (i.e., largest singular value) denoted by kM k, 2) the Frobenius norm
(i.e., square root of the sum of squared singular values) denoted by kM kF and 3) the nuclear norm
or trace norm (i.e., sum of singular values) denoted by kM k∗ . The only used vector norm is the `2
norm, which is denoted by k · k2 . The symbol | · | is reserved for the cardinality of an index set.
3 Isomeric Condition and Matrix Completion

This section introduces the proposed isomeric condition and a set of theorems for matrix completion.
But most of the detailed proofs are deferred until the Supplementary Materials.
3.1 Isomeric Condition

In general cases, as aforementioned, matrix completion is an ill-posed problem. Thus, some assump-
tions are necessary for studying Problem 1.1. To eliminate the disadvantages of the setup of random
sampling, we define and investigate a so-called isomeric condition.
3.1.1 Definitions
For the ease of understanding, we shall begin with a concept called k-isomerism (or k-isomeric in
adjective form), which could be regarded as an extension of low-rankness.
Definition 3.1 (k-isomeric). A matrix M ∈ Rm×l is called k-isomeric if and only if any k rows of
M can linearly represent all rows in M . That is,
rank ([M ]ω,: ) = rank (M ) , ∀ω ⊆ {1, 2, · · · , m}, |ω| = k,
where | · | is the cardinality of an index set.
In general, k-isomerism is somewhat similar to Spark [37] which defines the smallest linearly
dependent subset of the rows of a matrix. For a matrix M to be k-isomeric, it is necessary that
rank (M ) ≤ k, not sufficient. In fact, k-isomerism is also somehow related to the concept of
coherence [4, 21]. When the coherence of a matrix M ∈ Rm×l is not too high, the rows of M will
sufficiently spread, and thus M could be k-isomeric with a small k, e.g., k = rank (M ). Whenever
the coherence of M is very high, one may need a large k to satisfy the k-isomeric property. For
example, consider an extreme case where M is a rank-1 matrix with one row being 1 and everywhere
else being 0. In this case, we need k = m to ensure that M is k-isomeric.
While Definition 3.1 involves all 1D index sets of cardinality k, we often need the isomeric property
to be associated with a certain 2D index set Ω. To this end, we define below a concept called
Ω-isomerism (or Ω-isomeric in adjective form).
Definition 3.2 (Ω-isomeric). Let M ∈ Rm×l and Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n}. Suppose
that Ωj 6= ∅ (empty set), ∀1 ≤ j ≤ n. Then the matrix M is called Ω-isomeric if and only if

rank [M ]Ωj ,: = rank (M ) , ∀j = 1, 2, · · · , n.
Note here that only the number of rows in M is required to coincide with the row indices included in
Ω, and thereby l 6= n is allowable.
Generally, Ω-isomerism is less strict than k-isomerism. Provided that |Ωj | ≥ k, ∀1 ≤ j ≤ n, a matrix
M is k-isomeric ensures that M is Ω-isomeric as well, but not vice versa. For the extreme example
where M is nonzero at only one row, interestingly, M can be Ω-isomeric as long as the locations of
the nonzero elements are included in Ω.
With the notation of ΩT = {(j1 , i1 )|(i1 , j1 ) ∈ Ω}, the isomeric property could be also defined on
the column vectors of a matrix, as shown in the following definition.

Definition 3.3 (Ω/ΩT -isomeric). Let M ∈ Rm×n and Ω ⊆ {1, 2, · · · , m}×{1, 2, · · · , n}. Suppose
Ωi 6= ∅ and Ωj 6= ∅, ∀i = 1, · · · , m, j = 1, · · · , n. Then the matrix M is called Ω/ΩT -isomeric if
and only if M is Ω-isomeric and M T is ΩT -isomeric as well.
To solve Problem 1.1 without the imperfect assumption of missing at random, as will be shown later,
we need to assume that L0 is Ω/ΩT -isomeric. This condition has excluded the unidentifiable cases
where any rows or columns of L0 are wholly missing. In fact, whenever L0 is Ω/ΩT -isomeric, the
number of observed entries in each row and column of L0 has to be greater than or equal to the rank
of L0 ; this is consistent with the results in [20]. Moreover, Ω/ΩT -isomerism has actually well treated
the cases where L0 is of high coherence. For example, consider an extreme case where L0 is 1 at only
one element and 0 everywhere else. In this case, L0 cannot be Ω/ΩT -isomeric unless the nonzero
element is observed. So, generally, it is possible to restore the missing entries of a highly coherent
matrix, as long as the Ω/ΩT -isomeric condition is obeyed.
3.1.2 Basic Properties

While its definitions are associated with a certain matrix, the isomeric condition is actually character-
izing some properties of a space, as shown in the lemma below.
Lemma 3.1. Let L0 ∈ Rm×n and Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n}. Denote the SVD of L0 as
U0 Σ0 V0T . Then we have:
1. L0 is Ω-isomeric if and only if U0 is Ω-isomeric.

2. LT0 is ΩT -isomeric if and only if V0 is ΩT -isomeric.
Proof. It could be manipulated that

[L0 ]Ωj ,: = ([U0 ]Ωj ,: )Σ0 V0T , ∀j = 1, · · · , n.
Since Σ0 V0T is row-wisely full rank, we have

rank [L0 ]Ωj ,: = rank [U0 ]Ωj ,: , ∀j = 1, · · · , n.
As a result, L0 is Ω-isomeric is equivalent to U0 is Ω-isomeric. In a similar way, the second claim is
proved as well.
It is easy to see that the above lemma is still valid even when the condition of Ω-isomerism is replaced
by k-isomerism. Thus, hereafter, we may say that a space is isomeric (k-isomeric, Ω-isomeric or
ΩT -isomeric) as long as its basis matrix is isomeric. In addition, the isomeric property is subspace
successive, as shown in the next lemma.
Lemma 3.2. Let Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n} and U0 ∈ Rm×r be the basis matrix of a
Euclidean subspace embedded in Rm . Suppose that U is a subspace of U0 , i.e., U = U0 U0T U . If U0
is Ω-isomeric then U is Ω-isomeric as well.
Proof. By U = U0 U0T U and U0 is Ω-isomeric,

rank [U ]Ωj ,: = rank ([U0 ]Ωj ,: )U0T U = rank U0T U

= rank U0 U0T U = rank (U ) , ∀1 ≤ j ≤ n.

The above lemma states that, in one word, the subspace of an isomeric space is isomeric.
3.1.3 Important Properties

As aforementioned, the isometric condition is actually necessary for ensuring that the minimal rank
solution to (1) can identify L0 . To see why, let’s assume that U0 ∩ Ω⊥ 6= {0}, where we denote by
U0 Σ0 V0T the SVD of L0 . Then one could construct a nonzero perturbation, denoted as ∆ ∈ U0 ∩ Ω⊥ ,
and accordingly, obtain a feasible solution L̃0 = L0 + ∆ to the problem in (1). Since ∆ ∈ U0 , we
have rank(L̃0 ) ≤ rank (L0 ). Even more, it is entirely possible that rank(L̃0 ) < rank (L0 ). Such
a case is unidentifiable in nature, as the global optimum to problem (1) cannot identify L0 . Thus,

to ensure that the global minimum to (1) can identify L0 , it is essentially necessary to show that
U0 ∩ Ω⊥ = {0} (resp. V0 ∩ Ω⊥ = {0}), which is equivalent to the operator PU0 PΩ PU0 (resp.
PV0 PΩ PV0 ) is invertible (see Lemma 6.8 of the Supplementary Materials). Interestingly, the isomeric
condition is indeed a sufficient and necessary condition for the operators PU0 PΩ PU0 and PV0 PΩ PV0
to be invertible, as shown in the following theorem.
Theorem 3.1. Let L0 ∈ Rm×n and Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n}. Let the SVD of L0 be
U0 Σ0 V0T . Denote PU0 (·) = U0 U0T (·) and PV0 (·) = (·)V0 V0T . Then we have the following:
1. The linear operator PU0 PΩ PU0 is invertible if and only if U0 is Ω-isomeric.
2. The linear operator PV0 PΩ PV0 is invertible if and only if V0 is ΩT -isomeric.
The necessity stated above implies that the isomeric condition is actually a very mild hypothesis. In
general, there are numerous reasons for the target matrix L0 to be isomeric. Particularly, the widely
used assumptions of low-rankness, incoherence and uniform sampling are indeed sufficient (but not
necessary) to ensure isomerism, as shown in the following theorem.
Theorem 3.2. Let L0 ∈ Rm×n and Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n}. Denote n1 = max(m, n)
and n2 = min(m, n). Suppose that L0 is incoherent and Ω is a 2D index set sampled uniformly
at random, namely Pr((i, j) ∈ Ω) = ρ0 and Pr((i, j) ∈ / Ω) = 1 − ρ0 . For any δ > 0, if ρ0 > δ
is obeyed and rank (L0 ) < δn2 /(c log n1 ) holds for some numerical constant c then, with high
probability at least 1 − n−10
1 , L0 is Ω/ΩT -isomeric.
It is worth noting that the isomeric condition can be obeyed in numerous circumstances other than
the case of uniform sampling plus incoherence. For example,
1 0 0
" #
Ω = {(1, 1), (1, 2), (1, 3), (2, 1), (3, 1)} and L0 = 0 0 0 ,
0 0 0
where L0 is a 3×3 matrix with 1 at (1, 1) and 0 everywhere else. In this example, L0 is not incoherent
and the sampling is not uniform either, but it could be verified that L0 is Ω/ΩT -isomeric.
3.2 Results
In this subsection, we shall show how the isomeric condition can take effect in the context of
nonuniform sampling, establishing some theorems pertaining to missing data recovery [35] as well
as matrix completion.
3.2.1 Missing Data Recovery

Before exploring the matrix completion problem, for the ease of understanding, we would like
to consider a missing data recovery problem studied by Zhang [35], which could be described as
follows: Let y0 ∈ Rm be a data vector drawn form some low-dimensional subspace, denoted as
y0 ∈ S0 ⊂ Rm . Suppose that y0 contains some available observations in yb ∈ Rk and some missing
entries in yu ∈ Rm−k . Namely, after a permutation,

yb
y0 = , yb ∈ Rk , yu ∈ Rm−k . (3)
yu
Given the observations in yb , we seek to restore the unseen entries in yu . To do this, we consider the
prevalent idea that represents a data vector as a linear combination of the bases in a given dictionary:
y0 = Ax0 , (4)
where A ∈ Rm×p is a dictionary constructed in advance and x0 ∈ Rp is the representation of y0 .
Utilizing the same permutation used in (3), we can partition the rows of A into two parts according to
the indices of the observed and missing entries, respectively:

Ab
A= , Ab ∈ Rk×p , Au ∈ R(m−k)×p . (5)
Au
In this way, the equation in (4) gives that
yb = Ab x0 and y u = Au x 0 .

As we now can see, the unseen data yu could be restored, as long as the representation x0 is retrieved
by only accessing the available observations in yb . In general cases, there are infinitely many
representations that satisfy y0 = Ax0 , e.g., x0 = A+ y0 , where (·)+ is the pseudo-inverse of a matrix.
Since A+ y0 is the representation of minimal `2 norm, we revisit the traditional `2 program:
1 2
min kxk2 , s.t. yb = Ab x, (6)
x 2
where k · k2 is the `2 norm of a vector. Under some verifiable conditions, the above `2 program
is indeed consistently successful in a sense as in the following: For any y0 ∈ S0 with an arbitrary
partition y0 = [yb ; yu ] (i.e., arbitrarily missing), the desired representation x0 = A+ y0 is the unique
minimizer to the problem in (6). That is, the unseen data yu is exactly recovered by firstly computing
the minimizer x∗ to problem (6) and then calculating yu = Au x∗ .
Theorem 3.3. Let y0 = [yb ; yu ] ∈ Rm be an authentic sample drawn from some low-dimensional
subspace S0 embedded in Rm , A ∈ Rm×p be a given dictionary and k be the number of available
observations in yb . Then the convex program (6) is consistently successful, provided that S0 ⊆
span{A} and the dictionary A is k-isomeric.
Unlike the theory in [35], the condition of which is unverifiable, our k-isomeric condition could be
verified in finite time. Notice, that the problem of missing data recovery is closely related to matrix
completion, which is actually to restore the missing entries in multiple data vectors simultaneously.
Hence, Theorem 3.3 can be naturally generalized to the case of matrix completion, as will be shown
in the next subsection.
3.2.2 Matrix Completion

The spirits of the `2 program (6) can be easily transferred to the case of matrix completion. Follow-
ing (6), one may consider Frobenius norm minimization for matrix completion:
1 2
min kXkF , s.t. PΩ (AX − L0 ) = 0, (7)
X 2
where A ∈ Rm×p is a dictionary assumed to be given. As one can see, the problem in (7) is equivalent
to (6) if L0 is consisting of only one column vector. The same as (6), the convex program (7) can
also exactly recover the desired representation matrix A+ L0 , as shown in the theorem below. The
difference is that we here require Ω-isomerism instead of k-isomerism.
Theorem 3.4. Let L0 ∈ Rm×n and Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n}. Suppose that A ∈ Rm×p
is a given dictionary. Provided that L0 ∈ span{A} and A is Ω-isomeric, the desired representation
X0 = A+ L0 is the unique minimizer to the problem in (7).
Theorem 3.4 tells us that, in general, even when the locations of the missing entries are interrelated
and nonuniformly distributed, the target matrix L0 can be restored as long as we have found a proper
dictionary A. This motivates us to consider the commonly used bilinear program that seeks both A
and X simultaneously:
1 2 1 2
min kAkF + kXkF , s.t. PΩ (AX − L0 ) = 0, (8)
A,X 2 2
where A ∈ Rm×p and X ∈ Rp×n . The problem above is bilinear and therefore nonconvex. So, it
would be hard to obtain a strong performance guarantee as done in the convex programs, e.g., [4, 21].
Interestingly, under a very mild condition, the problem in (8) is proved to include the exact solutions
that identify the target matrix L0 as the critical points.
Theorem 3.5. Let L0 ∈ Rm×n and Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n}. Denote the rank and SVD
of L0 as r0 and U0 Σ0 V0T , respectively. If L0 is Ω/ΩT -isomeric then the exact solution, denoted by
(A0 , X0 ) and given by
1 1
A0 = U0 Σ02 QT , X0 = QΣ02 V0T , ∀Q ∈ Rp×r0 , QT Q = I,
is a critical point to the problem in (8).
To exhibit the power of program (8), however, the parameter p, which indicates the number of
columns in the dictionary matrix A, must be close to the true rank of the target matrix L0 . This is

convex (nonuniform) nonconvex (nonuniform) convex (uniform) nonconvex (uniform)
95 95 95 95
observed entries (%)

75 75 75 75
55 55 55 55
35 35 35 35
15 15 15 15
1 1 1 1
1 15 35 55 75 95 1 15 35 55 75 95 1 15 35 55 75 95 1 15 35 55 75 95
rank(L0) rank(L0) rank(L0) rank(L0)
Figure 2: Comparing the bilinear program (9) (p = m) with the convex method (2). The numbers
plotted on the above figures are the success rates within 20 random trials. The white and black points
mean “succeed” and “fail”, respectively. Here the success is in a sense that PSNR ≥ 40dB, where
PSNR standing for peak signal-to-noise ratio.
impractical in the cases where the rank of L0 is unknown. Notice, that the Ω-isomeric condition
imposed on A requires
rank (A) ≤ |Ωj |, ∀j = 1, 2, · · · , n.
This, together with the condition of L0 ∈ span{A}, essentially need us to solve a low rank matrix
recovery problem [14]. Hence, we suggest to combine the formulation (7) with the popular idea of
nuclear norm minimization, resulting in a bilinear program that jointly estimates both the dictionary
matrix A and the representation matrix X by
1 2
min kAk∗ + kXkF , s.t. PΩ (AX − L0 ) = 0, (9)
A,X 2
which, by coincidence, has been mentioned in a paper about optimization [32]. Similar to (8), the
program in (9) has the following theorem to guarantee its performance.
2 1
Unlike (8), which possesses superior performance only if p is close to rank (L0 ) and the initial
solution is chosen carefully, the bilinear program in (9) can work well by simply choosing p = m
and using A = I as the initial solution. To see why, one essentially needs to figure out the conditions
under which a specific optimization procedure can produce an optimal solution that meets an exact
solution. This requires extensive justifications and we leave it as future work.
4 Simulations
To verify the superiorities of the nonconvex matrix completion methods over the convex program (2),
we would like to experiment with randomly generated matrices. We generate a collection of m × n
(m = n = 100) target matrices according to the model of L0 = BC, where B ∈ Rm×r0 and
C ∈ Rr0 ×n are N (0, 1) matrices. The rank of L0 , i.e., r0 , is configured as r0 = 1, 5, 10, · · · , 90, 95.
Regarding the index set Ω consisting of the locations of the observed entries, we consider t-
wo settings: One is to create Ω by using a Bernoulli model to randomly sample a subset from
{1, · · · , m} × {1, · · · , n} (referred to as “uniform”), the other is as in Figure 1 that makes the
locations of the observed entries be concentrated around the main diagonal of a matrix (referred to as
“nonuniform”). The observation fraction is set to be |Ω|/(mn) = 0.01, 0.05, · · · , 0.9, 0.95. For each
pair of (r0 , |Ω|/(mn)), we run 20 trials, resulting in 8000 simulations in total.
When p = m and the identity matrix is used to initialize the dictionary A, we have empirically found
that program (8) has the same performance as (2). This is not strange, because it has been proven
in [16] that kLk∗ = minA,X 21 (kAk2F + kXk2F ), s.t. L = AX. Figure 2 compares the bilinear

program (9) to the convex method (2). It can be seen that (9) works distinctly better than (2). Namely,
while handling the nonuniformly missing data, the number of matrices successfully restored by the
bilinear program (9) is 102% more than the convex program (2). Even for dealing with the missing
entries chosen uniformly at random, in terms of the number of successfully restored matrices, the
bilinear program (9) can still outperform the convex method (2) by 44%. These results illustrate that,
even in the cases where the rank of L0 is unknown, the bilinear program (9) can do much better than
the convex optimization based method (2).
5 Conclusion and Future Work

This work studied the problem of matrix completion with nonuniform sampling, a significant setting
not extensively studied before. To figure out the conditions under which exact recovery is possible,
we proposed a so-called isomeric condition, which provably holds when the standard assumptions
of low-rankness, incoherence and uniform sampling arise. In addition, we also exemplified that
the isomeric condition can be obeyed in the other cases beyond the setting of uniform sampling.
Even more, our theory implies that the isomeric condition is indeed necessary for making sure
that the minimal rank completion can identify the target matrix L0 . Equipped with the isomeric
condition, finally, we mathematically proved that the widely used bilinear programs can include the
exact solutions that recover the target matrix L0 as the critical points; this guarantees the recovery
performance of bilinear programs to some extend.
However, there still remain several problems for future work. In particular, it is unknown under which
conditions a specific optimization procedure for (9) can produce an optimal solution that exactly
restores the target matrix L0 . To do this, one needs to analyze the convergence property as well as
the recovery performance. Moreover, it is unknown either whether the isomeric condition suffices
for ensuring that the minimal rank completion can identify the target L0 . These require extensive
justifications and we leave them as future work.
Acknowledgment
We would like to thanks the anonymous reviewers and meta-reviewers for providing us many valuable
comments to refine this paper.
References
[1] Emmanuel Candès and Terence Tao. The power of convex relaxation: Near-optimal matrix completion.
IEEE Transactions on Information Theory, 56(5):2053–2080, 2009.
[2] Emmanuel Candès and Yaniv Plan. Matrix completion with noise. In IEEE Proceeding, volume 98, pages
925–936, 2010.
[3] William E. Bishop and Byron M. Yu. Deterministic symmetric positive semidefinite matrix completion.
In Neural Information Processing Systems, pages 2762–2770, 2014.
[4] Emmanuel Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations
of Computational Mathematics, 9(6):717–772, 2009.
[5] Eyal Heiman, Gideon Schechtman, and Adi Shraibman. Deterministic algorithms for matrix completion.
Random Structures and Algorithms, 45(2):306–317, 2014.
[6] Raghunandan H. Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries.
[7] Raghunandan H. Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from noisy entries.
Journal of Machine Learning Research, 11:2057–2078, 2010.
[8] Akshay Krishnamurthy and Aarti Singh. Low-rank matrix and tensor completion via adaptive sampling.
[9] Troy Lee and Adi Shraibman. Matrix completion from any given set of observations. In Neural Information
Processing Systems, pages 1781–1787, 2013.
[10] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning
large incomplete matrices. Journal of Machine Learning Research, 11:2287–2322, 2010.
[11] Karthik Mohan and Maryam Fazel. New restricted isometry results for noisy low-rank recovery. In IEEE
International Symposium on Information Theory, pages 1573–1577, 2010.

[12] B. Recht, W. Xu, and B. Hassibi. Necessary and sufficient conditions for success of the nuclear norm
heuristic for rank minimization. Technical report, CalTech, 2008.
[13] Markus Weimer, Alexandros Karatzoglou, Quoc V. Le, and Alex J. Smola. Cofi rank - maximum margin
matrix factorization for collaborative ranking. In Neural Information Processing Systems, 2007.
[14] Emmanuel J. Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis?
Journal of the ACM, 58(3):1–37, 2011.
[15] Alexander L. Chistov and Dima Grigoriev. Complexity of quantifier elimination in the theory of alge-
braically closed fields. In Proceedings of the Mathematical Foundations of Computer Science, pages
17–31, 1984.
[16] Maryam Fazel, Haitham Hindi, and Stephen P. Boyd. A rank minimization heuristic with application to
minimum order system approximation. In American Control Conference, pages 4734–4739, 2001.
[17] Rong Ge, Jason D. Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Neural
Information Processing Systems, pages 2973–2981, 2016.
[18] Franz J. Király, Louis Theran, and Ryota Tomioka. The algebraic combinatorial approach for low-rank
matrix completion. J. Mach. Learn. Res., 16(1):1391–1436, January 2015.
[19] Guangcan Liu and Ping Li. Recovery of coherent data via low-rank dictionary pursuit. In Neural
[20] Daniel L. Pimentel-Alarcón and Robert D. Nowak. The Information-theoretic requirements of subspace
clustering with missing data. In International Conference on Machine Learning, 48:802–810, 2016.
[21] Guangcan Liu and Ping Li. Low-rank matrix completion in the presence of high coherence. IEEE
Transactions on Signal Processing, 64(21):5623–5633, 2016.
[22] Guangcan Liu, Zhouchen Lin, Shuicheng Yan, Ju Sun, Yong Yu, and Yi Ma. Robust recovery of subspace
structures by low-rank representation. IEEE Transactions on Pattern Recognition and Machine Intelligence,
35(1):171–184, 2013.
[23] Guangcan Liu, Qingshan Liu, and Ping Li. Blessing of dimensionality: Recovering mixture data via
dictionary pursuit. IEEE Transactions on Pattern Recognition and Machine Intelligence, 39(1):47–60,
2017.
[24] Guangcan Liu, Huan Xu, Jinhui Tang, Qingshan Liu, and Shuicheng Yan. A deterministic analysis for
LRR. IEEE Transactions on Pattern Recognition and Machine Intelligence, 38(3):417–430, 2016.
[25] Raghu Meka, Prateek Jain, and Inderjit S. Dhillon. Matrix completion from power-law distributed samples.
[26] Sahand Negahban and Martin J. Wainwright. Restricted strong convexity and weighted matrix completion:
Optimal bounds with noise. Journal of Machine Learning Research, 13:1665–1697, 2012.
[27] Yudong Chen, Srinadh Bhojanapalli, Sujay Sanghavi, and Rachel Ward. Completing any low-rank matrix,
provably. Journal of Machine Learning Research, 16: 2999-3034, 2015.
[28] Praneeth Netrapalli, U. N. Niranjan, Sujay Sanghavi, Animashree Anandkumar, and Prateek Jain. Non-
convex robust PCA. In Neural Information Processing Systems, pages 1107–1115, 2014.
[29] Yuzhao Ni, Ju Sun, Xiaotong Yuan, Shuicheng Yan, and Loong-Fah Cheong. Robust low-rank subspace
segmentation with semidefinite guarantees. In International Conference on Data Mining Workshops, pages
1179–1188, 2013.
[30] R. Rockafellar. Convex Analysis. Princeton University Press, Princeton, NJ, USA, 1970.
[31] Ruslan Salakhutdinov and Nathan Srebro. Collaborative filtering in a non-uniform world: Learning with
the weighted trace norm. In Neural Information Processing Systems, pages 2056–2064, 2010.
[32] Fanhua Shang, Yuanyuan Liu, and James Cheng. Scalable algorithms for tractable schatten quasi-norm
minimization. In AAAI Conference on Artificial Intelligence, pages 2016–2022, 2016.
[33] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via non-convex factorization. IEEE
Transactions on Information Theory, 62(11):6535–6579, 2016.
[34] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust PCA via outlier pursuit. IEEE Transactions
on Information Theory, 58(5):3047–3064, 2012.
[35] Yin Zhang. When is missing data recoverable? CAAM Technical Report TR06-15, 2006.
[36] Tuo Zhao, Zhaoran Wang, and Han Liu. A nonconvex optimization framework for low rank matrix
estimation. In Neural Information Processing Systems, pages 559–567, 2015.
[37] David L. Donoho and Michael Elad. Optimally sparse representation in general (nonorthogonal) dic-
tionaries via `1 minimization. Proceedings of the National Academy of Sciences, 100(5): 2197-2202,
2003.
10

Noisy Tensor Completion via the Sum-of-Squares Hierarchy
∗ †
Boaz Barak Ankur Moitra
February 19, 2016
Abstract
arXiv:1501.06521v3 [cs.LG] 18 Feb 2016
In the noisy tensor completion problem we observe m entries (whose location is chosen uniformly at
random) from an unknown n1 × n2 × n3 tensor T . We assume that T is entry-wise close to being rank
r. Our goal is to fill in its missing entries using as few observations as possible. Let n = max(n1 , n2 , n3 ).
We show that if m = n3/2 r then there is a polynomial time algorithm based on the sixth level of the
sum-of-squares hierarchy for completing it. Our estimate agrees with almost all of T ’s entries almost
exactly and works even when our observations are corrupted by noise. This is also the first algorithm for
tensor completion that works in the overcomplete case when r > n, and in fact it works all the way up
to r = n3/2− .
Our proofs are short and simple and are based on establishing a new connection between noisy tensor
completion (through the language of Rademacher complexity) and the task of refuting random constant
satisfaction problems. This connection seems to have gone unnoticed even in the context of matrix
completion. Furthermore, we use this connection to show matching lower bounds. Our main technical
result is in characterizing the Rademacher complexity of the sequence of norms that arise in the sum-of-
squares relaxations to the tensor nuclear norm. These results point to an interesting new direction: Can
we explore computational vs. sample complexity tradeoffs through the sum-of-squares hierarchy?
∗ Harvard John A. Paulson School of Engineering and Applied Sciences. Email: b@boazbarak.org
† MassachusettsInstitute of Technology. Department of Mathematics and the Computer Science and Artificial Intelligence
Lab. Email: moitra@mit.edu. This work is supported in part by a grant from the MIT NEC Corporation and a Google Research
Award.

1 Introduction
Matrix completion is one of the cornerstone problems in machine learning and has a diverse range of appli-
cations. One of the original motivations for it comes from the Netflix Problem where the goal is to predict
user-movie ratings based on all the ratings we have observed so far, from across many different users. We
can organize this data into a large, partially observed matrix where each row represents a user and each
column represents a movie. The goal is to fill in the missing entries. The usual assumptions are that the
ratings depend on only a few hidden characteristics of each user and movie and that the underlying matrix is
approximately low rank. Another standard assumption is that it is incoherent, which we elaborate on later.
How many entries of M do we need to observe in order to fill in its missing entries? And are there efficient
algorithms for this task?
There have been thousands of papers on this topic and by now we have a relatively complete set of
answers. A representative result (building on earlier works by Fazel [28], Recht, Fazel and Parrilo [67],
Srebro and Shraibman [71], Candes and Recht [19], Candes and Tao [20]) due to Keshavan, Montanari and
Oh [50] can be phrased as follows: Suppose M is an unknown n1 × n2 matrix that has rank r but each of
its entries has been corrupted by independent Gaussian noise with standard deviation δ. Then if we observe
roughly
m = (n1 + n2 )r log(n1 + n2 )
of its entries, the locations of which are chosen uniformly at random, there is an algorithm that outputs a
matrix X that with high probability satisfies
1 X
err(X) = Xi,j − Mi,j ≤ O(δ) .

n1 n2 i,j
There are extensions to non-uniform sampling models [55, 24], as well as various efficiency improvements
[47, 40]. What is particularly remarkable about these guarantees is that the number of observations needed
is within a logarithmic factor of the number of parameters — (n1 + n2 )r — that define the model.
In fact, there are benefits to working with even higher-order structure but so far there has been little
progress on natural extensions to the tensor setting. To motivate this problem, consider the Groupon
Problem (which we introduce here to illustrate this point) where the goal is to predict user-activity ratings.
The challenge is that which activities we should recommend (and how much a user liked a given activity)
depends on time as well — weekday/weekend, day/night, summer/fall/winter/spring, etc. or even some
combination of these. As above, we can cast this problem as a large, partially observed tensor where the
first index represents a user, the second index represents an activity and the third index represents the time
period. It is again natural to model it as being close to low rank, under the assumption that a much smaller
number of (latent) factors about the interests of the user, the type of activity and the time period should
contribute to the rating. How many entries of the tensor do we need to observe in order to fill in its missing
entries? This problem is emblematic of a larger issue: Can we always solve linear inverse problems when
the number of observations is comparable to the number of parameters in the mode, or is computational
intractability an obstacle?
In fact, one of the advantages of working with tensors is that their decompositions are unique in important
ways that matrix decompositions are not. There has been a groundswell of recent work that uses tensor
decompositions for exactly this reason for parameter learning in phylogenetic trees [60], HMMs [60], mixture
models [46], topic models [2] and to solve community detection [3]. In these applications, one assumes access
to the entire tensor (up to some sampling noise). But given that the underlying tensors are low-rank, can
we observe fewer of their entries and still utilize tensor methods?
A wide range of approaches to solving tensor completion have been proposed [56, 35, 70, 73, 61, 52, 48,
14, 74]. However, in terms of provable guarantees none1 of them improve upon the following näive algorithm.
If the unknown tensor T is n1 × n2 × n3 we can treat it as a collection of n1 matrices each of size n2 × n3 . It
1 Most of the existing approaches rely on computing the tensor nuclear norm, which is hard to compute [39, 41]. The only
other algorithms we are aware of [48, 14] require that the factors be orthogonal. This is a rather strong assumption. First,
orthogonality requires the rank to be at most n. Second, even when r ≤ n, most tensors need to be “whitened” to be put in this
form and then a random sample from the “whitened” tensor would correspond to a (dense) linear combination of the entries of
the original tensor, which would be quite a different sampling model.

is easy to see that if T has rank at most r then each of these slices also has rank at most r (and they inherit
incoherence properties as well). By treating a third-order tensor as nothing more than an unrelated collection
of n1 low-rank matrices, we can complete each slice separately using roughly m = n1 (n2 + n3 )r log(n2 + n3 )
observations in total. When the rank is constant, this is a quadratic number of observations even though the
number of parameters in the model is linear.
Here we show how to solve the (noisy) tensor completion problem with many fewer observations. Let
n1 ≤ n2 ≤ n3 . We give an algorithm based on the sixth level of the sum-of-squares hierarchy that can
accurately fill in the missing entries of an unknown, incoherent n1 × n2 × n3 tensor T that is entry-wise close
to being rank r with roughly
m = (n1 )1/2 (n2 + n3 )r log4 (n1 + n2 + n3 )
observations. Moreover, our algorithm works even when the observations are corrupted by noise. When
n = n1 = n2 = n3 , this amounts to about n1/2 r observations per slice which is much smaller than what
we would need to apply matrix completion on each slice separately. Our algorithm needs to leverage the
structure between the various slices.
1.1 Our Results

We give an algorithm for noisy tensor completion that works for third-order tensors. Let T be a third-order
n1 × n2 × n3 tensor that is entry-wise close to being low rank. In particular let
r
X
T = σ` a` ⊗ b` ⊗ c` + ∆ (1)
`=1
where σ` is a scalar and a` , b` and c` are vectors of length n1 , n2 and n3 respectively. Here ∆ is a tensor
that represents noise. Its entries can be thought of as representing model misspecification because T is not
exactly low rank or noise in our observations or both. We will only make assumptions about the average
and maximum absolute value of entries in ∆. The vectors a` , b` and c` are called factors, and we will assume
√
that their norms are roughly ni for reasons that will become clear later. Moreover we will assume that the
2
magnitude of each of their entries is bounded by √ C in which case
√ we call the vectors C-incoherent . (Note
that a random vector of dimension n and norm n will be O( log ni )-incoherent with high probability.)
The advantage of these conventions are that a typical entry in T does not become vanishingly small as we
increase the dimensions of the tensor. This will make it easier to state and interpret the error bounds of our
algorithm.
Let Ω represent the locations of the entries that we observe, which (as is standard) are chosen uniformly
at random and without replacement. Set |Ω| = m. Our goal is to output a hypothesis X that has small
entry-wise error, defined as:
1 X
err(X) = Xi,j,k − Ti,j,k

n1 n2 n3
i,j,k
This measures the error on both the observed and unobserved entries of T . Our goal is to give algorithms
that achieve vanishing error, as the size of the problem increases. Moreover we will want algorithms that
need as few observations as possible. Here and throughout let n1 ≤ n2 ≤ n3 and n = max{n1 , n2 , n3 }. Our
main result is:
Theorem 1.1 (Main theorem). Suppose we are given m observations whose locations are chosen uniformly
at random (and without replacement) Pfrom a tensor T of the form Pr (1) where each of the factors a` , b` and
c` are C-incoherent. Let δ = n1 n12 n3 i,j,k |∆i,j,k |. And let r∗ = `=1 |σ` |. Then there is a polynomial time
algorithm that outputs a hypothesis X that with probability 1 − satisfies
s
(n1 )1/2 (n2 + n3 ) log4 n + log 2/
err(X) ≤ 4C 3 r∗ + 2δ
m
2 Incoherence is often defined based on the span of the factors, but we will allow the number of factors to be larger than any
of the dimensions of the tensor so we will need an alternative way to ensure that the non-zero entries of the factors are spread
out

q
m
provided that maxi,j,k |∆i,j,k | ≤ log 2/ δ.
Since the error bound above is quite involved, let us dissect the terms in it. In fact, having an additive
δ in the error bound is unavoidable. We have not assumed anything about ∆ in (1) except a bound on
the average and maximum magnitude of its entries. If ∆ were a random tensor whose entries are +δ and
−δ then no matter how many entries of T we observe, we cannot hope to obtain error less than δ on the
unobserved entries3 . The crucial point is that the remaining term in the error bound becomes o(1) when
e ∗ )2 n3/2 ) which for polylogarithmic r∗ improves over the näive algorithm for tensor completion
m = Ω((r
by a polynomial factor in terms of the number of observations. Moreover our algorithm works without any
constraints that factors a` , b` and c` be orthogonal or even have low inner-product.
In non-degenerate cases we can even remove another factor of r∗ from the number of observations we
need. Suppose that T is a tensor as in (1), but let σ` be Gaussian random variables with mean zero and
variance one. The factors a` , b` and c` are still fixed, but because of the randomness in the coefficients σ` ,
the entries of T are now random variables.
Corollary 1.2. Suppose we are given m observations whose locations are chosen uniformly at random (and
without replacement) from a tensor T of the form (1), where each coefficient σ` is a Gaussian random
variable with mean zero and variance one, and each of the factors a` , b` and c` are C-incoherent.
Further, suppose that for a 1 − o(1) fraction of the entries of T , we have var(Ti,j,k ) ≥ r/ polylog(n) = V
and that ∆ is a tensor where each entry is a Gaussian with mean zero and variance o(V ). Then there is a
polynomial time algorithm that outputs a hypothesis X that satisfies

Xi,j,k = 1 ± o(1) Ti,j,k
for a 1 − o(1) fraction of the entries. The algorithm succeeds with probability at least 1 − o(1) over the
randomness of the locations of the observations, and the realizations of the random variables σ` and the
entries of ∆. Moreover the algorithm uses m = C 6 n3/2 r polylog(n) observations.
In the setting above, it is enough that the coefficients σ` are random and that the non-zero entries in the
factors are spread out√ to ensure that the typical entry in T has variance about r. Consequently, the typical
entry in T is about r. This fact combined with the error bounds in Theorem 1.1 immediately yield the
above corollary . Remarkably, the guarantee is interesting even when r = n3/2− (the so-called overcomplete
case). In this setting, if we observe a subpolynomial fraction of the entries of T we are able to recover almost
all of the remaining entries almost entirely, even though there are no known algorithms for decomposing
an overcomplete, third-order tensor even if we are given all of its entries, at least without imposing much
stronger conditions that the factors be nearly orthogonal [36].
We believe that this work is a natural first step in designing practically efficient algorithms for tensor
completion. Our algorithms manage to leverage the structure across the slices through the tensor, instead
of treating each slice as an independent matrix completion problem. Now that we know this is possible,
a natural follow-up question is to get more efficient algorithms. Our algorithms are based on the sixth
level of the sum-of-squares hierarchy and run in polynomial time, but are quite far from being practically
efficient as stated. Recent work of Hopkins et al. [44] shows how to speed up sum-of-squares and obtain
nearly linear time algorithms for a number of problems where the only previously known algorithms ran
in a prohibitively large degree polynomial running time. Another approach would be to obtain similar
guarantees for alternating minimization. Currently, the only known approaches [48] require that the factors
are orthonormal and only work in the undercomplete case. Finally, it would be interesting to get algorithms
that recover a low rank tensor exactly when there is no noise.
1.2 Our approach

All of our algorithms are based on solving the following optimization problem:
1 X
min kXkK s.t. ∃X with |Xi,j,k − Ti,j,k | ≤ 2δ (2)
m
(i,j,k)∈Ω
3 The factor of 2 is not important, and comes from needing a bound on the empirical error of how well the low rank part of
T itself agrees with our observations so far. We could replace it with any other constant factor that is larger than 1.

and outputting the minimizer X, where k · kK is some norm that can be computed in polynomial time. It will
be clear from the way we define the norm that the low rank part of T will itself be a good candidate solution.
But this is not necessarily the solution that the convex program finds. How do we know that whatever it
finds not only has low entry-wise error on the observed entries of T , but also on the unobserved entries too?
This is a well-studied topic in statistical learning theory, and as is standard we can use the notion of
Rademacher complexity as a tool to bound the error. The Rademacher complexity is a property of the
norm we choose, and our main innovation is to use the sum-of-squares hierarchy to suggest a suitable norm.
Our results are based on establishing a connection between noisy tensor completion and refuting random
constraint satisfaction problems. Moreover, our analysis follows by embedding algorithms for refutation
within the sum-of-squares hierarchy as a method to bound the Rademacher complexity.
A natural question to ask is: Are there other norms that have even better Rademacher complexity than
the ones we use here, and that are still computable in polynomial time? It turns out that any such norm
would immediately lead to much better algorithms for refuting random constraint satisfaction problems than
we currently know. We have not yet introduced Rademacher complexity yet, so we state our lower bounds
informally:
Theorem 1.3 (informal). For any > 0, if there is a polynomial time algorithm that achieves error
r
∗ n3/2−
err(X) ≤ r
m
through the framework of Rademacher complexity then there is an efficient algorithm for refuting a random
3-SAT formula on n variables with m = n3/2− clauses. Moreover the natural sum-of-squares relaxation
requires at least n2 -levels in order to achieve the above error (again through the framework of Rademacher
complexity).
These results follow directly from the works of Grigoriev [38], Schoenebeck [68] and Feige [29]. There are
similar connections between our upper bounds and the work of Coja-Oghlan, Goerdt and Lanka [25] who
give an algorithm for strongly refuting random 3-SAT. In Section 2 we explain some preliminary connections
between these fields, at which point we will be in a better position to explain how we can borrow tools from
one area to address open questions in another. We state this theorem more precisely in Corollary 2.13 and
Corollary 5.6, which provide both conditional and unconditional lower bounds that match our upper bounds.
1.3 Computational vs. Sample Complexity Tradeoffs

It is interesting to compare the story of matrix completion and tensor completion. In matrix completion, we
have the best of both worlds: There are efficient algorithms which work when the number of observations
is close to the information theoretic minimum. In tensor completion, we gave algorithms that improve
upon the number of observations needed by a polynomial factor but still require a polynomial factor more
observations than can be achieved if we ignore computational considerations. We believe that for many other
linear inverse problems (e.g. sparse phase retrieval), there may well be gaps between what can be achieved
information theoretically and what can be achieved with computationally efficient estimators. Moreover,
proving lower bounds against the sum-of-squares hierarchy offers a new type of evidence that problems are
hard, that does not rely on reductions from other average-case hard problems which seem (in general) to
be brittle and difficult to execute while preserving the naturalness of the input distribution. In fact, even
when there are such reductions [12], the sum-of-squares hierarchy offers a methodology to make sharper
predictions for questions like: Is there a quasi-polynomial time algorithm for sparse PCA, or does it require
exponential time?
Organization
In Section 2 we introduce Rademacher complexity, the tensor nuclear norm and strong refutation. We
connect these concepts by showing that any norm that can be computed in polynomial time and has good
Rademacher complexity yields an algorithm for strongly refuting random 3-SAT. In Section 3 we show
how a particular algorithm for strong refutation can be embedded into the sum-of-squares hierarchy and
directly leads to a norm that can be computed in polynomial time and has good Rademacher complexity.

In Section 4 we establish certain spectral bounds that we need, and prove our main upper bounds. In
Section 5 we prove lower bounds on the Rademacher complexity of the sequence of norms arising from the
sum-of-squares hierarchy by a direct reduction to lower bounds for refuting random 3-XOR. In Appendix A
we give a reduction from noisy tensor completion on asymmetric tensors to symmetric tensors. This is what
allows us to extend our analysis to arbitrary order d tensors, but the proofs are essentially identical to those
in the d = 3 case but more notationally involved so we omit them.
2 Noisy Tensor Completion and Refutation

Here we make the connection between noisy tensor completion and strong refutation explicit. Our first step
is to formulate a problem that is a special case of both, and studying it will help us clarify how notions from
one problem translate to the other.
2.1 The Distinguishing Problem

Here we introduce a problem that we call the distinguishing problem. We are given random observations
from a tensor and promised that the underlying tensor fits into one of the two following categories. We want
an algorithm that can tell which case the samples came from, and succeeds using as few observations as
possible. The two cases are:
1. Each observation is chosen uniformly at random (and without replacement) from a tensor T where
independently for each entry we set

ai aj ak with probability 7/8

Ti,j,k = 1 with probability 1/16

−1 else

where a is a vector whose entries are ±1.

2. Alternatively, each observation is chosen uniformly at random (and without replacement) from a tensor
T each of whose entries is independently set to either +1 or −1 and with equal probability.
In the first case, the entries of the underlying tensor T are predictable. It is possible to guess a 15/16 fraction
of them correctly, once we have observed enough of its entries to be able to deduce a. And in the second
case, the entries of T are completely unpredictable because no matter how many entries we have observed,
the remaining entries are still random. Thus we cannot predict any of the unobserved entries better than
random guessing.
Now we will explain how the distinguishing problem can be equivalently reformulated in the language of
refutation. We give a formal definition for strong refutation later (Definition 2.10), but for the time being
we can think of it as the task of (given an instance of a constraint satisfaction problem) certifying that there
is no assignment that satisfies many of the clauses. We will be interested in 3-XOR formulas, where there
are n variables v1 , v2 , ..., vn that are constrained to take on values +1 or −1. Each clause takes the form
vi · vj · vk = Ti,j,k
where the right hand side is either +1 or −1. The clause represents a parity constraint but over the domain
{+1, −1} instead of over the usual domain F2 . We have chosen the notation suggestively so that it hints at
the mapping between the two views of the problem. Each observation Ti,j,k maps to a clause vi ·vj ·vk = Ti,j,k
and vice-versa. Thus an equivalent way to formulate the distinguishing problem is that we are given a 3-XOR
formula which was generated in one of the following two ways:
1. Each clause in the formula is generated by choosing an ordered triple of variables (vi , vj , vk ) uniformly
at random (and without replacement) and we set

ai aj ak with probability 7/8

vi · vj · vk = 1 with probability 1/16

−1 else


where a is a vector whose entries are ±1. Now a represents a planted solution and by design our
sampling procedure guarantees that many of the clauses that are generated are consistent with it.
2. Alternatively, each clause in the formula is generated by choosing an ordered triple of variables
(vi , vj , vk ) uniformly at random (and without replacement) and we set vi · vj · vk = zi,j,k where zi,j,k is
a random variable that takes on values +1 and −1.
In the first case, the 3-XOR formula has an assignment that satisfies a 15/16 fraction of the clauses in
expectation by setting vi = ai . In the second case, any fixed assignment satisfies at most half of the clauses
in expectation. Moreover if we are given Ω(n log n) clauses, it is easy to see by applying the Chernoff bound
and taking a union bound over all possible assignments that with high probability there is no assignment
that satisfies more than a 1/2 + o(1) fraction of the clauses.
This will be the starting point for the connections we establish between noisy tensor completion and
refutation. Even in the matrix case these connections seem to have gone unnoticed, and the same spectral
bounds that are used to analyze the Rademacher complexity of the nuclear norm [71] are also used to refute
random 2-SAT formulas [37], but this is no accident.
2.2 Rademacher Complexity

Ultimately our goal is to show that the hypothesis X that our convex program finds is entry-wise close to
the unknown tensor T . By virtue of the fact that X is a feasible solution to (2) we know that it is entry-wise
close to T on the observed entries. This is often called the empirical error:
Definition 2.1. For a hypothesis X, the empirical error is
1 X
emp-err(X) = |Xi,j,k − Ti,j,k |
m
(i,j,k)∈Ω
Recall that err(X) is the average entry-wise error between X and T , over all (observed and unobserved)
entries. Also recall that among the candidate X’s that have low empirical error, the convex program finds
the one that minimizes kXkK for some polynomial time computable norm. The way we will choose the norm
k · kK and our bound on the maximum magnitude of an entry of ∆ will guarantee that the low rank part
of T will with high probability be a feasible solution. This ensures that kXkK for the X we find is not too
large either. One way to bound err(X) is to show that no hypothesis in the unit norm ball can have too
large a gap between its error and its empirical error (and then dilate the unit norm ball so that it contains
X). With this in mind, we define:
Definition 2.2. For a norm k · kK and a set Ω of observations, the generalization error is

sup err(X) − emp-err(X)

kXkK ≤1
It turns out that one can bound the generalization error via the Rademacher complexity.
Definition 2.3. Let Ω = {(i1 , j1 , k1 ), (i2 , j2 , k2 ), ..., (im , jm , km )} be a set of m locations chosen uniformly
at random (and without replacement) from [n1 ] × [n2 ] × [n3 ]. And let σ1 , σ2 , ..., σ` be random ±1 variables.
The Rademacher complexity of (the unit ball of) the norm k · kK is defined as
h X m i
Rm (k · kK ) = E sup σ` Xi` ,j` ,k`

Ω,σ kXkK ≤1 `=1
It follows from a standard symmetrization argument from empirical process theory [51, 11] that the
Rademacher complexity does indeed bound the generalization error.
Theorem 2.4. Let ∈ (0, 1) and suppose each X with kXkK ≤ 1 has bounded loss — i.e. |Xi,j,k −Ti,j,k | ≤ a
and that locations (i, j, k) are chosen uniformly at random and without replacement. Then with probability
at least 1 − , for every X with kXkK ≤ 1, we have
r
m ln(1/)
err(X) ≤ emp-err(X) + 2R (k · kK ) + 2a
m

We repeat the proof here following [11] for the sake of completeness but readers familiar with Rademacher
complexity can feel free to skip ahead to Definition 2.5. The main idea is to let Ω0 be an independent set of
m samples from the same distribution, again without replacement. The expected generalization error is:
h 1 Xm i
E sup |Xi` ,j` ,k` − Ti` ,j` ,k` | − E [|Xi,j,k − Ti,j,k |] (∗)

Ω kXkK ≤1 m i,j,k
`=1
Then we can write
m
1 X m
h 1 X i
(∗) = E sup |Xi` ,j` ,k` − Ti` ,j` ,k` | − E0 [ |Xi0` ,j`0 ,k`0 − Ti0` ,j`0 ,k`0 |]

Ω kXkK ≤1 m `=1
mΩ
`=1
h m
1 X i
≤ E sup |Xi` ,j` ,k` − Ti` ,j` ,k` | − |Xi0` ,j`0 ,k`0 − Ti0` ,j`0 ,k`0 |

Ω,Ω0 kXkK ≤1 m
`=1
where the last line follows by the concavity of sup(·). Now we can use the Rademacher (random ±1) variables
{σ` }` and rewrite the right hand side of the above expression as follows:
h m
1 X i
(∗) ≤ E0 sup σ` |Xi` ,j` ,k` − Ti` ,j` ,k` | − |Xi0` ,j`0 ,k`0 − Ti0` ,j`0 ,k`0 |

Ω,Ω ,σ kXkK ≤1 m `=1
h 1 Xm 1 X m i
≤ E sup σ` |Xi` ,j` ,k` − Ti` ,j` ,k` | + σ` |Xi0` ,j`0 ,k`0 − Ti0` ,j`0 ,k`0 |

Ω,Ω0 ,σ kXkK ≤1 m m
`=1 `=1
h m
1 X i
≤ 2 E sup σ` |Xi` ,j` ,k` − Ti` ,j` ,k` |

Ω,σ kXkK ≤1 m `=1
h m
1 X i
≤ 2 E sup σ` |Xi` ,j` ,k` | + |Ti` ,j` ,k` |

Ω,σ kXkK ≤1 m
`=1
m
h 1 X i h m
1 X i
≤ 2 E σ` |Ti` ,j` ,k` | + 2 E sup σ` |Xi` ,j` ,k` |

Ω,σ m Ω,σ kXkK ≤1 m
`=1 `=1
m
h 1 X i h m
1 X i
= 2 E σ` Ti` ,j` ,k` + 2 E sup σ` Xi` ,j` ,k`

Ω,σ m Ω,σ kXkK ≤1 m
`=1 `=1
where the second, fourth and fifth inequalities use the triangle inequality. The equality uses the fact that the
σ` ’s are random signs and hence can absorb the absolute value around the terms that they multiply. The
second term above in the last expression is exactly the Rademacher complexity that we defined earlier. This
argument only shows that the Rademacher complexity bounds the expected generalization error. However
it turns out that we can also use the Rademacher complexity to bound the generalization error with high
probability by applying McDiarmid’s inequality. See for example [5]. We also remark that generalization
bounds are often stated in the setting where samples are drawn i.i.d., but here the locations of our observations
are sampled without replacement. Nevertheless for the settings of m we are interested in, the fraction of
our observations that are repeats is o(1) — in fact it is subpolynomial — and we can move back and forth
between both sampling models at negligible loss in our bounds.
In much of what follows it will be convenient to think of Ω = {(i1 , j1 , k1 ), (i2 , j2 , k2 ), ..., (im , jm , km )} and
{σ` }` as being represented by a sparse tensor Z, defined below.
Definition 2.5. Let Z be an n1 × n2 × n3 tensor such that
(
0, if (i, j, k) ∈/Ω
Zi,j,k = P
` s.t. (i,j,k)=(i` ,j` ,k` ) σ`
This definition greatly simplifies our notation. In particular we have

m
X X
σ` Xi` ,j` ,k` = Zi,j,k Xi,j,k = hZ, Xi
`=1 i,j,k

where we have introduced the notation h · , · i to denote the natural inner-product between tensors. Our
main technical goal in this paper will be to analyze the Rademacher complexity of a sequence of successively
tighter norms that we get from the sum-of-squares hierarchy, and to derive implications for noisy tensor
completion and for refutation from these bounds.
2.3 The Tensor Nuclear Norm

Here we introduce the tensor nuclear norm and analyze its Rademacher complexity. Many works have
suggested using it to solve tensor completion problems [56, 70, 74]. This suggestion is quite natural given
that it is based on a similar guiding principle as that which led to `1 -minimization in compressed sensing
and the nuclear norm in matrix completion [28]. More generally, one can define the atomic norm for a wide
range of linear inverse problems [23], and the `1 -norm, the nuclear norm and the tensor nuclear norm are all
special cases of this paradigm. Before we proceed, let us first formally define the notion of incoherence that
we gave in the introduction.
√
Definition 2.6. A length ni vector a is C-incoherent if kak = ni and kak∞ ≤ C.
Recall that we chose to work with vectors whose typical entry is a constant so that the entries in T do
not become vanishingly small as the dimensions of the tensor increase. We can now define the tensor nuclear
norm4 :
Definition 2.7 (tensor nuclear norm). Let A ⊆ Rn1 ×n2 ×n3 be defined as
n o
A = X s.t. ∃ distribution µ on triples of C-incoherent vectors with Xi,j,k = E [ai bj ck ]
(a,b,c)←µ
The tensor nuclear norm of X which is denoted by kXkA is the infimum over α such that X/α ∈ A.
In particular kT − ∆kA ≤ r∗ . Finally we give an elementary bound on the Rademacher complexity of the
tensor nuclear norm. Recall that n = max(n1 , n2 , n3 ).
pn
Lemma 2.8. Rm (k · kA ) = O(C 3 m )
Proof. Recall the definition of Z given in Definition 2.5. With this we can write
h Xm i h i
E sup σ` Xi` ,j` ,k` = E sup |hZ, a ⊗ b ⊗ ci|

Ω,σ kXkA ≤1 Ω,σ C-incoherent a,b,c
`=1
We can now adapt the discretization approach in [33], although our task is considerably simpler because
we are constrained to C-incoherent a’s. In particular, let
n n o
S = aa is C-incoherent and a ∈ Z

By standard bounds on the size of an -net [58], we get that |S| ≤ O(C/)n . Suppose that P |hZ, a⊗b⊗ci| ≤ M
for all a, b, c ∈ S. Then for an arbitrary, but C-incoherent a we can expand it as a = i i ai where each
ai ∈ S and similarly for b and c. And now
XXX
|hZ, a ⊗ b ⊗ ci| ≤ i j k |hZ, ai ⊗ bi ⊗ ci i| ≤ (1 − )−3 M
i j k
Moreover since each entry in a ⊗ b ⊗ c has magnitude at most C 3 we can apply a Chernoff bound to conclude
that for any particular a, b, c ∈ S we have
p
|hZ, a ⊗ b ⊗ ci| ≤ O C 3 m log 1/γ
4 The usual definition of the tensor nuclear norm has no constraints that the vectors a, b and c be C-incoherent. However,
adding this additional requirement only serves to further restrict the unit norm ball, while ensuring that the low rank part of T
(when scaled down) is still in it, since the factors of T are anyways assumed to be C-incoherent. This makes it easier to prove
recovery guarantees because we do not need to worry about sparse vectors behaving very differently than incoherent ones, and
since we are not going to compute this norm anyways this modification will make our analysis easier.

with probability at least 1 − γ. Finally, if we set γ = ( C )−n and we set = 1/2 we get that
m (1 − )−3 rn
R (A) ≤ max |hZ, a ⊗ b ⊗ ci| = O C 3
m a,b,c∈S m
and this completes the proof.
The important point is that the Rademacher complexity of the tensor nuclear norm is o(1) whenever
m = ω(n). In the next subsection we will connect this to refutation in a way that allows us to strengthen
known hardness results for computing the tensor nuclear norm [39, 41] and show that it is even hard to
compute in an average-case sense based on some standard conjectures about the difficulty of refuting random
3-SAT.
2.4 From Rademacher Complexity to Refutation

Here we show the first implication of the connection we have established. Any norm that can be computed in
polynomial time and has good Rademacher complexity immediately yields an algorithm for strongly refuting
random 3-SAT and 3-XOR formulas. Next let us finally define strong refutation.
Definition 2.9. For a formula φ, let opt(φ) be the largest fraction of clauses that can be satisfied by any
assignment.
In what follows, we will use the term random 3-XOR formula to refer to a formula where each clause is
generated by choosing an ordered triple of variables (vi , vj , vk ) uniformly at random (and without replace-
ment) and setting vi · vj · vk = z where z is a random variable that takes on values +1 and −1.
Definition 2.10. An algorithm for strongly refuting random 3-XOR takes as input a 3-XOR formula φ and
outputs a quantity alg(φ) that satisfies
1. For any 3-XOR formula φ, opt(φ) ≤ alg(φ)
2. If φ is a random 3-XOR formula with m clauses, then with high probability alg(φ) = 1/2 + o(1)
This definition only makes sense when m is large enough so that opt(φ) = 1/2 + o(1) holds with high
probability, which happens when m = ω(n). The goal is to design algorithms that use as few clauses as
possible, and are able to certify that a random formula is indeed far from satisfiable (without underestimating
the fraction of clauses that can be satisfied) and to do so as close as possible to the information theoretic
threshold.
Now let us use a polynomial time computable norm k · kK that has good Rademacher complexity to
give an algorithm for strongly refuting random 3-XOR. As in Section 2.1, given a formula φ we map its m
clauses to a collection of m observations according to the usual rule: If there are n variables, we construct
an n × n × n tensor Z where for each clause of the form vi · vj · vk = zi,j,k we put the entry zi,j,k at location
(i, j, k). All the rest of the entries in Z are set to zero. We solve the following optimization problem:
1
max η s.t. ∃X with kXkK ≤ 1 and hZ, Xi ≥ 2η (3)
m
Let η ∗ be the optimum value. We set alg(φ) = 1/2 + η ∗ . What remains is to prove that the output of this
algorithm solves the strong refutation problem for 3-XOR.
Theorem 2.11. Suppose that k · kK is computable in polynomial time and satisfies kXkK ≤ 1 whenever
X = a ⊗ a ⊗ a and a is a vector with ±1 entries. Further suppose that for any X with kXkK ≤ 1 its entries
are bounded by C 3 in absolute value. Then (3) can be solved in polynomial time and if Rm (k · kK ) = o(1)
then setting alg(φ) = 1/2 + η ∗ solves strong refutation for 3-XOR with O(C 6 m log n) clauses.
Proof. The key observation is the following inequality which relates (3) to opt(φ).
1
2 opt(φ) − 1 ≤ sup hZ, Xi
m kXkK ≤1

To establish this inequality, let v1 , v2 , ..., vn be the assignment that maximizes the fraction of clauses satisfied.
If we set ai = vi and X = a ⊗ a ⊗ a we have that kXkK ≤ 1 by assumption. Thus X is a feasible solution.
Now with this choice of X for the right hand side, every term in the sum that corresponds to a satisfied
clause contributes +1 and every term that corresponds to an unsatisfied clause contributes −1. We get
2 opt(φ) − 1 for this choice of X, and this completes the proof of the inequality above.
The crucial point is that the expectation of the right hand side over Ω and σ is exactly the Rademacher
complexity. However we want a bound that holds with high probability instead of just in expectation. It
follows from McDiarmid’s inequality and the fact that the entries of Z and of X are bounded by 1 and by
C 3 in absolute value respectively that if we take O(C 6 m log n) observations the right hand side will be o(1)
with high probability. In this case, rearranging the inequality we have
1
opt(φ) ≤ 1/2 + sup hZ, Xi
m kXkK ≤1
The right hand side is exactly alg(φ) and is 1/2 + o(1) with high probability, which implies that both
conditions in the definition for strong refutation hold and this completes the proof.
We can now combine Theorem 2.11 with the bound on the Rademacher complexity of the tensor nuclear
norm given in Lemma 2.8 to conclude that if we could compute the tensor nuclear norm we would also obtain
an algorithm for strongly refuting random 3-XOR with only m = Ω(n log n) clauses. It is not obvious but
it turns out that any algorithm for strongly refuting random 3-XOR implies one for 3-SAT. Let us define
strong refutation for 3-SAT. We will refer to any variable vi or its negation v̄i as a literal. We will use the
term random 3-SAT formula to refer to a formula where each clause is generated by choosing an ordered
triple of literals (yi , yj , yk ) uniformly at random (and without replacement) and setting yi ∨ yj ∨ yk = 1.
Definition 2.12. An algorithm for strongly refuting random 3-SAT takes as input a 3-SAT formula φ and
outputs a quantity alg(φ) that satisfies
1. For any 3-SAT formula φ, opt(φ) ≤ alg(φ)

2. If φ is a random 3-SAT formula with m clauses, then with high probability alg(φ) = 7/8 + o(1)
The only change from Definition 2.10 comes from the fact that for 3-SAT a random assignment satisfies a
7/8 fraction of the clauses in expectation. Our goal here is to certify that the largest fraction of clauses that
can be satisfied is 7/8 + o(1). The connection between refuting random 3-XOR and 3-SAT is often called
“Feige’s XOR Trick” [29]. The first version of it was used to show that an algorithm for -refuting 3-XOR
can be turned into an algorithm for -refuting 3-SAT. However we will not use this notion of refutation so
for further details we refer the reader to [29]. The reduction was extended later by Coja-Oghlan, Goerdt
and Lanka [25] to strong refutation, which for us yields the following corollary:
Corollary 2.13. Suppose that k · kK is computable in polynomial time and satisfies kXkK ≤ 1 whenever
X = a ⊗ a ⊗ a and a is a vector with ±1 entries. Suppose further that for any X with kXkK ≤ 1 its entries
are bounded by C 3 in absolute value and that Rm (k · kK ) = o(1). Then there is a polynomial time algorithm
for strongly refuting a random 3-SAT formula with O(C 6 m log n) clauses.
Now we can get a better understanding of the obstacles to noisy tensor completion by connecting it to the
literature on refuting random 3-SAT. Despite a long line of work on refuting random 3-SAT [37, 32, 31, 30, 25],
there is no known polynomial time algorithm that works with m = n3/2− clauses for any > 0. Feige [29]
conjectured that for any constant C, there is no polynomial time algorithm for refuting random 3-SAT with
m = Cn clauses5 . Daniely et al. [26] conjectured that there is no polynomial time algorithm for m = n3/2−
for any > 0. What we have shown above is that any norm that is a relaxation to the tensor nuclear
norm and can be computed in polynomial time but has Rademacher complexity is Rm (k · kK ) = o(1) for
m = n3/2− would disprove the conjecture of Daniely et al. [26] and would yield much better algorithms for
refuting random 3-SAT than we currently know, despite fifteen years of work on the subject.
5 In
Feige’s paper [29] there was no need to make the conjecture any stronger because it was already strong enough for all of
the applications in inapproximability.
10

This leaves open an important question. While there are no known algorithms for strongly refuting
random 3-SAT with m = n3/2− clauses, there are algorithms that work with roughly m = n3/2 clauses
[25]. Do these algorithms have any implications for noisy tensor completion? We will adapt the algorithm
of Coja-Oghlan, Goerdt and Lanka [25] and embed it within the sum-of-squares hierarchy. In turn, this
will give us a norm that we can use to solve noisy tensor completion which uses a polynomial factor fewer
observations than known algorithms.
3 Using Resolution to Bound the Rademacher Complexity

3.1 Pseudo-expectation
Here we introduce the sum-of-squares hierarchy and will use it (at level six) to give a relaxation to the tensor
nuclear norm. This will be the norm that we will use in proving our main upper bounds. First we introduce
the notion of a pseudo-expectation operator from [7, 8, 10]:
0
Definition 3.1 (Pseudo-expectation [7]). Let k be even and let Pkn denote the linear subspace of all
polynomials of degree at most k on n0 variables. A linear operator E e : P n0 → R is called a degree k
k
pseudo-expectation operator if it satisfies the following conditions:
(1) E[1]
e = 1 (normalization)
e 2 ] ≥ 0, for any degree at most k/2 polynomial P (nonnegativity)

(2) E[P
0
Moreover suppose that p ∈ Pkn with deg(p) = k 0 . We say that E e satisfies the constraint {p = 0} if E[pq]
e =0
n0 2 n 0
for every q ∈ Pk−k0 . And we say that E
e satisfies the constraint {p ≥ 0} if E[pq
e ] ≥ 0 for every q ∈ Pb(k−k0 )/2c .
0
The rationale behind this definition is that if µ is a distribution on vectors in Rn then the operator
E[p]
e = EY ←µ [p(Y )] is a degree d pseudo-expectation operator for every d — i.e. it meets the conditions of
Definition 3.1. However the converse is in general not true. We are now ready to define the norm that will
be used in our upper bounds:
Definition 3.2 (SOSk norm). We let Kk be the set of all X ∈ Rn1 ×n2 ×n3 such that there exists a degree
k pseudo-expectation operator on Pkn1 +n2 +n3 satisfying the following polynomial constraints (where the
(a)
variables are the Yi ’s)
Pn1 (1) Pn2 (2) Pn3 (3)
(a) { i=1 (Yi )2 = n1 }, { i=1 (Yi )2 = n2 } and { i=1 (Yi )2 = n3 }
(1) 2 (2) 2 (3) 2
(b) {(Yi ) ≤ C 2 }, {(Yi ) ≤ C 2 } and {(Yi ) ≤ C 2 } for all i and
(1) (2) (3)
(c) Xi,j,k = E[Y
e
i Yj Yk ] for all i, j and k.
The SOSk norm of X ∈ Rn1 ×n2 ×n3 which is denoted by kXkKk is the infimum over α such that X/α ∈ Kk .
The constraints in Definition 3.1 can be expressed as an O(nk )-sized semidefinite program. This implies
that given any set of polynomial constraints of the form {p = 0}, {p ≥ 0}, one can efficiently find a degree
k pseudo-distribution satisfying those constraints if one exists. This is often called the degree k Sum-of-
Squares algorithm [69, 62, 53, 63]. Hence we can compute the norm kXkKk of any tensor X to within
arbitrary accuracy in polynomial time. And because it is a relaxation to the tensor nuclear norm which is
defined analogously but over a distribution on C-incoherent vectors instead of a pseudo-distribution over
them, we have that kXkKk ≤ kXkA for every tensor X. Throughout most of this paper, we will be interested
in the case k = 6.
11

3.2 Resolution in K6
Recall that any polynomial time computable norm with good Rademacher complexity with m observations
yields an algorithm for strong refutation with roughly m clauses too. Here we will use an algorithm for
strongly refuting random 3-SAT to guide our search for an appropriate norm. We will adapt an algorithm
due to Coja-Oghlan, Goerdt and Lanka [25] that strongly refutes random 3-SAT, and will instead give an
algorithm that strongly refutes random 3-XOR. Moreover each of the steps in the algorithm embeds into
the sixth level of the sum-of-squares hierarchy by mapping resolution operations to applications of Cauchy-
Schwartz, that ultimately show how the inequalities that define the norm (Definition 3.2) can be manipulated
to give bounds on its own Rademacher complexity.
Let’s return to the task of bounding the Rademacher complexity of k · kK6 . Let X be arbitrary but satisfy
kXkK6 ≤ 1. Then there is a degree six pseudo-expectation meeting the conditions of Definition 3.2. Using
Cauchy-Schwartz we have:
2 X X 2 XX 2
hZ, Xi = e (1) Y (2) Y (3) ] ≤ n1
Zi,j,k E[Y Z i,j,k
e (1) Y (2) Y (3) ]
E[Y (4)
i j k i j k
i j,k i j,k
To simplify our notation, we will define the following polynomial

(2) (3)
X
Qi,Z (Y (2) , Y (3) ) = Zi,j,k Yj Yk
j,k
which we will use repeatedly. If d is even then any degree d pseudo-expectation operator satisfies the
2 e 2 ] for every polynomial p of degree at most d/2 (e.g., see Lemma A.4 in [6]). Hence
constraint (E[p])
e ≤ E[p
the right hand side of (4) can be bounded as:
X 2 X h (1) 2 i
n1 e (1) Qi,Z (Y (2) , Y (3) )]
E[Y ≤ n1 e Y Qi,Z (Y (2) , Y (3) )
E (5)
i i
i i
It turns out that bounding the right-hand side of (5) boils down to bounding the spectral norm of the
following matrix.
Definition 3.3. Let A be the n2 n3 × n2 n3 matrix whose rows and columns are indexed over ordered pairs
(j, k 0 ) and (j 0 , k) respectively, defined as
X
Aj,k0 ,j 0 ,k = Zi,j,k Zi,j 0 ,k0
i
We can now make the connection to resolution more explicit: We can think of a pair of observations
Zi,j,k , Zi,j 0 ,k0 as a pair of 3-XOR constraints, as usual. Resolving them (i.e. multiplying them) we obtain a
4-XOR constraint
xj · xk · xj 0 · xk0 = Zi,j,k Zi,j 0 ,k0
A captures the effect of resolving certain pairs of 3-XOR constraints into 4-XOR constraints. The challenge
is that the entries in A are not independent, so bounding its maximum singular value will require some care.
It is important that the rows of A are indexed by (j, k 0 ) and the columns are indexed by (j 0 , k), so that j
and j 0 come from different 3-XOR clauses, as do k and k 0 , and otherwise the spectral bounds that we will
want to prove about A would simply not be true! This is perhaps the key insight in [25].
It will be more convenient to decompose A and reason about its two types of contributions separately.
To that end, we let R be the n2 n3 × n2 n3 matrix whose non-zero entries are of the form
X
Rj,k,j,k = Zi,j,k Zi,j,k
i
and all of its other entries are set to zero. Then let B be the n2 n3 × n2 n3 matrix whose entries are of the
form (
0, if j = j 0 and k = k 0
Bj,k0 ,j 0 ,k = P
i Zi,j,k Zi,j ,k else
0 0
By construction we have A = B + R. Finally:
12

Lemma 3.4. 2 i
X h (1)
e Y Qi,Z (Y (2) , Y (3) )
E ≤ C 2 n2 n3 kBk + C 6 m
i
i
(1) 2
Proof. The pseudo-expectation operator satisfies {(Yi ) ≤ C 2 } for all i, and hence we have
X h 2 i X h 2 i h i
e Zi,j,k Zi,j 0 ,k0 Y (2) Y (3) Y (2) (3)
X X
e Yi Qi,Z (Y (2) , Y (3) )
E ≤ C2 e Qi,Z (Y (2) , Y (3) )
E = C2 E 0 Y 0
j k j k
i i i j,k,j 0 ,k0
(2)
Now let Y (2) ∈ Rn2 be a vector of variables where the ith entry is Yi and similarly for Y (3) . Then we can
re-write the right hand side as a matrix inner-product:
e (2) Y (3) Y (2) (3)

X X
C2 Zi,j,k Zi,j 0 ,k0 E[Yj k
2
j 0 Yk0 ] = C hA, E[(Y
e (2)
⊗ Y (3) )(Y (2) ⊗ Y (3) )T ]i
i j,k,j 0 ,k0
We will now bound the contribution of B and R separately.

(2)
Claim 3.5. E[(Y
e ⊗ Y (3) )(Y (2) ⊗ Y (3) )T ] is positive semidefinite and has trace at most n2 n3
(2)
Proof. It is easy to see that a quadratic form on E[(Y e ⊗ Y (3) )(Y (2) ⊗ Y (3) )T ] corresponds to E[p
e 2 ] for
n2 +n3
some p ∈ P2 and this implies the first part of the claim. Finally
(2) 2 (3)
X
(2)
Tr(E[(Y
e ⊗ Y (3) )(Y (2) ⊗ Y (3) )T ]) = E[(Y
e
j ) (Y )2 ] ≤ n2 n3
k
j,k
Pn2 (2)
where the last equality follows because the pseudo-expectation operator satisfies the constraints { i=1 (Yi )2 =
Pn3 (3)
n2 } and { i=1 (Yi )2 = n3 }.
Hence we can bound the contribution of the first term as C 2 hB, E[(Y
e (2)
⊗ Y (3) )(Y (2) ⊗ Y (3) )T ]]i ≤
2
C n2 n3 kBk. Now we proceed to bound the contribution of the second term:
(2) 2 (3) 2 4
Claim 3.6. E[(Yj ) (Yk ) ] ≤ C
e
Proof. It is easy to verify by direct computation that the following equality holds:

(2) (3) (2) (3) (3) (2) (2) (3)
C 4 − (Yj )2 (Yk )2 = C 2 − (Yj )2 C 2 − (Yk )2 + C 2 − (Yk )2 (Yj )2 + C 2 − (Yj )2 (Yk )2
Moreover the pseudo-expectation of each of the three terms above is nonnegative, by construction. This
implies the claim.
Moreover each entry in Z is in the set {−1, 0, +1} and there are precisely m non-zeros. Thus the sum of
the absolute values of all entries in R is at most m. Now we have:
(2) 2 (3) 2
X
C 2 hR, E[(Y
e (2)
⊗ Y (3) )(Y (2) ⊗ Y (3) )T ]i ≤ C 2 Rj,k,j,k E[(Y
e 6
j ) (Yk ) ] ≤ C m
j,k
And this completes the proof of the lemma.
4 Spectral Bounds
Recall the definition of B given in the previous section. In fact, for our spectral bounds it will be more
convenient to relabel the variables (but keeping the definition intact):
(
0, if j = j 0 and k = k 0
Bj,k,j 0 ,k0 = P
i Zi,j,k Zi,j ,k else
0 0
13

Let us consider the following random process: For r = 1, 2, ..., O(log n) partition the set of all ordered triples
(i, j, k) into two sets Sr and Tr . We will use this ensemble of partitions to define an ensemble of matrices
O(log n) 0
{Br }r=1 r
: Set Ui,j,k r
0 as equal to Zi,j,k 0 if (i, j, k ) ∈ Sr and zero otherwise. Similarly set Vi,j 0 ,k equal to
Zi,j 0 ,k if (i, j , k) ∈ Tr and zero otherwise. Also let Ei,j,j 0 ,k,k0 ,r be the event that there is no r0 < r where
0
(i, j, k 0 ) ∈ Sr0 and (i, j 0 , k) ∈ Tr0 or vice-versa. Now let
0 Vi,j 0 ,k 1E
X
Brj,k,j 0 ,k0 = r
Ui,j,k r
where 1E is short-hand for the indicator function of the event Ei,j,j 0 ,k,k0 ,r . The idea behind this construction
is that each pair of triples (i, j, k 0 ) and (i, j 0 , k) that contributes to B will be contribute to some Br with high
probability. Moreover it will not contribute to any later matrix in the ensemble. Hence with high probability
O(log n)
X
B= Br
r=1
Throughout the rest of this section, we will suppress the superscript r and work with a particular matrix
in the ensemble, B. Now let ` be even and consider
T T
Tr(BB
| BB{z ...BBT})
` times
As is standard, we are interested in bounding E[Tr(BBT BBT ...BBT )] in order to bound kBk. But note that
B is not symmetric. Also note that the random variables U and V are not independent, however whether or
not they are non-zero is non-positively correlated and their signs are mutually independent. Expanding the
trace above we have
X X X
Tr(BBT BBT ...BBT ) = ... Bj1 ,k1 ,j2 ,k2 Bj3 ,k3 ,j2 ,k2 ...Bj1 ,k1 ,j` ,k`
j1 ,k1 j2 ,k2 j`−1 ,k`−1
Ui1 ,j1 ,k2 Vi1 ,j2 ,k1 1E1 Ui2 ,j3 ,k2 Vi2 ,j2 ,k3 1E2 ...Ui` ,j1 ,k` Vi` ,j` ,k1 1E`
XXXX XX
= ...
j1 ,k1 i1 j2 ,k2 i2 j` ,k` i`
where 1E1 is the indicator for the event that the entry Bj1 ,k1 ,j2 ,k2 is not covered by an earlier matrix in the
ensemble, and similarly for 1E2 , ..., 1E` .
Notice that there are 2` random variables in the above sum (ignoring the indicator variables). Moreover
if any U or V random variable appears an odd number of times, then the contribution of the term to
E[Tr(BBT BBT ...BBT )] is zero. We will give an encoding for each term that has a non-zero contribution, and
we will prove that it is injective.
Fix a particular term in the above sum where each random variable appears an even number of times.
Let s be the number of distinct values for i. Moreover let i1 , i2 , ..., is be the order that these indices first
appear. Now let r1j denote the number of distinct values for j that appear with i1 in U terms — i.e. r1j is the
number of distinct j’s that appear as Ui1 ,j,∗ . Let r1k denote the number of distinct values for k that appear
with i1 in U terms — i.e. r1k is the number of distinct k’s that appear as or Ui1 ,∗,k . Similarly let q1j denote
the number of distinct values for j that appear with i1 in V terms — i.e. q1j is the number of distinct j’s
that appear as Vi1 ,j,∗ . And finally let q1k denote the number of distinct values for k that appear with i1 in
V terms — i.e. q1k is the number of distinct k’s that appear as Vi1 ,∗,k .
We give our encoding below. It is more convenient to think of the encoding as any way to answer the
following questions about the term.
(a) What is the order i1 , i2 , ..., is of the first appearance of each distinct value of i?
(b) For each i that appears, what is the order of each of the distinct values of j’s and k’s that appear along
with it in U ? Similarly, what is the order of each of the distinct values of j’s and k’s that appear along
with it in V ?
14

(c) For each step (i.e. a new variable in the term when reading from left to right), has the value of i been
visited already? Also, has the value for j or k that appears along with U been visited? Has the value
for j or k that appears along with V been visited? Note that whether or not j or k has been visited
(together in U ) depends on what the value of i is, and if i is a new value then the j or k value must
be new too, by definition. Finally, if any value has already been visited, which earlier value is it?
Let rj = r1j +r2j +...+rsj and rk = r1k +r2k +...+rsk . Similarly let qj = q1j +q2j +...qsj and qk = q1k +q2k +...qsk .
r q
Then the number of possible answers to (a) and (b) is at most ns1 and n2j nr3k n2j nq3k respectively. It is also easy
to see that the number of answers to (c) that arise over the sequence of ` steps is at most 8` (s(rj +rk )(qj +qk ))` .
We remark that much of the work on bounding the maximum eigenvalue of a random matrix is in removing
any `` type terms, and so one needs to encode re-visiting indices more compactly. However such terms will
only cost us polylogarithmic factors in our bound on kBk.
It is easy to see that this encoding is injective, since given the answers to the above questions one can
simulate each step and recover the sequence of random variables. Next we establish some easy facts that
allow us to bound E[Tr(BBT BBT ...BBT )].
Claim 4.1. For any term that has a non-zero contribution to E[Tr(BBT BBT ...BBT )], we must have s ≤ `/2
and rj + qj + rk + qk ≤ `
Proof. Recall that there are 2` random variables in the product and precisely ` of them correspond to U
variables and ` of them to V variables. Suppose that s > `/2. Then there must be at least one U variable
and at least one V variable that occur exactly once, which implies that its expectation is zero because the
signs of the non-zero entries are mutually independent. Similarly suppose rj + qj + rk + qk > `. Then there
must be at least one U or V variable that occurs exactly once, which also implies that its expectation is
zero.
Claim 4.2. For any valid encoding, s ≤ rj + qj and s ≤ rk + qk .
Proof. This holds because in each step where the i variable is new and has not been visited before, by
definition the j variable is new too (for the current i) and similarly for the k variable.
Finally, if s, rj , qj , rk and qk are defined as above then for any contributing term
Ui1 ,j1 ,k2 Vi1 ,j2 ,k1 Ui2 ,j3 ,k2 Vi2 ,j2 ,k3 ...Ui` ,j1 ,l` Vi` ,j` ,k1
its expectation is at most prj +rk pqj +qk where p = m/n1 n2 n3 because there are exactly rj + rk distinct U
variables and qj + qk distinct V variables whose values are in the set {−1, 0, +1} and whether or not a
variable is non-zero is non-positively correlated and the signs are mutually independent.
This now implies the main lemma:
`/2
Lemma 4.3. E[Tr(BBT BBT ...BBT )] ≤ n1 (max(n2 , n3 ))` p` (`)3`+3
Proof. Note that the indicator variables only have the effect of zeroing out some terms that could otherwise
contribute to E[Tr(BBT BBT ...BBT )]. Returning to the task at hand, we have
r q
X
E[Tr(BBT BBT ...BBT )] ≤ ns1 n2j nr3k n2j nq3k prj +rk pqj +qk 8` (s(rj + rk )(qj + qk ))`
s,rj ,rk ,qj ,qk
where the sum is over all valid triples s, rj , rk , qj , qk and hence s, r, q ≤ `/2 and s ≤ rj + rk and s ≤ qj + qk
using Claim 4.1 and Claim 4.2. We can upper bound the above as
X
E[Tr(BBT BBT ...BBT )] ≤ ns1 (pn2 )rj +qj (pn3 )rk +qk (`)3`+3
s,rj ,rk ,qj ,qk
X
≤ ns1 (p max(n2 , n3 ))rj +qj +rk +qk (`)3`+3
s,rj ,rk ,qj ,qk
Now if p max(n2 , n3 ) ≤ 1 then using Claim 4.2 followed by the first half of Claim 4.1 we have:
`/2
E[Tr(BBT BBT ...BBT )] ≤ ns1 (p max(n2 , n3 ))2s (`)3`+3 ≤ n1 (p max(n2 , n3 ))` (`)3`+3
15

1/2
where the last inequality follows because pn1 max(n2 , n3 ) > 1. Alternatively if p max(n2 , n3 ) > 1 then we
can directly invoke the second half of Claim 4.1 and get:
`/2
E[Tr(BBT BBT ...BBT )] ≤ ns1 (p max(n2 , n3 ))` (`)3`+3 ≤ n1 (p max(n2 , n3 ))` (`)3`+3
`/2
Hence E[Tr(BBT BBT ...BBT )] ≤ n1 max(n2 , n3 )` p` (`)3`+3 and this completes the proof.
As before, let n = max(n1 , n2 , n3 ). Then the last piece we need to bound the Rademacher complexity is
the following spectral bound:
4

Theorem 4.4. With high probability, kBk ≤ O 1/2m log n
n1 min(n2 ,n3 )
Proof. We proceed by using Markov’s inequality:
1/2
h
1/2
` i E[Tr(BBT BBT ...BBT )] `3
Pr[kBk ≥ n1 max(n2 , n3 )p(2`)3 ] = Pr kBk` ≥ n1 max(n2 , n3 )p(2`)3 ≤ `/2 ≤ 3`
n1 max(n2 , n3 )` p` (2`)3` 2
1/2
and hence setting ` = Θ(log n) we conclude that kBk ≤ 8n1 max(n2 , n3 )p log3 n holds with high probability.
PO(log n) r
Moreover B = r=1 B also holds with high probability. If this equality holds and each Br satisfies
1/2
kB k ≤ 8n1 max(n2 , n3 )p log3 n, we have
r
m log4 n
kBk ≤ max O(kBr k log n) = O 1/2
r n1 min(n2 , n3 )
where we have used the fact that p = m/n1 n2 n3 . This completes the proof of the theorem.
Proofs of Theorem 1.1 and Corollary 1.2

We can now bound the Rademacher complexity of the norm that we get from the six level sum-of-squares
relaxation to the tensor nuclear norm:
q
(n1 )1/2 (n2 +n3 ) log4 n
Theorem 4.5. Rm (k · kK6 ) ≤ O m
Proof. Consider any X with kXkK6 ≤ 1. Then using Lemma 3.4 and Theorem 4.4 we have
2 XX 2
1/2
hZ, Xi ≤ n1 Zi,j,k Xi,j,k ≤ C 2 n1 n2 n3 kBk + C 6 mn1 = O mn1 max(n2 , n3 ) log4 n + mn1
i j,k
Recall that Z was defined in Definition 2.5. The Rademacher complexity can now be bounded as
s
1 (n1 )1/2 (n2 + n3 ) log4 n
(hZ, Xi) ≤ O
m m
which completes the proof of the theorem.

Recall that bounds on the Rademacher complexity readily imply bounds on the generalization error (see
Theorem 2.4). We can now prove Theorem 1.1:
Proof. We solve (2) using the norm k · kK6 . Since this norm comes from the sixth level of the sum-of-squares
hierarchy, it follows that (2) is an n6 -sized semidefinite program and there is an efficient algorithm to solve
it to arbitrary accuracy. Moreover we can always plug in X = T − ∆ and the bounds on the maximum
magnitude of an entry in ∆ together with the Chernoff bound imply that with high probability X = T − ∆
is a feasible solution. Moreover kT − ∆kK6 ≤ r∗ . Hence with high probability, the minimizer X satisfies
kXkK6 ≤ r∗ . Now if we take any such X returned by the convex program, because it is feasible its empirical
error is at most 2δ. And since kXkK6 ≤ r∗ the bounds on the Rademacher complexity (Theorem 4.5) together
with Theorem 2.4 give the desired bounds on err(X) and complete the proof of our main theorem.
16

Finally we prove Corollary 1.2:
Proof. Our goal is to lower bound the absolute value of a typical entry in T . To be concrete, suppose that
var(Ti,j,k ) ≥ f (r, n) for a 1 − o(1) fraction of the entries where f (r, n) = r1/2 / logD n. Consider Ti,j,k , which
we will view as a degree three polynomial in Gaussian random variables. Then the anti-concentration bounds
of Carbery and Wright [21] now imply that |Ti,j,k | ≥ f (r, n)/ log n with probability 1 − o(1). With this in
mind, we define
R = {(i, j, k) s.t. |Ti,j,k | ≥ f (r, n)/ log n}
and it follows form Markov’s bound that that |R| ≥ (1 − o(1))n1 n2 n3 . Now consider just those entries in R
which we get substantially wrong:
R0 = {(i, j, k) s.t. (i, j, k) ∈ R and |Xi,j,k − Ti,j,k | ≥ 1/ log n}
We can now invoke Theorem 1.1 which guarantees that the hypothesis X that results from solving (2)
e 3/2 r). This bound on the error
satisfies err(X) = o(1/ log n) with probability 1 − o(1) provided that m = Ω(n
immediately implies that |R0 | = o(n1 n2 n3 ) and so |R \ R0 | = (1 − o(1))n1 n2 n3 . This completes the proof of
the corollary.
5 Sum-of-Squares Lower Bounds

Here we will show strong lower bounds on the Rademacher complexity of the sequence of relaxations to the
tensor nuclear norm that we get from the sum-of-squares hierarchy. Our lower bounds follow as a corollary
from known lower bounds for refuting random instances of 3-XOR [38, 68]. First we need to introduce the
formulation of the sum-of-squares hierarchy used in [68]: We will call a Boolean function f a k-junta if there
is set S ⊆ [n] of at most k variables so that f is determined by the values in S.
Definition 5.1. The k-round Lasserre hierarchy is the following relaxation:
(a) kv0 k2 = 1, kvC k2 = 1 for all C ∈ C
(b) hvf , vg i = hvf 0 , vg0 i for all f, g, f 0 , g 0 that are k-juntas and f · g ≡ f 0 · g 0
(c) vf + vg = vf +g for all f, g that are k-juntas and satisfy f · g ≡ 0
Here we define a vector vf for each k-junta, and C is a class of constraints that must be satisfied by any
Boolean solution (and are necessarily k-juntas themselves). See [68] for more background, but it is easy to
construct a feasible solution to the above convex program given a distribution on feasible solutions for some
constraint satisfaction problem. In the above relaxation, we think of functions f as being {0, 1}-valued. It
will be more convenient to work with an intermediate relaxation where functions are {−1, 1}-valued and the
intuition is that uS for some set S ⊆ [n] should correspond to the vector for the character χS .
Definition 5.2. Alternatively, the k-round Lasserre hierarchy is the following relaxation:
(a) ku∅ k2 = 1, hu∅ , uS i = (−1)ZS for all (⊕S , ZS ) ∈ C
(b) huS , uT i = huS 0 , uT 0 i for sets S, T, S 0 , T 0 that are size at most k and satisfy S∆T = S 0 ∆T 0 , where ∆ is
the symmetric difference.
Here we have explicitly made the switch to XOR-constraints — namely (⊕S , ZS ) has ZS ∈ {0, 1} and
correspond to the constraint that the parity on the set S is equal to ZS . Now if we have a feasible solution
to the constraints in Definition 5.1 where all the clauses are XOR-constraints, we can construct a feasible
solution to the constraints in Definition 5.2 as follows. If S is a set of size at most k, we define
uS ≡ vg − vf
where f is the parity function on S and g = 1 − f is its complement. Moreover let u∅ = v0 .

Claim 5.3. {uS } is a feasible solution to the constraints in Definition 5.2
17

Proof. Consider Constraint (b) in Definition 5.2, and let S, T, S 0 , T 0 be sets of size at most k that satisfy
S ⊕ T = S 0 ⊕ T 0 . Then our goal is to show that
hvgS − vfS , vgT − vfT i = hvgS0 − vfS0 , vgT 0 − vfT 0 i
where fS is the parity function on S, and similarly for the other functions. Then we have fS · fT ≡ fS 0 · fT 0
because S ⊕ T = S 0 ⊕ T 0 , and this implies that hvfS , vfT i = hvfS0 , vfT 0 i. An identical argument holds for the
other terms. This implies that all the Constraints (b) hold. Similarly suppose (⊕S , ZS ) ∈ C. Since fS · gS ≡ 0
and fS + gS ≡ 1 it is well-known that (1) vfS and vgS are orthogonal (2) vfS + vgS = v0 and (3) since fS ∈ C
in Definition 5.1, we have vgS = 0 (see [68]). Thus
hu∅ , uS i = hv0 , vgS i − hv0 , vfS i = −1
and this completes the proof.
Now following Barak et al. [6] wePcan use
Q the constraints in Definition 5.2 to define the operator E[·]. In
e
n
particular, given p ∈ Pk where p ≡ S cS i∈S Yi and p is multilinear, we set
X
E[p)]
e = cS hu∅ , uS i
S
Here we will also need to define E[p]

e when p is not multilinear, and in that case if Yi appears an even number
of times we replace it with 1 and if it appears an odd number of times we replace it by Yi to get a multilinear
polynomial q and then set E[p]
e = E[q].
e
e is a feasible solution to the constraints in Definition 3.2, and for any (⊕S , ZS ) ∈ C we have
Claim 5.4. E[·]
E[ i∈S Yi ] = (−1)ZS .
e Q
Proof. Then by construction E[1]

e = 1, and the proof that E[p e 2 ] ≥ 0 is given in [6], but we repeat it here for
P Q
completeness. Let p = S cS i∈S Yi be multilinear where
Qwe follow Q the above recipe and replace terms of
the form Yi2 with (1/n) as needed. Then p2 = S,T cS cT i∈S Yi i∈T Yi and moreover
P
X X X 2
e 2] =
E[p cS cT hu∅ , uS∆T i = cS cT huS , uT i =

cS uS ≥ 0

S,T S,T S
e satisfies the constraints { Pn2 2 2

as desired. Next we must verify that E[·] i=1 Yi = n} and {Yi ≤ C } for all
i ∈ {1, 2, ..., n}, in accordance with Definition 3.1. To that end, observe that
h Xn i
E
e Yi2 − n q = 0
i=1
n
which holds for any polynomial q ∈ Pk−2 .Finally consider
h i h i
e C 2 − Y 2 q2 = E
E e C 2 − 1 q2 ≥ 0
i
which follows because C 2 ≥ 1 and holds for any polynomial q ∈ Pb(d−d

n
0 )/2c . This completes the proof.
Theorem 5.5. [38, 68] Let φ be a random 3-XOR formula on n variables with m = n3/2− clauses. Then
for any > 0 and any c < 2, the k = Ω(nc ) round Lasserre hierarchy given in Definition 5.1 permits a
feasible solution, with probability 1 − o(1).
Note that the constant in the Ω(·) depends on and c. Then using the above reductions, we have the
following as an immediate corollary:
Corollary 5.6. For any > 0 and any c < 2 and k = Ω(nc ), if m = n3/2− the Rademacher complexity
Rm (k · kKk ) = 1 − o(1).
Thus there is a sharp phase transition (as a function of the number of observations) in the Rademacher
complexity of the norms derived from the sum-of-squares hierarchy. At level six, Rm (k·kK6 ) = o(1) whenever
m = ω(n3/2 log4 n). In contrast, Rm (k · kKk ) = 1 − o(1) when m = n3/2− even for very strong relaxations
2
derived from n2 rounds of the sum-of-squares hierarchy. These norms require time 2n to compute but still
achieve essentially no better bounds on their Rademacher complexity.
18

Acknowledgements
We would like to thank Aram Harrow for many helpful discussions.
References
[1] S. Allen, R. O’Donnell and D. Witmer. How to refute a random CSP. FOCS 2015, to appear.
[2] A. Anandkumar, D. Foster, D. Hsu, S. Kakade, Y. Liu. A spectral algorithm for latent dirichlet
allocation. NIPS, pages 926–934, 2012.
[3] A. Anandkumar, R. Ge, D. Hsu and S. Kakade. A tensor spectral approach to learning mixed member-
ship community models. COLT, pages 867–881, 2013.
[4] A. Anandkumar, D. Hsu and S. Kakade. A method of moments for mixture models and hidden markov
models. COLT, pages 1–33, 2012.
[5] N. Balcan. Machine Learning Theory Notes. http://www.cc.gatech.edu/~ninamf/ML11/lect1115.
pdf
[6] B. Barak, F. Brandao, A. Harrow, J. Kelner, D. Steurer and Y. Zhou. Hypercontractivity, sum-of-
squares proofs, and their applications. STOC, pages 307–326, 2012.
[7] B. Barak, J. Kelner and D. Steurer. Rounding sum-of-squares relaxations. STOC, pages 31–40, 2014.
[8] B. Barak, J. Kelner and D. Steurer. Dictionary learning and tensor decomposition via the sum-of-squares
method. STOC, pages 143–151, 2015.
[9] B. Barak, G. Kindler and D. Steurer. On the optimality of semidefinite relaxations for average-case and
generalized constraint satisfaction. ITCS, pages 197–214, 2013.
[10] B. Barak and D. Steurer. Sum-of-squares proofs and the quest toward optimal algorithms/ Proceedings
of the ICM, 2014.
[11] P. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: risk bounds and structural
results. Journal of Machine Learning Research, 3:463–482, 2003.
[12] Q. Berthet and P. Rigollet. Computational lower bounds for sparse principal component detection.
COLT, pages 1046–1066, 2013.
[13] A. Bhaskara, M. Charikar, A. Moitra and A. Vijayaraghavan. Smoothed analysis of tensor decomposi-
tions. STOC, pages 594–603, 2014.
[14] S. Bhojanapalli and S. Sanghavi. A new sampling technique for tensors. arXiv:1502.05023
[15] E. Candes, Y. Eldar, T. Strohmer and V. Voroninski. Phase retrieval via matrix completion. SIAM
Journal on Imaging Sciences, 6(1):199–225, 2013.
[16] E. Candes and C. Fernandez-Granda. Towards a mathematical theory of super-resolution. Communi-
cations on Pure and Applied Mathematics, 67(6):906–956, 2014.
[17] E. Candes, X. Li, Y. Ma and J. Wright. Robust principal component analysis? Journal of the ACM,
58(3):1–37, 2011.
[18] E. Candes and Y. Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925–936, 2010.
[19] E. Candes and B. Recht. Exact matrix completion via convex optimization. Foundations of Computa-
tional Math., 9(6):717–772, 2008.
[20] E. Candes and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE
Transactions on Information Theory, 56(5):2053-2080, 2010
19

[21] A. Carbery and J. Wright. Distributional and `q norm inequalities for polynomials over convex bodies
in Rn . Mathematics Research Letters, 8(3):233–248, 2001.
[22] V. Chandrasekaran and M. Jordan. Computational and statistical tradeoffs via convex relaxation.
Proceedings of the National Academy of Sciences, 110(13)E1181–E1190, 2013.
[23] V. Chandrasekaran, B. Recht, P. Parrilo and A. Willsky. The convex geometry of linear inverse problems.
Foundations of Computational Math., 12(6)805–849, 2012.
[24] Y. Chen, S. Bhojanapalli, S. Sanghavi and R. Ward. Coherent matrix completion. ICML, pages 674–682,
2014.
[25] A. Coja-Oghlan, A. Goerdt and A. Lanka. Strong refutation heuristics for random k-SAT. Combina-
torics, Probability and Computing, 16(1):5–28, 2007.
[26] A. Daniely, N. Linial and S. Shalev-Shwartz. More data speeds up training time in learning half spaces
over sparse vectors. NIPS, pages 145–153, 2013.
[27] A. Daniely, N. Linial and S. Shalev-Shwartz. From average case complexity to improper learning
complexity. STOC, pages 441–448, 2014.
[28] M. Fazel. Matrix Rank Minimization with Applications. PhD thesis, Stanford University, 2002.
[29] U. Feige. Relations between average case complexity and approximation complexity. STOC, pages
534–543, 2002.
[30] U. Feige, J.H. Kim and E. Ofek. Witnesses for non-satisfiability of dense random 3CNF formulas In
Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages
497–508, 2006.
[31] U. Feige and E. Ofek. Easily refutable subformulas of large random 3-CNF formulas. Theory of Com-
puting 3:25–43, 2007.
[32] J. Friedman, A. Goerdt and M. Krivelevich. Recognizing more unsatisfiable random k-SAT instances
efficiently. SIAM Journal on Computing 35(2):408–430, 2005.
[33] J. Friedman, J. Kahn and E. Szemerédi. On the second eigenvalue of random regular graphs. STOC,
pages 534–543, 1989.
[34] Z. Füredi and J. Komlós. The eigenvalues of random symmetric matrices. Combinatorica, 1:233–241,
1981.
[35] S. Gandy, B. Recht and I. Yamada. Tensor completion and low-n-rank tensor recovery via convex
optimization. Inverse Problems, 27(2):1–19, 2011.
[36] R. Ge and T. Ma. Decomposing overcomplete 3rd order tensors using sum-of-squares algorithms.
RANDOM, pages 829–849, 2015.
[37] A. Goerdt and M. Krivelevich. Efficient recognition of random unsatisfiable k-SAT instances by spectral
methods. In Annual Symposium on Theoretical Aspects of Computer Science, pages 294–304, 2001.
[38] D. Grigoriev. Linear lower bound on degrees of Positivstellensatz calculus proofs for the parity. Theo-
retical Computer Science 259(1-2):613–622, 2001.
[39] L. Gurvits. Classical deterministic complexity of Edmonds’ problem and quantum entanglement. STOC,
pages 10–19, 2003.
[40] M. Hardt. Understanding alternating minimization for matrix completion. FOCS, pages 651–660, 2014.
[41] A. Harrow and A. Montanaro. Testing product states, quantum merlin-arther games and tensor opti-
mization. Journal of the ACM, 60(1):1–43, 2013.
20

[42] C. Hillar and L-H. Lim. Most tensor problems are N P -hard. Journal of the ACM, 60(6):1–39, 2013.
[43] S. B. Hopkins, P. K. Kothari, and A. Potechin. Sos and planted clique: Tight analysis of MPW moments
at all degrees and an optimal lower bound at degree four. SODA 2016, to appear.
[44] S. Hopkins, T. Schramm, J. Shi and D. Steurer. Private Communication, 2015.
[45] S. Hopkins, J. Shi and D. Steurer. Tensor principal component analysis via sum-of-square proofs COLT
pages 956–1006, 2015.
[46] D. Hsu and S. Kakade. Learning mixtures of spherical gaussians: Moment methods and spectral
decompositions. ITCS, pages 11–20, 2013.
[47] P. Jain, P. Netrapalli and S. Sanghavi. Low rank matrix completion using alternating minimization.
STOC, pages 665–674, 2013.
[48] P. Jain and S. Oh. Provable tensor factorization with missing data. NIPS, pages 1431–1439, 2014.
[49] R. Keshavan, A. Montanari and S. Oh. Matrix completion from a few entries. IEEE Transactions on
Information Theory, 56(6):2980-2998, 2010.
[50] R. Keshavan, A. Montanari and S. Oh. Matrix completion from noisy entries. Journal of Machine
Learning Research, 11:2057–2078, 2010.
[51] V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization
error of combined classifiers. Annals of Statistics 30(1):1–50, 2002.
[52] D. Kressner, M. Steinlechner and B. Vandereycken. Low-rank tensor completion by Riemannian opti-
mization. BIT Numerical Mathematics, 54(2):447–468, 2014.
[53] J. Lasserre. Global optimization with polynomials and the problem of moments. SIAM Journal on
Optimization, 11(3):796–817, 2001.
[54] J. Lasserre. Moments, Positive Polynomials and Their Applications Imperial College Press, 2009.
[55] T. Lee and A. Shraibman. Matrix completion from any given set of observations. NIPS, pages 1781–1787,
2013.
[56] J. Liu, P. Musialski, P. Wonka and J. Ye. Tensor completion for estimating missing values in visual
data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):208–220, 2013.
[57] T. Ma and A. Wigderson. Sum-of-square lower bounds for sparse PCA. NIPS 2015, to appear.
[58] J. Matousek. Lectures on Discrete Geometry. Springer, 2002.
[59] A. Moitra. Algorithmic Aspects of Machine Learning. http://people.csail.mit.edu/moitra/docs/
bookex.pdf
[60] E. Mossel and S. Roch. Learning nonsingular phylogenies and hidden markov models. STOC, pages
366–375, 2005.
[61] C. Mu, B. Huang, J. Wright and D. Goldfarb. Square deal: Lower bounds and improved relaxations for
tensor recovery. ICML, pages 73–81, 2014.
[62] Y. Nesterov. Squared functional systems and optimization problems. High Performance Optimization,
13:405–440, 2000.
[63] P. Parrilo. Structured Semidefinite Programs and Semialgebraic Geometry Method in Robustness and
Optimization. PhD thesis, California Institute of Technology, 2000.
[64] P. Parrilo. Semidefinite programming relaxations for semialgebraic problems. Mathematical Program-
ming, 96:293–320, 2003.
21

[65] P. Raghavendra and T. Schramm. Tight lower bounds for planted clique in the degree-4 SOS program.
SODA 2016, to appear.
[66] B. Recht. A simpler approach to matrix completion. Journal of Machine Learning Research, 12:3413–
3430, 2011.
[67] B. Recht, M. Fazel and P. Parrilo. Guaranteed minimum rank solutions of matrix equations via nuclear
norm minimization. SIAM Review, 52(3):471–501, 2010.
[68] G. Schoenebeck. Linear level Lasserre lower bounds for certain k-CSPs. FOCS, pages 593–602, 2008.
[69] N. Z. Shor. An approach to obtaining global extremums in polynomial mathematical programming

problems. Cybernetics and System Analysis, 23(5):695–700, 1987.
[70] M. Signoretto, L. De Lathauwer and J. Suykens. Nuclear norms for tensors and their use for convex
multilinear estimation. Tech Report 10-186, K. U. Leuven, 2010.
[71] N. Srebro and A. Shraibman. Rank, trace-norm and max-norm. COLT, pages 545–560, 2005.
[72] G. Tang, B. Bhaskar and B. Recht. Compressed sensing off the grid. IEEE Transactions on Information
Theory, 59(11):7465–7490, 2013.
[73] R. Tomioko, K. Hayashi and H. Kashima. Estimation of low-rank tensors via convex optimization.
arXiv:1010.0789, 2011.
[74] M. Yuan and C.H. Zhang. On tensor completion via nuclear norm minimization. Foundations of
Computational Mathematics, to appear.
A Reduction from Asymmetric to Symmetric Tensors

Here we give a general reduction, and show that any algorithm for tensor prediction that works for symmetric
tensors can be used to predict the entries of an asymmetric tensor too. Hardt gave a related reduction for
the cases of matrices [40] and it is instructive to first understand this reduction, before proceeding to the
tensor case. Suppose we are given a matrix M that is not necessarily symmetric. Then the approach of [40]
is to construct the following symmetric matrix:
0 MT

S= .
M 0
We have not precisely defined the notion of incoherence that is used in the matrix completion literature, but
it turns out to be easy to see that S is low rank and incoherent as well.
The important point is that given m samples generated uniformly at random from M , we can generate
random samples from S too. It will be more convenient to think of these random samples as being generated
without replacement, but this reduction works just as well without replacement too. Let M ∈ Rn1 ×n2 . Now
n2 +n2
for each sample from S, with probability p = (n11+n22)2 we reveal a uniformly random entry in the either
block of zeros. And with probability 1 − p we reveal a uniformly random entry from M . Each entry in M
appears exactly twice in S, and we choose to reveal this entry of M with probability 1/2 from the top-right
block, and otherwise from the bottom-left block. Thus given m samples from M , we can generate from S
(in fact we can generate even more, because some of the revealed entries will be zeros). It is easy to see that
this approach works for the case of sampling without replacement to, in that m samples without replacement
from M can be used to generate at least m samples without replacement from S.
Now let us proceed to the tensor case. Let us introduce the following definition, for ease of notation:
Definition A.1. Let m(n, r, , f , C) be such that, there is an algorithm that on a rank r, order d, size
n × n × ... × n symmetric tensor where each factor has norm at most C, the algorithm returns an estimate
X with err(X) = f with probability 1 − when it is given m(n, r, , f ) samples chosen uniformly at random
(and without replacement).
22

Pd √
Lemma A.2. For any odd d, suppose we are given m( j=1 nj , r2d−1 , , f , d) samples chosen uniformly
at random (and without replacement) from an n1 × n2 × ... × nd tensor
r
X
T = a1i ⊗ a2i ⊗ ... ⊗ adi
i=1
where each factor is unit norm. There is an algorithm that with probability at least 1 − returns an estimate
Y with Pd
( j=1 nj )d
err(Y ) ≤ Qd f
d!2d−1 j=1 nj
Proof. Our goal is to symmetrize an asymmetric tensor, and in such a way that each entry in the symmetrized
tensor is either zero or else corresponds to an entry in the original tensor. Our reduction will work for any
odd order d tensor. In particular let
Xr
T = a1i ⊗ a2i ⊗ ... ⊗ adi
i=1
Pd
be an order d tensor where the dimension of aj is nj . Also let n = j=1 nj . Then we will construct a
symmetric, order d tensor as follows. Let σ1 , σ2 , ...σd be a collection
Qd of d random ± variables that are chosen
uniformly at random from the 2d−1 configurations where j=1 σj = 1. Then we consider the following
random vector
ai (σ1 , σ2 , ...σd ) = [σ1 a1i , σ2 a2i , ..., σd adi ]
Here ai (σ1 , σ2 , ...σd ) is an n-dimensional vector that results from concatenating the vectors a1i , a2i , ..., adi but
after flipping some of their signs according to σ1 , σ2 , ...σd . Then we set
r
X ⊗d
S= E [ ai (σ1 , σ2 , ...σd ) ]
σ1 ,σ2 ,...σd
i=1
It is immediate that S is symmetric and has rank at most 2d−1 r by expanding out the expectation into a
sum over the valid sign configurations. Moreover each rank one term in the decomposition is of the form
a⊗d where kak22 = d because it is the concatenation of d unit vectors.
If σ1 , σ2 , ...σd is fixed, then each entry in S is itself a degree d polynomial in the σj variables. By our
construction of the σj variables, and because d is odd so there are no terms where every variable appears to
an Qeven power, it follows that all the terms vanish in expectation except for the terms which have a factor
d
of j=1 σj , and these are exactly terms that correspond to some permutation π : [d] → [d], and a term of
the form
d
π(1) π(2) π(d)
X
ai ⊗ ai ⊗, ..., ⊗ai
i=1
Hence all of the entries in S are either zero or are 2d−1 times an entry in T . As before, we can generate m
uniformly random samples from S given m uniformly random samples from T , by simply choosing to sample
an entry from one of the blocks of zeros with the appropriate probability, or else revealing an entry of T and
choosing where in S to reveal this entry uniformly at random. Hence:
1 X 1 X
Pd |Yi1 ,i2 ,...,id − Si1 ,i2 ,...,id | ≤ Pd |Yi1 ,i2 ,...,id − Si1 ,i2 ,...,id |
( j=1 nj )d (i1 ,i2 ,...,id )∈Γ ( j=1 nj )d i1 ,i2 ,...,id
where Γ represents the locations in S where an entry of T appears. The right hand side above is at most
f with probability 1 − . Moreover each entry in T appears in exactly d! locations in S. And when it does
appear, it is scaled by 2d−1 . And hence if we multiply the left hand side by
Pd
( j=1 nj )d
Qd
d!2d−1 j=1 nj
we obtain err(Y ). This completes the reduction.
Note that in the case where n1 = n2 = n3 ... = nd , the error and the rank in this reduction increase only by
at most an ed and 2d factor respectively.
23

The Power of Convex Relaxation:
Near-Optimal Matrix Completion
Emmanuel J. Candès† and Terence Tao]
† Applied and Computational Mathematics, Caltech, Pasadena, CA 91125

] Department of Mathematics, University of California, Los Angeles, CA 90095
arXiv:0903.1476v1 [cs.IT] 9 Mar 2009
March 9, 2009
Abstract
This paper is concerned with the problem of recovering an unknown matrix from a small
fraction of its entries. This is known as the matrix completion problem, and comes up in a
great number of applications, including the famous Netflix Prize and other similar questions in
collaborative filtering. In general, accurate recovery of a matrix from a small number of entries
is impossible; but the knowledge that the unknown matrix has low rank radically changes this
premise, making the search for solutions meaningful.
This paper presents optimality results quantifying the minimum number of entries needed to
recover a matrix of rank r exactly by any method whatsoever (the information theoretic limit).
More importantly, the paper shows that, under certain incoherence assumptions on the singular
vectors of the matrix, recovery is possible by solving a convenient convex program as soon as the
number of entries is on the order of the information theoretic limit (up to logarithmic factors).
This convex program simply finds, among all matrices consistent with the observed entries, that
with minimum nuclear norm. As an example, we show that on the order of nr log(n) samples
are needed to recover a random n × n matrix of rank r by any method, and to be sure, nuclear
norm minimization succeeds as soon as the number of entries is of the form nrpolylog(n).
Keywords. Matrix completion, low-rank matrices, semidefinite programming, duality in opti-

mization, nuclear norm minimization, random matrices and techniques from random matrix theory,
free probability.
1 Introduction
1.1 Motivation
Imagine we have an n1 × n2 array of real1 numbers and that we are interested in knowing the
value of each of the n1 n2 entries in this array. Suppose, however, that we only get to see a
small number of the entries so that most of the elements about which we wish information are
simply missing. Is it possible from the available entries to guess the many entries that we have
not seen? This problem is now known as the matrix completion problem [7], and comes up in a
great number of applications, including the famous Netflix Prize and other similar questions in
1
Much of the discussion below, as well as our main results, apply also to the case of complex matrix completion,
with some minor adjustments in the absolute constants; but for simplicity we restrict attention to the real case.

collaborative filtering [12]. In a nutshell, collaborative filtering is the task of making automatic
predictions about the interests of a user by collecting taste information from many users. Netflix
is a commercial company implementing collaborative filtering, and seeks to predict users’ movie
preferences from just a few ratings per user. There are many other such recommendation systems
proposed by Amazon, Barnes and Noble, and Apple Inc. to name just a few. In each instance, we
have a partial list about a user’s preferences for a few rated items, and would like to predict his/her
preferences for all items from this and other information gleaned from many other users.
In mathematical terms, the problem may be posed as follows: we have a data matrix M ∈
Rn1 ×n2 which we would like to know as precisely as possible. Unfortunately, the only information
available about M is a sampled set of entries Mij , (i, j) ∈ Ω, where Ω is a subset of the complete set
of entries [n1 ] × [n2 ]. (Here and in the sequel, [n] denotes the list {1, . . . , n}.) Clearly, this problem
is ill-posed for there is no way to guess the missing entries without making any assumption about
the matrix M .
An increasingly common assumption in the field is to suppose that the unknown matrix M has
low rank or has approximately low rank. In a recommendation system, this makes sense because
often times, only a few factors contribute to an individual’s taste. In [7], the authors showed that
this premise radically changes the problem, making the search for solutions meaningful. Before
reviewing these results, we would like to emphasize that the problem of recovering a low-rank
matrix from a sample of its entries, and by extension from fewer linear functionals about the
matrix, comes up in many application areas other than collaborative filtering. For instance, the
completion problem also arises in computer vision. There, many pixels may be missing in digital
images because of occlusion or tracking failures in a video sequence. Recovering a scene and
inferring camera motion from a sequence of images is a matrix completion problem known as the
structure-from-motion problem [9,23]. Other examples include system identification in control [19],
multi-class learning in data analysis [1–3], global positioning—e.g. of sensors in a network—from
partial distance information [5, 21, 22], remote sensing applications in signal processing where we
would like to infer a full covariance matrix from partially observed correlations [25], and many
statistical problems involving succinct factor models.
1.2 Minimal sampling

This paper is concerned with the theoretical underpinnings of matrix completion and more specif-
ically in quantifying the minimum number of entries needed to recover a matrix of rank r exactly.
This number generally depends on the matrix we wish to recover. For simplicity, assume that the
unknown rank-r matrix M is n × n. Then it is not hard to see that matrix completion is impossible
unless the number of samples m is at least 2nr − r2 , as a matrix of rank r depends on this many
degrees of freedom. The singular value decomposition (SVD)
X
M= σk uk vk∗ , (1.1)
k∈[r]
where σ1 , . . . , σr ≥ 0 are the singular values, and the singular vectors u1 , . . . , ur ∈ Rn1 = Rn and
v1 , . . . , vr ∈ Rn2 = Rn are two sets of orthonormal vectors, is useful to reveal these degrees of
freedom. Informally, the singular values σ1 ≥ . . . ≥ σr depend on r degrees of freedom, the left
singular vectors uk on (n − 1) + (n − 2) + . . . + (n − r) = nr − r(r + 1)/2 degrees of freedom, and
similarly for the right singular vectors vk . If m < 2nr − r2 , no matter which entries are available,

there can be an infinite number of matrices of rank at most r with exactly the same entries, and
so exact matrix completion is impossible. In fact, if the observed locations are sampled at random,
we will see later that the minimum number of samples is better thought of as being on the order
of nr log n rather than nr because of a coupon collector’s effect.
In this paper, we are interested in identifying large classes of matrices which can provably be
recovered by a tractable algorithm from a number of samples approaching the above limit, i.e. from
about nr log n samples. Before continuing, it is convenient to introduce some notations that will
be used throughout: let PΩ : Rn×n → Rn×n be the orthogonal projection onto the subspace of
matrices which vanish outside of Ω ((i, j) ∈ Ω if and only if Mij is observed); that is, Y = PΩ (X)
is defined as (
Xij , (i, j) ∈ Ω,
Yij =
0, otherwise,
so that the information about M is given by PΩ (M ). The matrix M can be, in principle, recovered
from PΩ (M ) if it is the unique matrix of rank less or equal to r consistent with the data. In other
words, if M is the unique solution to
minimize rank(X)
(1.2)
subject to PΩ (X) = PΩ (M ).
Knowing when this happens is a delicate question which shall be addressed later. For the moment,
note that attempting recovery via (1.2) is not practical as rank minimization is in general an NP-
hard problem for which there are no known algorithms capable of solving problems in practical
time once, say, n ≥ 10.
In [7], it was proved 1) that matrix completion is not as ill-posed as previously thought and
2) that exact matrix completion is possible by convex programming. The authors of [7] proposed
recovering the unknown matrix by solving the nuclear norm minimization problem
minimize kXk∗
(1.3)
subject to PΩ (X) = PΩ (M ),
where the nuclear norm kXk∗ of a matrix X is defined as the sum of its singular values,
X
kXk∗ := σi (X). (1.4)
i
(The problem (1.3) is a semidefinite program [11].) They proved that if Ω is sampled uniformly at
random among all subset of cardinality m and M obeys a low coherence condition which we will
review later, then with large probability, the unique solution to (1.3) is exactly M , provided that
the number of samples obeys
m ≥ C n6/5 r log n (1.5)
(to be completely exact, there is a restriction on the range of values that r can take on).
In (1.5), the number of samples per degree of freedom is not logarithmic or polylogarithmic in
the dimension, and one would like to know whether better results approaching the nr log n limit are
possible. This paper provides a positive answer. In details, this work develops many useful matrix
models for which nuclear norm minimization is guaranteed to succeed as soon as the number of
entries is of the form nrpolylog(n).

1.3 Main results
A contribution of this paper is to develop simple hypotheses about the matrix M which makes
it recoverable by semidefinite programming from nearly minimally sampled entries. To state our
assumptions, we recall the SVD of M (1.1) and denote by PU (resp. PV ) the orthogonal projections
onto the column (resp. row) space of M ; i.e. the span of the left (resp. right) singular vectors. Note
that X X
PU = ui u∗i ; PV = vi vi∗ . (1.6)
i∈[r] i∈[r]
Next, define the matrix E as X

E := ui vi∗ . (1.7)
i∈[r]
We observe that E interacts well with PU and PV , in particular obeying the identities
PU E = E = EPV ; E ∗ E = PV ; EE ∗ = PU .
One can view E as a sort of matrix-valued “sign pattern” for M (compare (1.7) with (1.1)), and is
also closely related to the subgradient ∂kM k∗ of the nuclear norm at M (see (3.2)).
It is clear that some assumptions on the singular vectors ui , vi (or on the spaces U, V ) is needed
in order to have a hope of efficient matrix completion. For instance, if u1 and v1 are Kronecker
delta functions at positions i, j respectively, then the singular value σ1 can only be recovered if one
actually samples the (i, j) coordinate, which is only likely if one is sampling a significant fraction
of the entire matrix. Thus we need the vectors ui , vi to be “spread out” or “incoherent” in some
sense. In our arguments, it will be convenient to phrase this incoherence assumptions using the
projection matrices PU , PV and the sign pattern matrix E. More precisely, our assumptions are as
follows.
A1 There exists µ1 > 0 such that for all pairs (a, a0 ) ∈ [n1 ] × [n1 ] and (b, b0 ) ∈ [n2 ] × [n2 ],
√
r r
hea , PU ea0 i − 1a=a0 ≤ µ1 , (1.8a)

n1 n1
√
r r
heb , PV eb0 i − 1b=b0 ≤ µ1 . (1.8b)

n2 n2
A2 There exists µ2 > 0 such that for all (a, b) ∈ [n1 ] × [n2 ],
√
r
|Eab | ≤ µ2 √ . (1.9)
n1 n2
We will say that the matrix M obey the strong incoherence property with parameter µ if one can
take µ1 and µ2 both less than equal to µ. (This property is related to, but slightly different from,
the incoherence property, which will be discussed in Section 1.6.1.)
Remark. Our assumptions only involve the singular vectors u1 , . . . , ur , v1 , . . . , vr of M ; the
singular values σ1 , . . . , σr are completely unconstrained. This lack of dependence on the singular
values is a consequence of the geometry of the nuclear norm (and in particular, the fact that the
subgradient ∂kXk∗ of this norm is independent of the singular values, see (3.2)).

It is not hard to see that µ must be greater than 1. For instance, (1.9) implies
X
r= |Eab |2 ≤ µ22 r
(a,b)∈[n1 ]×[n2 ]
which forces µ2 ≥ 1. The Frobenius norm identities

X
r = kPU k2F = |hea , PU ea0 i|2
a,a0 ∈[n1 ]
and (1.8a), (1.8b) also place a similar lower bound on µ1 .

We will show that 1) matrices obeying the strong incoherence property with a small value
of the parameter µ can be recovered from fewer entries and that 2) many matrices of interest
obey the strong incoherence property with a small µ. We will shortly develop three models, the
uniformly bounded orthogonal model, the low-rank low-coherence model, and the random orthogonal
model which all illustrate the point that if the singular vectors of M are “spread out” in the
sense that their amplitudes all have about the same size, then the parameter µ is√low. In some
sense, “most” low-rank matrices obey the strong incoherence property with µ = O( log n), where
n = max(n1 , n2 ). Here, O(·) is the standard asymptotic notation, which is reviewed in Section 1.8.
Our first matrix completion result is as follows.
Theorem 1.1 (Matrix completion I) Let M ∈ Rn1 ×n2 be a fixed matrix of rank r = O(1)
obeying the strong incoherence property with parameter µ. Write n := max(n1 , n2 ). Suppose we
observe m entries of M with locations sampled uniformly at random. Then there is a positive
numerical constant C such that if
m ≥ C µ4 n(log n)2 , (1.10)
then M is the unique solution to (1.3) with probability at least 1 − n−3 . In other words: with high
probability, nuclear-norm minimization recovers all the entries of M with no error.
This result is noteworthy for two reasons. The first is that the matrix model is deterministic
and only needs the strong incoherence assumption. The second is more substantial. Consider the
class of bounded rank matrices obeying µ = O(1). We shall see that no method whatsoever can
recover those matrices unless the number of entries obeys m ≥ c0 n log n for some positive numerical
constant c0 ; this is the information theoretic limit. Thus Theorem 1.1 asserts that exact recovery by
nuclear-norm minimization occurs nearly as soon as it is information theoretically possible. Indeed,
if the number of samples is slightly larger, by a logarithmic factor, than the information theoretic
limit, then (1.3) fills in the missing entries with no error.
We stated Theorem 1.1 for bounded ranks, but our proof gives a result for all values of r.
Indeed, the argument will establish that the recovery is exact with high probability provided that
m ≥ C µ4 nr2 (log n)2 . (1.11)
When r = O(1), this is Theorem 1.1. We will prove a stronger and near-optimal result below
(Theorem 1.2) in which we replace the quadratic dependence on r with linear dependence. The
reason why we state Theorem 1.1 first is that its proof is somewhat simpler than that of Theorem
1.2, and we hope that it will provide the reader with a useful lead-in to the claims and proof of our
main result.

Theorem 1.2 (Matrix completion II) Under the same hypotheses as in Theorem 1.1, there is
a numerical constant C such that if
m ≥ C µ2 nr log6 n, (1.12)
M is the unique solution to (1.3) with probability at least 1 − n−3 .

This result is general and nonasymptotic.
The proof of Theorems 1.1, 1.2 will occupy the bulk of the paper, starting at Section 3.
1.4 A surprise
We find it unexpected that nuclear norm-minimization works so well, for reasons we now pause to
discuss. For simplicity, consider matrices with a strong incoherence parameter µ polylogarithmic in
the dimension. We know that for the rank minimization program (1.2) to succeed, or equivalently
for the problem to be well posed, the number of samples must exceed a constant times nr log n.
However, Theorem 1.2 proves that the convex relaxation is rigorously exact nearly as soon as our
problem has a unique low-rank solution. The surprise here is that admittedly, there is a priori no
good reason to suspect that convex relaxation might work so well. There is a priori no good reason
to suspect that the gap between what combinatorial and convex optimization can do is this small.
In this sense, we find these findings a little unexpected.
The reader will note an analogy with the recent literature on compressed sensing, which shows
that under some conditions, the sparsest solution to an underdetermined system of linear equations
is that with minimum `1 norm.
1.5 Model matrices

We now discuss model matrices which obey the conditions (1.8) and (1.9) for small values of the
strong incoherence parameter µ. For simplicity we restrict attention to the square matrix case
n1 = n2 = n.
1.5.1 Uniformly bounded model

In this section we shall show, roughly speaking, that almost all n × n matrices M with singular
vectors obeying the size property
p
kuk k`∞ , kvk k`∞ ≤ µB /n, (1.13)
√
with µB = O(1) also satisfy the assumptions A1 and A2 with µ1 , µ2 = O( log n). This justifies our
earlier claim that when the singular vectors are spread out, then the strong incoherence property
holds for a small value of µ.
We define a random model obeying (1.13) as follows: take two arbitrary families of n orthonor-
mal vectors [u1 , . . . , un ] and [v1 , . . . , vn ] obeying (1.13). We allow the ui and vi to be deterministic;
for instance one could have ui = vi for all i ∈ [n].
1. Select r left singular vectors uα(1) , . . . , uα(r) at random with replacement from the first family,
and r right singular vectors vβ(1) , . . . , vβ(r) from the second family, also at random. We do
not require that the β are chosen independently from the α; for instance one could have
β(k) = α(k) for all k ∈ [r].

∗
P
2. Set M := k∈[r] k σk uα(k) vβ(k) , where the signs 1 , . . . , r ∈ {−1, +1} are chosen indepen-
dently at random (with probability 1/2 of each choice of sign), and σ1 , . . . , σr > 0 are arbitrary
distinct positive numbers (which are allowed to depend on the previous random choices).
We emphasize that the only assumptions about the families [u1 , . . . , un ] and [v1 , . . . , vn ] is that
they have small components. For example, they may be the same. Also note that this model allows
for any kind of dependence between the left and right singular selected vectors. For instance, we
may select the same columns as to obtain a symmetric matrix as in the case where the two families
are the same. Thus, one can think of our model as producing a generic matrix with uniformly
bounded singular vectors. √
We now show that PU , PV and E obey (1.8) and (1.9), with µ1 , µ2 = O(µB log n), with large
probability. For (1.9), observe that
X
∗
E= k uα(k) vβ(k) ,
k∈[r]
and {k } is a sequence

√ of i.i.d. ±1 symmetric random variables. Then Hoeffding’s inequality shows
that µ2 = O(µB log n); see [7] for details.
For (1.8), we will use a beautiful concentration-of-measure result of McDiarmid.
Theorem 1.3 [18] Let {a1 , . . . , an } be a sequence of scalars obeying
P |ai | ≤ α. Choose a random
set S of size s without replacement from {1, . . . , n} and let Y = i∈S ai . Then for each t ≥ 0,
t2
P(|Y − E Y | ≥ t) ≤ 2e− 2sα2 . (1.14)
From (1.6) we have X
PU = uk u∗k ,
k∈S
where S := {α(1), . . . , α(r)}. For any fixed a, a0∈ [n], set
X
Y := hPU ea , PU ea0 i = hea , uk ihuk , ea0 i
k∈S
r
and note that E Y = Since |hea , uk ihuk , ea0 i| ≤ µB /n, we apply (1.14) and obtain
n 1a=a .
0
√
r 2
≤ 2e−λ /2 .

P hPU ea , PU ea0 i − 1{a=a0 } r/n ≥ λ µB
n
√
Taking λ proportional to log n and applying the√ union bound for a, a0 ∈ [n] proves (1.8) with
probability at least 1 − n−3 (say) with µ1 = O(µB log n).
Combining this computation with Theorems 1.1, 1.2, we have established the following corollary:
Corollary 1.4 (Matrix completion, uniformly bounded model) Let M be a matrix sampled
from a uniformly bounded model. Under the hypotheses of Theorem 1.1, if
m ≥ C µ2B nr log7 n,
M is the unique solution to (1.3) with probability at least 1 − n−3 . As we shall see below, when
r = O(1), it suffices to have
m ≥ C µ2B n log2 n.

Remark. For large values of the rank, the assumption that the `∞ norm √ of the singular vectors
√
is O(1/ n) is not sufficient to conclude that (1.8) holds with µ1 = O( log n). Thus, the extra
randomization step (in which we select the r singular vectors from a list of n possible vectors) is in
some sense necessary. As an example, take [u1 , . . . , ur ] to be the first r columns of the Hadamard
√
transform where each row corresponds to a frequency. Then kuk k`∞ ≤ 1/ n but if r ≤ n/2, the
first two rows of [u1 , . . . , ur ] are identical. Hence
hPU e1 , PU e2 i = r/n.
√
Obviously, this does not scale like r/n. Similarly, the sign flip (step 2) is also necessary as
otherwise, we could have E = PU as in the case where [u1 , . . . , un ] = [v1 , . . . , vn ] and the same
columns are selected. Here,
1X r
max Eaa = max kPU ea k2 ≥ kPU ea k2 = ,
a a n a n
√
which does not scale like r/n either.
1.5.2 Low-rank low-coherence model

When the rank is small, the assumption that the singular vectors are spread is sufficient to show
that the parameter µ is small. To see this, suppose that the singular vectors obey (1.13). Then
r µB r
hPU ea , PU ea0 i − 1{a=a0 } ≤ max kPU ea k2 ≤ . (1.15)

n a∈[n] n
The first inequality follows from the Cauchy-Schwarz inequality
|hPU ea , PU ea0 i| ≤ kPU ea kkPU ea0 k
for a 6= a0 and from the Frobenius norm bound

1 r
max kPU ea k2 ≥ kPU k2F = .
a∈[n] n n
√
This gives µ1 ≤ µB r. Also, by another application of Cauchy-Schwarz we have
µB r
|Eab | ≤ max kPU ea k max kPV eb k ≤ (1.16)
a∈[n] b∈[n] n
√ √
so that we also have µ2 ≤ µB r. In short, µ ≤ µB r.
Our low-rank low-coherence model assumes that r = O(1) and that the singular vectors obey
(1.13). When µB = O(1), this model obeys the strong incoherence property with µ = O(1). In this
case, Theorem 1.1 specializes as follows:
Corollary 1.5 (Matrix completion, low-rank low-coherence model) Let M be a matrix of

bounded rank (r = O(1)) whose singular vectors obey (1.13). Under the hypotheses of Theorem 1.1,
if
m ≥ C µ2B n log2 n,
then M is the unique solution to (1.3) with probability at least 1 − n−3 .

1.5.3 Random orthogonal model
Our last model is borrowed from [7] and assumes that the column matrices [u1 , . . . , ur ] and
[v1 , . . . , vr ] are independent random orthogonal matrices, with no assumptions whatsoever on the
singular values σ1 , . . . , σr . Note that this is a special case of the uniformly bounded model since
this is equivalent to selecting two n × n random orthonormal bases, and then selecting the singular
vectors as in Section 1.5.1. Since we know q that the maximum entry of an n × n random orthogonal
matrix is bounded by a constant times logn n with large probability, then Section 1.5.1 shows that
this model obeys the strong incoherence property with µ = O(log n). Theorems 1.1, 1.2 then give
Corollary 1.6 (Matrix completion, random orthogonal model) Let M be a matrix sampled
from the random orthogonal model. Under the hypotheses of Theorem 1.1, if
m ≥ C nr log8 n,
then M is the unique solution to (1.3) with probability at least 1 − n−3 . The exponent 8 can be
lowered to 7 when r ≥ log n and to 6 when r = O(1).
As mentioned earlier, we have a lower bound m ≥ 2nr − r2 for matrix completion, which can be
improved to m ≥ Cnr log n under reasonable hypotheses on the matrix M . Thus, the hypothesis
on m in Corollary 1.6 cannot be substantially improved. However, it is likely that by specializing
the proofs of our general results (Theorems 1.1 and 1.2) to this special case, one may be able to
improve the power of the logarithm here, though it seems that a substantial effort would be needed
to reach the optimal level of nr log n even in the bounded rank case.
Speaking of logarithmic improvements, we have shown that µ = O(log n), which is sharp since
for r = 1, one cannot hope
√ for better estimates. For r much larger than log n, however, one can
improve this to µ = O( log n). As far as µ1 is concerned, this is essentially a consequence of the
Johnson-Lindenstrauss lemma. For a 6= a0 , write
1
kPU ea + PU ea0 k2 − kPU ea − PU ea0 k2 .

hPU ea , PU ea0 i =
4
We claim that for each a 6= a0 ,
√

2 2r r log n
kPU (ea ± ea0 )k − ≤ C (1.17)

n n
with probability at least 1 − n−5 , say. This inequality is indeed well known. Observe that kPU xk
has the same distribution than the Euclidean norm of the first r components of a vector uniformly
distributed on the n − 1 dimensional sphere of radius kxk. Then we have [4]:
r r r
r 2 2
P (1 − ε)kxk ≤ kPU xk ≤ (1 − ε)−1 kxk ≤ 2e− r/4 + 2e− n/4 .
n n
q
Choosing x = ea ±ea0 , = C0 logr n , and applying the union bound proves the claim as long as long
as r is sufficiently larger than log n. Finally, since a bound on the diagonal term kP 2
√U ea k − r/n in
(1.8) follows from the same inequality by simply choosing x = ea , we have µ1 = O( log n). Similar
arguments for µ2 exist but we forgo the details.

1.6 Comparison with other works
1.6.1 Nuclear norm minimization
The mathematical study of matrix completion began with [7], which made slightly different incoher-
ence assumptions than in this paper. Namely, let us say that the matrix M obeys the incoherence
property with a parameter µ0 > 0 if
µ0 r µ0 r
kPU ea k2 ≤ , kPV eb k2 ≤ (1.18)
n1 n2
for all a ∈ [n1 ], b ∈ [n2 ]. Again, this implies µ0 ≥ 1.
In [7] it was shown that if a fixed matrix M obeys the incoherence property with parameter µ0 ,
then nuclear minimization succeeds with large probability if
m ≥ C µ0 n6/5 r log n (1.19)
provided that µ0 r ≤ n1/5 .

Now consider a matrix M obeying the strong incoherence property with µ = O(1). Then since
µ0 ≥ 1, (1.19) guarantees exact reconstruction only if m ≥ C n6/5 r log n (and r = O(n1/5 )) while
our results only need nrpolylog(n) samples. Hence, our results provide a substantial improvement
over (1.19) at least in the regime which permits minimal sampling.
We would like to note that there are obvious relationships between the best incoherence param-
eter µ0 and the best strong incoherence parameters µ1 , µ2 for a given matrix M , which we take to
be square for simplicity. On the one hand, (1.8) implies that
√
2 r µ1 r
kPU ea k ≤ +
n n
√
so that one can take µ0 ≤ 1 + µ1 / r. This shows that one can apply results from the incoherence
model (in which we only know (1.18)) to our model (in which we assume strong incoherence). On
the other hand,
µ0 r
|hPU ea , PU ea0 i| ≤ kPU ea kkPU ea0 k ≤
n
√ √
so that µ1 ≤ µ0 r. Similarly, µ2 ≤ µ0 r so that one can transfer results in the other direction as
well.
We would like to mention another important paper [20] inspired by compressed sensing, and
which also recovers low-rank matrices from partial information. The model in [20], however, assumes
some sort of Gaussian measurements and is completely different from the completion problem
discussed in this paper.
1.6.2 Spectral methods

An interesting new approach to the matrix completion problem has been recently introduced in [13].
This algorithm starts by trimming each row and column with too few entries; i.e. one replaces the
entries in those rows and columns by zero. Then one computes the SVD of the trimmed matrix
and truncate it as to only keep the top r singular values (note that one would need to know r a
priori ). Then under some conditions (including the incoherence property (1.18) with µ = O(1)),
this work shows that accurate recovery is possible from a minimal number of samples, namely, on
10

the order of nr log n samples. Having said this, this work is not directly comparable to ours because
it operates in a different regime. Firstly, the results are asymptotic and are valid in a regime when
the dimensions of the matrix tend to infinity in a fixed ratio while ours are not. Secondly, there is
a strong assumption about the range of the singular values the unknown matrix can take on while
we make no such assumption; they must be clustered so that no singular value can be too large or
too small compared to the others. Finally, this work only shows approximate recovery—not exact
recovery as we do here—although exact recovery results have been announced. This work is of
course very interesting because it may show that methods—other than convex optimization—can
also achieve minimal sampling bounds.
1.7 Lower bounds

We would like to conclude the tour of the results introduced in this paper with a simple lower bound,
which highlights the fundamental role played by the coherence in controlling what is information-
theoretically possible.
Theorem 1.7 (Lower bound, Bernoulli model) Fix 1 ≤ m, r ≤ n and µ0 ≥ 1, let 0 < δ <
1/2, and suppose that we do not have the condition
m µ0 r n
− log 1 − 2 ≥ log . (1.20)
n n 2δ
Then there exist infinitely many pairs of distinct n × n matrices M 6= M 0 of rank at most r
and obeying the incoherence property (1.18) with parameter µ0 such that PΩ (M ) = PΩ (M 0 ) with
probability at least δ. Here, each entry is observed with probability p = m/n2 independently from
the others.
Clearly, even if one knows the rank and the coherence of a matrix ahead of time, then no
algorithm can be guaranteed to succeed based on the knowledge of PΩ (M ) only, since they are many
candidates which are consistent with these data. We prove this theorem in Section 2. Informally,
Theorem 1.7 asserts that (1.20) is a necessary condition for matrix completion to work with high
probability if all we know about the matrix M is that it has rank at most r and the incoherence
property with parameter µ0 . When the right-hand side of (1.20) is less than ε < 1, this implies
n
m ≥ (1 − ε/2)µ0 nr log . (1.21)
2δ
Recall that the number of degrees of freedom of a rank-r matrix is 2nr(1 − r/2n). Hence,
to recover an arbitrary rank-r matrix with the incoherence property with parameter µ0 with any
decent probability by any method whatsoever, the minimum number of samples must be about
the number of degrees of freedom times µ0 log n; in other words, the oversampling factor is directly
proportional to the coherence. Since µ0 ≥ 1, this justifies our earlier assertions that nr log n samples
are really needed.
In the Bernoulli model used in Theorem 1.7, the number of entries is a binomial random variable
sharply concentrating around its mean m. There is very little difference between this model and
the uniform model which assumes that Ω is sampled uniformly at random among all subsets of
cardinality m. Results holding for one hold for the other with only very minor adjustments. Because
we are concerned with essential difficulties, not technical ones, we will often prove our results using
the Bernoulli model, and indicate how the results may easily be adapted to the uniform model.
11

1.8 Notation
Before continuing, we provide here a brief summary of the notations used throughout the paper.
To simplify the notation, we shall work exclusively with square matrices, thus
n1 = n2 = n.
The results for non-square matrices (with n = max(n1 , n2 )) are proven in exactly the same fashion,
but will add more subscripts to a notational system which is already quite complicated, and we
will leave the details to the interested reader. We will also assume that n ≥ C for some sufficiently
large absolute constant C, as our results are vacuous in the regime n = O(1).
Throughout, we will always assume that m is at least as large as 2nr, thus
2r ≤ np, p := m/n2 . (1.22)
A variety of norms on matrices X ∈ Rn×n will be discussed. The spectral norm (or operator
norm) of a matrix is denoted by
kXk := sup kXxk = sup σj (X).

x∈Rn :kxk=1 1≤j≤n
The Euclidean inner product between two matrices is defined by the formula
hX, Y i := trace(X ∗ Y ),
and the corresponding Euclidean norm, called the Frobenius norm or Hilbert-Schmidt norm, is
denoted
Xn
1/2
kXkF := hX, Xi = ( σj (X)2 )1/2 .
j=1
The nuclear norm of a matrix X is denoted

n
X
kXk∗ := σj (X).
j=1
For vectors, we will only consider the usual Euclidean `2 norm which we simply write as kxk.
Further, we will also manipulate linear transformation which acts on the space Rn×n matrices
such as PΩ , and we will use calligraphic letters for these operators as in A(X). In particular, the
identity operator on this space will be denoted by I : Rn×n → Rn×n , and should not be confused
with the identity matrix I ∈ Rn×n . The only norm we will consider for these operators is their
spectral norm (the top singular value)
kAk := sup kA(X)kF .

X:kXkF ≤1
Thus for instance

kPΩ k = 1.
We use the usual asymptotic notation, for instance writing O(M ) to denote a quantity bounded
in magnitude by CM for some absolute constant C > 0. We will sometimes raise such notation to
12

some power, for instance O(M )M would denote a quantity bounded in magnitude by (CM )M for
some absolute constant C > 0. We also write X . Y for X = O(Y ), and poly(X) for O(1+|X|)O(1) .
We use 1E to denote the indicator function of an event E, e.g. 1a=a0 equals 1 when a = a0 and
0 when a 6= a0 .
If A is a finite set, we use |A| to denote its cardinality.
We record some (standard) conventions involving empty sets. The set [n] := {1, . . . , n} is
understood
P to be the empty set when n = 0. We Q also make the usual conventions that an empty
sum x∈∅ P f (x) is zero, and an empty product x∈∅ f (x) is one. Note however that a k-fold sum
such as a1 ,...,ak ∈[n] f (a1 , . . . , a k ) does not vanish when k = 0, but is instead equal to a single
0
summand f () with the empty tuple () ∈ [n] as the input; thus for instance the identity
X k
Y X k
f (ai ) = f (a)
a1 ,...,ak ∈[n] i=1 a∈[n]
is valid both for positive integers k and for k = 0 (and both for non-zero f and for zero f , recalling
of course that 00 = 1). We will refer to sums over the empty tuple as trivial sums to distinguish
them from empty sums.
2 Lower bounds
This section proves Theorem 1.7, which asserts that no method can recover an arbitrary n × n
matrix of rank r and coherence at most µ0 unless the number of random samples obeys (1.20). As
stated in the theorem, we establish lower bounds for the Bernoulli model, which then apply to the
model where exactly m entries are selected uniformly at random, see the Appendix for details.
It may be best to consider a simple example first to understand the main idea behind the proof
of Theorem 1.7. Suppose that r = 1, µ0 > 1 in which case M = xy ∗ . For simplicity, suppose that
√
y is fixed, say y = (1, . . . , 1), and x is chosen arbitrarily from the cube [1, µ0 ]n of Rn . One easily
verifies that M obeys the coherence property with parameter µ0 (and in fact also obeys the strong
incoherence property with a comparable parameter). Then to recover M , we need to see at least
one entry per row. For instance, if the first row is unsampled, one has no information about the
√
first coordinate x1 of x other than that it lies in [1, µ0 ], and so the claim follows in this case by
√
varying x1 along the infinite set [1, µ0 ].
Now under the Bernoulli model, the number of observed entries in the first row—and in any
fixed row or column—is a binomial random variable with a number of trials equal to n and a
probability of success equal to p. Therefore, the probability π0 that any row is unsampled is equal
to π0 = (1 − p)n . By independence, the probability that all rows are sampled at least once is
(1 − π0 )n , and any method succeeding with probability greater 1 − δ would need
(1 − π0 )n ≥ 1 − δ.
or −nπ0 ≥ n log(1 − π0 ) ≥ log(1 − δ). When δ < 1/2, log(1 − δ) ≥ −2δ and thus, any method
would need
2δ
π0 ≤ .
n
This is the desired conclusion when µ0 > 1, r = 1.
13

This type of simple analysis easily extends to general values of the rank r and of the coherence.
Without loss of generality, assume that ` := µn0 r is an integer, and consider a (self-adjoint) n × n
matrix M of rank r of the form
X r
M := σk uk u∗k ,
k=1
where the σk are drawn arbitrarily from [0, 1] (say), and the singular vectors u1 , . . . , ur are defined
as follows: r
1 X
ui,k := ei , Bk = {(k − 1)` + 1, (k − 1)` + 2, . . . , k`};
`
i∈Bk
that is to say, uk vanishes everywhere except on a support of ` consecutive indices. Clearly, this
matrix is incoherent with parameter µ0 . Because the supports of the singular vectors are disjoint,
M is a block-diagonal matrix with diagonal blocks of size ` × `. We now argue as before. Recovery
with positive probability is impossible unless we have sampled at least one entry per row of each
diagonal block, since otherwise we would be forced to guess at least one of the σk based on no
information (other than that σk lies in [0, 1]), and the theorem will follow by varying this singular
value. Now the probability π0 that the first row of the first block—and any fixed row of any fixed
block—is unsampled is equal to (1−p)` . Therefore, any method succeeding with probability greater
1 − δ would need
(1 − π1 )n ≥ 1 − δ,
which implies π1 ≤ 2δ/n just as before. With π1 = (1 − p)` , this gives (1.20) under the Bernoulli
model. The second part of the theorem, namely, (1.21) follows from the equivalent characterization
µ0 r
m ≥ n2 1 − e− log(n/2δ)

n
together with 1 − e−x > x − x2 /2 whenever x ≥ 0.
3 Strategy and Novelty

This section outlines the strategy for proving our main results, Theorems 1.1 and 1.2. The proofs
of these theorems are the same up to a point where the arguments to estimate the moments of a
certain random matrix differ. In this section, we present the common part of the proof, leading
to two key moment estimates, while the proofs of these crucial estimates are the object of later
sections.
One can of course prove our claims for the Bernoulli model with p = m/n2 and transfer the
results to the uniform model, by using the arguments in the appendix. For example, the probability
that the recovery via (1.3) is not exact is at most twice that under the Bernoulli model.
3.1 Duality
We begin by recalling some calculations from [7, Section 3]. From standard duality theory, we know
that the correct matrix M ∈ Rn×n is a solution to (1.3) if and only if there exists a dual certificate
Y ∈ Rn1 ×n2 with the property that PΩ (Y ) is a subgradient of the nuclear norm at M , which we
write as
PΩ (Y ) ∈ ∂kM k∗ . (3.1)
14

We recall the projection matrices PU , PV and the companion matrix E defined by (1.6), (1.7).
It is known [15, 24] that
∂kM k∗ = E + W : W ∈ Rn×n , PU W = 0, W PV = 0, kW k ≤ 1 .

(3.2)
There is a more compact way to write (3.2). Let T ⊂ Rn×n be the span of matrices of the form
uk y ∗ and xvk∗ and let T ⊥ be its orthogonal complement. Let PT : Rn×n → T be the orthogonal
projection onto T ; one easily verifies the explicit formula
PT (X) = PU X + XPV − PU XPV , (3.3)
and note that the complementary projection PT ⊥ := I − PT is given by the formula
PT ⊥ (X) = (I − PU )X(I − PV ). (3.4)
In particular, PT ⊥ is a contraction:
kPT ⊥ k ≤ 1. (3.5)
Then Z ∈ ∂kXk∗ if and only if
PT (Z) = E, and kPT ⊥ (Z)k ≤ 1.
With these preliminaries in place, [7] establishes the following result.

Lemma 3.1 (Dual certificate implies matrix completion) Let the notation be as above. Sup-
pose that the following two conditions hold:
1. There exists Y ∈ Rn×n obeying
(a) PΩ (Y ) = Y ,
(b) PT (Y ) = E, and
(c) kPT ⊥ (Y )k < 1.
2. The restriction PΩ T : T → PΩ (Rn×n ) of the (sampling) operator PΩ restricted to T is

injective.
Then M is the unique solution to the convex program (1.3).
Proof See [7, Lemma 3.1].
The second sufficient condition, namely, the injectivity of the restriction to PΩ has been studied
in [7]. We recall a useful result.
Theorem 3.2 (Rudelson selection estimate) [7, Theorem 4.1] Suppose Ω is sampled accord-
ing to the Bernoulli model and put n := max(n1 , n2 ). Assume that M obeys (1.18). Then there is
a numerical constant CR such that for all β > 1, we have the bound
p−1 kPT PΩ PT − pPT k ≤ a (3.6)
with probability at least 1 − 3n−β provided that a < 1, where a is the quantity
r
µ0 nr(β log n)
a := CR (3.7)
m
15

We will apply this theorem with β := 4 (say). The statement (3.6) is stronger than the injectivity
of the restriction of PΩ to T . Indeed, take m sufficiently large so that the a < 1. Then if X ∈ T ,
we have
kPT PΩ (X) − pXkF < apkXkF ,
and obviously, PΩ (X) cannot vanish unless X = 0.
In order for the condition a < 1 to hold, we must have
m ≥ C0 µ0 nr log n (3.8)
for a suitably large constant C0 . But this follows from the hypotheses in either Theorem 1.1 or
Theorem 1.2, for reasons that we now pause to explain. In either of these theorems we have
m ≥ C1 µnr log n (3.9)

√ √
for some large constant C1 . Recall from Section 1.6.1 that µ0 ≤ 1 + µ1 / r ≤ 1 + µ/ r, and so
(3.9) implies (3.8) whenever µ0 ≥ 2 (say). When µ0 < 2, we can also deduce (3.8) from (3.9) by
applying the trivial bound µ ≥ 1 noted in the introduction.
In summary, to prove Theorem 1.1 or Theorem 1.2, it suffices (under the hypotheses of these
theorems) to exhibit a dual matrix Y obeying the first sufficient condition of Lemma 3.1, with
probability at least 1 − n−3 /2 (say). This is the objective of the remaining sections of the paper.
3.2 The dual certificate

Whenever the map PΩ T : T → PΩ (Rn×n ) restricted to T is injective, the linear map
T → T
X 7→ PT PΩ PT (X)
is invertible, and we denote its inverse by (PT PΩ PT )−1 : T → T . Introduce the dual matrix
Y ∈ PΩ (Rn×n ) ⊂ Rn×n defined via
Y = PΩ PT (PT PΩ PT )−1 E. (3.10)
By construction, PΩ (Y ) = Y , PT (Y ) = E and, therefore, we will establish that M is the unique

minimizer if one can show that
kPT ⊥ (Y )k < 1. (3.11)
The dual matrix Y would then certify that M is the unique solution, and this is the reason why
we will refer to Y as a candidate certificate. This certificate was also used in [7].
Before continuing, we would like to offer a little motivation for the choice of the dual matrix Y .
It is not difficult to check that (3.10) is actually the solution to the following problem:
minimize kZkF
subject to PT PΩ (Z) = E.
Note that by the Pythagorean identity, Y obeys
kY k2F = kPT (Y )k2F + kPT ⊥ (Y )k2F = r + kPT ⊥ (Y )k2F .
16

The interpretation is now clear: among all matrices obeying PΩ (Z) = Z and PT (Z) = E, Y is that
element which minimizes kPT ⊥ (Z)kF . By forcing the Frobenius norm of PT ⊥ (Y ) to be small, it
is reasonable to expect that its spectral norm will be sufficiently small as well. In that sense, Y
defined via (3.10) is a very suitable candidate.
Even though this is a different problem, our candidate certificate resembles—and is inspired
by—that constructed in [8] to show that `1 minimization recovers sparse vectors from minimally
sampled data.
3.3 The Neumann series

We now develop a useful formula for the candidate certificate, and begin by introducing a normalized
version QΩ : Rn×n → Rn×n of PΩ , defined by the formula
1
QΩ := PΩ − I (3.12)
p
where I : Rn×n → Rn×n is the identity operator on matrices (not the identity matrix I ∈ Rn×n !).
Note that with the Bernoulli model for selecting Ω, that QΩ has expectation zero.
From (3.12) we have PT PΩ PT = pPT (I + QΩ )PT , and owing to Theorem 3.2, one can write
(PT PΩ PT )−1 as the convergent Neumann series
X
p(PT PΩ PT )−1 = (−1)k (PT QΩ PT )k .
k≥0
From the identity PT ⊥ PT = 0 we conclude that PT ⊥ PΩ PT = p(PT ⊥ QΩ PT ). One can therefore

express the candidate certificate Y (3.10) as
X
PT ⊥ (Y ) = (−1)k PT ⊥ QΩ (PT QΩ PT )k (E)
k≥0
X
= (−1)k PT ⊥ (QΩ PT )k QΩ (E),
k≥0
where we have used PT2 = PT and PT (E) = E. By the triangle inequality and (3.5), it thus suffices
to show that X
k(QΩ PT )k QΩ (E)k < 1
k≥0
with probability at least 1 − n−3 /2.

It is not hard to bound the tail of the series thanks to Theorem 3.2. First, this theorem
bounds the spectral norm of PT QΩ PT by the quantity a in (3.7). This gives that for each k ≥ 1,
√
k(PT QΩ PT )k (E)kF < ak kEkF = ak r and, therefore,
√
k(QΩ PT )k QΩ (E)kF = kQΩ PT (PT QΩ PT )k (E)kF ≤ kQΩ PT kak r.
Second, this theorem also bounds kQΩ PT k (recall that this is the spectral norm) since
kQΩ PT k2 = max hQΩ PT (X), QΩ PT (X)i = hX, PT Q2Ω PT (X)i.

kXkF ≤1
17

Expanding the identity PΩ2 = PΩ in terms of QΩ , we obtain
1
Q2Ω = [(1 − 2p)QΩ + (1 − p)I], (3.13)
p
and thus, for all kXkF ≤ 1,
phX, PT Q2Ω PT (X)i = (1 − 2p)hX, PT QΩ PT (X)i + (1 − p)kPT (X)k2F ≤ a + 1.

p
Hence kQΩ PT k ≤ (a + 1)/p. For each k0 ≥ 0, this gives
r r
X
k 3r X k 6r k0
k(QΩ PT ) QΩ (E)kF ≤ a ≤ a
2p p
k≥k0 k≥k0
provided that a < 1/2. With p = m/n2 and a defined by (3.7) with β = 4, we have
k0 +1
√

X
k µ0 nr log n 2
k(QΩ PT ) QΩ (E)kF ≤ n×O
m
k≥k0
1 1
with probability at least 1 − n−4 . When k0 + 1 ≥ log n, n k0 +1 ≤ n log n = e and thus for each such
a k0 ,
k0 +1
X
k µ0 nr log n 2
k(QΩ PT ) QΩ (E)kF ≤ O (3.14)
m
k≥k0
with the same probability.

To summarize this section, we conclude that since both our results assume that m ≥ c0 µ0 nr log n
for some sufficiently large numerical constant c0 (see the discussion at the end of Section 3.1), it
now suffices to show that
blog nc
X 1
k(QΩ PT )k QΩ Ek ≤ (3.15)
2
k=0
(say) with probability at least 1 − n−3 /4 (say).
3.4 Centering
We have already normalised PΩ to have “mean zero” in some sense by replacing it with QΩ . Now we
perform a similar operation for the projection PT : X 7→ PU X + XPV − PU XPV . The eigenvalues
of PT are centered around
ρ0 := trace(PT )/n2 = 2ρ − ρ2 , ρ := r/n, (3.16)
as this follows from the fact that PT is a an orthogonal projection onto a space of dimension
2nr − r2 . Therefore, we simply split PT as
PT = QT + ρ0 I, (3.17)
so that the eigenvalues of QT are centered around zero. From now on, ρ and ρ0 will always be the
numbers defined above.
18

Lemma 3.3 (Replacing PT with QT ) Let 0 < σ < 1. Consider the event such that
k+1
k(QΩ QT )k QΩ (E)k ≤ σ 2 , for all 0 ≤ k < k0 . (3.18)
Then on this event, we have that for all 0 ≤ k < k0 ,

k+1
k(QΩ PT )k QΩ (E)k ≤ (1 + 4k+1 ) σ 2 , (3.19)
provided that 8nr/m < σ 3/2 .
From (3.19) and the geometric series formula we obtain the corollary
0 −1
kX
√ 1
k(QΩ PT )k QΩ (E)k ≤ 5 σ √ . (3.20)
1−4 σ
k=0
Let σ0 be such that the right-hand side is less than 1/4, say. Applying this with σ = σ0 , we
conclude that to prove (3.15) with probability at least 1 − n−3 /4, it suffices by the union bound
to show that (3.18) for this value of σ. (Note that the hypothesis 8nr/m < σ 3/2 follows from the
hypotheses in either Theorem 1.1 or Theorem 1.2.)
Lemma 3.3, which is proven in the Appendix, is useful because the operator QT is easier to
work with than PT in the sense that it is more homogeneous, and obeys better estimates. If we
split the projections PU , PV as
PU = ρI + QU , PV = ρI + QV , (3.21)
then QT obeys
QT (X) = (1 − ρ)QU X + (1 − ρ)XQV − QU XQV .
Let Ua,a0 , Vb,b0 denote the matrix elements of QU , QV :
Ua,a0 := hea , QU ea0 i = hea , PU ea0 i − ρ1a=a0 , (3.22)
and similarly for Vb,b0 . The coefficients cab,a0 b0 of QT obey
cab,a0 b0 := hea e∗b , QT (ea0 eb0 )i = (1 − ρ)1b=b0 Ua,a0 + (1 − ρ)1a=a0 Vb,b0 − Ua,a0 Vb,b0 . (3.23)
An immediate consequence of this under the assumptions (1.8), is the estimate

√
µ r µ2 r
|cab,a0 b0 | . (1a=a0 + 1b=b0 ) + 2. (3.24)
n n
√
When µ = O(1), these coefficients are bounded by O( r/n) when a = a0 or b = b0 while in contrast,
if we stayed with PT rather than QT , the diagonal coefficients would be as large as r/n. However,
our lemma states that bounding k(QΩ QT )k QΩ (E)k automatically bounds k(QΩ PT )k QΩ (E)k by
nearly the same quantity. This is the main advantage of replacing the PT by the QT in our
analysis.
19

3.5 Key estimates
To summarize the previous discussion, and in particular the bounds (3.20) and (3.14), we see
everything reduces to bounding the spectral norm of (QΩ QT )k QΩ (E) for k = 0, 1, . . . , blog nc.
Providing good upper bounds on these quantities is the crux of the argument. We use the moment
method, controlling a spectral norm a matrix by the trace of a high power of that matrix. We will
prove two moment estimates which ultimately imply our two main results (Theorems 1.1 and 1.2)
respectively. The first such estimate is as follows:
Theorem 3.4 (Moment bound I) Set A = (QΩ QT )k QΩ (E) for a fixed k ≥ 0. Under the as-
sumptions of Theorem 1.1, we have that for each j > 0,
2j(k+1) nrµ2 j(k+1)
E trace(A∗ A)j = O j(k + 1) rµ := µ2 r,

n , (3.25)
m
provided that m ≥ nrµ2 and n ≥ c0 j(k + 1) for some numerical constant c0 .
By Markov’s inequality, this result automatically estimates the norm of (QΩ QT )k QΩ (E) and im-
mediately gives the following corollary.
Corollary 3.5 (Existence of dual certificate I) Under the assumptions of Theorem 1.1, the
matrix Y (3.10) is a dual certificate, and obeys kPT ⊥ (Y )k ≤ 1/2 with probability at least 1 − n−3
provided that m obeys (1.10).
Proof Set A = (QΩ QT )k QΩ (E) with k ≤ log n, and set σ ≤ σ0 . By Markov’s inequality
k+1 E kAk2j
P(kAk ≥ σ 2 )≤ ,
σ j(k+1)
Now choose j > 0 to be the smallest integer such that j(k + 1) ≥ log n. Since
kAk2j ≤ trace(A∗ A)j ,
Theorem 1.1 gives

k+1
≤ γ j(k+1)

P kAk ≥ σ 2
for some
(j(k + 1))2 nr2
µ
γ=O
am
1 1
where we have used the fact that n j(k+1) ≤ n log n = e. Hence, if
nrµ2 (log n)2

m ≥ C0 , (3.26)
σ
for some numerical constant C0 , we have γ < 1/4 and
k+1
P k(QΩ QT )k QΩ (E)k ≥ σ ≤ n−4 .

2
Therefore, [ k+1
{(QΩ QT )k QΩ (E)k ≥ a 2 }
0≤k<log n
20

has probability less or equal to n−4 log n ≤ n−3 /2 for n ≥ 2. Since the corollary assumes r = O(1),
then (3.26) together with (3.20) and (3.14) prove the claim thanks to our choice of σ.
Of course, Theorem 1.1 follows immediately from Corollary 3.5 and Lemma 3.1. In the same
way, our second result (Theorem 1.2) follows from a more refined estimate stated below.
Theorem 3.6 (Moment bound II) Set A = (QΩ QT )k QΩ (E) for a fixed k ≥ 0. Under the
assumptions of Theorem 1.2, we have that for each j > 0 (rµ is given in (3.25)),
(j(k + 1))6 nrµ j(k+1)
E trace(A∗ A)j ≤

(3.27)
m
provided that n ≥ c0 j(k + 1) for some numerical constant c0 .
Just as before, this theorem immediately implies the following corollary.

Corollary 3.7 (Existence of dual certificate II) Under the assumptions of Theorem 1.2, the
matrix Y (3.10) is a dual certificate, and obeys kPT ⊥ (Y )k ≤ 1/2 with probability at least 1 − n−3
provided that m obeys (1.12).
The proof is identical to that of Corollary 3.5 and is omitted. Again, Corollary 3.7 and Lemma 3.1
immediately imply Theorem 1.2.
We have learned that verifying that Y is a valid dual certificate reduces to (3.25) and (3.27), and
we conclude this section by giving a road map to the proofs. In Section 4, we will develop a formula
for E trace(A∗ A)j , which is our starting point for bounding this quantity. Then Section 5 develops
the first and perhaps easier bound (3.25) while Section 6 refines the argument by exploiting clever
cancellations, and establishes the nearly optimal bound (3.27).
3.6 Novelty
As explained earlier, this paper derives near-optimal sampling results which are stronger than
those in [7]. One of the reasons underlying this improvement is that we use completely differ-
ent techniques. In details, [7] constructs the dual certificate (3.10) and proceeds by showing that
kPT ⊥ (Y )k < 1 by bounding each term in the series k≥0 k(QΩ PT )k QΩ (E)k < 1. Further, to prove
P
that the early terms (small values of k) are appropriately small, the authors employ a sophisticated
array of tools from asymptotic geometric analysis, including noncommutative Khintchine inequali-
ties [16], decoupling techniques of Bourgain and Tzafiri and of de la Peña [10], and large deviations
inequalities [14]. They bound each term individually up to k = 4 and use the same argument as
that in Section 3.3 to bound the rest of the series. Since the tail starts at k0 = 5, this gives that a
sufficient condition is that the number of samples exceeds a constant times µ0 n6/5 nr log n. Bound-
ing each term k(QΩ PT )k QΩ (E)kk with the tools put forth in [7] for larger values of k becomes
increasingly delicate because of the coupling between the indicator variables defining the random
set Ω. In addition, the noncommutative Khintchine inequality seems less effective in higher dimen-
sions; that is, for large values of k. Informally speaking, the reason for this seems to be that the
types of random sums that appear in the moments (QΩ PT )k QΩ (E) for large k involve complicated
combinations of the coefficients of PT that are not simply components of some product matrix, and
which do not simplify substantially after a direct application of the Khintchine inequality.
In this paper, we use a very different strategy to estimate the spectral norm of (QΩ QT )k QΩ (E),
and employ moment methods, which have a long history in random matrix theory, dating back at
21

least to the classical work of Wigner [26]. We raise the matrix A := (QΩ QT )k QΩ (E) to a large
power j so that
σ12j (A) = kAk2j ≈ trace(A∗ A)j =
X 2j
σi (A)
i∈[n]
(the largest element dominates the sum). We then need to compute the expectation of the right-
hand side, and reduce matters to a purely combinatorial question involving the statistics of various
types of paths in a plane. It is rather remarkable that carrying out these combinatorial calculations
nearly give the quantitatively correct answer; the moment method seems to come close to giving
the ultimate limit of performance one can expect from nuclear-norm minimization.
As we shall shortly see, the expression trace(A∗ A)j expands as a sum over “paths” of products
of various coefficients of the operators QΩ , QT and the matrix E. These paths can be viewed as
complicated variants of Dyck paths. However, it does not seem that one can simply invoke standard
moment method calculations in the literature to compute this sum, as in order to obtain efficient
bounds, we will need to take full advantage of identities such as PT PT = PT (which capture certain
cancellation properties of the coefficients of PT or QT ) to simplify various components of this sum.
It is only after performing such simplifications that one can afford to estimate all the coefficients
by absolute values and count paths to conclude the argument.
4 Moments
Let j ≥ 0 be a fixed integer. The goal of this section is to develop a formula for
X := E trace(A∗ A)j . (4.1)
This will clearly be of use in the proofs of the moment bounds (Theorems 3.4, 3.6).
4.1 First step: expansion

We first write the matrix A in components as
X
A= Aab eab
a,b∈[n]
for some scalars Aab , where eab is the standard basis for the n × n matrices and Aab is the (a, b)th
entry of A. Then X Y
trace(A∗ A)j = Aai bi Aai+1 bi ,
a1 ,...,aj ∈[n] i∈[j]
b1 ,...,bj ∈[n]
where we adopt the cyclic convention aj+1 = a1 . Equivalently, we can write

1
XY Y
∗ j
trace(A A) = Aai,µ bi,µ , (4.2)
i∈[j] µ=0
where the sum is over all ai,µ , bi,µ ∈ [n] for i ∈ [j], µ ∈ {0, 1} obeying the compatibility conditions
ai,1 = ai+1,0 ; bi,1 = bi,0 for all i ∈ [j]
22

Figure 1: A typical path in [n] × [n] that appears in the expansion of trace(A∗ A)j , here with
j = 3.
with the cyclic convention aj+1,0 = a1,0 .

Example. If j = 2, then we can write trace(A∗ A)j as
X
Aa1 b1 Aa2 b1 Aa2 b2 Aa2 b1 .
a1 ,a2 ,b1 ,b2 ∈[n]
or equivalently as
2 Y
XY 1
Aai,µ ,bi,µ
i=1 µ=0
where the sum is over all a1,0 , a1,1 , a2,0 , a2,1 , b1,0 , b1,1 , b2,0 , b2,1 ∈ [n] obeying the compatibility con-
ditions
a1,1 = a2,0 ; a2,1 = a1,0 ; b1,1 = b1,0 ; b2,1 = b2,0 .
Remark. The sum in (4.2) can be viewed as over all closed paths of length 2j in [n] × [n],
where the edges of the paths alternate between “horizontal rook moves” and “vertical rook moves”
respectively; see Figure 1.
Second, write QT and QΩ in coefficients as
X
QT (ea0 b0 ) = cab,a0 b0 eab
ab
where cab,a0 b0 is given by (3.23), and
QΩ (ea0 b0 ) = ξa0 b0 ea0 b0 ,
23

where ξab are the iid, zero-expectation random variables
1
ξab := 1 − 1.
p (a,b)∈Ω
With this, we have
X Y k
Y
Aa0 ,b0 := cal−1 bl−1 ,al bl ξal bl Eak bk (4.3)
a1 ,b1 ,...,ak ,bk ∈[n] l∈[k] l=0
for any a0 , b0 ∈ [n]. Note that this formula is even valid in the base case k = 0, where it simplifies
to just Aa0 b0 = ξa0 b0 Ea0 b0 due to our conventions on trivial sums and empty products.
Example. If k = 2, then
X
Aa0 ,b0 = ξa0 b0 ca0 b0 ,a1 ,b1 ξa1 b1 ca1 b1 ,a2 b2 ξa2 b2 Ea2 b2 .
a1 ,a2 ,b1 ,b2 ∈[n]
Remark. One can view the right-hand side of (4.3) as the sum over paths of length k + 1 in
[n] × [n] starting at the designated point (a0 , b0 ) and ending at some arbitrary point (ak , bk ). Each
edge (from (ai , bi ) to (ai+1 , bi+1 )) may be a horizontal or vertical “rook move” (in that at least
one of the a or b coordinates does not change2 ), or a “non-rook move” in which both the a and b
coordinates change. It will be important later on to keep track of which edges are rook moves and
which ones are not, basically because of the presence of the delta functions 1a=a0 , 1b=b0 in (3.23).
Each edge in this path is weighted by a c factor, and each vertex in the path is weighted by a ξ
factor, with the final vertex also weighted by an additional E factor. It is important to note that
the path is allowed to cross itself, in which case weights such as ξ 2 , ξ 3 , etc. may appear, see Figure
2.
Inserting (4.3) into (4.2), we see that X can thus be expanded as
1 h Y
XY Y k i
Y
E cai,µ,l−1 bi,µ,l−1 ,ai,µ,l bi,µ,l ξai,µ,l bi,µ,l Eai,µ,k bi,µ,k , (4.4)
∗ i∈[j] µ=0 l∈[k] l=0
P
where the sum ∗ is over all combinations of ai,µ,l , bi,µ,l ∈ [n] for i ∈ [j], µ ∈ {0, 1} and 0 ≤ l ≤ k
obeying the compatibility conditions
ai,1,0 = ai+1,0,0 ; bi,1,0 = bi,0,0 for all i ∈ [j] (4.5)
with the cyclic convention aj+1,0,0 = a1,0,0 .

Example. Continuing our running example j = k = 2, we have
2 Y
XY 1
X=E ξai,µ,0 bi,µ,0 cai,µ,0 bi,µ,0 ,ai,µ,1 bi,µ,1 ξai,µ,1 bi,µ,1 cai,µ,1 bi,µ,1 ,ai,µ,2 bi,µ,2 ξai,µ,2 bi,µ,2 Eai,µ,2 bi,µ,2
∗ i=1 µ=0
where ai,µ,l for i = 1, 2, µ = 0, 1, l = 0, 1, 2 obey the compatibility conditions
a1,1,0 = a2,0,0 ; a2,1,0 = a1,0,0 ; b1,1,0 = b1,0,0 ; b2,1,0 = b2,0,0 .

2
Unlike the ordinary rules of chess, we will consider the trivial move when ai+1 = ai and bi+1 = bi to also qualify
as a “rook move”, which is simultaneously a horizontal and a vertical rook move.
24

Figure 2: A typical path appearing in the expansion (4.3) of Aa0 b0 , here with k = 5. Each
vertex of the path gives rise to a ξ factor (with the final vertex, coloured in red, providing an
additional E factor), while each edge of the path provides a c factor. Note that the path is
certainly allowed to cross itself (leading to the ξ factors being raised to powers greater than 1,
as is for instance the case here at (a1 , b1 ) = (a4 , b4 )), and that the edges of the path may be
horizontal, vertical, or neither.
Note that despite the small values of j and k, this is already a rather complicated sum, ranging
over n2j(2k+1) = n20 summands, each of which is the product of 4j(k + 1) = 24 terms.
Remark. The expansion (4.4) is the sum over a sort of combinatorial “spider”, whose “body”
is a closed path of length 2j in [n] × [n] of alternating horizontal and vertical rook moves, and
whose 2j “legs” are paths of length k, emanating out of each vertex of the body. The various
“segments” of the legs (which can be either rook or non-rook moves) acquire a weight of c, and
the “joints” of the legs acquire a weight of ξ, with an additional weight of E at the tip of each leg.
To complicate things further, it is certainly possible for a vertex of one leg to overlap with another
vertex from either the same leg or a different leg, introducing weights such as ξ 2 , ξ 3 , etc.; see Figure
3. As one can see, the set of possible configurations that this “spider” can be in is rather large and
complicated.
4.2 Second step: collecting rows and columns

We now group the terms in the expansion (4.4) into a bounded number of components, depending
on how the various horizontal coordinates ai,µ,l and vertical coordinates bi,µ,l overlap.
It is convenient to order the 2j(k + 1) tuples (i, µ, l) ∈ [j] × {0, 1} × {0, . . . , k} lexicographically
by declaring (i, µ, l) < (i0 , µ0 , l0 ) if i < i0 , or if i = i0 and µ < µ0 , or if i = i0 and µ = µ0 and l < l0 .
We then define the indices si,µ,l , ti,µ,l ∈ {1, 2, 3, . . .} recursively for all (i, µ, l) ∈ [j]×{0, 1}×[k] by
setting s1,0,0 = 1 and declaring si,µ,l := si0 ,µ0 ,l0 if there exists (i0 , µ0 , l0 ) < (i, µ, l) with ai0 ,µ0 ,l0 = ai,µ,l ,
or equal to the first positive integer not equal to any of the si0 ,µ0 ,l0 for (i0 , µ0 , l0 ) < (i, µ, l) otherwise.
25

Figure 3: A “spider” with j = 3 and k = 2, with the “body” in boldface lines and the “legs”
as directed paths from the body to the tips (marked in red).
Define ti,µ,l using bi,µ,l similarly. We observe the cyclic condition
si,1,0 = si+1,0,0 ; ti,1,0 = ti,0,0 for all i ∈ [j] (4.6)
with the cyclic convention sj+1,0,0 = s1,0,0 .

Example. Suppose that j = 2, k = 1, and n ≥ 30, with the (ai,µ,l , bi,µ,l ) given in lexicographical
ordering as
(a0,0,0 , b0,0,0 ) = (17, 30)

(a0,0,1 , b0,0,1 ) = (13, 27)
(a0,1,0 , b0,1,0 ) = (28, 30)
(a0,1,1 , b0,1,1 ) = (13, 25)
(a1,0,0 , b1,0,0 ) = (28, 11)
(a1,0,1 , b1,0,1 ) = (17, 27)
(a1,1,0 , b1,1,0 ) = (17, 11)
(a1,1,1 , b1,1,1 ) = (13, 27)
26

Then we would have
(s0,0,0 , t0,0,0 ) = (1, 1)
(s0,0,1 , t0,0,1 ) = (2, 2)
(s0,1,0 , t0,1,0 ) = (3, 1)
(s0,1,1 , t0,1,1 ) = (2, 3)
(s1,0,0 , t1,0,0 ) = (3, 4)
(s1,0,1 , t1,0,1 ) = (1, 2)
(s1,1,0 , t1,1,0 ) = (1, 4)
(s1,1,1 , t1,1,1 ) = (2, 2).
Observe that the conditions (4.5) hold for this example, which then forces (4.6) to hold also.
In addition to the property (4.6), we see from construction of (s, t) that for any (i, µ, l) ∈
[j] × {0, 1} × {0, . . . , k}, the sets
{s(i0 , µ0 , l0 ) : (i0 , µ0 , l0 ) ≤ (i, µ, l)}, {t(i0 , µ0 , l0 ) : (i0 , µ0 , l0 ) ≤ (i, µ, l)} (4.7)
are initial segments, i.e. of the form [m] for some integer m. Let us call pairs (s, t) of sequences
with this property, as well as the property (4.6), admissible; thus for instance the sequences in the
above example are admissible. Given an admissible pair (s, t), if we define the sets J, K by
J := {si,µ,l : (i, µ, l) ∈ [j] × {0, 1} × {0, . . . , k}}
(4.8)
K := {ti,µ,l : (i, µ, l) ∈ [j] × {0, 1} × {0, . . . , k}}
then we observe that J = [|J|], K = [|K|]. Also, if (s, t) arose from ai,µ,l , bi,µ,l in the above manner,
there exist unique injections α : J → [n], β : K → [n] such that ai,µ,l = α(si,µ,l ) and bi,µ,l = β(ti,µ,l ).
Example. Continuing the previous example, we have J = [3], K = [4], with the injections
α : [3] → [n] and β : [4] → [n] defined by
α(1) := 17; α(2) := 13; α(3) := 28
and
β(1) := 30; β(2) := 27; β(3) := 25; β(4) := 11.
Conversely, any admissible pair (s, t) and injections α, β determine ai,µ,l and bi,µ,l . Because of
this, we can thus expand X as
X 1 h Y
XY Y
X= E cα(si,µ,l−1 )β(ti,µ,l−1 ),α(si,µ,l )β(ti,µ,l )
(s,t) α,β i∈[j] µ=0 l∈[k]
L
Y i

ξα(si,µ,l )β(ti,µ,l ) Eα(si,µ,k )β(ti,µ,k ) ,
l=0
where the outer sum is over all admissible pairs (s, t), and the inner sum is over all injections.
Remark. As with preceding identities, the above formula is also valid when k = 0 (with our
conventions on trivial sums and empty products), in which case it simplifies to
X 1
XY Y
X= E ξα(si,µ,0 )β(ti,µ,0 ) Eα(si,µ,0 )β(ti,µ,0 ) .
(s,t) α,β i∈[j] µ=0
27

Remark. One can think of (s, t) as describing the combinatorial “configuration” of the “spider”
((ai,µ,l , bi,µ,l ))(i,µ,l)∈[j]×{0,1}×{0,...,k} - it determines which vertices of the spider are equal to, or on
the same row or column as, other vertices of the spider. The injections α, β then enumerate the
ways in which such a configuration can be “represented” inside the grid [n] × [n].
4.3 Third step: computing the expectation

The expansion we have for X looks quite complicated. However, the fact that the ξab are inde-
pendent and have mean zero allows usQto simplify this expansion to a significant degree. Indeed,
observe that the random variable Ξ := i∈[j] µ=0 L
Q1 Q
l=0 ξα(si,µ,l )β(ti,µ,l ) has zero expectation if there
is any pair in J × K which can be expressed exactly once in the form (si,µ,l , ti,µ,l ). Thus we may
assume that no pair can be expressed exactly once in this manner. If δ is a Bernoulli variable with
P(δ = 1) = p = 1 − P(δ = 0), then for each s ≥ 0, one easily computes
E(δ − p)s = p(1 − p) (1 − p)s−1 + (−1)s ps−1

and hence
1
| E( δ − 1)s | ≤ p1−s .
p
The value of the expectation of E Ξ does not depend on the choice of α or β, and the calculation
above shows that Ξ obeys
1
| E Ξ| ≤ 2j(k+1)−|Ω| ,
p
where
Ω := {(si,µ,l , ti,µ,l ) : (i, µ, l) ∈ [j] × {0, 1} × {0, . . . , k}} ⊂ J × K. (4.9)
Applying this estimate and the triangle inequality, we can thus bound X by
X
X≤ (1/p)2j(k+1)−|Ω|
(s,t) strongly admissible
1 h Y i
X Y Y

cα(si,µ,l−1 )β(ti,µ,l−1 ),α(si,µ,l )β(ti,µ,l ) Eα(si,µ,k )β(ti,µ,k ) , (4.10)
α,β i∈[j] µ=0 l∈[k]
where the sum is over those admissible (s, t) such that each element of Ω is visited at least twice
by the sequence (si,µ,l , ti,µ,l ); we shall call such (s, t) strongly admissible. We will use the bound
(4.10) as a starting point for proving the moment estimates (3.25) and (3.27).
Example. The pair (s, t) in the Example in Section 4.2 is admissible but not strongly admissible,
because not every element of the set Ω (which, in this example, is {(1, 1), (2, 2), (3, 1), (2, 3), (3, 4),
(1, 2), (1, 4)}) is visited twice by the (s, t).
Remark. Once again, the formula (4.10) is valid when k = 0, with the usual conventions on
empty products (in particular, the factor involving the c coefficients can be deleted in this case).
5 Quadratic bound in the rank

This section establishes (3.25) under the assumptions of Theorem 1.1, which is the easier of the
two moment estimates. Here we shall just take the absolute values in (4.10) inside the summation
28

and use the estimates on the coefficients given to us by hypothesis. Indeed, starting with (4.10)
and the triangle inequality and applying (1.9) together with (3.23) gives
X X√
X ≤ O(1)j(k+1) (1/p)2j(k+1)−|Ω| ( rµ /n)2jk+|Q|+2j ,
(s,t) strongly admissible α,β
where we recall that rµ = µ2 r, and Q is the set of all (i, µ, l) ∈ [j] × {0, 1} × [k] such that
si,µ,l−1 6= si,µ,l and ti,µ,l−1 6= ti,µ,l . Thinking of the sequence {(si,µ,l , ti,µ,l )} as a path in J × K,
we have that (i, µ, l) ∈ Q if and only if the move from (si,µ,l−1 , ti,µ,l−1 ) to (si,µ,l , ti,µ,l ) is neither
horizontal nor vertical; per our earlier discussion, this is a “non-rook” move.
Example. The example in Section 4.2 is admissible, but not strongly admissible. Nevertheless,
the above definitions can still be applied, and we see that Q = {(0, 0, 1), (0, 1, 1), (1, 0, 1), (1, 1, 1)}
in this case, because all of the four associated moves are non-rook moves.
As the number of injections α, β is at most n|J| , n|K| respectively, we thus have
X √
X ≤ O(1)j(k+1) (1/p)2j(k+1)−|Ω| n|J|+|K| ( rµ /n)2jk+|Q|+2j ,
(s,t) str. admiss.
which we rearrange slightly as

r2 2j(k+1)−|Ω| |Q|
+2|Ω|−3j(k+1) |J|+|K|−|Q|−|Ω|
µ
X
X ≤ O(1)j(k+1) rµ2 n .
np
(s,t) str. admiss.
Since (s, t) is strongly admissible and every point in Ω needs to be visited at least twice, we see
that
|Ω| ≤ j(k + 1).
Also, since Q ⊂ [j] × {0, 1} × [k], we have the trivial bound
|Q| ≤ 2jk.
This ensures that

|Q|
+ 2|Ω| − 3j(k + 1) ≤ 0
2
and
2j(k + 1) − |Ω| ≥ j(k + 1).
From the hypotheses of Theorem 1.1 we have np ≥ rµ2 , and thus
r2 j(k+1)
µ
X
X≤O n|J|+|K|−|Q|−|Ω| .
np
(s,t) str. admiss.
Remark. In the case where k = 0 in which Q = ∅, one can easily obtain a better estimate,
namely, (if np ≥ rµ )
r j
µ
X
X≤O n|J|+|K|−|Ω| .
np
(s,t) str. admiss.
29

Call a triple (i, µ, l) recycled if we have si0 ,µ0 ,l0 = si,µ,l or ti0 ,µ0 ,l0 = ti,µ,l for some (i0 , µ0 , l0 ) <
(i, µ, l), and totally recycled if (si0 ,µ0 ,l0 , ti0 ,µ0 ,l0 ) = (si,µ,l , ti,µ,l ) for some (i0 , µ0 , l0 ) < (i, µ, l). Let Q0
denote the set of all (i, µ, l) ∈ Q which are recycled.
Example. The example in Section 4.2 is admissible, but not strongly admissible. Nevertheless,
the above definitions can still be applied, and we see that the triples
(0, 1, 0), (0, 1, 1), (1, 0, 0), (1, 0, 1), (1, 1, 0), (1, 1, 1)
are all recycled (because they either reuse an existing value of s or t or both), while the triple
(1, 1, 1) is totally recycled (it visits the same location as the earlier triple (0, 0, 1)). Thus in this
case, we have Q0 = {(0, 1, 1), (1, 0, 1), (1, 1, 1)}.
We observe that if (i, µ, l) ∈ [j] × {0, 1} × [k] is not recycled, then it must have been reached
from (i, µ, l − 1) by a non-rook move, and thus (i, µ, l) lies in Q.
Lemma 5.1 (Exponent bound) For any admissible tuple, we have |J|+|K|−|Q|−|Ω| ≤ −|Q0 |+
1.
Proof We let (i, µ, l) increase from (1, 0, 0) to (j, 1, k) and see how each (i, µ, l) influences the
quantity |J| + |K| − |Q\Q0 | − |Ω|.
Firstly, we see that the triple (1, 0, 0) initialises |J|, |K|, |Ω| = 1 and |Q\Q0 | = 0, so |J| + |K| −
|Q\Q0 | − |Ω| = 1 at this initial stage. Now we see how each subsequent (i, µ, l) adjusts this quantity.
If (i, µ, l) is totally recycled, then J, K, Ω, Q\Q0 are unchanged by the addition of (i, µ, l), and
so |J| + |K| − |Q\Q0 | − |Ω| does not change.
If (i, µ, l) is recycled but not totally recycled, then one of J, K increases in size by at most one,
as does Ω, but the other set of J, K remains unchanged, as does Q\Q0 , and so |J|+|K|−|Q\Q0 |−|Ω|
does not increase.
If (i, µ, l) is not recycled at all, then (by (4.6)) we must have l > 0, and then (by definition of
Q, Q0 ) we have (i, µ, l) ∈ Q\Q0 , and so |Q\Q0 | and |Ω| both increase by one. Meanwhile, |J| and
|K| increase by 1, and so |J| + |K| − |Q\Q0 | − |Ω| does not change. Putting all this together we
obtain the claim.
This lemma gives
r2 j(k+1) 0
µ
X
X≤O n−|Q |+1 .
np
str. admiss.
Remark. When k = 0, we have the better bound

r j
µ
X
X≤O n.
np
str. admiss.
To estimate the above sum, we need to count strongly admissible pairs. This is achieved by the
following lemma.
Lemma 5.2 (Pair counting) For fixed q ≥ 0, the number of strongly admissible pairs (s, t) with
|Q0 | = q is at most O(j(k + 1))2j(k+1)+q .
30

Proof Firstly observe that once one fixes q, the number of possible choices for Q0 is 2jk

q , which
we can bound crudely by 22j(k+1) = O(1)2j(k+1)+q . So we may without loss of generality assume
that Q0 is fixed. For similar reasons we may assume Q is fixed.
As with the proof of Lemma 5.1, we increment (i, µ, l) from (1, 0, 0) to (j, 1, k) and upper bound
how many choices we have available for si,µ,l , ti,µ,l at each stage.
There are no choices available for s1,0,0 , t1,0,0 , which must both be one. Now suppose that
(i, µ, l) > (1, 0, 0). There are several cases.
If l = 0, then by (4.6) one of si,µ,l , ti,µ,l has no choices available to it, while the other has at
most O(j(k + 1)) choices. If l > 0 and (i, µ, l) 6∈ Q, then at least one of si,µ,l , ti,µ,l is necessarily
equal to its predecessor; there are at most two choices available for which index is equal in this
fashion, and then there are O(j(k + 1)) choices for the other index.
If l > 0 and (i, µ, l) ∈ Q\Q0 , then both si,µ,l and ti,µ,l are new, and are thus equal to the first
positive integer not already occupied by si0 ,µ0 ,l0 or ti0 ,µ0 ,l0 respectively for (i0 , µ0 , l0 ) < (i, µ, l). So
there is only one choice available in this case.
Finally, if (i, µ, l) ∈ Q0 , then there can be O(j(k + 1)) choices for both si,µ,l and ti,µ,l .
Multiplying together all these bounds, we obtain that the number of strongly admissible pairs
is bounded by
0 0 0
O(j(k + 1))2j+2jk−|Q|+2|Q | = O(j(k + 1))2j(k+1)−|Q\Q |+|Q | ,
which proves the claim (here we discard the |Q \ Q0 | factor).
Using the above lemma we obtain
!j(k+1) 2jk
rµ2 X
X ≤ O(1) j(k+1)
n O(j(k + 1))2j(k+1)+q n−q .
np
q=0
Under the assumption n ≥ c0 j(k + 1) for some numerical constant c0 , we can sum the series and
obtain Theorem 3.4.
Remark. When k = 0, we have the better bound
j
2j rµ
X ≤ O(j) n .
np
6 Linear bound in the rank

We now prove the more sophisticated moment estimate (3.27) under the hypotheses of Theorem
1.2. Here, we cannot afford to take absolute values immediately, as in the proof of (3.25), but
first must exploit some algebraic cancellation properties in the coefficients cab,a0 b0 , Eab appearing in
(4.10) to simplify the sum.
6.1 Cancellation identities

Recall from (3.23) that the coefficients cab,a0 b0 are defined in terms of the coefficients Ua,a0 , Vb,b0
introduced in (3.22). We recall the symmetries Ua,a0 = Ua0 ,a , Vb,b0 = Vb0 ,b and the projection
31

identities
X
Ua,a0 Ua0 ,a00 = (1 − 2ρ) Ua,a00 − ρ (1 − ρ) 1a=a00 , (6.1)
a0
X
Vb,b0 Vb0 ,b00 = (1 − 2ρ) Vb,b00 − ρ (1 − ρ) 1b=b00 ; (6.2)
b0
the first identity follows from the matrix identity

X
Ua,a0 Ua0 ,a00 = hea , Q2U ea0 i
a0
after one writes the projection identity PU2 = PU in terms of QU using (3.21), and similarly for the
second identity.
In a similar vein, we also have the identities
X X
Ua,a0 Ea0 ,b = (1 − ρ) Ea,b = Ea,b0 Vb0 ,b , (6.3)
a0 b0
which simply come from QU E = PU E −ρE = (1−ρ)E together with EQV = EPV −ρE = (1−ρ)E.
Finally, we observe the two equalities
X X
Ea,b Ea0 ,b = Ua,a0 + ρ1a=a0 , Ea,b Ea,b0 = Vb,b0 + ρ1b=b0 . (6.4)
b a
The first identity follows from the fact that b Ea,b Ea0 ,b is the (a, a0 )th element of EE ∗ = PU =
P
QU + ρI, and the second one similarly follows from the identity E ∗ E = PV = QV + ρI.
6.2 Reduction to a summand bound

Just as before, our goal is to estimate
X := E trace(A∗ A)j , A = (QΩ QT )k QΩ E.
We recall the bound (4.10), and expand out each of the c coefficients using (3.23) into three
terms. To describe the resulting expansion of the sum we need more notation. Define an admissible
quadruplet (s, t, LU , LV ) to be an admissible pair (s, t), together with two sets LU , LV with LU ∪
LV = [j] × {0, 1} × [k], such that si,µ,l−1 = si,µ,l whenever (i, µ, l) ∈ ([j] × {0, 1} × [k])\LU , and
ti,µ,l−1 = ti,µ,l whenever (i, µ, l) ∈ ([j] × {0, 1} × [k])\LV . If (s, t) is also strongly admissible, we say
that (s, t, LU , LV ) is a strongly admissible quadruplet.
The sets LU \LV , LV \LU , LU ∩ LV will correspond to the three terms 1b=b0 Ua,a0 , 1a=a0 Vb,b0 ,
Ua,a0 Vb,b0 appearing in (3.23). With this notation, we expand the product
1 Y
Y Y
cα(si,µ,l−1 )β(ti,µ,l−1 ),α(si,µ,l )β(ti,µ,l )
i∈[j] µ=0 l∈[k]
as
X h Y i
(1 − ρ)|LU \LV |+|LU \LV | (−1)|LU ∩LV | 1β(ti,µ,l−1 )=β(ti,µ,l ) Uα(si,µ,l−1 ),α(si,µ,l )
LU ,LV (i,µ,l)∈LU \LV
h Y ih Y i
1α(si,µ,l−1 ),α(si,µ,l ) Vβ(ti,µ,l−1 ),β(ti,µ,l ) Uα(si,µ,l−1 ),α(si,µ,l ) Vβ(ti,µ,l−1 ),β(ti,µ,l ) ,
(i,µ,l)∈LV \LU (i,µ,l)∈LU ∩LV
32

where the sum is over all partitions as above, and which we can rearrange as
X h Y ih Y i
[−(1 − ρ)]2j(k+1)−|LU ∩LV | Uα(si,µ,l−1 ),α(si,µ,l ) Vβ(ti,µ,l−1 ),β(ti,µ,l ) .
LU ,LV (i,µ,l)∈LU (i,µ,l)∈LV
From this and the triangle inequality, we observe the bound
X ≤ (1 − ρ)2j(k+1)−|LU ∩LV |
X
(1/p)2j(k+1)−|Ω| |Xs,t,LU ,LV |,
(s,t,LU ,LV )
where the sum ranges over all strongly admissible quadruplets, and
1
Xh Y Y i
Xs,t,LU ,LV := Eα(si,µ,k )β(ti,µ,k )
α,β i∈[j] µ=0
h Y ih Y i
Uα(si,µ,l−1 ),α(si,µ,l ) Vβ(ti,µ,l−1 ),β(ti,µ,l ) .
(i,µ,l)∈LU (i,µ,l)∈LV
Remark. A strongly admissible quadruplet can be viewed as the configuration of a “spider” with
several additional constraints. Firstly, the spider must visit each of its vertices at least twice (strong
admissibility). When (i, µ, l) ∈ [j] × {0, 1} × [k] lies out of LU , then only horizontal rook moves are
allowed when reaching (i, µ, l) from (i, µ, l − 1); similarly, when (i, µ, l) lies out of LV , then only
vertical rook moves are allowed from (i, µ, l − 1) to (i, µ, l). In particular, non-rook moves are only
allowed inside LU ∩ LV ; in the notation of the previous section, we have Q ⊂ LU ∩ LV . Note though
that while one has the right to execute a non-rook move to LU ∩ LV , it is not mandatory; it could
still be that (si,µ,l−1 , ti,µ,l−1 ) shares a common row or column (or even both) with (si,µ,l , ti,µ,l ).
We claim the following fundamental bound on the summand |Xs,t,LU ,LV |:
Proposition 6.1 (Summand bound) Let (s, t, LU , LV ) be a strongly admissible quadruplet. Then
we have
|Xs,t,LU ,LV | ≤ O(j(k + 1))2j(k+1) (r/n)2j(k+1)−|Ω| n.
Assuming this proposition, we have

X
X ≤ O(j(k + 1))2j(k+1) (r/np)2j(k+1)−|Ω| n
(s,t,LU ,LV )
and since |Ω| ≤ j(k + 1) (by strong admissibility) and r ≤ np, and the number of (s, t, LU , LV ) can
be crudely bounded by O(j(k + 1))4j(k+1) ,
X ≤ O(j(k + 1))6j(k+1) (r/np)j(k+1) n.
This gives (3.27) as desired. The bound on the number of quadruplets follows from the fact that
there are at most j(k + 1)4j(k+1) strongly admissible pairs and that the number of (LU , LV ) per
pair is at most O(1)j(k+1) .
Remark. It seems clear that the exponent 6 can be lowered by a finer analysis, for instance
by using counting bounds such as Lemma 5.2. However, substantial effort seems to be required in
order to obtain the optimal exponent of 1 here.
33

Figure 4: A generalized spider (note the variable leg lengths). A vertex labeled just by LU
must have been reached from its predecessor by a vertical rook move, while a vertex labeled
just by LV must have been reached by a horizontal rook move. Vertices labeled by both LU
and LV may be reached from their predecessor by a non-rook move, but they are still allowed
to lie on the same row or column as their predecessor, as is the case in the leg on the bottom
left of this figure. The sets LU , LV indicate which U and V terms will show up in the expansion
(6.5).
6.3 Proof of Proposition 6.1

To prove the proposition, it is convenient to generalise it by allowing k to depend on i, µ. More
precisely, define a configuration C = (j, k, J, K, s, t, LU , LV ) to be the following set of data:
• An integer j ≥ 1, and a map k : [j] × {0, 1} → {0, 1, 2, . . .}, generating a set Γ := {(i, µ, l) :
i ∈ [j], µ ∈ {0, 1}, 0 ≤ l ≤ k(i, µ)};
• Finite sets J, K, and surjective maps s : Γ → J and t : Γ → K obeying (4.6);
• Sets LU , LV such that
LU ∪ LV := Γ+ := {(i, µ, l) ∈ Γ : l > 0}
and such that si,µ,l−1 = si,µ,l whenever (i, µ, l) ∈ Γ+ \LU , and ti,µ,l−1 = ti,µ,l whenever
(i, µ, l) ∈ Γ+ \LV .
Remark. Note we do not require configurations to be strongly admissible, although for our
application to Proposition 6.1 strong admissibility is required. Similarly, we no longer require that
the segments (4.7) be initial segments. This removal of hypotheses will give us a convenient amount
of flexibility in a certain induction argument that we shall perform shortly. One can think of a
configuration as describing a “generalized spider” whose legs are allowed to be of unequal length,
but for which certain of the segments (indicated by the sets LU , LV ) are required to be horizontal
or vertical. The freedom to extend or shorten the legs of the spider separately will be of importance
when we use the identities (6.1), (6.3), (6.4) to simplify the expression Xs,t,LU ,LV , see Figure 4.
34

Given a configuration C, define the quantity XC by the formula
1
Xh Y Y i
XC := Eα(s(i,µ,k(i,µ)))β(t(i,µ,k(i,µ)))
α,β i∈[j] µ=0
h Y ih Y i
Uα(s(i,µ,l−1)),α(s(i,µ,l)) Vβ(t(i,µ,l−1)),β(t(i,µ,l)) , (6.5)
(i,µ,l)∈LU (i,µ,l)∈LV
where α : J → [n], β : K → [n] range over all injections. To prove Proposition 6.1, it then suffices
to show that
|XC | ≤ (C0 (1 + |J| + |K|))|J|+|K| (rµ /n)|Γ|−|Ω| n (6.6)
for some absolute constant C0 > 0, where
Ω := {(s(i, µ, l), t(i, µ, l)) : (i, µ, l) ∈ Γ},
since Proposition 6.1 then follows from the special case in which k(i, µ) = k is constant and (s, t)
is strongly admissible, in which case we have
|J| + |K| ≤ 2|Ω| ≤ |Γ| = 2j(k + 1)
(by strong admissibility).

To prove the claim (6.6) we will perform strong induction on the quantity |J| + |K|; thus we
assume that the claim has already been proven for all configurations with a strictly smaller value
of |J| + |K|. (This inductive hypothesis can be vacuous for very small values of |J| + |K|.) Then,
for fixed |J| + |K|, we perform strong induction on |LU ∩ LV |, assuming that the claim has already
been proven for all configurations with the same value of |J| + |K| and a strictly smaller value of
|LU ∩ LV |.
Remark. Roughly speaking, the inductive hypothesis is asserting that the target estimate (6.6)
has already been proven for all generalized spider configurations which are “simpler” than the
current configuration, either by using fewer rows and columns, or by using the same number of
rows and columns but by having fewer opportunities for non-rook moves.
As we shall shortly see, whenever we invoke the inner induction hypothesis (decreasing |LU ∩LV |,
keeping |J| + |K| fixed) we are replacing the expression XC with another expression XC 0 covered
by this hypothesis; this causes no degradation in the constant. But when we invoke the outer
induction hypothesis (decreasing |J| + |K|), we will be splitting up XC into about O(1 + |J| + |K|)
terms XC 0 , each of which is covered by this hypothesis; this causes a degradation of O(1 + |J| + |K|)
in the constants and is thus responsible for the loss of (C0 (1 + |J| + |K|))|J|+|K| in (6.6).
For future reference we observe that we may take rµ ≤ n, as the hypotheses of Theorem 1.1 are
vacuous otherwise (m cannot exceed n2 ).
To prove (6.6) we divide into several cases.
6.3.1 First case: an unguarded non-rook move

Suppose first that LU ∩ LV contains an element (i0 , µ0 , l0 ) with the property that
(si0 ,µ0 ,l0 −1 , ti0 ,µ0 ,l0 ) 6∈ Ω. (6.7)
35

Note that this forces the edge from (si0 ,µ0 ,l0 −1 , ti0 ,µ0 ,l0 −1 ) to (si0 ,µ0 ,l0 , ti0 ,µ0 ,l0 ) to be partially “un-
guarded” in the sense that one of the opposite vertices of the rectangle that this edge is inscribed
in is not visited by the (s, t) pair.
When we have such an unguarded non-rook move, we can “erase” the element (i0 , µ0 , l0 ) from
LU ∩ LV by replacing C = (j, k, J, K, s, t, LU , LV ) by the “stretched” variant C 0 = (j 0 , k 0 , J 0 , K 0 , s0 ,
t0 , L0U , L0V ), defined as follows:
• j 0 := j, J 0 := J, and K 0 := K.
• k 0 (i, µ) := k(i, µ) for (i, µ) 6= (i0 , µ0 ), and k 0 (i0 , µ0 ) := k(i0 , µ0 ) + 1.
• (s0i,µ,l , t0i,µ,l ) := (si,µ,l , ti,µ,l ) whenever (i, µ) 6= (i0 , µ0 ), or when (i, µ) = (i0 , µ0 ) and l < l0 .
• (s0i,µ,l , t0i,µ,l ) := (si,µ,l−1 , ti,µ,l−1 ) whenever (i, µ) = (i0 , µ0 ) and l > l0 .
• (s0i0 ,µ0 ,l0 , t0i0 ,µ0 ,l0 ) := (si0 ,µ0 ,l0 −1 , ti0 ,µ0 ,l0 ).
• We have
L0U := {(i, µ, l) ∈ LU : (i, µ) 6= (i0 , µ0 )}

∪ {(i0 , µ0 , l) ∈ LU : l < l0 }
∪ {(i0 , µ0 , l + 1) : (i0 , µ0 , l) ∈ LU ; l > l0 + 1}
∪ {(i0 , µ0 , l0 + 1)}
and
L0V := {(i, µ, l) ∈ LV : (i, µ) 6= (i0 , µ0 )}

∪ {(i0 , µ0 , l) ∈ LV : l < l0 }
∪ {(i0 , µ0 , l + 1) : (i0 , µ0 , l) ∈ LV ; l > l0 + 1}
∪ {(i0 , µ0 , l0 )}.
All of this is illustrated in Figure 5.

One can check that C 0 is still a configuration, and XC 0 is exactly equal to XC ; informally what
has happened here is that a single “non-rook” move (which contributed both a Ua,a0 factor and a
Vb,b0 factor to the summand in XC ) has been replaced with an equivalent pair of two rook moves
(one of which contributes the Ua,a0 factor, and the other contributes the Vb,b0 factor).
Observe that, |Γ0 | = |Γ| + 1 and |Ω0 | = |Ω| + 1 (here we use the non-guarded hypothesis (6.7)),
while |J 0 | + |K 0 | = |J| + |K| and |L0U ∩ L0V | = |LU ∩ LV | − 1. Thus in this case we see that the claim
follows from the (second) induction hypothesis. We may thus eliminate this case and assume that
(si0 ,µ0 ,l0 −1 , ti0 ,µ0 ,l0 ) ∈ Ω whenever (i0 , µ0 , l0 ) ∈ LU ∩ LV . (6.8)
For similar reasons we may assume
(si0 ,µ0 ,l0 , ti0 ,µ0 ,l0 −1 ) ∈ Ω whenever (i0 , µ0 , l0 ) ∈ LU ∩ LV . (6.9)
36

Figure 5: A fragment of a leg showing an unguarded non-rook move from
(si0 ,µ0 ,l0 −1 , ti0 ,µ0 ,l0 −1 ) to (si0 ,µ0 ,l0 , ti0 ,µ0 ,l0 ) is converted into two rook moves, thus decreas-
ing |LU ∩ LV | by one. Note that the labels further down the leg have to be incremented by
one.
6.3.2 Second case: a low multiplicity row or column, no unguarded non-rook moves
Next, given any x ∈ J, define the row multiplicity τx to be
τx := |{(i, µ, l) ∈ LU : s(i, µ, l) = x}|

+ |{(i, µ, l) ∈ LU : s(i, µ, l − 1) = x}|
+ |{(i, µ) ∈ [j] × {0, 1} : s(i, µ, k(i, µ)) = x}|
and similarly for any y ∈ K, define the column multiplicity τ y to be
τ y := |{(i, µ, l) ∈ LV : t(i, µ, l) = y}|

+ |{(i, µ, l) ∈ LV : t(i, µ, l − 1) = y}|
+ |{(i, µ) ∈ [j] × {0, 1} : t(i, µ, k(i, µ)) = y}|.
Remark. Informally, τx measures the number of times α(x) appears in (6.5), and similarly for
τy and β(y). Alternatively, one can think of τx as counting the number of times the spider has
the opportunity to “enter” and “exit” the row s = x, and similarly τ y measures the number of
opportunities to enter or exit the column t = y.
By surjectivity we know that τx , τ y are strictly positive for each x ∈ J, y ∈ K. We also observe
that τx , τ y must be even. To see this, write
X X
τx = 1s(i,µ,l)=x + 1s(i,µ,l−1)=x + 1s(i,µ,k(i,µ))=x .
(i,µ,l)∈LU (i,µ)∈[j]×{0,1}
Now observe that if (i, µ, l) ∈ Γ+ \LU , then 1s(i,µ,l)=x = 1s(i,µ,l−1)=x . Thus we have
X X
τx mod 2 = 1s(i,µ,l)=x + 1s(i,µ,l−1)=x + 1s(i,µ,k(i,µ))=x mod 2.
(i,µ,l)∈Γ+ i,µ∈[j]×{0,1}
37

(a) (b)
Figure 6: In (a), a multiplicity 2 row is shown. After using the identity (6.1), the contribution
of this configuration is replaced with a number of terms one of which is shown in (b), in which
the x row is deleted and replaced with another existing row x̃.
But we can telescope this to

X
τx mod 2 = 1s(i,µ,0)=x mod 2,
i,µ∈[j]×{0,1}
and the right-hand side vanishes by (4.6), showing that τx is even, and similarly τ y is even.
In this subsection, we dispose of the case of a low-multiplicity row, or more precisely when
τx = 2 for some x ∈ J. By symmetry, the argument will also dispose of the case of a low-multiplicity
column, when τ y = 2 for some y ∈ K.
Suppose that τx = 2 for some x ∈ J. We first remark that this implies that there does not exist
(i, µ, l) ∈ LU with s(i, µ, l) = s(i, µ, l − 1) = x. We argue by contradiction and define l? to be the
first integer larger than l for which (i, µ, l? ) ∈ LU . First, suppose that l? does not exist (which, for
instance, happens when l = k(i, µ)). Then in this case it is not hard to see that s(i, µ, k(i, µ)) = x
since for (i, µ, l0 ) ∈
/ LU , we have s(i, µ, l0 ) = s(i, µ, l0 − 1). In this case, τx exceeds 2. Else, l? does
exist but then s(i, µ, l? − 1) = x since s(i, µ, l0 ) = s(i, µ, l0 − 1) for l < l0 < l? . Again, τx exceeds
2 and this is a contradiction. Thus, if (i, µ, l) ∈ LU and s(i, µ, l) = x, then s(i, µ, l − 1) 6= x, and
similarly if (i, µ, l) ∈ LU and s(i, µ, l − 1) = x, then s(i, µ, l) 6= x.
Now let us look at the terms in (6.5) which involve α(x). Since τx = 2, there are only two
such terms, and each of the terms are either of the form Uα(x),α(x0 ) or Eα(x),β(y) for some y ∈ K or
x0 ∈ J\{x}. We now have to divide into three subcases.
Subcase 1: (6.5) contains two terms Uα(x),α(x0 ) , Uα(x),α(x00 ) . Figure 6(a) for a typical
configuration in which this is the case.
The idea is to use the identity (6.1) to “delete” the row x, thus reducing |J| + |K| and allowing
us to use an induction hypothesis. Accordingly, let us define J˜ := J\{j}, and let α̃ : J˜ → [n] be
the restriction of α to J. ˜ We also write a := α(x) for the deleted row a.
We now isolate the two terms Uα(x),α(x0 ) , Uα(x),α(x00 ) from the rest of (6.5), expressing this sum
as X h X i
... Ua,α̃(x0 ) Ua,α̃(x00 )
α̃,β ˜
a∈[n]\α̃(J)
38

Figure 7: Another term arising from the configuration in Figure 6(a), in which two U factors
have been collapsed into one. Note the reduction in length of the configuration by one.
where the . . . denotes the product of all the terms in (6.5) other than Uα(x),α(x0 ) and Uα(x),α(x00 ) ,
but with α replaced by α̃, and α̃, β ranging over injections from J˜ and K to [n] respectively.
From (6.1) we have
X
Ua,α̃(x0 ) Ua,α̃(x00 ) = (1 − 2ρ) Uα̃(x0 ),α̃(x00 ) − ρ (1 − ρ) 1x0 =x00
a∈[n]
and thus
X
Ua,α̃(x0 ) Ua,α̃(x00 ) =
˜
a∈[n]\α̃(J)
X
(1 − 2ρ) Uα̃(x0 ),α̃(x00 ) − ρ (1 − ρ) 1x0 =x00 − Uα̃(x̃),α̃(x0 ) Uα̃(x̃),α̃(x00 ) . (6.10)
x̃∈J˜
Consider the contribution of one of the final terms Uα̃(x̃),α̃(x0 ) Uα̃(x̃),α̃(x00 ) of (6.10). This contribution
is equal to XC 0 , where C 0 is formed from C by replacing J with J, ˜ and replacing every occurrence
of x in the range of α with x̃, but leaving all other components of C unchanged (see Figure 6(b)).
Observe that |Γ0 | = |Γ|, |Ω0 | ≤ |Ω|, |J 0 | + |K 0 | < |J| + |K|, so the contribution of these terms is
acceptable by the (first) induction hypothesis (for C0 large enough).
Next, we consider the contribution of the term Uα̃(x0 ),α̃(x00 ) of (6.10). This contribution is equal
to XC 00 , where C 00 is formed from C by replacing J with J, ˜ replacing every occurrence of x in
0
the range of α with x , and also deleting the one element (i0 , µ0 , l0 ) in LU from Γ+ (relabeling the
remaining triples (i0 , µ0 , l) for l0 < l ≤ k(i0 , µ0 ) by decrementing l by 1) that gave rise to Uα(x),α(x0 ) ,
unless this element (i0 , µ0 , l0 ) also lies in LV , in which case one removes (i0 , µ0 , l0 ) from LU but
leaves it in LV (and does not relabel any further triples) (see Figure 7 for an example of the former
case, and 8 for the latter case). One observes that |Γ00 | ≥ |Γ| − 1, |Ω00 | ≤ |Ω| − 1 (here we use (6.8),
(6.9)), |J 00 | + |K 00 | < |J| + |K|, and so this term also is controlled by the (first) induction hypothesis
(for C0 large enough).
Finally, we consider the contribution of the term ρ1x0 =x00 of (6.10), which of course is only non-
trivial when x0 = x00 . This contribution is equal to ρXC 000 , where C 000 is formed from C by deleting x
39

Figure 8: Another collapse of two U factors into one. This time, the presence of the LV label
means that the length of the configuration remains unchanged; but the guarded nature of the
collapsed non-rook move (evidenced here by the point (a)) ensures that the support Ω of the
configuration shrinks by at least one instead.
from J, replacing every occurrence of x in the range of α with x0 = x00 , and also deleting the two
elements (i0 , µ0 , l0 ), (i1 , µ1 , l1 ) of LU from Γ+ that gave rise to the factors Uα(x),α(x0 ) , Uα(x),α(x00 )
in (6.5), unless these elements also lie in LV , in which case one deletes them just from LU but
leaves them in LV and Γ+ ; one also decrements the labels of any subsequent (i0 , µ0 , l), (i1 , µ1 , l)
accordingly (see Figure 9). One observes that |Γ000 | − |Ω000 | ≥ |Γ| − |Ω| − 1, |J 000 | + |K 000 | < |J| + |K|,
and |J 000 | + |K 000 | + |L000 000
U ∩ LV | < |J| + |K| + |LU ∩ LV |, and so this term also is controlled by the
induction hypothesis. (Note we need to use the additional ρ factor (which is less than rµ /n) in
order to make up for a possible decrease in |Γ| − |Ω| by 1.)
This deals with the case when there are two U terms involving α(x).
Subcase 2: (6.5) contains a term Uα(x),α(x0 ) and a term Eα(x),β(y) .
A typical case here is depicted in Figure 10.
The strategy here is similar to Subcase 1, except that one uses (6.3) instead of (6.1). Letting
˜ α̃, a be as before, we can express (6.5) as
J,
X h X i
... Ua,α̃(x0 ) Ea,β(y)
α̃,β ˜
a∈[n]\α̃(J)
where the . . . denotes the product of all the terms in (6.5) other than Uα(x),α(x0 ) and Eα(x),β(y) , but
with α replaced by α̃, and α̃, β ranging over injections from J˜ and K to [n] respectively.
From (6.3) we have X
Ua,α̃(x0 ) Ea,β(y) = (1 − ρ) Eα̃(x0 ),β(y)
a∈[n]
and hence X X
Ua,α̃(x0 ) Ea,β(y) = (1 − ρ) Eα̃(x0 ),β(y) − Uα̃(j̃),α̃(x0 ) Eα̃(j̃),β(y) (6.11)
˜
a∈[n]\α̃(J) x̃∈J˜
40

Figure 9: A collapse of two U factors (with identical indices) to a ρ1x0 =x00 factor. The point
marked (a) indicates the guarded nature of the non-rook move on the right. Note that |Γ| − |Ω|
can decrease by at most 1 (and will often stay constant or even increase).
Figure 10: A configuration involving a U and E factor on the left. After applying (6.3), one
gets some terms associated to configuations such as those in the upper right, in which the x
row has been deleted and replaced with another existing row x̃, plus a term coming from a
configuration in the lower right, in which the U E terms have been collapsed to a single E term.
41

Figure 11: A multiplicity 2 row with two Es, which are necessarily at the ends of two adjacent
legs of the spider. Here we use (i, µ, l) as shorthand for (si,µ,l , ti,µ,l ).
The contribution of the final terms in (6.11) are treated in exactly the same way as the final terms
in (6.10), and the main term Eα̃(x0 ),β(y) is treated in exactly the same way as the term Uα̃(x0 ),α̃(x00 )
in (6.10). This concludes the treatment of the case when there is one U term and one E term
involving α(x).
Subcase 3: (6.5) contains two terms Eα(x),β(y) , Eα(x),β(y0 ) .
A typical case here is depicted in 11. The strategy here is similar to that in the previous two
subcases, but now one uses (6.4) rather than (6.1). The combinatorics of the situation are, however,
slightly different.
By considering the path from Eα(x),β(y) to Eα(x),β(y0 ) along the spider, we see (from the hypoth-
esis τx = 2) that this path must be completely horizontal (with no elements of LU present), and
the two legs of the spider that give rise to Eα(x),β(y) , Eα(x),β(y0 ) at their tips must be adjacent, with
their bases connected by a horizontal line segment. In other words, up to interchange of y and y 0 ,
and cyclic permutation of the [j] indices, we may assume that
(x, y) = (s(1, 1, k(i, 1)), t(1, 1, k(i, 1))); (x, y 0 ) = (s(2, 0, k(2, 0)), t(2, 0, k(2, 0)))
with
s(1, 1, l) = s(2, 0, l0 ) = x
for all 0 ≤ l ≤ k(1, 1) and 0 ≤ l0 ≤ k(2, 0), where the index 2 is understood to be identified with 1
in the degenerate case j = 1. Also, LU cannot contain any triple of the form (1, 1, l) for l ∈ [k(1, 1)]
or (2, 0, l0 ) for l0 ∈ [k(2, 0)] (and so all these triples lie in LV instead).
For technical reasons we need to deal with the degenerate case j = 1 separately. In this case, s
is identically equal to x, and so (6.5) simplifies to
Xh X 1 k(1,µ)
iY Y
Ea,β(y) Ea,β(y0 ) Vβ(t(i,µ,l−1)),β(t(i,µ,l)) .
β a∈[n] µ=0 l=0
2 = r, which
P
In the extreme degenerate case when k(1, 0) = k(1, 1) = 0, the sum is just a,b∈[n] Eab
is acceptable, so we may assume that k(1, 0) + k(1, 1) > 0. We may assume that the column
multiplicity τ ỹ ≥ 4 for every ỹ ∈ K, since otherwise we could use (the reflected form of) one of the
previous two subcases to conclude (6.6) from the induction hypothesis. (Note when y = y 0 , it is
not possible for τ y to equal 2 since k(1, 0) + k(1, 1) > 0.)
42

Using (6.4) followed by (1.8a) we have
√
X
Ea,β(y) Ea,β(y0 ) . rµ /n + 1y=y0 r/n . rµ /n

a∈[n]
and so by (1.8b) we can bound

X √
|XC | . (rµ /n)( rµ /n)k(1,0)+k(1,1) .
β
The number of possible β is at most n|K| , so to establish (6.6) in this case it suffices to show that
√
n|K| (rµ /n)( rµ /n)k(1,0)+k(1,1) . (rµ /n)|Γ|−|Ω| n.
Observe that in this degenerate case j = 1, we have |Ω| = |K| and |Γ| = k(1, 0) + k(1, 1) + 2. One
then checks that the claim is true when rµ = 1, so it suffices to check that the other extreme case
rµ = n, i.e.
1
|K| − (k(1, 0) + k(1, 1)) ≤ 1.
2
But as τ y ≥ 4 for all k, every element in K must be visited at least twice, and the claim follows.
Now we deal with the non-degenerate case j > 1. Letting J, ˜ α̃, a be as in previous subcases, we
can express (6.5) as h X i
X
... Ea,β(y) Ea,β(y0 ) (6.12)
α̃,β ˜
a∈[n]\α̃(J)
where the . . . denotes the product of all the terms in (6.5) other than Eα(x),β(y) and Eα(x),β(y0 ) , but
with α replaced by α̃, and α̃, β ranging over injections from J˜ and K to [n] respectively.
From (6.4), we have X
Ea,β(y) Ea,β(y0 ) = Vβ(y),β(y0 ) + ρ1y=y0
a∈[n]
and hence
X X
Ea,β(y) Ea,β(y0 ) = Vβ(y),β(y0 ) + ρ1y=y0 − Eα̃(j̃),β(y) Eα̃(j̃),β(y0 ) . (6.13)
˜
a∈[n]\α̃(J) x̃∈J˜
The final terms are treated here in exactly the same way as the final terms in (6.10) or (6.11).
Now we consider the main term Vβ(y),β(y0 ) . The contribution of this term will be of the form
XC 0 , where the configuration C 0 is formed from C by “detaching” the two legs (i, µ) = (1, 1), (2, 0)
from the spider, “gluing them together” at the tips using the Vβ(y),β(y0 ) term, and then “inserting”
those two legs into the base of the (i, µ) = (1, 0) leg. To explain this procedure more formally,
observe that the . . . term in (6.12) can be expanded further (isolating out the terms coming from
(i, µ) = (1, 1), (2, 0)) as
hk(2,0)
Y 1
ih Y i
Vβ(t(2,0,l−1)),β(t(2,0,l)) Vβ(s(1,1,l−1)),β(s(1,1,l)) . . .
l=1 l=k(1,1)
where the . . . now denote all the terms that do not come from (i, µ) = (1, 1) or (i, µ) = (2, 0), and
we have reversed the order of the second product for reasons that will be clearer later. Recalling
43

Figure 12: The configuation from Figure 11 after collapsing the two E’s to a V , which is
represented by a long curved line rather than a straight line for clarity. Note the substantial
relabeling of vertices.
that y = t(1, 1, k(1, 1)) and y 0 = t(2, 0, k(2, 0)), we see that the contribution of the first term of
(6.13) to (6.12) is now of the form
Xhk(2,0)
Y i 1
h Y i
Vβ(t(2,0,l−1)),β(t(2,0,l)) Vβ(t(2,0,k(2,0))),β(t(1,1,k(1,1))) Vβ(s(1,1,l−1)),β(s(1,1,l)) . . . .
α̃,β l=1 l=k(1,1)
But this expression is simply XC 0 , where the configuration of C 0 is formed from C in the following
fashion:
˜ and K 0 is equal to K.
• j 0 is equal to j − 1, J 0 is equal to J,
• k 0 (1, 0) := k(2, 0) + 1 + k(1, 1) + k(1, 0), and k 0 (i, µ) := k(i + 1, µ) for (i, µ) 6= (1, 0).
• The path {(s0 (1, 0, l), t0 (1, 0, l)) : l = 0, . . . , k 0 (1, 0)} is formed by concatenating the path
{(s(1, 0, 0), t(2, 0, l)) : l = 0, . . . , k(2, 0)}, with an edge from (s(1, 0, 0), t(2, 0, k(2, 0))) to
(s(1, 0, 0), t(1, 1, k(1, 1))), with the path {(s(1, 0, 0), t(1, 1, l)) : l = k(1, 1), . . . , 0}, with the
path {(s(1, 0, l), t(1, 0, l)) : l = 0, . . . , k(1, 0)}.
• For any (i, µ) 6= (i, 0), the path {(s0 (i, µ, l), t0 (i, µ, l)) : l = 0, . . . k 0 (i, µ)} is equal to the path
{(s(i, µ, l), t(i + 1, µ, l)) : l = 0, . . . , k(i + 1, µ)}.
• We have
L0U := {(1, 0, k(2, 0) + 1 + k(1, 1) + l) : (1, 0, l) ∈ LU }

∪ {(i, µ, l) : (i + 1, µ, l) ∈ LU }
and
L0V := {(1, 0, k(2, 0) + 1 + k(1, 1) + l) : (1, 0, l) ∈ LV }

∪ {(i, µ, l) : (i + 1, µ, l) ∈ LV }
∪ {(1, 0, 1), . . . , (1, 0, k(2, 0) + 1 + k(1, 1))}.
This construction is represented in Figure 12.
44

One can check that this is indeed a configuration. One has |J 0 | + |K 0 | < |J| + |K|, |Γ0 | =
|Γ| − 1, and |Ω0 | ≤ |Ω| − 1, and so this contribution to (6.6) is acceptable from the (first) induction
hypothesis.
This handles the contribution of the Vβ(y),β(y0 ) term. The ρ1y=y0 term is treated similarly, except
that there is no edge between the points (s(1, 0, 0), t(2, 0, k(2, 0))) and (s(1, 0, 0), t(1, 1, k(1, 1)))
(which are now equal, since y = y 0 ). This reduces the analogue of |Γ0 | to |Γ| − 2, but the additional
factor of ρ (which is at most rµ /n) compensates for this. We omit the details. This concludes the
treatment of the third subcase.
6.3.3 Third case: High multiplicity rows and columns

After eliminating all of the previous cases, we may now may assume (since τx is even) that
τx ≥ 4 for all x ∈ J (6.14)
and similarly we may assume that
τ y ≥ 4 for all y ∈ K. (6.15)
We have now made the maximum use we can of the cancellation identities (6.1), (6.3), (6.4),
and have no further use for them. Instead, we shall now place absolute values everywhere and
estimate XC using (1.9), (1.8a), (1.8b), obtaining the bound
√
|XC | ≤ n|J|+|K| O( rµ /n)|Γ|+|LU ∩LV | .
Comparing this with (6.6), we see that it will suffice (by taking C0 large enough) to show that
√
n|J|+|K| ( rµ /n)|Γ|+|LU ∩LV | ≤ (rµ /n)|Γ|−|Ω| n.
Using the extreme cases rµ = 1 and rµ = n as test cases, we see that our task is to show that
|J| + |K| ≤ |LU ∩ LV | + |Ω| + 1 (6.16)
and
1
|J| + |K| ≤ (|Γ| + |LU ∩ LV |) + 1. (6.17)
2
The first inequality (6.16) is proven by Lemma 5.1. The second is a consequence of the double
counting identity X X
4(|J| + |K|) ≤ τx + τ y = 2|Γ| + 2|LU ∩ LV |
x∈J y∈K
where the inequality follows from (6.14)–(6.15) (and we don’t even need the +1 in this case).
7 Discussion
Interestingly, there is an emerging literature on the development of efficient algorithms for solving
the nuclear-norm minimization problem (1.3) [6, 17]. For instance, in [6], the authors show that
the singular-value thresholding algorithm can solve certain problem instances in which the matrix
has close to a billion unknown entries in a matter of minutes on a personal computer. Hence, the
45

near-optimal sampling results introduced in this paper are practical and, therefore, should be of
consequence to practitioners interested in recovering low-rank matrices from just a few entries.
To be broadly applicable, however, the matrix completion problem needs to be robust vis a vis
noise. That is, if one is given a few entries of a low-rank matrix contaminated with a small amount
of noise, one would like to be able to guess the missing entries, perhaps not exactly, but accurately.
We actually believe that the methods and results developed in this paper are amenable to the study
of “the noisy matrix completion problem” and hope to report on our progress in a later paper.
8 Appendix
8.1 Equivalence between the uniform and Bernoulli models
8.1.1 Lower bounds
For the sake of completeness, we explain how Theorem 1.7 implies nearly identical results for the
uniform model. We have established the lower bound by showing that there are two fixed matrices
M 6= M 0 for which PΩ (M ) = PΩ (M 0 ) with probability greater than δ unless m obeys the bound
(1.20). Suppose that Ω is sampled according to the Bernoulli model with p0 ≥ m/n2 and let F be
the event {PΩ (M ) = PΩ (M 0 )}. Then
n2
X
P(F ) = P(F | |Ω| = k) P(|Ω| = k)
k=0
m−1 n2
X X
≤ P(|Ω| = k) + P(F | |Ω| = k) P(|Ω| = k)
k=0 k=m
≤ P(|Ω| < m) + P(F | |Ω| = m),
where we have used the fact that for k ≥ m, P(F | |Ω| = m) ≥ P(F | |Ω| = k). The conditional
distribution of Ω given its cardinality is uniform and, therefore,
PUnif(m) (F ) ≥ PBer(p0 ) (F ) − PBer(p0 ) (|Ω| < m),
in which PUnif(m) and PBer(p0 ) are probabilities calculated under the uniform and Bernoulli models.
If we choose p0 = 2m/n2 , we have that PBer(p0 ) (|Ω| < m) ≤ δ/2 provided δ is not ridiculously small.
Thus if PBer(p0 ) (F ) ≥ δ, we have
PUnif(m) (F ) ≥ δ/2.
In short, we get a lower bound for the uniform model by applying the bound for the Bernoulli
model with a value of p = 2m2 /n and a probability of failure equal to 2δ.
8.1.2 Upper bounds

We prove the claim stated at the onset of Section 3 which states that the probability of failure
under the uniform model is at most twice that under the Bernoulli model. Let F be the event that
46

the recovery via (1.3) is not exact. With our earlier notations,
n2
X
PBer(p) (F ) = PBer(p) (F | |Ω| = k) PBer(p) (|Ω| = k)
k=0
m
X
≥ PBer(p) (F | |Ω| = k) PBer(p) (|Ω| = k)
k=0
m
X
≥ PBer(p) (F | |Ω| = m) PBer(p) (|Ω| = k)
k=0
1
≥ P (F ),
2 Unif(m)
where we have used PBer(p) (F | |Ω| = k) ≥ PBer(p) (F | |Ω| = m) for k ≤ m (the probability of failure
is nonincreasing in the size of the observed set), and PBer(p) (|Ω| ≤ m) ≥ 1/2.
8.2 Proof of Lemma 3.3

In this section, we will make frequent use of (3.13) and of the similar identity
Q2T = (1 − 2ρ0 )QT + ρ0 (1 − ρ0 )I, (8.1)
which is obtained by squaring both sides of (3.17) together with PT2 = PT . We begin with two
lemmas.
Lemma 8.1 For each k ≥ 0, we have

k k−1
(k) (k)
X X
k
(QΩ PT ) QΩ = αj (QΩ QT )j QΩ + βj (QΩ QT )j
j=0 j=0
k−2 k−3
(k) (k)
X X
+ γj QT (QΩ QT )j QΩ + δj QT (QΩ QT )j , (8.2)
j=0 j=0
(0)
where starting from α0 = 1, the sequences {α(k) }, {β (k) }, {γ (k) } and {δ (k) } are inductively defined
via
(k+1) (k) (k)ρ0 (1 − 2p) (k) (k) (k) (k)
αj = [αj−1 + (1 − ρ0 )γj−1 ] + [αj + (1 − ρ0 )γj ] + 1j=0 ρ0 [β0 + (1 − ρ0 )δ0 ]
p
(k+1) (k) (k) ρ0 (1 − 2p) (k) (k) 1 − p (k) (k)
βj = [βj−1 + (1 − ρ0 )δj−1 ] + [βj + (1 − ρ0 )δj ]1j>0 + 1j=0 ρ0 [α0 + (1 − ρ0 )γ0 ]
p p
and
(k+1) ρ0 (1 − p) (k) (k)
γj = [αj+1 + (1 − ρ0 )γj+1 ]
p
(k+1) ρ0 (1 − p) (k) (k)
δj = [βj+1 + (1 − ρ0 )δj+1 ].
p
(k)
In the above recurrence relations, we adopt the convention that αj = 0 whenever j is not in the
(k) (k) (k)
range specified by (8.2), and similarly for βj , γj and δj .
47

Proof The proof operates by induction. The claim for k = 0 is straightforward. To compute the
coefficient sequences of (QΩ PT )k+1 QΩ from those of (QΩ PT )k QΩ , use the identity PT = QT + ρ0 I
to decompose (QΩ PT )k+1 QΩ as follows:
(QΩ PT )k+1 QΩ = QΩ QT (QΩ PT )k QΩ + ρ0 QΩ (QΩ PT )k QΩ .
Then expanding (QΩ PT )k QΩ as in (8.2), and using the two identities

( (1−p)
1−2p
j p QΩ + p I, j = 0,
QΩ (QΩ QT ) QΩ = 1−2p j (1−p) j−1 Q ,
p (QΩ QT ) QΩ + p QT (QΩ QT ) Ω j > 0,
and (
j QΩ , j = 0,
QΩ (QΩ QT ) = 1−2p j (1−p) j−1 ,
p (QΩ QT ) + p QT (QΩ QT ) j > 0,
which both follow from (3.13), gives the desired recurrence relation. The calculation is rather
straightforward and omitted.
(k)
We note that the recurrence relations give αk = 1 for all k ≥ 0,
(k) (k−1) (1) ρ0 (1 − p)

βk−1 = βk−2 = . . . = β0 =
p
for all k ≥ 1, and
(k) ρ0 (1 − p) (k−1) ρ0 (1 − p)
γk−2 = αk−1 = ,
p p
(k) ρ0 (1 − p) (k−1) ρ0 (1 − p) 2
δk−3 = βk−2 = ,
p p
for all k ≥ 2 and k ≥ 3 respectively.
Lemma 8.2 Put λ = ρ0 /p and observe that by assumption (1.22), λ < 1. Then for all j, k ≥ 0, we
have k−j
(k) (k) (k) (k)
max |αj |, |βj |, |γj |, |δj | ≤ λd 2 e 4k . (8.3)
Proof We prove the lemma by induction on k. The claim is true for k = 0. Suppose it is true up
to k, we then use the recurrence relations given by Lemma 8.1 to establish the claim up to k + 1.
In details, since |1 − ρ0 | < 1, ρ0 < λ and |1 − 2p| < 1, the recurrence relation for α(k+1) gives
(k+1) (k) (k) (k) (k) (k) (k)
|αj | ≤ |αj−1 | + |γj−1 | + λ[|αj | + |γj |] + 1j=0 λ[|β0 | + |δ0 |]
k+1−j k−j k
≤ 2 λd 2
e
4k 1j>0 + 2λd 2
e+1
4k + 2λd 2 e+1 4k 1j=0
k+1−j k+1−j k+1
≤ 2 λd 2
e
4k 1j>0 + 2 λd 2
e
4k + 2 λd 2
e
4k 1j=0
k+1−j
≤ λd 2
e
4k+1 ,
48

(k+1)
which proves the claim for the sequence {α(k) }. We bound |βj | in exactly the same way and
omit the details. Now the recurrence relation for γ (k+1) gives
(k+1) (k) (k)
|γj | ≤ λ[|αj+1 | + |γj+1 |]
k−j−1
≤ 2λd 2
e+1
4k
k+1−j
≤ 4k+1 λd 2
e
,
(k+1)
which proves the claim for the sequence {γ (k) }. The quantity |δj | is bounded in exactly the
same way, which concludes the proof of the lemma.
We are now well positioned to prove Lemma 3.3 and begin by recording a useful fact. Since for
any X, kPT ⊥ (X)k ≤ kXk, and
QT = PT − ρ0 I = (I − PT ⊥ ) − ρ0 I = (1 − ρ0 )I − PT ⊥ ,
the triangular inequality gives that for all X,
kQT (X)k ≤ 2kXk. (8.4)
Now
k k−1
(k) (k)
X X
k(QΩ PT )k QΩ (E)k ≤ |αj |k(QΩ QT )j QΩ (E)k + |βj |k(QΩ QT )j (E)k
j=0 j=0
k−2 k−3
(k) (k)
X X
+ |γj |kQT (QΩ QT )j QΩ (E)k + |δj |kQT (QΩ QT )j (E)k,
j=0 j=0
and it follows from (8.4) that

k k−1
(k) (k) (k) (k)
X X
k j
k(QΩ PT ) QΩ (E)k ≤ (|αj | + 2|γj |)k(QΩ QT ) QΩ (E)k + (|βj | + 2|δj |)k(QΩ QT )j (E)k.
j=0 j=0
For j = 0, we have k(QΩ QT )j (E)k = kEk = 1 while for j > 0
k(QΩ QT )j (E)k = k(QΩ QT )j−1 QΩ QT (E)k = (1 − ρ0 )k(QΩ QT )j−1 QΩ (E)k
since QT (E) = (1 − ρ0 )(E). By using the size estimates given by Lemma 8.2 on the coefficients, we
have
k−1 k−1
1 1 k+1 X k−j j+1 X k−j j
k(QΩ PT )k QΩ (E)k ≤ σ 2 + 4k λd 2 e σ 2 + 4 k λd 2 e σ 2
3 3
j=0 j=0
k−1 k−1
1 k+1 k+1 X k−j k−j k X k−j k−j
≤ σ 2 + 4k σ 2 λd 2 e σ − 2 + 4 k σ 2 λd 2 e σ − 2
3
j=0 j=0
k−1
1 k+1 k+1 k
X k−j k−j
≤ σ 2 + 4k σ 2 + σ 2 λd 2 e σ − 2 .
3
j=0
49

Now,
k−1
2√

X
d k−j e − k−j λ λ 1
λ 2 σ 2 ≤ √ + λ
≤ σ
σ σ 1− σ
3
j=0
where the last inequality holds provided that 4λ ≤ σ 3/2 . The conclusion is
k+1
k(QΩ PT )k QΩ (E)k ≤ (1 + 4k+1 )σ 2 ,
which is what we needed to establish.
Acknowledgements
E. C. is supported by ONR grants N00014-09-1-0469 and N00014-08-1-0749 and by the Waterman
Award from NSF. E. C. would like to thank Xiaodong Li and Chiara Sabatti for helpful conversa-
tions related to this project. T. T. is supported by a grant from the MacArthur Foundation, by
NSF grant DMS-0649473, and by the NSF Waterman award.
References
[1] J. Abernethy, F. Bach, T. Evgeniou, and J.-P. Vert. Low-rank matrix factorization with attributes.
Technical Report N24/06/MM, Ecole des Mines de Paris, 2006.
[2] Y. Amit, M. Fink, N. Srebro, and S. Ullman. Uncovering shared structures in multiclass classification.
Proceedings of the Twenty-fourth International Conference on Machine Learning, 2007.
[3] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. Neural Information Processing
Systems, 2007.
[4] A. Barvinok. A course in convexity, volume 54 of Graduate Studies in Mathematics. American Mathe-
matical Society, Providence, RI, 2002.
[5] P. Biswas, T-C. Lian, T-C. Wang, and Y. Ye. Semidefinite programming based algorithms for sensor
network localization. ACM Trans. Sen. Netw., 2(2):188–220, 2006.
[6] J-F. Cai, E. J. Candès, and Z. Shen. A singular value thresholding algorithm for matrix completion.
Technical report, 2008. Preprint available at http://arxiv.org/abs/0810.3286.
[7] E. J. Candès and B. Recht. Exact Matrix Completion via Convex Optimization. To appear in Found.
of Comput. Math., 2008.
[8] E. J. Candès, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction from
highly incomplete frequency information. IEEE Trans. Inform. Theory, 52(2):489–509, 2006.
[9] P. Chen and D. Suter. Recovering the missing components in a large noisy low-rank matrix: application
to SFM source. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(8):1051–1063,
2004.
[10] V. H. de la Peña and S. J. Montgomery-Smith. Decoupling inequalities for the tail probabilities of
multivariate U -statistics. Ann. Probab., 23(2):806–816, 1995.
[11] M. Fazel, H. Hindi, and S. Boyd. Log-det heuristic for matrix rank minimization with applications to
Hankel and Euclidean distance matrices. Proc. Am. Control Conf, June 2003.
[12] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry. Using collaborative filtering to weave an information
tapestry. Communications of the ACM, 35:61–70, 1992.
50

[13] R. Keshavan, S. Oh, and A. Montanari. Matrix completion from a few entries. Submitted to ISIT’09,
2009.
[14] M. Ledoux. The Concentration of Measure Phenomenon. American Mathematical Society, 2001.
[15] A. S. Lewis. The mathematics of eigenvalue optimization. Math. Program., 97(1-2, Ser. B):155–176,
2003.
[16] F. Lust-Picquard. Inégalités de Khintchine dans Cp (1 < p < ∞). Comptes Rendus Acad. Sci. Paris,
Série I, 303(7):289–292, 1986.
[17] S. Ma, D. Goldfarb, and L. Chen. Fixed point and Bregman iterative methods for matrix rank mini-
mization. Technical report, 2008.
[18] C. McDiarmid. Centering sequences with bounded differences. Combin. Probab. Comput., 6(1):79–86,
1997.
[19] M. Mesbahi and G. P. Papavassilopoulos. On the rank minimization problem over a positive semidefinite
linear matrix inequality. IEEE Transactions on Automatic Control, 42(2):239–243, 1997.
[20] B. Recht, M. Fazel, and P. Parrilo. Guaranteed minimum rank solutions of matrix equations via nuclear
norm minimization. Submitted to SIAM Review, 2007.
[21] A. Singer. A remark on global positioning from local distances. Proc. Natl. Acad. Sci. USA,
105(28):9507–9511, 2008.
[22] A. Singer and M. Cucuringu. Uniqueness of low-rank matrix completion by rigidity theory. Submitted
for publication, 2009.
[23] C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: a factorization
method. International Journal of Computer Vision, 9(2):137–154, 1992.
[24] G. A. Watson. Characterization of the subdifferential of some matrix norms. Linear Algebra Appl.,
170:33–45, 1992.
[25] C-C. Weng. Matrix completion for sensor networks, 2009. Personal communication.
[26] E. Wigner. Characteristic vectors of bordered matrices with infinite dimensions. Ann. of Math., 62:548–
564, 1955.
51

1
Matrix Completion with Noise

Emmanuel J. Candès and Yaniv Plan
Applied and Computational Mathematics, Caltech, Pasadena, CA 91125
Abstract—On the heels of compressed sensing, a re- of the rapidly developing field of compressed sensing,
markable new field has very recently emerged. This field and is already changing the way engineers think about
addresses a broad range of problems of significant practical data acquisition, hence this special issue and others, see
interest, namely, the recovery of a data matrix from what
[2] for example. Concretely, if a signal has a sparse
arXiv:0903.3131v1 [cs.IT] 18 Mar 2009
appears to be incomplete, and perhaps even corrupted,

information. In its simplest form, the problem is to recover frequency spectrum and we only have information about
a matrix from a small sample of its entries, and comes a few time or space samples, then one can invoke linear
up in many areas of science and engineering including programming to interpolate the signal exactly. One can
collaborative filtering, machine learning, control, remote of course exchange time (or space) and frequency, and
sensing, and computer vision to name a few.
This paper surveys the novel literature on matrix com- recover sparse signals from just a few of their Fourier
pletion, which shows that under some suitable conditions, coefficients as well.
one can recover an unknown low-rank matrix from a Imagine now that we only observe a few entries of a
nearly minimal set of entries by solving a simple convex data matrix. Then is it possible to accurately—or even
optimization problem, namely, nuclear-norm minimization exactly—guess the entries that we have not seen? For
subject to data constraints. Further, this paper introduces
novel results showing that matrix completion is provably example, suppose we observe a few movie ratings from
accurate even when the few observed entries are corrupted a large data matrix in which rows are users and columns
with a small amount of noise. A typical result is that one are movies (we can only observe a few ratings because
can recover an unknown n × n matrix of low rank r from each user is typically rating a few movies as opposed to
just about nr log2 n noisy samples with an error which the tens of thousands of movies which are available). Can
is proportional to the noise level. We present numerical
results which complement our quantitative analysis and we predict the rating a user would hypothetically assign
show that, in practice, nuclear norm minimization accu- to a movie he/she has not seen? In general, everybody
rately fills in the many missing entries of large low-rank would agree that recovering a data matrix from a subset
matrices from just a few noisy samples. Some analogies of its entries is impossible. However, if the unknown
between matrix completion and compressed sensing are matrix is known to have low rank or approximately low
discussed throughout.
rank, then accurate and even exact recovery is possible
Keywords. Matrix completion, low-rank matrices, by nuclear norm minimization [8], [12]. This revelation,
semidefinite programming, duality in optimization, which to some extent is inspired by the great body of
nuclear-norm minimization, oracle inequalities, com- work in compressed sensing, is the subject of this paper.
pressed sensing. From now on, we will refer to the problem of infer-
ring the many missing entries as the matrix completion
I. I NTRODUCTION problem. By extension, inferring a matrix from just a few
Imagine that we only observe a few samples of a linear functionals will be called the the low-rank matrix
signal. Is it possible to reconstruct this signal exactly recovery problem. Now just as sparse signal recovery
or at least accurately? For example, suppose we observe is arguably of paramount importance these days, we do
a few entries of a vector x ∈ Rn , which we can think believe that matrix completion and, in general, low-rank
of as a digital signal or image. Can we recover the large matrix recovery is just as important, and will become
fraction of entries—of pixels if you will—that we have increasingly studied in years to come. For now, we give
not seen? In general, everybody would agree that this is a few examples of applications in which these problems
impossible. However, if the signal is known to be sparse do come up.
in the Fourier domain and, by extension, in an incoherent • Collaborative filtering. In a few words, collabora-
domain, then accurate—and even exact—recovery is tive filtering is the task of making automatic predic-
possible by `1 minimization [9], see also [20] for other tions about the interests of a user by collecting taste
algorithms, [15], [16] for other types of measurements, information from many users [21]. Perhaps the most
and [31] for different ideas. This revelation is at the root well-known implementation of collaborating filter-

ing is the Netflix recommendation system alluded mate or transmit all correlations because of power
to earlier, which seeks to make rating predictions constraints [32]. In this case, we would like to infer
about unseen movies. This is a matrix completion a full covariance matrix from just a few observed
problem in which the unknown full matrix has partial correlations. This is a matrix completion
approximately low rank because only a few fac- problem in which the unknown signal covariance
tors typically contribute to an individual’s tastes matrix has low rank since it is equal to the number
or preferences. In the new economy, companies of incident waves, which is usually much smaller
are interested predicting musical preferences (Apple than the number of sensors.
Inc.), literary preferences (Amazon, Barnes and There are of course many other examples including the
Noble) and many other such things. structure-from-motion problem [13], [30] in computer
• System identification. In control, one would like to vision, multi-class learning in data analysis [3], [4], and
fit a discrete-time linear time-invariant state-space so on.
model This paper investigates whether or not one can recover
low rank matrices from fewer entries, and if so, how
x(t + 1) = Ax(t) + Bu(t),
and how well. In Section II, we will study the noiseless
y(t) = Cx(t) + Du(t) problem in which the observed entries are precisely those
of the unknown matrix. Section III examines the more
to a sequence of inputs u(t) ∈ Rm and outputs common situation in which the few available entries are
y(t) ∈ Rp , t = 0, . . . , N . The vector x(t) ∈ Rn corrupted with noise. We complement our study with a
is the state of the system at time t, and n is the few numerical experiments demonstrating the empirical
order of the system model. From the input/output performance of our methods in Section IV and conclude
pair {(u(t), y(t)) : t = 0, . . . N }, one would like with a short discussion (Section V).
to recover the dimension of the state vector n (the
Before we begin, it is best to provide a brief summary
model order), and the dynamics of the system,
of the notations used throughout the paper. We shall use
i.e. the matrices A, B, C, D, and the initial state
three norms of a matrix X ∈ Rn1 ×n2 with singular
x(0). This problem can be cast as a low-rank matrix
values {σk }. The spectral norm is denoted by kXk
recovery problem, see [23] and references therein.
and is the largest singular value. The Euclidean inner
• Global positioning. Finding the global positioning
product between two matrices is defined by the formula
of points in Euclidean space from a local or partial
hX, Y i := trace(X ∗ Y ), and the corresponding Eu-
set of pairwise distances is a problem in geometry
clidean norm is called the Frobenius norm and denoted
that emerges naturally in sensor networks [6], [28],
by kXkF (note that this is the `2 norm of the vector
[29]. For example, because of power constraints,
of singularPvalues). The nuclear norm is denoted by
sensors may only be able to construct reliable
kXk∗ := k σk and is the sum of singular values (the
distance estimates from their immediate neighbors.
`1 norm of the vector {σk }). As is standard, X Y
From these estimates, we can form a partially
means that X − Y is positive semidefinite.
observed distance matrix, and the problem is to
Further, we will also manipulate linear transformation
infer all the pairwise distances from just a few
which acts on the space Rn1 ×n2 , and we will use
observed ones so that locations of the sensors can
calligraphic letters for these operators as in A(X). In
be reliably estimated. This reduces to a matrix
particular, the identity operator on this space will be
completion problem where the unknown matrix is
denoted by I : Rn1 ×n2 → Rn1 ×n2 . We use the same
of rank two if the sensors are located in the plane,
convention as above, and A I means that A − I
and three if they are located are in space.
(seen as a big matrix) is positive semidefinite.
• Remote sensing. The MUSIC algorithm [27] is
We use the usual asymptotic notation, for instance
frequently used to determine the direction of arrival
writing O(M ) to denote a quantity bounded in mag-
of incident signals in a coherent radio-frequency
nitude by CM for some absolute constant C > 0.
environment. In a typical application, incoming
signals are being recorded at various sensor lo-
cations, and this algorithm operates by extracting II. E XACT M ATRIX C OMPLETION
the directions of wave arrivals from the covariance From now on, M ∈ Rn1 ×n2 is a matrix we would
matrix obtained by computing the correlations of like to know as precisely as possible. However, the only
the signals received at all sensor pairs. In remote information available about M is a sampled set of entries
sensing applications, one may not be able to esti- Mij , (i, j) ∈ Ω, where Ω is a subset of the complete

set of entries [n1 ] × [n2 ]. (Here and in the sequel, [n] coordinates. To express this condition, recall the singular
denotes the list {1, . . . , n}.) It will be convenient to value decomposition (SVD) of a matrix of rank r,
summarize the information available via PΩ (M ), where X
the sampling operator PΩ : Rn1 ×n2 → Rn1 ×n2 is M= σk uk vk∗ , (II.1)
defined by k∈[r]
in which σ1 , . . . , σr ≥ 0 are the singular values, and

(
Xij , (i, j) ∈ Ω,
[PΩ (X)]ij = u1 , . . . , ur ∈ Rn1 , v1 , . . . , vr ∈ Rn2 are the singular
0, otherwise.
vectors. Our assumption is as follows:
Thus, the question is whether it is possible to recover p p
kuk k`∞ ≤ µB /n1 , kvk k`∞ ≤ µB /n, (II.2)
our matrix only from the information PΩ (M ). We will
assume that the entries are selected at random without for some µB ≥ 1, where the `∞ norm is of course
replacement as to avoid trivial situations in which a defined by kxk`∞ = maxi |xi |. We think of µB as being
row or a column is unsampled, since matrix completion small, e.g. O(1), so that the singular vectors are not too
is clearly impossible in such cases. (If we have no spiky as explained above.
data about a specific user, how can we guess his/her If the singular vectors of M are sufficiently spread,
preferences? If we have no distance estimates about a the hope is that there is a unique low-rank matrix which
specific sensor, how can we guess its distances to all the is consistent with the observed entries. If this is the case,
sensors?) one could, in principle, recover the unknown matrix by
Even with the information that the unknown matrix solving
M has low rank, this problem may be severely ill posed.
Here is an example that shows why: let x be a vector in minimize rank(X)
(II.3)
Rn and consider the n × n rank-1 matrix subject to PΩ (X) = PΩ (M ),
 
x1 x2 x3 · · · xn−1 xn where X ∈ Rn1 ×n2 is the decision variable. Unfortu-
0
 0 0 ··· 0 0  nately, not only is this algorithm NP-hard, but all known
M = e1 x∗ =  0
 0 0 ··· 0 0 , algorithms for exactly solving it are doubly exponential
 .. .. .. .. .. ..  in theory and in practice [14]. This is analogous to
. . . . . . 
0 0 0 ··· 0 0 the intractability of `0 -minimization in sparse signal
recovery.
where e1 is the first vector in the canonical basis of Rn . A popular alternative is the convex relaxation [8], [12],
Clearly, this matrix cannot be recovered from a subset of [17], [19], [26]
its entries. Even if one sees 95% of the entries sampled
at random, then we will miss elements in the first row minimize kXk∗
(II.4)
with very high probability, which makes the recovery subject to PΩ (X) = PΩ (M ),
of the vector x, and by extension of M , impossible. (see [5], [25] for the earlier related trace heuristic). Just
The analogy in compressed sensing is that one obviously as `1 -minimization is the tightest convex relaxation of
cannot recover a signal assumed to be sparse in the time the combinatorial `0 -minimization problem in the sense
domain, by subsampling in the time domain! that the `1 ball of Rn is the convex hull of unit-normed
This example shows that one cannot hope to complete 1-sparse vectors (i.e. vectors with at most one nonzero
the matrix if some of the singular vectors of the matrix entry), nuclear-norm minimization is the tightest convex
are extremely sparse—above, one cannot recover M relaxation of the NP-hard rank minimization problem.
without sampling all the entries in the first row, see [8] To be sure, the nuclear ball {X ∈ Rn1 ×n2 : kXk∗ ≤ 1}
for other related pathological examples. More generally, is the convex hull of the set of rank-one matrices with
if a row (or column) has no relationship to the other spectral norm bounded by one. Moreover, in compressed
rows (or columns) in the sense that it is approximately sensing, `1 minimization subject to linear equality con-
orthogonal, then one would basically need to see all straints can be cast as a linear program (LP) for the
the entries in that row to recover the matrix M . Such `1 norm has an LP characterization: indeed for each
informal considerations led the authors of [8] to intro- x ∈ Rn , kxk`1 is the optimal value of
duce a geometric incoherence assumption, but for the
moment, we will discuss an even simpler notion which maximize hu, xi
forces the singular vectors of M to be spread across all subject to kuk`∞ ≤ 1,

with decision variable u ∈ Rn . In the same vein, the nu- at least on the order O(n log n) for this to happen as
clear norm of X ∈ Rn1 ×n2 has the SDP characterization this is the famous coupon collector’s problem. Hence,
maximize hW, Xi (II.6) misses the information theoretic limit by at most
(II.5) a logarithmic factor.
subject to kW k ≤ 1,
To obtain similar results for all values of the rank,
with decision variable W ∈ Rn1 ×n2 . This expresses the [12] introduces the strong incoherence property with
fact that the spectral norm is dual to the nuclear norm. parameter µ stated below.
The constraint on the spectral norm of W is an SDP A1 Let PU (resp. PV ) be the orthogonal pro-
constraint since it is equivalent to jection onto the singular vectors u1 , . . . , ur
(resp. v1 , . . ., vr ). For all pairs (a, a0 ) ∈ [n1 ] ×

In1 W
0, [n1 ] and (b, b0 ) ∈ [n2 ] × [n2 ],
W ∗ In2
√
where In is the n × n identity matrix. Hence, (II.4) is
r r
, P e i − 1 ≤ µ ,
0 0

hea U a a=a
an SDP, which one can express by writing kXk∗ as the n1
√
n1
optimal value of the SDP dual to (II.5). r r
heb , PV eb0 i − 1b=b0 ≤ µ .

In [12], it is proven that nuclear-norm minimization n2 n2
succeeds nearly as soon as recovery is possible by any
A2 Let E be the “sign matrix” defined by
method whatsoever.
Theorem 1: [12] Let M ∈ Rn1 ×n2 be a fixed
X
E= uk vk∗ . (II.7)
matrix of rank r = O(1) obeying (II.2) and set n := k∈[r]
max(n1 , n2 ). Suppose we observe m entries of M with
locations sampled uniformly at random. Then there is a For all (a, b) ∈ [n1 ] × [n2 ],
positive numerical constant C such that if √
r
|Eab | ≤ µ √ .
m ≥ C µ4B n log2 n, (II.6) n1 n2
then M is the unique solution to (II.4) with probability These conditions do not assume anything about the
at least 1 − n−3 . In other words: with high probability, singular values. As we will see, incoherent matrices with
nuclear-norm minimization recovers all the entries of M a small value of the strong incoherence parameter µ can
with no error. be recovered from a minimal set of entries. Before we
As a side remark, one can obtain a probability of state this result, it is important to note that many model
success at least 1 − n−β for β by taking C in (II.6) matrices obey the strong incoherence property with a
of the form C 0 β for some universal constant C 0 . small value of µ.
An n1 × n2 matrix of rank r depends upon r(n1 + • Suppose the singular vectors obey (II.2) with µB =
n2 − r) degrees of freedom1 . When r is small, the O(1) (which informally says that the singular vec-
number of degrees of freedom is much less than n1 n2 tors are not spiky), then with the exception of a
and this is the reason why subsampling is possible. (In very few peculiar matrices, M obeys √ the strong
compressed sensing, the number of degrees of freedom incoherence property with µ = O( log n).
corresponds to the sparsity of the signal; i.e. the number • Assume that the column matrices [u1 , . . . , ur ] and
of nonzero entries.) What is remarkable here, is that [v1 , . . . , vr ] are independent random orthogonal ma-
exact recovery by nuclear norm minimization occurs as trices, then with high probability, M obeys √ the
soon as the sample size exceeds the number of degrees strong incoherence property with µ = O( log n),
of freedom by a couple of logarithmic factors. Further, at least when r ≥ log n as to avoid small samples
observe that if Ω completely misses one of the rows effects.
(e.g. one has no rating about one user) or one of the The sampling result below is general, nonasymptotic
columns (e.g. one has no rating about one movie), then and optimal up to a few logarithmic factors.
one cannot hope to recover even a matrix of rank 1 of Theorem 2: [12] With the same notations as in The-
the form M = xy ∗ . Thus one needs to sample every orem 1, there is a numerical constant C such that if
row (and also every column) of the matrix. When Ω is
sampled at random, it is well established that one needs m ≥ C µ2 nr log6 n, (II.8)
1 This can be seen by counting the degrees of freedom in the singular M is the unique solution to (II.4) with probability at
value decomposition. least 1 − n−3 .

In other words, if a matrix is strongly incoherent and
the cardinality of the sampled set is about the number of
degrees of freedom times a few logarithmic factors, then
nuclear-norm minimization is exact. This improves on
an earlier result of Candès and Recht [8] who proved—
under slightly different assumptions—that on the order
of n6/5 r log n samples were sufficient, at least for values
of the rank obeying r ≤ n1/5 .
We would like to point out a result of a broadly
similar nature, but with a completely different recovery
algorithm and with a somewhat different range of appli-
cability, which was recently established by Keshavan,
Oh, and Montanari [22]. Their conditions are related
to the incoherence property introduced in [8], and are
also satisfied by a number of reasonable random matrix
Fig. 1: The blue shape (courtesy of B. Recht)
models. There is, however, another condition which represents the nuclear ball (see the main text),
states that the singular values of the unknown matrix and the plane the feasible set (courtesy .
cannot be too large or too small (the ratio between the
top and lowest value must be bounded). This algorithm
1) trims each row and column with too few entries; if there exists a dual matrix Λ such that PΩ (Λ) is a
i.e replaces the entries in those rows and columns by subgradient of the nuclear norm at M , written as
zero and 2) computes the SVD of the trimmed matrix PΩ (Λ) ∈ ∂kM k∗ . (II.9)
and truncate it as to only keep the top r singular values
(note that the value of r is needed here). The result is Recall the SVD (II.1) of M and the “sign matrix” E
that under some suitable conditions discussed above, this (II.7). It is is well-known that Z ∈ ∂kM k∗ if and only
recovers a good approximation to the matrix M provided if Z is of the form,
that the number of samples be on the order of nr. The Z = E + W, (II.10)
recovery is not exact but only approximate although the
authors have announced that one could add extra steps where
to the algorithm to provide an exact recovery if one has PU W = 0, W PV = 0, kW k ≤ 1. (II.11)
more samples (on the order of nr log n). At the time of
this writing, such results are not yet publicly available. In English, Z is a subgradient if it can be decomposed as
the sign matrix plus another matrix with spectral norm
bounded by one, whose column (resp. row) space is or-
A. Geometry and dual certificates thogonal to the span of u1 , . . . , ur , (resp. of v1 , . . . , vr ).
Another way to put this is by using notations introduced
We cannot possibly rehash the proof of Theorem 2 in [8]. Let T be the linear space spanned by elements
from [12] in this paper, or even explain the main techni- of the form uk x∗ and yvk∗ , k ∈ [r], and let T ⊥ be the
cal steps, because of space limitations. We will, however, orthogonal complement to T . Note that T ⊥ is the set
detail sufficient and almost necessary conditions for the of matrices obeying PU W = 0 and W PV = 0. Then,
low-rank matrix M to be the unique solution to the SDP Z ∈ ∂kM k∗ if and only if
(II.4). This will be useful to establish stability results.
The recovery is exact if the feasible set is tangent to Z = E + PT ⊥ (Z), kPT ⊥ (Z)k ≤ 1.
the nuclear ball at the point M , see Figure 1 which rep- This motivates the following definition.
3
resents the set of points
y, z) ∈ R such that the 2×2
(x, Definition 3 (Dual certificate): We say that Λ is a
x y dual certificate if Λ is supported on Ω (Λ = PΩ (Λ)),
symmetric matrix has nuclear norm bounded by
y z PT (Λ) = E and kPT ⊥ (Λ)k ≤ 1.
one. To express this mathematically2 , standard duality Before continuing, we would like to pause to observe
theory asserts that M is a solution to (II.4) if and only the relationship with `1 minimization. The point x? ∈
2 In general, M minimizes the nuclear norm subject to the linear
Rn is solution to
constraints A(X) = b, A : Rn1 ×n2 → Rm , if and only if there is minimize kxk`1
(II.12)
λ ∈ Rm such that A∗ (λ) ∈ ∂kM k∗ . subject to Ax = b,

with A ∈ Rm×n if and only if there exists λ ∈ Rm such The methods for proving that matrix completion by
that A∗ λ ∈ ∂kx? k`1 . Note that if S ? is the support of nuclear minimization is exact, consist in constructing a
x? , z ∈ ∂kx? k`1 is equivalent to dual certificate.
( Theorem 6: [12] Under the assumptions of either
sgn(x?i ), i ∈ S ? , Theorem 1 or Theorem 2, there exists a dual certifi-
z = e + w, e =
0, i∈/ S∗, cate obeying kPT ⊥ (Λ)k ≤ 1/2. In addition, if p =
m/(n1 n2 ) is the fraction of observed entries, the op-
and
erator PT PΩ PT : T → T is one-to-one and obeys
wi = 0 for all i ∈ S, kwk`∞ ≤ 1.
p 3p
Hence, there is a clear analogy and one can think of T I PT PΩ PT I, (II.13)
2 2
defined above as playing the role of the support set in
where I : T → T is the identity operator.
the sparse recovery problem.
The second part, namely, (II.13) shows that the map-
With this in place, we shall make use of the following
ping PΩ : T → Rn1 ×n2 is injective. Hence, the sufficient
lemma from [8]:
conditions of Lemma 5 are verified, and the recovery
Lemma 4: [8] Suppose there exists a dual certificate
is exact. What is interesting, is that the existence of a
Λ and consider any H obeying PΩ (H) = 0. Then
dual certificate together with the near-isometry (II.13)—
kM + Hk∗ ≥ kM k∗ − (1 − kPT ⊥ (Λ)k)kPT ⊥ (H)k∗ . in fact, the lower bound—are sufficient to establish the
robustness of matrix completion vis a vis noise.
Proof: For any Z ∈ ∂kM k∗ , we have
kM + Hk∗ ≥ kM k∗ + hZ, Hi. III. S TABLE M ATRIX C OMPLETION
In any real world application, one will only observe a
With Λ = E + PT ⊥ (Λ) and Z = E + PT ⊥ (Z), we have
few entries corrupted at least by a small amount of noise.
kM + Hk∗ ≥ kM k∗ + hΛ, Hi + hPT ⊥ (Z − Λ), Hi In the Netflix problem, users’ ratings are uncertain. In
= kM k∗ + hZ − Λ, PT ⊥ (H)i the system identification problem, one cannot determine
the locations y(t) with infinite precision. In the global
since PΩ (H) = 0. Now we use the fact that positioning problem, local distances are imperfect. And
the nuclear and spectral norms are dual to one an- finally, in the remote sensing problem, the signal covari-
other. In particular, there exists kZk ≤ 1 such that ance matrix is always modeled as being corrupted by
hZ, PT ⊥ (H)i = kPT ⊥ (H)k∗ and |hΛ, PT ⊥ (H)i| = the covariance of noise signals. Hence, to be broadly
|hPT ⊥ (Λ), PT ⊥ (H)i| ≤ kPT ⊥ (Λ)kkPT ⊥ (H)k∗ . There- applicable, we need to develop results which guarantee
fore, that reasonably accurate matrix completion is possible
from noisy sampled entries. This section develops novel
kM + Hk∗ ≥ kM k∗ + (1 − kPT ⊥ (Λ)k)kPT ⊥ (H)k∗ ,
results showing that this is, indeed, the case.
which concludes the proof. Our noisy model assumes that we observe
A consequence of this lemma are the sufficient con-
Yij = Mij + Zij , (i, j) ∈ Ω, (III.1)
ditions below.
Lemma 5: [8] Suppose there exists a dual certifi- where {Zij : (i, j) ∈ Ω} is a noise term which may be
cate obeying kPT ⊥ (Y )k < 1 and that the restriction stochastic or deterministic (adversarial). Another way to
PΩ T : T → PΩ (Rn×n ) of the (sampling) operator PΩ express this model is as
restricted to T is injective. Then M is the unique solution
to the convex program (II.4). PΩ (Y ) = PΩ (M ) + PΩ (Z),
Proof: Consider any feasible perturbation M + H where Z is an n×n matrix with entries Zij for (i, j) ∈ Ω
obeying PΩ (H) = 0. Then by assumption, Lemma 4 (note that the values of Z outside of Ω are irrelevant).
gives All we assume is that kPΩ (Z)kF ≤ δ for some δ > 0.
kM + Hk∗ > kM k∗ For example, if {Zij } is a white noise √sequence with
standard deviation σ, then δ 2 ≤ (m + 8m)σ 2 with
unless PT ⊥ (H) = 0. Assume then that PT ⊥ (H) = 0;
high probability, say. To recover the unknown matrix,
that is to say, H ∈ T . Then PΩ (H) = 0 implies that
we propose solving the following optimization problem:
H = 0 by the injectivity assumption. The conclusion
is that M is the unique minimizer since any nontrivial minimize kXk∗
(III.2)
perturbation increases the nuclear norm. subject to kPΩ (X − Y )kF ≤ δ

Among all matrices consistent with the data, find the one for some numerical constant C0 . That is, an esti-
with minimum nuclear norm. This is also an SDP, and mate p which would be better by a factor proportional
let M̂ be the solution to this problem. to 1/ min(n1 , n2 ). It would be interesting to know
Our main result is that this reconstruction is accurate. whether or not estimates, which are as good as what
is achievable under the RIP, hold for the RIPless matrix
Theorem 7: With the notations of Theorem 6, suppose completion problem. We will return to such comparisons
there exists a dual certificate obeying kPT ⊥ (Λ)k ≤ 1/2 later (Section III-B).
and that PT PΩ PT p2 I (both these conditions are true We close this section by emphasizing that our methods
with very large probability under the assumptions of the are also applicable to sparse signal recovery problems in
noiseless recovery Theorems 1 and 2). Then M̂ obeys which the RIP does not hold.
s
Cp min(n1 , n2 )
kM − M̂ kF ≤ 4 δ + 2δ, (III.3) A. Proof of Theorem 7
p
We use the notation of the previous section, and begin
with Cp = 2 + p.
the proof by observing two elementary properties. The
For small values of p (recall this is the fraction first is that since M is feasible for (III.2), we have the
of observed q entries), the error is of course at most cone constraint
just about 4 2 min(n p
1 ,n2 )
δ. As we will see from the
proof, there is nothing special about 1/2 in the condition kM̂ k∗ ≤ kM k∗ . (III.5)
kPT ⊥ (Λ)k ≤ 1/2. All we need is that there is a dual
certificate obeying kPT ⊥ (Λ)k ≤ a for some a < 1 The second is that the triangle inequality implies the tube
(the value of a only influences the numerical constant constraint
in (III.3)). Further, when Z is random, (III.3) holds on
kPΩ (M̂ − M )kF ≤ kPΩ (M̂ − Y )kF + kPΩ (Y − M )kF
the event kPΩ (Z)kF ≤ δ.
Roughly speaking, our theorem states the following: ≤ 2δ, (III.6)
when perfect noiseless recovery occurs, then matrix since M is feasible. We will see that under our hypothe-
completion is stable vis a vis perturbations. To be sure, ses, (III.5) and (III.6) imply that M̂ is close to M . Set
the error is proportional to the noise level δ; when M̂ = M + H and put HΩ := PΩ (H), HΩc := PΩc (H)
the noise level is small, the error is small. Moreover, for short. We need to bound kHk2F = kHΩ k2F +kHΩc k2F ,
improving conditions under which noiseless recovery and since (III.6) gives kHΩ kF ≤ 2δ, it suffices to bound
occurs, has automatic consequences for the more realistic kHΩc kF . Note that by the Pythagorean identity, we have
recovery from noisy samples.
A significant novelty here is that there is just no kHΩc k2F = kPT (HΩc )k2F + kPT ⊥ (HΩc )k2F , (III.7)
equivalent of this result in the compressed sensing or
statistical literature for our matrix completion problem and it is thus sufficient to bound each term in the right
does not obey the restricted isometry property (RIP) [10]. hand-side.
For matrices, the RIP would assume that the sampling We start with the second term. Let Λ be a dual
operator obeys certificate obeying kPT ⊥ (Λ)k ≤ 1/2, we have
1 kM + Hk∗ ≥ kM + HΩc k∗ − kHΩ k∗
(1 − δ)kXk2F ≤ kPΩ (X)k2F ≤ (1 + δ)kXk2F (III.4)
p
and
for all matrices X with sufficiently small rank and δ < 1
sufficiently small [26]. However, the RIP does not hold kM + HΩc k∗ ≥ kM k∗ + [1 − kPT ⊥ (Λ)k]kPT ⊥ (HΩc )k∗ .
here. To see why, let the sampled set Ω be arbitrarily
chosen and fix (i, j) ∈ / Ω. Then the rank-1 matrix ei e∗j The second inequality follows from Lemma 4. Therefore,
whose (i, j)th entry is 1, and vanishes everywhere else, with kPT ⊥ (Λ)k ≤ 1/2, the cone constraint gives
obeys PΩ (ei e∗j ) = 0. Clearly, this violates (III.4).
1
It is nevertheless instructive to compare (III.3) with the kM k∗ ≥ kM k∗ + kPT ⊥ (HΩc )k∗ − kHΩ k∗ ,
bound one would achieve if the RIP (III.4) were true. In 2
this case, [18] would give or, equivalently,
kM̂ − M kF ≤ C0 p−1/2 δ kPT ⊥ (HΩc )k∗ ≤ 2kHΩ k∗ .

Since the nuclear norm dominates the Frobenius norm, (which is the case under the hypotheses of Theorem 7),
kPT ⊥ (HΩc )kF ≤ kPT ⊥ (HΩc )k∗ , we have the least-squares solution is given by
kPT ⊥ (HΩc )kF ≤ 2kHΩ k∗ M Oracle := (A∗ A)−1 A∗ (Y )
√ √
≤ 2 nkHΩ kF ≤ 4 nδ, (III.8) = M + (A∗ A)−1 A∗ (Z). (III.11)
where the second inequality follows from the Cauchy- Hence,

Schwarz inequality, and the last from (III.6). kM Oracle − M kF = k(A∗ A)−1 A∗ (Z)kF .
To develop a bound on kPT (HΩ )kF , observe that the
assumption PT PΩ PT p2 I together with PT2 = PT , Let Z 0 be the minimal (normalized) eigenvector of
PΩ2 = PΩ give A∗ A with minimum eigenvalue λmin , and set Z =
−1/2
δλmin A(Z 0 ) (note that by definition PΩ (Z) = Z since
kPΩ PT (HΩc )k2F = hPΩ PT (HΩc ), PΩ PT (HΩc )i Z is in the range of A). By construction, kZkF = δ,
= hPT PΩ PT (HΩc ), PT (HΩc )i and
p
≥ kPT (HΩc )k2F . −1/2
k(A∗ A)−1 A∗ (Z)kF = λmin δ & p−1/2 δ
2
But since PΩ (HΩc ) = 0 = PΩ PT (HΩc ) + since by assumption, all the eigenvalues of A∗ A =
PΩ PT ⊥ (HΩc ), we have PT PΩ PT lie in the interval [p/2, 3p/2]. The matrix
Z defined above also maximizes k(A∗ A)−1 A∗ (Z)kF
kPΩ PT (HΩc )kF = kPΩ PT ⊥ (HΩc )kF among all matrices bounded by δ and so the oracle
≤ kPT ⊥ (HΩc )kF . achieves
kM Oracle − M kF ≈ p−1/2 δ (III.12)
Hence, the last two inequalities give
2 2 with
√ adversarial noise. Consequently, our analysis looses
kPT (HΩc )k2F ≤ kPΩ PT (HΩc )k2F ≤ kPT ⊥ (HΩc )k2F . a n factor vis a vis an optimal bound that is achievable
p p
(III.9) via the help of an oracle.
As a consequence of this and (III.7), we have The diligent reader may argue that the least-squares
2 solution above may not be of rank r (it is at most
kHΩc k2F ≤ + 1 kPT ⊥ (HΩc )k2F . of rank 2r) and may thus argue that this is not the
p strongest possible oracle. However, as explained below,
The theorem then follows from this inequality together if the oracle gave T and r, then the best fit in T of rank
with (III.8). r would not do much better than (III.12). In fact, there
is an elegant way to understand the significance of this
oracle which we now present. Consider a stronger oracle
B. Comparison with an oracle
which reveals the row space of the unknown matrix M
We would like to return to discussing the best possible (and thus the rank of the matrix). Then we would know
accuracy one could ever hope for. For simplicity, assume that the unknown matrix is of the form
that n1 = n2 = n, and suppose that we have an oracle
informing us about T . In many ways, going back to the M = MC R ∗ ,
discussion from Section II-A, this is analogous to giving where MC is an n × r matrix, and R is an n × r matrix
away the support of the signal in compressed sensing whose columns form an orthobasis for the row space
[11]. With this precious information, we would know (which we can build since the oracle gave us perfect
that M lives in a linear space of dimension 2nr − r2 information). We would then fit the nr unknown entries
and would probably solve the problem by the method of by the method of least squares and find X ∈ Rn×r
least squares: minimizing
minimize kPΩ (X) − PΩ (Y )kF kPΩ (XR∗ ) − PΩ (Y )kF .
(III.10)
subject to X ∈ T.
Using our previous notations, the oracle gives away T0 ⊂
That is, we would find the matrix in T , which best
T where T0 is the span of elements of the form yvk∗ ,
fits the data in a least-squares sense. Let A : T → Ω
k ∈ [r], and is more precise. If A0 : T0 → Ω is defined
(we abuse notations and let Ω be the range of PΩ )
by A0 := PΩ PT0 , then the least-squares solution is now
defined by A := PΩ PT . Then assuming that the operator
A∗ A = PT PΩ PT mapping T onto T is invertible (A∗0 A0 )−1 A∗0 (Y ).

n 100 200 500 1000
RMS error .99 .61 .34 .24
that even though each entry is corrupted by noise with
variance 1, when M is a 1000 by 1000 matrix, the RMS
TABLE I: RMS error (kM̂ − M kF /n) as a error per entry is .24. To see the significance of this,
function of n when subsampling 20% of an suppose one had the chance to see all the entries of the
n × n matrix of rank two. Each RMS error is noisy matrix Y = M + Z. Naively accepting Y as an
averaged over 20 experiments. estimate of M would lead to an expected MS error of
E kY − M k2F /n2 = E kZk2F /n2 = 1, whereas the RMS
error achieved from only viewing 20% of the entries is
Because all the eigenvalues of A∗0 A0 belong to kM̂ − M k2F /n2 = .242 = .0576 when solving the SDP
[λmin (A∗ A), λmax (A∗ A)], the previous analysis applies (IV.1)! Not only are we guessing accurately the entries
and this stronger oracle would also achieve an error of we have not seen, but we also “denoise” those we have
size about p−1/2 δ. In conclusion, when all we know seen.
is kPΩ (Z)kF ≤ δ, one cannot hope for a root-mean In order to stably recover M from a fraction of noisy
squared error better than p−1/2 δ. entries, the following regularized nuclear norm mini-
Note that when the noise is stochastic, e. g. when mization problem was solved using the FPC algorithm
Zij is white noise with standard deviation σ, the oracle from [24],
gives an error bound which is adaptive, and is smaller
as the rank gets smaller. Indeed, E k(A∗ A)−1 A∗ (Z)k2F 1
minimize kPΩ (X − Y )k2F + µkXk∗ . (IV.1)
is equal to 2
It is a standard duality result that (IV.1) is equivalent to
2nr − r2 2 2nr 2
σ 2 trace((A∗ A)−1 ) ≈ σ ≈ σ , (III.13) (III.2), for some value of µ, and thus one could use (IV.1)
p p to solve (III.2) by searching for the value of µ(δ) giving
since all the 2nr − r2 eigenvalues of (A∗ A)−1 are just kPΩ (M̂ −Y )kF = δ (assuming kPΩ (Y )kF > δ). We use
about equal to p−1 . When nr m, this is better than (IV.1) because it works well in practice, and because the
(III.12). FPC algorithm solves (IV.1) nicely and accurately. We
also remark that a variation on our stability proof could
IV. N UMERICAL E XPERIMENTS also give a stable error bound when using the SDP (IV.1).
It is vital to choose a suitable value of µ, which we do
We have seen that matrix completion is stable amid
with the following heuristic argument: first, simplifying
noise. To emphasize the practical nature of this result, a
to the case when Ω is the set of all elements of the
series of numerical matrix completion experiments were
matrix, note that the solution of (IV.1) is equal to Y
run with noisy data. To be precise, for several values
but with singular values shifted towards zero by µ
of the dimension n (our first experiments concern n × n
(soft-thresholding), as can be seen from the optimality
matrices), the rank r, and the fraction of observed entries
conditions of Section II by means of subgradients, or
p = m/n2 , the following numerical simulations were
see [7]. When Ω is not the entire set, the solution is
repeated 20 times, and the errors averaged. A rank-r
no longer exactly a soft-thresholded version of Y , but
matrix M is created as the product of two rectangular
experimentally, it is generally close. Thus, we want to
matrices, M = ML MR∗ , where√the entries of ML , MR ∈
pick µ large enough to threshold away the noise (keep
Rn×r are iid N (0, σn2 := 20/ n)3 . The sampled set Ω
the variance low), and small enough not to overshrink the
is picked uniformly at random among all sets with m
original matrix (keep the bias low). To this end, µ is set
entries. The observations PΩ (Y ) are corrupted by noise
to be the smallest possible value such that if M = 0 and
as in (III.1), where {Zij } is iid N (0, σ 2 ); here, we take
Y = Z, then it is likely that the minimizer of (IV.1)
σ = 1. Lastly, M̂ is recovered as the solution to (IV.1)
satisfies M̂ = 0. It can be seen that the solution to
below.
(IV.1) is M̂ = 0 if kPΩ (Y )k ≤ µ (once again, check
For a peek at the results, consider Table I. The RMS
the subgradient or [7]). Then the question is: what is
error defined as kM̂ − M kF /n, measures the root-mean
kPΩ (Z)k? If we make a nonessential change in the way
squared error per entry. From the table, one can see
Ω is sampled, then the answer follows from random
3 The value of σ is rather arbitrary. Here, it is set so that the singular
n
matrix theory. Rather than picking Ω uniformly at ran-
values of M are quite larger than the singular values of PΩ (Z) so dom, choose Ω by selecting each entry with probability
that M can be distinguished from the null matrix. Having said that, p, independently of the others. With this modification,
note that for large n and small r, the entries of M are much smaller
than those of the noise, and thus the signal appears to be completely each entry of PΩ (Z) is iid with variance pσ 2 . Then√ if
buried in noise. Z ∈ Rn×n , it is known that n−1/2 kPΩ (Z)k → 2pσ,

√
almost surely as n → ∞. Thus we pick µ = 2npσ,
where p = m/n2 . In practice, this value of µ seems 0.8
recovery error using SDP
to work very well for square matrices. For n1 × n2 1.68*(oracle error)
1.68 *[(2nr − r2)/(pn2)]1/2
0.7
matrices, based on the same considerations, the proposal
√ √ √
is µ = ( n1 + n2 ) pσ with p = m/(n1 n2 ). 0.6
In order to interpret our numerical results, they are

0.5
compared to those achieved by the oracle, see Section
rms error
III-B. To this end, Figure 2 plots three curves for varying 0.4
values of n, p, and r: 1) the RMS error introduced above,
2) the RMS error achievable when the oracle reveals 0.3
T , and the problem is solved using least squares, 3)

0.2
the estimated oracle root p expected MSp error derived
in Section III-B, i.e. df/[n2 p] = df/m, where 0.1
0 20 40 60 80 100 120 140
df = r(2n − r). In our experiments, as n and m/df m/df
increased, with r = 2, the RMS error of the nuclear

p norm
problem appeared to be fit very well by 1.68 df/m. 1.2
Thus, to compare the oracle error to the actual recovered 1.1 1.68*(oracle error)
1.68*[(2nr − r2)/(pn2)]1/2
error, we plotted the oracle errors times 1.68. We also 1
note that in our experiments,

p the RMS error was never 0.9
greater than 2.25 df/m.
0.8
No one can predict the weather. We conclude the nu-
rms error
0.7
merical section with a real world example. We retrieved
0.6
from the website [1] a 366 × 1472 matrix whose entries
are daily average temperatures at 1472 different weather 0.5
stations throughout the world in 2008. Checking its SVD 0.4
reveals that this is an approximately low rank matrix as 0.3
expected. In fact, letting M be the temperature matrix, 0.2

100 200 300 400 500 600 700 800 900 1000
and calling M2 the matrix created by truncating the SVD n
after the top two singular values gives kM2 kF /kM kF =

.9927. 0.9
To test the performance of our matrix completion 1.68*(oracle error)
0.8 1.68*[(2nr − r2)/(pn2)]1/2
algorithm, we subsampled 30% of M and then recovered
an estimate, M̂ , using (IV.1). Note that this is a much 0.7
different problem than those proposed earlier in this

0.6
section. Here, we attempt to recover a matrix that is not
rms error
exactly low rank, but only approximately. The solution 0.5

gives a relative error of kM̂ − M kF /kM kF = .166.
For comparison4 , exact knowledge of the best rank-2 0.4
approximation achieves kM2 − M kF /kM kF = .121. 0.3

Here µ has been selected to give a good cross-validated
error and is about 535. 0.2
1 2 3 4 5 6 7 8 9 10
r
V. D ISCUSSION
Fig. 2: Comparison between the recovery
This paper reviewed and developed some new results error, the oracle error times 1.68, and the
about matrix completion. By and large, low-rank ma- estimated oracle error times 1.68. Each point
trix recovery is a field in complete infancy abounding on the plot corresponds to an average over 20
with interesting and open questions, and if the recent trials. Top: in this experiment, n = 600, r = 2
and p varies. The x-axis is the number of mea-
4 The number 2 is somewhat arbitrary here, although we picked it surements per degree of freedom (df). Middle:
because there is a large drop-off in the size of the singular values after n varies whereas r = 2, p = .2. Bottom:
the second. If, for example, M10 is the best rank-10 approximation, n = 600, r varies and p = .2.
then kM10 − M kF /kM kF = .081.
10

avalanche of results in compressed sensing is any indica- [10] E. J. Candès and T. Tao. Decoding by linear programming. IEEE
tion, it is likely that this field will experience tremendous Trans. Inform. Theory, 51(12):4203–4215, 2005.
[11] E. J. Candès and T. Tao. The Dantzig selector: statistical
growth in the next few years. estimation when p is much larger than n. Ann. Statist., 35, 2007.
At an information theoretic level, one would like to [12] E. J. Candès and T. Tao. The power of convex relaxation: Near-
know whether one can recover low-rank matrices from optimal matrix completion. Technical report, 2009. Submitted
for publication and preprint available at http://arxiv.org/abs/0903.
a few general linear functionals, i. e. from A(M ) = b, 1476.
where A is a linear map from Rn1 ×n2 → Rm . In this [13] P. Chen and D. Suter. Recovering the missing components
direction, we would like to single out the original result in a large noisy low-rank matrix: application to SFM source.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
of Recht, Fazel and Parrilo [26] who showed—by lever- 26(8):1051–1063, 2004.
aging the techniques and proofs from the compressed [14] A. L. Chistov and D. Yu. Grigoriev. Complexity of quantifier
sensing literature—that if each measurement is of the elimination in the theory of algebraically closed fields. In Pro-
ceedings of the 11th Symposium on Mathematical Foundations
form hAk , Xi, where Ak is an independent array of iid of Computer Science, volume 176 of Lecture Notes in Computer
Gaussian variables (a la compressed sensing), then the Science, pages 17–31. Springer Verlag, 1984.
nuclear norm heuristics recovers rank-r matrices from on [15] D. Donoho. For most large underdetermined systems of linear
equations, the minimal L1-norm solution is also the sparsest
the order of nr log n such randomized measurements. solution. Comm. Pure Appl. Math., 59(6), June 2006.
At a computational level, one would like to have [16] D. L. Donoho and J. Tanner. Counting faces of randomly-
available a suite of efficient algorithms for minimiz- projected polytopes when the projection radically lowers dimen-
sion. J. Amer. Math. Soc., 2006. To appear.
ing the nuclear norm under convex constraints and, in [17] M. Fazel. Matrix Rank Minimization with Applications. PhD
general, for finding low-rank matrices obeying convex thesis, Stanford University, 2002.
constraints. Algorithms with impressive performance in [18] M. Fazel, E. Candès, B. Recht, and P. Parrilo. Compressed
sensing and robust recovery of low rank matrices. Pacific Grove,
some situations have already been proposed [7], [24] CA, October 2008.
but the computational challenges of solving problems [19] M. Fazel, H. Hindi, and S. Boyd. Log-det heuristic for matrix
with millions if not billions of unknowns obviously still rank minimization with applications to Hankel and Euclidean
distance matrices. Proc. Am. Control Conf, June 2003.
require much research. [20] A. Gilbert, S. Muthukrishnan, and M. Strauss. Improved time
bounds for near-optimal sparse Fourier representation. In Proc.
Wavelets XI at SPIE Optics and Photonics, San Diego, California,
Acknowledgements 2005.
[21] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry. Using collabo-
E. C. is supported by ONR grants N00014-09-1-0469 rative filtering to weave an information tapestry. Communications
and N00014-08-1-0749 and by the Waterman Award of the ACM, 35:61–70, 1992.
[22] R. Keshavan, S. Oh, and A. Montanari. Matrix completion from
from NSF. E. C. would like to thank Terence Tao and a few entries. Submitted to ISIT’09 and available at arXiv:0901.
Stephen Becker for some very helpful discussions. 3150, 2009.
[23] Z. Liu and L. Vandenberghe. Interior-point method for nuclear
norm approximation with application to system identification.
R EFERENCES Submitted to Mathematical Programming, 2008.
[24] S. Ma, D. Goldfarb, and L. Chen. Fixed point and Bregman
[1] National climatic data center. iterative methods for matrix rank minimization. Technical report,
http://www.ncdc.noaa.gov/oa/ncdc.html. 2008.
[2] IEEE Signal Processing Magazine 25, special issue on sensing, [25] M. Mesbahi and G. P. Papavassilopoulos. On the rank minimiza-
sampling, and compression, March 2008. tion problem over a positive semidefinite linear matrix inequality.
[3] Y. Amit, M. Fink, N. Srebro, and S. Ullman. Uncovering shared IEEE Transactions on Automatic Control, 42(2):239–243, 1997.
structures in multiclass classification. Proceedings of the Twenty- [26] B. Recht, M. Fazel, and P. Parrilo. Guaranteed minimum rank
fourth International Conference on Machine Learning, 2007. solutions of matrix equations via nuclear norm minimization.
[4] A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature Submitted to SIAM Review, 2007.
learning. Neural Information Processing Systems, 2007. [27] R. O. Schmidt. Multiple emitter location and signal parameter
[5] C. Beck and R. D’Andrea. Computational study and comparisons estimation. IEEE Trans. Ant. and Prop., 34(3):276–280, 1986.
of LFT reducibility methods. In Proceedings of the American [28] A. Singer. A remark on global positioning from local distances.
Control Conference, 1998. Proc. Natl. Acad. Sci. USA, 105(28):9507–9511, 2008.
[6] P. Biswas, T-C. Lian, T-C. Wang, and Y. Ye. Semidefinite [29] A. Singer and M. Cucuringu. Uniqueness of low-rank matrix
programming based algorithms for sensor network localization. completion by rigidity theory. Submitted for publication, 2009.
ACM Trans. Sen. Netw., 2(2):188–220, 2006. [30] C. Tomasi and T. Kanade. Shape and motion from image streams
[7] J-F. Cai, E. J. Candès, and Z. Shen. A singular value thresholding under orthography: a factorization method. International Journal
algorithm for matrix completion. Technical report, 2008. Preprint of Computer Vision, 9(2):137–154, 1992.
available at http://arxiv.org/abs/0810.3286. [31] M. Vetterli, P. Marziliano, and T. Blu. Sampling signals with fi-
[8] E. J. Candès and B. Recht. Exact Matrix Completion via Convex nite rate of innovation. IEEE Trans. Signal Process., 50(6):1417–
Optimization. To appear in Found. of Comput. Math., 2008. 1428, 2002.
[9] E. J. Candès, J. Romberg, and T. Tao. Robust uncertainty [32] C-C. Weng. Matrix completion for sensor networks, 2009.
principles: exact signal reconstruction from highly incomplete Personal communication.
frequency information. IEEE Trans. Inform. Theory, 52(2):489–
509, 2006.
11

High-Rank Matrix Completion and
Subspace Clustering with Missing Data
Brian Eriksson∗ Laura Balzano∗
Boston University and University of Wisconsin - Madison
University of Wisconsin - Madison sunbeam@ece.wisc.edu
arXiv:1112.5629v2 [cs.IT] 27 Dec 2011
eriksson@cs.bu.edu
Robert Nowak
University of Wisconsin - Madison
nowak@ece.wisc.edu
December 2011
Abstract
This paper considers the problem of completing a matrix with many missing entries under the assumption that
the columns of the matrix belong to a union of multiple low-rank subspaces. This generalizes the standard low-rank
matrix completion problem to situations in which the matrix rank can be quite high or even full rank. Since the
columns belong to a union of subspaces, this problem may also be viewed as a missing-data version of the subspace
clustering problem. Let X be an n × N matrix whose (complete) columns lie in a union of at most k subspaces, each
of rank ≤ r < n, and assume N ≫ kn. The main result of the paper shows that under mild assumptions each column
of X can be perfectly recovered with high probability from an incomplete version so long as at least CrN log2 (n)
entries of X are observed uniformly at random, with C > 1 a constant depending on the usual incoherence conditions,
the geometrical arrangement of subspaces, and the distribution of columns over the subspaces. The result is illustrated
with numerical experiments and an application to Internet distance matrix completion and topology identification.
1 Introduction
Consider a real-valued n × N dimensional matrix X. Assume that the columns of X lie in the union of at most k
subspaces of Rn , each having dimension at most r < n and assume that N > kn. We are especially interested
in “high-rank” situations in which the total rank (the rank of the union of the subspaces) may be n. Our goal is to
complete X based on an observation of a small random subset of its entries. We propose a novel method for this
matrix completion problem. In the applications we have in mind N may be arbitrarily large, and so we will focus on
quantifying the probability that a given column is perfectly completed, rather than the probability that whole matrix is
perfectly completed (i.e., every column is perfectly completed). Of course it is possible to translate between these two
quantifications using a union bound, but that bound becomes meaningless if N is extremely large.
Suppose the entries of X are observed uniformly at random with probability p0 . Let Ω denote the set of indices
of observed entries and let XΩ denote the observations of X. Our main result shows that under a mild set of assump-
tions each column of X can be perfectly recovered from XΩ with high probability using a computationally efficient
procedure if
r
p0 ≥ C log2 (n) (1)
n
where C > 1 is a constant depending on the usual incoherence conditions as well as the geometrical arrangement of
subspaces and the distribution of the columns in the subspaces.
∗ The first two authors contributed equally to this paper.
1.1 Connections to Low-Rank Completion
Low-rank matrix completion theory [1] shows that an n × N matrix of rank r can be recovered from incomplete obser-
vations, as long as the number of entries observed (with locations sampled uniformly at random) exceeds rN log2 N
(within a constant factor and assuming n ≤ N ). It is also known that, in the same setting, completion is impossible if
the number of observed entries is less than a constant times rN log N [2]. These results imply that if the rank of X is
close to n, then all of the entries are needed in order to determine the matrix.
Here we consider a matrix whose columns lie in the union of at most k subspaces of Rn . Restricting the rank of
each subspace to at most r, then the rank of the full matrix our situation could be as large as kr, yielding the require-
ment krN log2 N using current matrix completion theory. In contrast, the bound in (1) implies that the completion of
each column is possible from a constant times rN log2 n entries sampled uniformly at random. Exact completion of
every column can be guaranteed by replacing log2 n with log2 N is this bound, but since we allow N to be very large
we prefer to state our result in terms of per-column completion. Our method, therefore, improves significantly upon
conventional low-rank matrix completion, especially when k is large. This does not contradict the lower bound in [2],
because the matrices we consider are not arbitrary high-rank matrices, rather the columns must belong to a union of
rank ≤ r subspaces.
1.2 Connections to Subspace Clustering

Let x1 , . . . , xN ∈ Rn and assume each xi lies in one of at most k subspaces of Rn . Subspace clustering is the problem
of learning the subspaces from {xi }Ni=1 and assigning each vector to its proper subspace; cf. [3] for a overview. This is
a challenging problem, both in terms of computation and inference, but provably probably correct subspace clustering
algorithms now exist [4, 5, 6]. Here we consider the problem of high rank matrix completion, which is essentially
equivalent to subspace clustering with missing data. This problem has been looked at in previous works [7, 8], but
to the best of our knowledge our method and theoretical bounds are novel. Note that our sampling probability bound
(1) requires that only slightly more than r out of n entries are observed in each column, so the matrix may be highly
incomplete.
1.3 A Motivating Application

There are many applications of subspace clustering, and it is reasonable to suppose that data may often be missing in
high-dimensional problems. One such application is the Internet distance matrix completion and topology identifica-
tion problem. Distances between networked devices can be measured in terms of hop-counts, the number of routers
between the devices. Infrastructures exist that record distances from N end host computers to a set of n monitoring
points throughout the Internet. The complete set of distances determines the network topology between the computers
and the monitoring points [9]. These infrastructures are based entirely on passively monitoring of normal traffic. One
advantage is the ability to monitor a very large portion of the Internet, which is not possible using active probing
methods due to the burden they place on networks. The disadvantage of passive monitoring is that measurements col-
lected are based on normal traffic, which is not specifically designed or controlled, therefore a subset of the distances
may not be observed. This poses a matrix completion problem, with the incomplete distance matrix being potentially
full-rank in this application. However, computers tend to be clustered within subnets having a small number of egress
(or access) points to the Internet at large. The number of egress points in a subnet limits the rank of the submatrix of
distances from computers in the subnet to the monitors. Therefore the columns of the n × N distance matrix lie in
the union of k low-rank subspaces, where k is the number of subnets. The solution to the matrix completion problem
yields all the distances (and hence the topology) as well as the subnet clusters.
1.4 Related Work

The proof of the main result draws on ideas from matrix completion theory, subspace learning and detection with
missing data, and subspace clustering. One key ingredient in our approach is the celebrated results on low-rank Matrix
Completion [1, 2, 10]. Unfortunately, in many real-world problems where missing data is present, particularly when
the data is generated from a union of subspaces, these matrices can have very large rank values (e.g., networking data
in [11]). Thus, these prior results will require effectively all the elements be observed to accurately reconstruct the
matrix.
Our work builds upon the results of [12], which quantifies the deviation of an incomplete vector norm with respect
to the incoherence of the sampling pattern. While this work also examines subspace detection using incomplete data,
it assumes complete knowledge of the subspaces.
While research that examines subspace learning has been presented in [13], the work in this paper differs by
the concentration on learning from incomplete observations (i.e., when there are missing elements in the matrix),
and by the methodological focus (i.e., nearest neighbor clustering versus a multiscale Singular Value Decomposition
approach).
1.5 Sketch of Methodology

The algorithm proposed in this paper involves several relatively intuitive steps, outlined below. We go into detail for
each of these steps in following sections.
Local Neighborhoods. A subset of columns of XΩ are selected uniformly at random. These are called seeds. A
set of nearest neighbors is identified for each seed from the remainder of XΩ . In Section 3, we show that nearest
neighbors can be reliably identified, even though a large portion of the data are missing, under the usual incoherence
assumptions.
Local Subspaces. The subspace spanned by each seed and its neighborhood is identified using matrix completion.
If matrix completion fails (i.e., if the resulting matrix does not agree with the observed entries and/or the rank of the
result is greater than r), then the seed and its neighborhood are discarded. In Section 4 we show that when the number
of seeds and the neighborhood sizes are large enough, then with high probability all k subspaces are identified. We
may also identify additional subspaces which are unions of the true subspaces, which leads us to the next step. An
example of these neighborhoods is shown in Figure 1.
Subspace Refinement. The set of subspaces obtained from the matrix completions is pruned to remove all but k
subspaces. The pruning is accomplished by simply discarding all subspaces that are spanned by the union of two or
more other subspaces. This can be done efficiently, as is shown in Section 5.
Full Matrix Completion. Each column in XΩ is assigned to its proper subspace and completed by projection onto
that subspace, as described in Section 6. Even when many observations are missing, it is possible to find the correct
subspace and the projection using results from subspace detection with missing data [12]. The result of this step is a
completed matrix X b such that each column is correctly completed with high probability.
The mathematical analysis will be presented in the next few sections, organized according to these steps. After proving
the main result, experimental results are presented in the final section.
Figure 1: Example of nearest-neighborhood selecting points on from a single subspace. For illustration, samples from three one-
dimensional subspaces are depicted as small dots. The large dot is the seed. The subset of samples with significant observed support
in common with that of the seed are depicted by ∗’s. If the density of points is high enough, then the nearest neighbors we identify
will belong to the same subspace as the seed. In this case we depict the ball containing the 3 nearest neighbors of the seed with
significant support overlap.
2 Key Assumptions and Main Result
The notion of incoherence plays a key role in matrix completion and subspace recovery from incomplete observations.
Definition 1. The coherence of an r-dimensional subspace S ⊆ Rn is
n
µ(S) := max kPS ej k22
r j
where PS is the projection operator onto S and {ej } are the canonical unit vectors for Rn .
nkxk2
Note that 1 ≤ µ(S) ≤ n/r. The coherence of single vector x ∈ Rn is µ(x) = kxk2∞ , which is precisely the
2
coherence of the one-dimensional subspace spanned by x. With this definition, we can state the main assumptions we
make about the matrix X.
A1. The columns of X lie in the union of at most k subspaces, with k = o(nd ) for some d > 0. The subspaces are
denoted by S1 , . . . , Sk and each has rank at most r < n. The ℓ2 -norm of each column is ≤ 1.
A2. The coherence of each subspace is bounded above by µ0 . The coherence of each column is bounded above by µ1
and for any pair of columns, x1 and x2 , the coherence of x1 − x2 is also bounded above by µ1 .
A3. The columns of X do not lie in the intersection(s) of the subspaces with probability 1, and if rank(Si ) = ri , then
any subset of ri columns from Si spans Si with probability 1. Let 0 < ǫ0 < 1 and Si,ǫ0 denote the subset of
points in Si at least ǫ0 distance away from any other subspace. There exists a constant 0 < ν0 ≤ 1, depending
on ǫ0 , such that
(i) The probability that a column selected uniformly at random belongs to Si,ǫ0 is at least ν0 /k.
(ii) If x ∈ Si,ǫ0 , then the probability that a column selected uniformly at random belongs to the ball of radius
ǫ0 centered at x is at least ν0 ǫr0 /k.
The conditions of A3 are met if, for example, the columns are drawn from a mixture of continuous distributions on
each of the subspaces. The value of ν0 depends on the geometrical arrangement of the subspaces and the distribution
of the columns within the subspaces. If the subspaces are not too close to each other, and the distributions within
the subspaces are fairly uniform, then typically ν0 will be not too close to 0. We define three key quantities, the
confidence parameter δ0 , the required number of “seed” columns s0 , and a quantity ℓ0 related to the neighborhood
formation process (see Algorithm 1 in Section 3):
1/2
δ0 := n2−2β log n , for some β > 1 , (2)

k(log k + log 1/δ0 )
s0 := ,
(1 − e−4 )ν0
& ( )'
2k 8k log(s0 /δ0 )
ℓ0 := max ǫ0 r , .
ν0 ( √3
) nν0 ( √ǫ03 )r
We can now state the main result of the paper.

Theorem 2.1. Let X be an n × N matrix satisfying A1-A3. Suppose that each entry of X is observed independently
with probability p0 . If
128 β max{µ21 , µ0 } r log2 (n)

p0 ≥
ν0 n
and
2 −1
N ≥ ℓ0 n(2δ0−1 s0 ℓ0 n)µ0 log p0
then each column of X can be perfectly recovered with probability at least 1 − (6 + 15s0 ) δ0 , using the methodology
sketched above (and detailed later in the paper).
The requirements on sampling are essentially the same as those for standard low-rank matrix completion, apart
from requirement that the total number of columns N is sufficiently large. This is needed to ensure that each of the
subspaces is sufficiently represented in the matrix. The requirement on N is polynomial in n for fixed p0 , which is
easy to see based on the definitions of δ0 , s0 , and ℓ0 (see further discussion at the end of Section 3).
Perfect recovery of each column is guaranteed with probability that decreases linearly in s0 , which itself is linear
in k (ignoring log factors). This is expected since this problem is more difficult than k individual low-rank matrix
completions. We state our results in terms of a per-column (rather than full matrix) recovery guarantee. A full matrix
recovery guarantee can be given by replacing log2 n with log2 N . This is evident from the final completion step
discussed in Lemma 8, below. However, since N may be quite large (perhaps arbitrarily large) in the applications we
envision, we chose to state our results in terms of a per-column guarantee.
The details of the methodology and lemmas leading to the theorem above are developed in the subsequent sections
following the four steps of the methodology outlined above. In certain cases it will be more convenient to consider
sampling the locations of observed entries uniformly at random with replacement rather than without replacement, as
assumed above. The following lemma will be useful for translating bounds derived assuming sampling with replace-
ment to our situation (the same sort of relation is noted in Proposition 3.1 in [1]).
Lemma 1. Draw m samples independently and uniformly from {1, . . . , n} and let Ω′ denote the resulting subset of
unique values. Let Ωm be a subset of size m selected uniformly at random from {1, . . . , n}. Let E denote an event
depending on a random subset of {1, . . . , n}. If P(E(Ωm )) is a non-increasing function of m, then P(E(Ω′ )) ≥
P(E(Ωm )).
Proof. For k = 1, . . . , m, let Ωk denote a subset of size k sampled uniformly at random from {1, . . . , n}, and let
m′ = |Ω′ |.
m
X
P(E(Ω′ )) = P (E(Ω′ ) | m′ = k) P(m′ = k)
k=0
Xm
= P(E(Ωk ))P(m′ = k)
k=0
m
X
≥ P(E(Ωm )) P(m′ = k) .
k=0
3 Local Neighborhoods
In this first step, s columns of XΩ are selected uniformly at random and a set of “nearby” columns are identified for
each, constituting a local neighborhood of size n. All bounds that hold are designed with probability at least 1 − δ0 ,
where δ0 is defined in (2) above. The s columns are called “seeds.” The required size of s is determined as follows.
Lemma 2. Assume A3 holds. If the number of chosen seeds,
s ≥ ,
(1 − e−4 )ν0
then with probability greater than 1 − δ0 for each i = 1, . . . , k, at least one seed is in Si,ǫ0 and each seed column has
at least
64 β max{µ21 , µ0 }
η0 := r log2 (n) (3)
ν0
observed entries.
Proof. First note that from Theorem 2.1, the expected number of observed entries per column is at least
128 β max{µ21 , µ0 }
η= r log2 (n)
ν0
Therefore, the number of observed entries ηb in a column selected uniformly at random is probably not significantly
less. More precisely, by Chernoff’s bound we have
η ≤ η/2) ≤ exp(−η/8) < e−4 .
P(b
Combining this with A3, we have the probability that a randomly selected column belongs to Si,ǫ0 and has η/2 or
more observed entries is at least ν0′ /k, where ν0′ := (1 − e−4 )ν0 . Then, the probability that the set of s columns does
not contain a column from Si,ǫ0 with at least η/2 observed entries is less than (1 − ν0′ /k)s . The probability that the
set does not contain at least one column from Si,ǫ0 with η/2 or more observed entries, for i = 1, . . . , k is less than
δ0 = k(1 − ν0′ /k)s . Solving for s in terms of δ0 yields
log k + log 1/δ0
s =
k/ν0′
log k/ν ′ −1
0
The result follows by noting that log(x/(x − 1)) ≥ 1/x, for x > 1.
Next, for each seed we must find a set of n columns from the same subspace as the seed. This will be accomplished
by identifying columns that are ǫ0 -close to the seed, so that if the seed belongs to Si,ǫ0 , the columns must belong to
the same subspace. Clearly the total number of columns N must be sufficiently large so that n or more such columns
can be found. We will return to the requirement on N a bit later, after first dealing with the following challenge.
Since the columns are only partially observed, it may not be possible to determine how close each is to the seed.
We address this by showing that if a column and the seed are both observed on enough common indices, then the
incoherence assumption A2 allows us reliably estimate the distance.
Lemma 3. Assume A2 and let y = x1 − x2 , where x1 and x2 are two columns of X. Assume there is a common set of
indices of size q ≤ n where both x1 and x2 are observed. Let ω denote this common set of indices and let yω denote
the corresponding subset of y. Then for any δ0 > 0, if the number of commonly observed elements
q ≥ 8µ21 log(2/δ0 ) ,
then with probability at least 1 − δ0
1 n 3
kyk22 ≤ kyω k22 ≤ kyk22 .
2 q 2
Proof. Note that kyω k22 is the sum of q random variables drawn uniformly at random without replacement from the set
{y12 , y22 , . . . , yn2 }, and Ekyω k22 = nq kyk22 . We will prove the bound under the assumption that, instead, the q variables
are sampled with replacement, so that they are independent. By Lemma 1, this will provide the desired result. Note
that if one variable in the sum kyω k22 is replaced with another value, then the sum changes in value by at most 2kyk2∞.
Therefore, McDiramid’s Inequality shows that for t > 0

q −t2
P kyω k22 − kyk22 ≥ t ≤ 2 exp ,
n 2qkyk4∞
or equivalently
n −qt2
P kyω k22 − kyk22 ≥ t ≤ 2 exp .
q 2n2 kyk4 ∞
Assumption A2 implies that n2 kyk4∞ ≤ µ21 kyk42 , and so we have

n −qt2
2 2
P kyω k2 − kyk2 ≥ t ≤ 2 exp .
q 2µ21 kyk42
Taking t = 21 kyk22 yields the result.
Suppose that x1 ∈ Si,ǫ0 (for some i) and that x2 6∈ Si , and that both x1 , x2 observe q ≥ 2µ20 log(2/δ0 ) common
indices. Let yω denote the difference between x1 and x2 on the common support set. If the partial distance nq kyω k22 ≤
ǫ20 /2, then the result above implies that with probability at least 1 − δ0
n
kx1 − x2 k22 ≤ 2 kyω k22 ≤ ǫ20 .
q
On the other hand if x2 ∈ Si and kx1 − x2 k22 ≤ ǫ20 /3, then with probability at least 1 − δ0
n 3
kyω k22 ≤ kx1 − x2 k22 ≤ ǫ20 /2 .
q 2
Using these results we will proceed as follows. For each seed we find all columns that have at least t0 > 2µ20 log(2/δ0 )
observations at indices in common with the seed (the precise value of t0 will be specified in a moment). Assuming
√ ℓ ≥ 1. In
that this set is sufficiently large, we will select ℓn these columns uniformly at random, for some integer
particular, ℓ will be chosen so that with high probability at least n of the columns will be within ǫ0 / 3√of the seed,
ensuring that with probability at least δ0 the corresponding partial distance of each will be within ǫ0 / 2. That is
enough to guarantee with the same probability that the columns are within ǫ0 of the seed. Of course, a union bound
will be needed so that the distance bounds above hold uniformly over the set of sℓn columns under consideration,
which means that we will need each to have at least t0 := 2µ20 log(2sℓn/δ0 ) observations at indices in common with
the corresponding seed. All this is predicated on N being large enough so that such columns exist in XΩ . We will
return to this issue later, after determining the requirement for ℓ. For now we will simply assume that N ≥ ℓn.
√
Lemma 4. Assume A3 and for each seed x let Tx,ǫ0 denote the number of columns of X in the ball of radius ǫ0 / 3
about x. If the number of columns selected for each seed, ℓn, such that,
( )
2k 8k log(s/δ0 )
ℓ ≥ max ǫ0 r , ,
ν0 ( √3
) nν0 ( √ǫ03 )r
then P (Tx,ǫ0 ≤ n) ≤ δ0 for all s seeds.

√
Proof. The probability that a column chosen uniformly at random from X belongs to this ball is at least ν0 (ǫ0 / 3)r /k,
by Assumption A3. Therefore the expected number of points is
ℓnν0 ( √ǫ03 )r
E[Tx,ǫ0 ] ≥ .
k
By Chernoff’s bound for any 0 < γ < 1
! !
ℓnν0 ( √ǫ03 )r γ 2 ℓnν0 ( √3 )
ǫ0 r
P Tx,ǫ0 ≤ (1 − γ) ≤ exp − .
k 2 k
Take γ = 1/2 which yields

ǫ0 r
! !
ℓnν0 ( √3
) ℓnν0 ( √ǫ03 )r
P Tx,ǫ0 ≤ ≤ exp − .
2k 8k
ǫ
ǫ

ℓnν0 ( √03 )r ℓnν0 ( √03 )r
We would like to choose ℓ so that 2k ≥ n and so that exp − 8k ≤ δ0 /s (so that the desired result
2k
fails for one or more of the s seeds is less than δ0 ). The first condition leads to the requirement ℓ ≥ ǫ
ν0 ( √03 )r
. The
8k log(s/δ0 )
second condition produces the requirement ℓ ≥ ǫ
nν0 ( √03 )r
.
We can now formally state the procedure for finding local neighborhoods in Algorithm 1. Recall that the number
of observed entries in each seed is at least η0 , per Lemma 2.
Lemma 5. If N is sufficiently large and η0 > t0 , then the Local Neighborhood Procedure in Algorithm 1 produces
at least n columns within ǫ0 of each seed, and at least one seed will belong to each of Si,ǫ0 , for i = 1, . . . , k, with
probability at least 1 − 3δ0 .
Proof. Lemma 2 states that if we select s0 seeds, then with probability at least 1 − δ0 there is a seed in each Si,ǫ0 ,
i = 1, . . . , k, with at least η0 observed entries, where η0 is defined in (3). Lemma 4 implies that if ℓ0 n columns are
selected uniformly at random for each seed, then with probability at least 1 − δ0 for each seed at least n of the columns
Algorithm 1 - Local Neighborhood Procedure
Input: n, k, µ0 , ǫ0 , ν0 , η0 , δ0 > 0.

s0 :=
(1 − e−4 )ν0
& ( )'
2k 8k log(s0 /δ0 )
ℓ0 := max ,
ν0 ( √ǫ03 )r nν0 ( √ǫ03 )r
t0 := ⌈2µ20 log(2s0 ℓ0 n/δ0 )⌉
Steps:
1. Select s0 “seed” columns uniformly at random and discard all with less than η0 observations
2. For each seed, find all columns with t0 observations at locations observed in the seed
3. Randomly select ℓ0 n columns from each such set
√
4. Form local neighborhood for each seed by randomly selecting n columns with partial distance less than ǫ0 / 2
from the seed
√
will be within a distance ǫ0 / 3 of the seed. Each seed has at least η0 observed entries and we need to find ℓ0 n other
columns with at least t0 observations at indices where the seed was observed. Provided that η0 ≥ t0 , this is certainly
possible if N is large enough. It follows from Lemma 3 that ℓ0 n columns have at least t0 observations at indices √
where the seed was also observed, then with probability at least 1 − δ0 the partial distances will be within ǫ0 / 2,
which implies the true distances are within ǫ0 . The result follows by the union bound.
Finally, we quantify just how large N needs to be. Lemma 4 also shows that we require at least
( )
2kn 8k log(s/δ0 )
N ≥ ℓn ≥ max ǫ0 r , .
ν0 ( √3
) ν0 ( √ǫ03 )r
However, we must also determine a lower bound on the probability that a column selected uniformly at random has at
least t0 observed indices in common with a seed. Let γ0 denote this probability, and let p0 denote the probability of
observing each entry in X. Note that our main result, Theorem 2.1, assumes that
128 β max{µ21 , µ0 } r log2 (n)

p0 ≥ .
ν0 n
Since each seed has at least η0 entries observed, γ0 is greater than or equal to the probability that a Binomial(η0 , p0 )
random variable is at least t0 . Thus,
Xη0
η0 j
γ0 ≥ p0 (1 − p0 )η0 −j .
j=t
j
0
This implies that the expected number of columns with t0 or more observed indices in common with a seed is at least
γ0 N . If ne is the actual number with this property, then by Chernoff’s bound, P(e n ≤ γ0 N/2) ≤ exp(−γ0 N/8). So
N ≥ 2ℓ0 γ0−1 n will suffice to guarantee that enough columns can be found for each seed with probability at least
1 − s0 exp(−ℓ0 n/4) ≥ 1 − δ0 since this will be far larger than 1 − δ0 , since δ0 is polynomial in n.
To take this a step further, a simple lower bound on γ0 is obtained as follows. Suppose we consider only a t0 -sized
subset of the indices where the seed is observed. The probability that another column selected at random is observed
2
at all t0 indices in this subset is pt00 . Clearly γ0 ≥ pt00 = exp(t0 log p0 ) ≥ (2s0 ℓ0 n)2µ0 log p0 . This yields the following
sufficient condition on the size of N :
2 −1
N ≥ ℓ0 n(2s0 ℓ0 n/δ0 )2µ0 log p0 .
From the definitions of s0 and ℓ0 , this implies that if 2µ20 log p0 is a fixed constant, then a sufficient number of
columns will exist if N = O(poly(kn/δ0 )). For example, if µ20 = 1 and p0 = 1/2, then N = O((kn)/δ0 )2.4 )
will suffice; i.e., N need only grow polynomially in n. On the other hand, in the extremely undersampled case p0
scales like log2 (n)/n (as n grows and r and k stay constant) and N will need to grow almost exponentially in n, like
nlog n−2 log log n .
4 Local Subspace Completion

For each of our local neighbor sets, we will have an incompletely observed n × n matrix; if all the neighbors belong to
a single subspace, the matrix will have rank ≤ r. First, we recall the following result from low-rank matrix completion
theory [1].
Lemma 6. Consider an n × n matrix of rank ≤ r and row and column spaces with coherences bounded above by
some constant µ0 . Then the matrix can be exactly completed if

m′ ≥ 64 max µ21 , µ0 βrn log2 (2n) (4)
2−2β 1/2
entries are observed uniformly at random, for constants β > 0 and with probability ≥ 1−6 (2n) log n−n2−2β .
We wish to apply these results to our local neighbor sets, but we have three issues we must address: First, the
sampling of the matrices formed by local neighborhood sets is not uniform since the set is selected based on the
observed indices of the seed. Second, given Lemma 2 we must complete not one, but s0 (see Algorithm 1) incomplete
matrices simultaneously with high probability. Third, some of the local neighbor sets may have columns from more
than one subspace. Let us consider each issue separately.
First consider the fact that our incomplete submatrices are not sampled uniformly. The non-uniformity can be
corrected with a simple thinning procedure. Recall that the columns in the seed’s local neighborhood are identified
first by finding columns with sufficient overlap with each seed’s observations. To refer to the seed’s observations, we
will say “the support of the seed.”
Due to this selection of columns, the resulting neighborhood columns are highly sampled on the support of the
seed. In fact, if we again use the notation q for the minimum overlap between two columns needed to calculate
distance, then these columns have at least q observations on the support of the seed. Off the support, these columns
are still sampled uniformly at random with the same probability as the entire matrix. Therefore we focus only on
correcting the sampling pattern on the support of the seed.
Let t be the cardinality of the support of a particular seed. Because all entries of the entire matrix are sampled
independently with probability p0 , then for a randomly selected column, the random variable which generates t is
binomial. For neighbors selected to have at least q overlap with a particular seed, we denote t′ as the number of
samples overlapping with the support of the seed. The probability density for t′ is positive only for j = q, . . . , t,

t j t−j
′ j p0 (1 − p0 )
P(t = j) =
ρ
Pt
where ρ = j=q jt pj0 (1 − p0 )t−j .
In order to thin the common support, we need two new random variables. The first is a bernoulli, call it Y , which
takes the value 1 with probability ρ and 0 with probability 1 − ρ. The second random variable, call it Z, takes values
j = 0, . . . , q − 1 with probability

t j t−j
j p0 (1 − p0 )
P(Z = j) =
1−ρ
Define t′′ = t′ Y + Z(1 − Y ). The density of t′′ is

′′ P(Z = j)(1 − ρ) j = 0, . . . , q − 1
P(t = j) = (5)
P(t′ = j)ρ j = q, . . . , t
which equal to the desired binomial distribution. Thus, the thinning is accomplished as follows. For each column draw
an independent sample of Y . If the sample is 1, then the column is not altered. If the sample is 0, then a realization of
Z is drawn, which we denote by z. Select a random subset of size z from the observed entries in the seed support and
discard the remainder. We note that the seed itself should not be used in completion, because there is a dependence
between the sample locations of the seed column and its selected neighbors which cannot be eliminated.
Now after thinning, we have the following matrix completion guarantee for each neighborhood matrix.
Lemma 7. Assume all s0 seed neighborhood matrices are thinned according to the discussion above, have rank ≤ r,
and the matrix entries are observed uniformly at random with probability,
128 β max{µ21 , µ0 } r log2 (n)

p0 ≥ (6)
ν0 n
1/2
Then with probability ≥ 1 − 12s0 n2−2β log n, all s0 matrices can be perfectly completed.
Proof. First, we find that if each matrix has

m′ ≥ 64 max µ21 , µ0 βrn log2 (2n)
1/2
entries observed uniformly at random (with replacement), then with probability ≥ 1 − 12s0 n2−2β log n, all s0
matrices are perfectly completed. This follows by Lemma 6, the observation that
2−2β 1/2 1/2
6 (2n) log n + n2−2β ≤ 12n2−2β log n ,
and a simple application of the union bound.

But, under our sampling assumptions, the number of entries observed in each seed neighborhood matrix is random.
Thus, the total number of observed entries in each is guaranteed to be sufficiently large with high probability as follows.
The random number of entries observed in an n × n matrix is m b ∼ Binomial(p0 , n2 ). By Chernoff’s bound we have
2 2
P(mb ≤ n p0 /2) ≤ exp(−n p0 /8). By the union bound we find that m b ≥ m′ entries are observed in each of the s0
128 β max{µ21 ,µ0 } r log2 (n)
seed matrices with probability at least 1 − exp(−n2 p0 /8 + log s0 ) if p0 ≥ ν0 n .
Since n2 p0 > rn log2 n and s0 = O(k(log k + log n)), this probability tends to zero exponentially in n as
long as k = o(en ), which holds according to Assumption A1. Therefore this holds with probability at least 1 −
1/2
12s0 n2−2β log n.
Finally, let us consider the third issue, the possibility that one or more of the points in the neighborhood of a seed
lies in a subspace different than the seed subspace. When this occurs, the rank of the submatrix formed by the seed’s
neighbor columns will be larger than the dimension of the seed subspace. Without loss of generality assume that we
have only two subspaces represented in the neighbor set, and assume their dimensions are r′ and r′′ . First, in the case
that r′ + r′′ > r, when a rank ≥ r matrix is completed to a rank r matrix, with overwhelming probability there will
be errors with respect to the observations as long as the number of samples in each column is O(r log r), which is
assumed in our case; see [12]. Thus we can detect and discard these candidates. Secondly, in the case that r′ + r′′ ≤ r,
we still have enough samples to complete this matrix successfully with high probability. However, since we have
drawn enough seeds to guarantee that every subspace has a seed with a neighborhood entirely in that subspace, we
will find that this problem seed is redundant. This is determined in the Subspace Refinement step.
5 Subspace Refinement
Each of the matrix completion steps above yields a low-rank matrix with a corresponding column subspace, which
we will call the candidate subspaces. While the true number of subspaces will not be known in advance, since
s0 = O(k(log k + log(1/δ0 )), the candidate subspaces will contain the true subspaces with high probability (see
Lemma 4). We must now deal with the algorithmic issue of determining the true set of subspaces.
We first note that, from Assumption A3, with probability 1 a set of points of size ≥ r all drawn from a single
subspace S of dimension ≤ r will span S. In fact, any b < r points will span a b-dimensional subspace of the
r-dimensional subspace S.
10
Assume that r < n, since otherwise it is clearly necessary to observe all entries. Therefore, if a seed’s nearest
neighborhood set is confined to a single subspace, then the columns in span their subspace. And if the seed’s nearest
neighborhood contains columns from two or more subspaces, then the matrix will have rank larger than that of any
of the constituent subspaces. Thus, if a certain candidate subspace is spanned by the union of two or more smaller
candidate subspaces, then it follows that that subspace is not a true subspace (since we assume that none of the true
subspaces are contained within another).
This observation suggests the following subspace refinement procedure. The s0 matrix completions yield s ≤ s0
candidate column subspaces; s may be less than s0 since completions that fail are discarded as described above. First
sort the estimated subspaces in order of rank from smallest to largest (with arbitrary ordering of subspaces of the
same rank), which we write as S(1) , . . . , S(s) . We will denote the final set of estimated subspaces as Sb1 , . . . , Sbk . The
first subspace Sb1 := S(1) , a lowest-rank subspace in the candidate set. Next, Sb2 = S(2) if and only if S(2) is not
contained in Sb1 . Following this simple sequential strategy, suppose that when we reach the candidate S(j) we have so
far determined Sb1 , . . . , Sbi , i < j. If S(j) is not in the span of ∪iℓ=1 Sbℓ , then we set Sbi+1 = S(j) , otherwise we move
on to the next candidate. In this way, we can proceed sequentially through the rank-ordered list of candidates, and we
will identify all true subspaces.
6 The Full Monty

Now all will be revealed. At this point, we have identified the true subspaces, and all N columns lie in the span of
one of those subspaces. For ease of presentation, we assume that the number of subspaces is exactly k. However if
columns lie in the span of fewer than k, then the procedure above will produce the correct number. To complete the
full matrix, we proceed one column at a time. For each column of XΩ , we determine the correct subspace to which
this column belongs, and we then complete the column using that subspace. We can do this with high probability due
to results from [12, 14].
The first step is that of subspace assignment, determining the correct subspace to which this column belongs.
In [14], it is shown that given k subspaces, an incomplete vector can be assigned to its closest subspace with high
probability given enough observations. In the situation at hand, we have a special case of the results of [14] because we
are considering the more specific situation where our incomplete vector lies exactly in one of the candidate subspaces,
and we have an upper bound for both the dimension and coherence of those subspaces.
Lemma 8. Let {S1 , . . . , Sk } be a collection of k subspaces of dimension ≤ r and coherence parameter bounded above
−1
T T
j j j
by µ0 . Consider column vector x with index set Ω ∈ {1, . . . , n}, and define PΩ,Sj = UΩ UΩ UΩ UΩj ,
where U j is the orthonormal column span of Sj and UΩj is the column span of Sj restricted to the observed rows, Ω.
Without loss of generality, suppose the column of interest x ∈ S1 . If A3 holds, and the probability of observing each
entry of x is independent and Bernoulli with parameter
128 β max{µ21 , µ0 } r log2 (n)

p0 ≥ .
ν0 n
Then with probability at least 1 − (3(k − 1) + 2)δ0 ,
kxΩ − PΩ,S1 xΩ k22 = 0 (7)
and for j = 2, . . . , k
kxΩ − PΩ,Sj xΩ k22 > 0 . (8)
Proof. We wish to use results from [12, 14], which require a fixed number of measurements |Ω|. By Chernoff’s bound

np0 −np0
P |Ω| ≤ ≤ exp .
2 8
11

Note that np0 > 16rβ log2 n, therefore exp −np 8
0
< (n−2β )log n < δ0 ; in other words, we observe |Ω| > np0 /2
entries of x with probability 1 − δ0 . This set Ω is selected uniformly at random among all sets of size |Ω|, but using
Lemma 1 we can assume that the samples are drawn uniformly with replacement in order to apply results of [12, 14].
Now we show that |Ω| > np0 /2 samples selected uniformly with replacement implies that

8rµ0 2r rµ0 (1 + ξ)2
|Ω| > max log , (9)
3 δ0 (1 − α)(1 − γ)
r r r
2µ21 1 1 8rµ0 2r
where ξ, α > 0 and γ ∈ (0, 1) are defined as α = |Ω| log δ0 , ξ = 2µ 1 log δ0 , and γ = 3|Ω| log δ0 .
We start with the second term in the max of (9). Substituting δ0 and the bound for p0 , one can show that for
n ≥ 15 both α ≤ 1/2 and γ ≤ 1/2. This makes (1 + ξ)2 /(1 − α)(1 − γ) ≤ 4(1 − ξ)2 ≤ 8ξ 2 for ξ > 2.5, i.e. for
δ0 < 0.04.
We finish this argument by noting that 8ξ 2 = 16µ1 log(1/δ0 ) < np0/2; there is in fact an O(r log(n)) gap
between the two. Similarly for the first term in the max of (9), 83 rµ0 log 2r
δ0 < np0 /2; here the gap is O(log(n)).

Now we prove (7), which follows from [12]. With |Ω| > 83 rµ0 log 2r T
δ0 , we have that UΩ UΩ is invertible with
probability at least 1 − δ0 according to Lemma 3 of [12]. This implies that
−1 T
U T x = UΩT UΩ UΩ xΩ . (10)
Call a1 = U T x. Since x ∈ S, U a1 = x, and a1 is in fact the unique solution to U a = x. Now consider the equation
−1 T
UΩ a = xΩ . The assumption that UΩT UΩ is invertible implies that a2 = UΩT UΩ UΩ xΩ exists and is the unique
solution to UΩ a = xΩ . However, UΩ a1 = xΩ as well, meaning that a1 = a2 . Thus, we have
kxΩ − PΩ,S1 xΩ k22 = kxΩ − UΩ U T xk22 = 0
with probability at least 1 − δ0 .
Now we prove (8), paralleling Theorem 1 in [14]. We use Assumption A3 to ensure that x ∈
/ Sj , j = 2, . . . , k.
This along with (9) and Theorem 1 from [12] guarantees that
2
|Ω|(1 − α) − rµ0 (1+ξ)
1−γ
kxΩ − PΩ,Sj xΩ k22 ≥ kx − PSj xk22 > 0
n
for each j = 2, . . . , k with probability at least 1 − 3δ0 . With a union bound this holds simultaneously for all k − 1
alternative subspaces with probability at least 1 − 3(k − 1)δ0 . When we also include the events that (7) holds and that
|Ω| > np0 /2, we get that the entire theorem holds with probability at least 1 − (3(k − 1) + 2)δ0 .
Finally, denote the column to be completed by xΩ . To complete xΩ we first determine which subspace it belongs to
using the results above. For a given column we can use the incomplete data projection residual of (7). With probability
at least 1−(3(k−1)+2)δ0, the residual will be zero for the correct subspace and strictly positive for all other subspaces.
−1 T
Using the span of the chosen subspace, U , we can then complete the column by using x b = U UΩT UΩ UΩ xΩ .
We reiterate that Lemma 8 allows us to complete a single column x with probability 1 − (3(k − 1) + 2)δ0 . If we
wish to complete the entire matrix, we will need another union bound over all N columns, leading to a log N factor in
our requirement on p0 . Since N may be quite large in applications, we prefer to state our result in terms of per-column
completion bound.
The confidence level stated in Theorem 2.1 is the result of applying the union bound to all the steps required in
the Sections 3, 4, and 6. All hold simultaneously with probability at least
1 − (6 + 3(k − 1) + 12s0 ) δ0 < 1 − (6 + 15s0 )δ0 ,
which proves the theorem.
7 Experiments
The following experiments evaluate the performance of the proposed high-rank matrix completion procedure and
compare results with standard low-rank matrix completion based on nuclear norm minimization.
12
7.1 Numerical Simulations
We begin by examining a highly synthesized experiment where the data exactly matches the assumptions of our high-
rank matrix completion procedure. The key parameters were chosen as follows: n = 100, N = 5000, k = 10, and
r = 5. The k subspaces were r-dimensional, and each was generated by r vectors drawn from the N (0, In ) distribution
and taking their span. The resulting subspaces are highly incoherent with the canonical basis for Rn . For each
subspace, we generate 500 points drawn from a N (0, U U T ) distribution, where U is a n×r matrix whose orthonormal
columns span the subspace. Our procedure was implemented using ⌈3k log k⌉ seeds. The matrix completion software
called GROUSE (available here [15]) was used in our procedure and to implement the standard low-rank matrix
completions. We ran 50 independent trials of our procedure and compared it to standard low-rank matrix completion.
The results are summarized in the figures below. The key message is that our new procedure can provide accurate
completions from far fewer observations compared to standard low-rank completion, which is precisely what our main
result predicts.
Figure 2: The number of correctly completed columns (with tolerances shown above, 10e-5 or 0.01), versus the average number
of observations per column. As expected, our procedure (termed high rank MC in the plot) provides accurate completion with only
about 50 samples per column. Note that r log n ≈ 23 in this simulation, so this is quite close to our bound. On the other hand, since
the rank of the full matrix is rk = 50, the standard low-rank matrix completion bound requires m > 50 log n ≈ 230. Therefore, it
is not surprising that the standard method (termed low rank MC above) requires almost all samples in each column.
7.2 Network Topology Inference Experiments

The ability to recover Internet router-level connectivity is of importance to network managers, network operators and
the area of security. As a complement to the heavy network load of standard active probing methods (e.g., [16]), which
scale poorly for Internet-scale networks, recent research has focused on the ability to recover Internet connectivity from
passively observed measurements [17]. Using a passive scheme, no additional probes are sent through the network;
instead we place passive monitors on network links to observe “hop-counts” in the Internet (i.e, the number of routers
between two Internet resources) from traffic that naturally traverses the link the monitor is placed on. An example of
this measurement infrastructure can be seen in Figure 3.
These hop count observations result in an n × N matrix, where n is the number of passive monitors and N is the
total unique IP addresses observed. Due to the passive nature of these observations, specifically the requirement that
we only observe traffic that happens to be traversing the link where a monitor is located, this hop count matrix will
be massively incomplete. A common goal is to impute (or fill-in) the missing components of this hop count matrix in
order to infer network characteristics.
Prior work on analyzing passively observed hop matrices have found a distinct subspace mixture structure [9],
where the full hop count matrix, while globally high rank, is generated from a series of low rank subcomponents.
These low rank subcomponents are the result of the Internet topology structure, where all IP addresses in a common
subnet exist behind a single common border router. This network structure is such that any probe sent from an IP in a a
particular subnet to a monitor must traverse through the same border router. A result of this structure is a rank-two hop
count matrix for all IP addresses in that subnet, consisting of the hop count vector to the border router and a constant
offset relating to the distance from each IP address to the border router. Using this insight, we apply the high-rank
13
Subnet
Subnet
Border
Router Border
Subnet Router Subnet
Internet Core
Border Border
Router Router
Passive Monitors
Figure 3: Internet topology example of subnets sending traffic to passive monitors through the Internet core and common border
routers.
matrix completion approach on incomplete hop count matrices.

Using a Heuristically Optimal Topology from [18], we simulated a network topology and measurement infrastruc-
ture consisting of N = 2700 total IP addresses uniformly distributed over k = 12 different subnets. The hop counts
are generated on the topology using shortest-path routing from n = 75 passive monitors located randomly throughout
the network. As stated above, each subnet corresponds to a subspace of dimension r = 2. Observing only 40% of the
total hop counts, in Figure 4 we present the results of the hop count matrix completion experiments, comparing the
performance of the high-rank procedure with standard low-rank matrix completion. The experiment shows dramatic
improvements, as over 70% of the missing hop counts can be imputed exactly using the high-rank matrix completion
methodology, and approximately no missing elements are imputed exactly using standard low-rank matrix completion.
0.8
Cumulative Distribution
0.6
0.4
0.2 High Rank MC

Standard MC
0
0 0.2 0.4 0.6 0.8 1
Approximation Error
Figure 4: Hop count imputation results, using a synthetic network with k = 12 subnets, n = 75 passive monitors, and N = 2700
IP addresses. The cumulative distribution of estimation error is shown with respect to observing 40% of the total elements.
Finally, using real-world Internet delay measurements (courtesy of [19]) from n = 100 monitors to N = 22550
IP addresses, we test imputation performance when the underlying subnet structure is not known. Using the estimate
k = 15, in Figure 5 we find a significant performance increase using the high-rank matrix completion technique.
References
[1] B. Recht, “A Simpler Approach to Matrix Completion,” in To appear in Journal of Machine Learning Research,
arXiv:0910.0651v2.
[2] E. J. Candès and T. Tao, “The Power of Convex Relaxation: Near-Optimal Matrix Completion.” in IEEE Trans-
actions on Information Theory, vol. 56, May 2010, pp. 2053–2080.
[3] R. Vidal, “A Tutorial on Subspace Clustering,” in Johns Hopkins Technical Report, 2010.
[4] K. Kanatani, “Motion Segmentation by Subspace Separation and Model Selection,” in Computer Vision, 2001.
ICCV 2001. Proceedings. Eighth IEEE International Conference on, vol. 2, 2001, pp. 586–591.
14
0.8
0.7
Cumulative Distribution
0.6
0.5
0.4
0.3
0.2
High Rank MC
0.1 Standard MC
0
0 20 40 60 80 100
Approximation Error (in ms)
Figure 5: Real-world delay imputation results, using a network n = 100 monitors, N = 22550 IP addresses, and an unknown
number of subnets. The cumulative distribution of estimation error is shown with respect to observing 40% of the total delay
elements.
[5] R. Vidal, Y. Ma, and S. Sastry, “Generalized Principal Component Analysis (GPCA),” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 27, December 2005.
[6] G. Lerman and T. Zhang, “Robust Recovery of Multiple Subspaces by Lp Minimization,” 2011, Preprint at
http://arxiv.org/abs/1104.3770.
[7] A. Gruber and Y. Weiss, “Multibody Factorization with Uncertainty and Missing Data using the EM Algorithm,”
in Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR), vol. 1, June 2004.
[8] R. Vidal, R. Tron, and R. Hartley, “Multiframe Motion Segmentation with Missing Data Using Power Factoriza-
tion and GPCA,” International Journal of Computer Vision, vol. 79, pp. 85–105, 2008.
[9] B. Eriksson, P. Barford, and R. Nowak, “Network Discovery from Passive Measurements,” in Proceedings of
ACM SIGCOMM Conference, Seattle, WA, August 2008.
[10] E. Candès and B. Recht, “Exact Matrix Completion Via Convex Optimization.” in Foundations of Computational
Mathematics, vol. 9, 2009, pp. 717–772.
[11] B. Eriksson, P. Barford, J. Sommers, and R. Nowak, “DomainImpute: Inferring Unseen Components in the
Internet,” in Proceedings of IEEE INFOCOM Mini-Conference, Shanghai, China, April 2011, pp. 171–175.
[12] L. Balzano, B. Recht, and R. Nowak, “High-Dimensional Matched Subspace Detection When Data are
Missing,” in Proceedings of the International Conference on Information Theory, June 2010, available at
http://arxiv.org/abs/1002.0852.
[13] G. Chen and M. Maggioni, “Multiscale Geometric and Spectral Analysis of Plane Arrangements,” in IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, June 2011.
[14] L. Balzano, R. Nowak, A. Szlam, and B. Recht, “k-Subspaces with missing data,” University of Wisconsin,
Madison, Tech. Rep. ECE-11-02, February 2011.
[15] L. Balzano and B. Recht, 2010, http://sunbeam.ece.wisc.edu/grouse/.
[16] N. Spring, R. Mahajan, and D. Wetherall, “Measuring ISP Topologies with Rocketfuel,” in Proceedings of ACM
SIGCOMM, Pittsburgh, PA, August 2002.
[17] B. Eriksson, P. Barford, R. Nowak, and M. Crovella, “Learning Network Structure from Passive Measurements,”
in Proceedings of ACM Internet Measurement Conference, San Diego, CA, October 2007.
[18] L. Li, D. Alderson, W. Willinger, and J. Doyle, “A First-Principles Approach to Understanding the Internet’s
Router-Level Topology,” in Proceedings of ACM SIGCOMM Conference, August 2004.
[19] J. Ledlie, P. Gardner, and M. Seltzer, “Network Coordinates in the Wild,” in Proceedings of NSDI Conference,
April 2007.
15
A New Theory for Matrix Completion
Guangcan Liu∗ Qingshan Liu† Xiao-Tong Yuan‡
B-DAT, School of Information & Control, Nanjing Univ Informat Sci & Technol
NO 219 Ningliu Road, Nanjing, Jiangsu, China, 210044
{gcliu,qsliu,xtyuan}@nuist.edu.cn
Abstract
Prevalent matrix completion theories reply on an assumption that the locations of
the missing data are distributed uniformly and randomly (i.e., uniform sampling).
Nevertheless, the reason for observations being missing often depends on the unseen
observations themselves, and thus the missing data in practice usually occurs in a
nonuniform and deterministic fashion rather than randomly. To break through the
limits of random sampling, this paper introduces a new hypothesis called isomeric
condition, which is provably weaker than the assumption of uniform sampling and
arguably holds even when the missing data is placed irregularly. Equipped with
this new tool, we prove a series of theorems for missing data recovery and matrix
completion. In particular, we prove that the exact solutions that identify the target
matrix are included as critical points by the commonly used nonconvex programs.
Unlike the existing theories for nonconvex matrix completion, which are built
upon the same condition as convex programs, our theory shows that nonconvex
programs have the potential to work with a much weaker condition. Comparing to
the existing studies on nonuniform sampling, our setup is more general.
1 Introduction
Missing data is a common occurrence in modern applications such as computer vision and image
processing, reducing significantly the representativeness of data samples and therefore distorting
seriously the inferences about data. Given this pressing situation, it is crucial to study the problem
of recovering the unseen data from a sampling of observations. Since the data in reality is often
organized in matrix form, it is of considerable practical significance to study the well-known problem
of matrix completion [1] which is to fill in the missing entries of a partially observed matrix.
Problem 1.1 (Matrix Completion). Denote the (i, j)th entry of a matrix as [·]ij . Let L0 ∈ Rm×n be
an unknown matrix of interest. In particular, the rank of L0 is unknown either. Given a sampling of
the entries in L0 and a 2D index set Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n} consisting of the locations
of the observed entries, i.e., given
{[L0 ]ij |(i, j) ∈ Ω} and Ω,
can we restore the missing entries whose indices are not included in Ω, in an exact and scalable
fashion? If so, under which conditions?
Due to its unique role in a broad range of applications, e.g., structure from motion and magnetic
resonance imaging, matrix completion has received extensive attentions in the literatures, e.g., [2–13].
∗
The work of Guangcan Liu is supported in part by national Natural Science Foundation of China (NSFC)
under Grant 61622305 and Grant 61502238, in part by Natural Science Foundation of Jiangsu Province of China
(NSFJPC) under Grant BK20160040.
†
The work of Qingshan Liu is supported by NSFC under Grant 61532009.
‡
The work of Xiao-Tong Yuan is supported in part by NSFC under Grant 61402232 and Grant 61522308, in
part by NSFJPC under Grant BK20141003.
31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
Figure 1: Left and Middle: Typical configurations for the locations of the observed entries. Right: A
real example from the Oxford motion database. The black areas correspond to the missing entries.
In general, given no presumption about the nature of matrix entries, it is virtually impossible to
restore L0 as the missing entries can be of arbitrary values. That is, some assumptions are necessary
for solving Problem 1.1. Based on the high-dimensional and massive essence of today’s data-driven
community, it is arguable that the target matrix L0 we wish to recover is often low rank [23]. Hence,
one may perform matrix completion by seeking a matrix with the lowest rank that also satisfies the
constraints given by the observed entries:
min rank (L) , s.t. [L]ij = [L0 ]ij , ∀(i, j) ∈ Ω. (1)
L
Unfortunately, this idea is of little practical because the problem above is NP-hard and cannot be
solved in polynomial time [15]. To achieve practical matrix completion, Candès and Recht [4]
suggested to consider an alternative that minimizes instead the nuclear norm which is a convex
envelope of the rank function [12]. Namely,
min kLk∗ , s.t. [L]ij = [L0 ]ij , ∀(i, j) ∈ Ω, (2)
L
where k · k∗ denotes the nuclear norm, i.e., the sum of the singular values of a matrix. Rather
surprisingly, it is proved in [4] that the missing entries, with high probability, can be exactly restored
by the convex program (2), as long as the target matrix L0 is low rank and incoherent and the set Ω of
locations corresponding to the observed entries is a set sampled uniformly at random. This pioneering
work provides people several useful tools to investigate matrix completion and many other related
problems. Those assumptions, including low-rankness, incoherence and uniform sampling, are now
standard and widely used in the literatures, e.g., [14, 17, 22, 24, 28, 33, 34, 36]. In particular, the
analyses in [17, 33, 36] show that, in terms of theoretical completeness, many nonconvex optimization
based methods are as powerful as the convex program (2). Unfortunately, these theories still depend
on the assumption of uniform sampling, and thus they cannot explain why there are many nonconvex
methods which often do better than the convex program (2) in practice.
The missing data in practice, however, often occurs in a nonuniform and deterministic fashion instead
of randomly. This is because the reason for an observation being missing usually depends on the
unseen observations themselves. For example, in structure from motion and magnetic resonance
imaging, typically the locations of the observed entries are concentrated around the main diagonal of
a matrix4 , as shown in Figure 1. Moreover, as pointed out by [19, 21, 23], the incoherence condition
is indeed not so consistent with the mixture structure of multiple subspaces, which is also a ubiquitous
phenomenon in practice. There has been sparse research in the direction of nonuniform sampling,
e.g., [18, 25–27, 31]. In particular, Negahban and Wainwright [26] studied the case of weighted
entrywise sampling, which is more general than the setup of uniform sampling but still a special
form of random sampling. Király et al. [18] considered deterministic sampling and is most related to
this work. However, they had only established conditions to decide whether a particular entry of the
matrix can be restored. In other words, the setup of [18] may not handle well the dependence among
the missing entries. In summary, matrix completion still starves for practical theories and methods,
although has attained considerable improvements in these years.
To break through the limits of the setup of random sampling, in this paper we introduce a new
hypothesis called isomeric condition, which is a mixed concept that combines together the rank and
coherence of L0 with the locations and amount of the observed entries. In general, isomerism (noun
4
This statement means that the observed entries are concentrated around the main diagonal after a permutation
of the sampling pattern Ω.
of isomeric) is a very mild hypothesis and only a little bit more strict than the well-known oracle
assumption; that is, the number of observed entries in each row and column of L0 is not smaller than
the rank of L0 . It is arguable that the isomeric condition can hold even when the missing entries have
irregular locations. In particular, it is provable that the widely used assumption of uniform sampling
is sufficient to ensure isomerism, not necessary. Equipped with this new tool, isomerism, we prove a
set of theorems pertaining to missing data recovery [35] and matrix completion. For example, we
prove that, under the condition of isomerism, the exact solutions that identify the target matrix are
included as critical points by the commonly used bilinear programs. This result helps to explain the
widely observed phenomenon that there are many nonconvex methods performing better than the
convex program (2) on real-world matrix completion tasks. In summary, the contributions of this
paper mainly include:
We invent a new hypothesis called isomeric condition, which provably holds given the
standard assumptions of uniform sampling, low-rankness and incoherence. In addition,
we also exemplify that the isomeric condition can hold even if the target matrix L0 is not
incoherent and the missing entries are placed irregularly. Comparing to the existing studies
about nonuniform sampling, our setup is more general.
Equipped with the isomeric condition, we prove that the exact solutions that identify L0
are included as critical points by the commonly used bilinear programs. Comparing to the
existing theories for nonconvex matrix completion, our theory is built upon a much weaker
assumption and can therefore partially reveal the superiorities of nonconvex programs over
the convex methods based on (2).
We prove that the isomeric condition is sufficient and necessary for the column and row
projectors of L0 to be invertible given the sampling pattern Ω. This result implies that
the isomeric condition is necessary for ensuring that the minimal rank solution to (1) can
identify the target L0 .
The rest of this paper is organized as follows. Section 2 summarizes the mathematical notations used
in the paper. Section 3 introduces the proposed isomeric condition, along with some theorems for
matrix completion. Section 4 shows some empirical results and Section 5 concludes this paper. The
detailed proofs to all the proposed theorems are presented in the Supplementary Materials.
2 Notations
Capital and lowercase letters are used to represent matrices and vectors, respectively, except that the
lowercase letters, i, j, k, m, n, l, p, q, r, s and t, are used to denote some integers, e.g., the location of
an observation, the rank of a matrix, etc. For a matrix M , [M ]ij is its (i, j)th entry, [M ]i,: is its ith row
and [M ]:,j is its jth column. Let ω1 and ω2 be two 1D index sets; namely, ω1 = {i1 , i2 , · · · , ik } and
ω2 = {j1 , j2 , · · · , js }. Then [M ]ω1 ,: denotes the submatrix of M obtained by selecting the rows with
indices i1 , i2 , · · · , ik , [M ]:,ω2 is the submatrix constructed by choosing the columns j1 , j2 , · · · , js ,
and similarly for [M ]ω1 ,ω2 . For a 2D index set Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n}, we imagine it
as a sparse matrix and, accordingly, define its “rows”, “columns” and “transpose” as follows: The
ith row Ωi = {j1 |(i1 , j1 ) ∈ Ω, i1 = i}, the jth column Ωj = {i1 |(i1 , j1 ) ∈ Ω, j1 = j} and the
transpose ΩT = {(j1 , i1 )|(i1 , j1 ) ∈ Ω}.
The special symbol (·)+ is reserved to denote the Moore-Penrose pseudo-inverse of a matrix. More
T
precisely, for a matrix M with Singular Value Decomposition (SVD)5 M = UM ΣM VM , its pseudo-
+ −1 T
inverse is given by M = VM ΣM UM . For convenience, we adopt the conventions of using
span{M } to denote the linear space spanned by the columns of a matrix M , using y ∈ span{M } to
denote that a vector y belongs to the space span{M }, and using Y ∈ span{M } to denote that all the
column vectors of a matrix Y belong to span{M }.
Capital letters U , V , Ω and their variants (complements, subscripts, etc.) are reserved for left singular
vectors, right singular vectors and index set, respectively. For convenience, we shall abuse the
notation U (resp. V ) to denote the linear space spanned by the columns of U (resp. V ), i.e., the
column space (resp. row space). The orthogonal projection onto the column space U , is denoted by
PU and given by PU (M ) = U U T M , and similarly for the row space PV (M ) = M V V T . The same
In this paper, SVD always refers to skinny SVD. For a rank-r matrix M ∈ Rm×n , its SVD is of the form
5
T
UM ΣM VM , where UM ∈ Rm×r , ΣM ∈ Rr×r and VM ∈ Rn×r .
notation is also used to represent a subspace of matrices (i.e., the image of an operator), e.g., we say
that M ∈ PU for any matrix M which satisfies PU (M ) = M . We shall also abuse the notation Ω
to denote the linear space of matrices supported on Ω. Then the symbol PΩ denotes the orthogonal
projection onto Ω, namely,

[M ]ij , if (i, j) ∈ Ω,
[PΩ (M )]ij =
0, otherwise.
Similarly, the symbol PΩ⊥ denotes the orthogonal projection onto the complement space of Ω. That
is, PΩ + PΩ⊥ = I, where I is the identity operator.
Three types of matrix norms are used in this paper, and they are all functions of the singular values:
1) The operator norm or 2-norm (i.e., largest singular value) denoted by kM k, 2) the Frobenius norm
(i.e., square root of the sum of squared singular values) denoted by kM kF and 3) the nuclear norm
or trace norm (i.e., sum of singular values) denoted by kM k∗ . The only used vector norm is the `2
norm, which is denoted by k · k2 . The symbol | · | is reserved for the cardinality of an index set.
3 Isomeric Condition and Matrix Completion

This section introduces the proposed isomeric condition and a set of theorems for matrix completion.
But most of the detailed proofs are deferred until the Supplementary Materials.
3.1 Isomeric Condition

In general cases, as aforementioned, matrix completion is an ill-posed problem. Thus, some assump-
tions are necessary for studying Problem 1.1. To eliminate the disadvantages of the setup of random
sampling, we define and investigate a so-called isomeric condition.
3.1.1 Definitions
For the ease of understanding, we shall begin with a concept called k-isomerism (or k-isomeric in
adjective form), which could be regarded as an extension of low-rankness.
Definition 3.1 (k-isomeric). A matrix M ∈ Rm×l is called k-isomeric if and only if any k rows of
M can linearly represent all rows in M . That is,
rank ([M ]ω,: ) = rank (M ) , ∀ω ⊆ {1, 2, · · · , m}, |ω| = k,
where | · | is the cardinality of an index set.
In general, k-isomerism is somewhat similar to Spark [37] which defines the smallest linearly
dependent subset of the rows of a matrix. For a matrix M to be k-isomeric, it is necessary that
rank (M ) ≤ k, not sufficient. In fact, k-isomerism is also somehow related to the concept of
coherence [4, 21]. When the coherence of a matrix M ∈ Rm×l is not too high, the rows of M will
sufficiently spread, and thus M could be k-isomeric with a small k, e.g., k = rank (M ). Whenever
the coherence of M is very high, one may need a large k to satisfy the k-isomeric property. For
example, consider an extreme case where M is a rank-1 matrix with one row being 1 and everywhere
else being 0. In this case, we need k = m to ensure that M is k-isomeric.
While Definition 3.1 involves all 1D index sets of cardinality k, we often need the isomeric property
to be associated with a certain 2D index set Ω. To this end, we define below a concept called
Ω-isomerism (or Ω-isomeric in adjective form).
Definition 3.2 (Ω-isomeric). Let M ∈ Rm×l and Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n}. Suppose
that Ωj 6= ∅ (empty set), ∀1 ≤ j ≤ n. Then the matrix M is called Ω-isomeric if and only if

rank [M ]Ωj ,: = rank (M ) , ∀j = 1, 2, · · · , n.
Note here that only the number of rows in M is required to coincide with the row indices included in
Ω, and thereby l 6= n is allowable.
Generally, Ω-isomerism is less strict than k-isomerism. Provided that |Ωj | ≥ k, ∀1 ≤ j ≤ n, a matrix
M is k-isomeric ensures that M is Ω-isomeric as well, but not vice versa. For the extreme example
where M is nonzero at only one row, interestingly, M can be Ω-isomeric as long as the locations of
the nonzero elements are included in Ω.
With the notation of ΩT = {(j1 , i1 )|(i1 , j1 ) ∈ Ω}, the isomeric property could be also defined on
the column vectors of a matrix, as shown in the following definition.
Definition 3.3 (Ω/ΩT -isomeric). Let M ∈ Rm×n and Ω ⊆ {1, 2, · · · , m}×{1, 2, · · · , n}. Suppose
Ωi 6= ∅ and Ωj 6= ∅, ∀i = 1, · · · , m, j = 1, · · · , n. Then the matrix M is called Ω/ΩT -isomeric if
and only if M is Ω-isomeric and M T is ΩT -isomeric as well.
To solve Problem 1.1 without the imperfect assumption of missing at random, as will be shown later,
we need to assume that L0 is Ω/ΩT -isomeric. This condition has excluded the unidentifiable cases
where any rows or columns of L0 are wholly missing. In fact, whenever L0 is Ω/ΩT -isomeric, the
number of observed entries in each row and column of L0 has to be greater than or equal to the rank
of L0 ; this is consistent with the results in [20]. Moreover, Ω/ΩT -isomerism has actually well treated
the cases where L0 is of high coherence. For example, consider an extreme case where L0 is 1 at only
one element and 0 everywhere else. In this case, L0 cannot be Ω/ΩT -isomeric unless the nonzero
element is observed. So, generally, it is possible to restore the missing entries of a highly coherent
matrix, as long as the Ω/ΩT -isomeric condition is obeyed.
3.1.2 Basic Properties

While its definitions are associated with a certain matrix, the isomeric condition is actually character-
izing some properties of a space, as shown in the lemma below.
Lemma 3.1. Let L0 ∈ Rm×n and Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n}. Denote the SVD of L0 as
U0 Σ0 V0T . Then we have:
1. L0 is Ω-isomeric if and only if U0 is Ω-isomeric.

2. LT0 is ΩT -isomeric if and only if V0 is ΩT -isomeric.
Proof. It could be manipulated that

[L0 ]Ωj ,: = ([U0 ]Ωj ,: )Σ0 V0T , ∀j = 1, · · · , n.
Since Σ0 V0T is row-wisely full rank, we have

rank [L0 ]Ωj ,: = rank [U0 ]Ωj ,: , ∀j = 1, · · · , n.
As a result, L0 is Ω-isomeric is equivalent to U0 is Ω-isomeric. In a similar way, the second claim is
proved as well.
It is easy to see that the above lemma is still valid even when the condition of Ω-isomerism is replaced
by k-isomerism. Thus, hereafter, we may say that a space is isomeric (k-isomeric, Ω-isomeric or
ΩT -isomeric) as long as its basis matrix is isomeric. In addition, the isomeric property is subspace
successive, as shown in the next lemma.
Lemma 3.2. Let Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n} and U0 ∈ Rm×r be the basis matrix of a
Euclidean subspace embedded in Rm . Suppose that U is a subspace of U0 , i.e., U = U0 U0T U . If U0
is Ω-isomeric then U is Ω-isomeric as well.
Proof. By U = U0 U0T U and U0 is Ω-isomeric,

rank [U ]Ωj ,: = rank ([U0 ]Ωj ,: )U0T U = rank U0T U

= rank U0 U0T U = rank (U ) , ∀1 ≤ j ≤ n.

The above lemma states that, in one word, the subspace of an isomeric space is isomeric.
3.1.3 Important Properties

As aforementioned, the isometric condition is actually necessary for ensuring that the minimal rank
solution to (1) can identify L0 . To see why, let’s assume that U0 ∩ Ω⊥ 6= {0}, where we denote by
U0 Σ0 V0T the SVD of L0 . Then one could construct a nonzero perturbation, denoted as ∆ ∈ U0 ∩ Ω⊥ ,
and accordingly, obtain a feasible solution L̃0 = L0 + ∆ to the problem in (1). Since ∆ ∈ U0 , we
have rank(L̃0 ) ≤ rank (L0 ). Even more, it is entirely possible that rank(L̃0 ) < rank (L0 ). Such
a case is unidentifiable in nature, as the global optimum to problem (1) cannot identify L0 . Thus,
to ensure that the global minimum to (1) can identify L0 , it is essentially necessary to show that
U0 ∩ Ω⊥ = {0} (resp. V0 ∩ Ω⊥ = {0}), which is equivalent to the operator PU0 PΩ PU0 (resp.
PV0 PΩ PV0 ) is invertible (see Lemma 6.8 of the Supplementary Materials). Interestingly, the isomeric
condition is indeed a sufficient and necessary condition for the operators PU0 PΩ PU0 and PV0 PΩ PV0
to be invertible, as shown in the following theorem.
Theorem 3.1. Let L0 ∈ Rm×n and Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n}. Let the SVD of L0 be
U0 Σ0 V0T . Denote PU0 (·) = U0 U0T (·) and PV0 (·) = (·)V0 V0T . Then we have the following:
1. The linear operator PU0 PΩ PU0 is invertible if and only if U0 is Ω-isomeric.
2. The linear operator PV0 PΩ PV0 is invertible if and only if V0 is ΩT -isomeric.
The necessity stated above implies that the isomeric condition is actually a very mild hypothesis. In
general, there are numerous reasons for the target matrix L0 to be isomeric. Particularly, the widely
used assumptions of low-rankness, incoherence and uniform sampling are indeed sufficient (but not
necessary) to ensure isomerism, as shown in the following theorem.
Theorem 3.2. Let L0 ∈ Rm×n and Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n}. Denote n1 = max(m, n)
and n2 = min(m, n). Suppose that L0 is incoherent and Ω is a 2D index set sampled uniformly
at random, namely Pr((i, j) ∈ Ω) = ρ0 and Pr((i, j) ∈ / Ω) = 1 − ρ0 . For any δ > 0, if ρ0 > δ
is obeyed and rank (L0 ) < δn2 /(c log n1 ) holds for some numerical constant c then, with high
probability at least 1 − n−10
1 , L0 is Ω/ΩT -isomeric.
It is worth noting that the isomeric condition can be obeyed in numerous circumstances other than
the case of uniform sampling plus incoherence. For example,
1 0 0
" #
Ω = {(1, 1), (1, 2), (1, 3), (2, 1), (3, 1)} and L0 = 0 0 0 ,
0 0 0
where L0 is a 3×3 matrix with 1 at (1, 1) and 0 everywhere else. In this example, L0 is not incoherent
and the sampling is not uniform either, but it could be verified that L0 is Ω/ΩT -isomeric.
3.2 Results
In this subsection, we shall show how the isomeric condition can take effect in the context of
nonuniform sampling, establishing some theorems pertaining to missing data recovery [35] as well
as matrix completion.
3.2.1 Missing Data Recovery

Before exploring the matrix completion problem, for the ease of understanding, we would like
to consider a missing data recovery problem studied by Zhang [35], which could be described as
follows: Let y0 ∈ Rm be a data vector drawn form some low-dimensional subspace, denoted as
y0 ∈ S0 ⊂ Rm . Suppose that y0 contains some available observations in yb ∈ Rk and some missing
entries in yu ∈ Rm−k . Namely, after a permutation,

yb
y0 = , yb ∈ Rk , yu ∈ Rm−k . (3)
yu
Given the observations in yb , we seek to restore the unseen entries in yu . To do this, we consider the
prevalent idea that represents a data vector as a linear combination of the bases in a given dictionary:
y0 = Ax0 , (4)
where A ∈ Rm×p is a dictionary constructed in advance and x0 ∈ Rp is the representation of y0 .
Utilizing the same permutation used in (3), we can partition the rows of A into two parts according to
the indices of the observed and missing entries, respectively:

Ab
A= , Ab ∈ Rk×p , Au ∈ R(m−k)×p . (5)
Au
In this way, the equation in (4) gives that
yb = Ab x0 and y u = Au x 0 .
As we now can see, the unseen data yu could be restored, as long as the representation x0 is retrieved
by only accessing the available observations in yb . In general cases, there are infinitely many
representations that satisfy y0 = Ax0 , e.g., x0 = A+ y0 , where (·)+ is the pseudo-inverse of a matrix.
Since A+ y0 is the representation of minimal `2 norm, we revisit the traditional `2 program:
1 2
min kxk2 , s.t. yb = Ab x, (6)
x 2
where k · k2 is the `2 norm of a vector. Under some verifiable conditions, the above `2 program
is indeed consistently successful in a sense as in the following: For any y0 ∈ S0 with an arbitrary
partition y0 = [yb ; yu ] (i.e., arbitrarily missing), the desired representation x0 = A+ y0 is the unique
minimizer to the problem in (6). That is, the unseen data yu is exactly recovered by firstly computing
the minimizer x∗ to problem (6) and then calculating yu = Au x∗ .
Theorem 3.3. Let y0 = [yb ; yu ] ∈ Rm be an authentic sample drawn from some low-dimensional
subspace S0 embedded in Rm , A ∈ Rm×p be a given dictionary and k be the number of available
observations in yb . Then the convex program (6) is consistently successful, provided that S0 ⊆
span{A} and the dictionary A is k-isomeric.
Unlike the theory in [35], the condition of which is unverifiable, our k-isomeric condition could be
verified in finite time. Notice, that the problem of missing data recovery is closely related to matrix
completion, which is actually to restore the missing entries in multiple data vectors simultaneously.
Hence, Theorem 3.3 can be naturally generalized to the case of matrix completion, as will be shown
in the next subsection.
3.2.2 Matrix Completion

The spirits of the `2 program (6) can be easily transferred to the case of matrix completion. Follow-
ing (6), one may consider Frobenius norm minimization for matrix completion:
1 2
min kXkF , s.t. PΩ (AX − L0 ) = 0, (7)
X 2
where A ∈ Rm×p is a dictionary assumed to be given. As one can see, the problem in (7) is equivalent
to (6) if L0 is consisting of only one column vector. The same as (6), the convex program (7) can
also exactly recover the desired representation matrix A+ L0 , as shown in the theorem below. The
difference is that we here require Ω-isomerism instead of k-isomerism.
Theorem 3.4. Let L0 ∈ Rm×n and Ω ⊆ {1, 2, · · · , m} × {1, 2, · · · , n}. Suppose that A ∈ Rm×p
is a given dictionary. Provided that L0 ∈ span{A} and A is Ω-isomeric, the desired representation
X0 = A+ L0 is the unique minimizer to the problem in (7).
Theorem 3.4 tells us that, in general, even when the locations of the missing entries are interrelated
and nonuniformly distributed, the target matrix L0 can be restored as long as we have found a proper
dictionary A. This motivates us to consider the commonly used bilinear program that seeks both A
and X simultaneously:
1 2 1 2
min kAkF + kXkF , s.t. PΩ (AX − L0 ) = 0, (8)
A,X 2 2
where A ∈ Rm×p and X ∈ Rp×n . The problem above is bilinear and therefore nonconvex. So, it
would be hard to obtain a strong performance guarantee as done in the convex programs, e.g., [4, 21].
Interestingly, under a very mild condition, the problem in (8) is proved to include the exact solutions
that identify the target matrix L0 as the critical points.
1 1
To exhibit the power of program (8), however, the parameter p, which indicates the number of
columns in the dictionary matrix A, must be close to the true rank of the target matrix L0 . This is
convex (nonuniform) nonconvex (nonuniform) convex (uniform) nonconvex (uniform)
95 95 95 95
observed entries (%)

75 75 75 75
55 55 55 55
35 35 35 35
15 15 15 15
1 1 1 1
1 15 35 55 75 95 1 15 35 55 75 95 1 15 35 55 75 95 1 15 35 55 75 95
rank(L0) rank(L0) rank(L0) rank(L0)
Figure 2: Comparing the bilinear program (9) (p = m) with the convex method (2). The numbers
plotted on the above figures are the success rates within 20 random trials. The white and black points
mean “succeed” and “fail”, respectively. Here the success is in a sense that PSNR ≥ 40dB, where
PSNR standing for peak signal-to-noise ratio.
impractical in the cases where the rank of L0 is unknown. Notice, that the Ω-isomeric condition
imposed on A requires
rank (A) ≤ |Ωj |, ∀j = 1, 2, · · · , n.
This, together with the condition of L0 ∈ span{A}, essentially need us to solve a low rank matrix
recovery problem [14]. Hence, we suggest to combine the formulation (7) with the popular idea of
nuclear norm minimization, resulting in a bilinear program that jointly estimates both the dictionary
matrix A and the representation matrix X by
1 2
min kAk∗ + kXkF , s.t. PΩ (AX − L0 ) = 0, (9)
A,X 2
which, by coincidence, has been mentioned in a paper about optimization [32]. Similar to (8), the
program in (9) has the following theorem to guarantee its performance.
2 1
Unlike (8), which possesses superior performance only if p is close to rank (L0 ) and the initial
solution is chosen carefully, the bilinear program in (9) can work well by simply choosing p = m
and using A = I as the initial solution. To see why, one essentially needs to figure out the conditions
under which a specific optimization procedure can produce an optimal solution that meets an exact
solution. This requires extensive justifications and we leave it as future work.
4 Simulations
To verify the superiorities of the nonconvex matrix completion methods over the convex program (2),
we would like to experiment with randomly generated matrices. We generate a collection of m × n
(m = n = 100) target matrices according to the model of L0 = BC, where B ∈ Rm×r0 and
C ∈ Rr0 ×n are N (0, 1) matrices. The rank of L0 , i.e., r0 , is configured as r0 = 1, 5, 10, · · · , 90, 95.
Regarding the index set Ω consisting of the locations of the observed entries, we consider t-
wo settings: One is to create Ω by using a Bernoulli model to randomly sample a subset from
{1, · · · , m} × {1, · · · , n} (referred to as “uniform”), the other is as in Figure 1 that makes the
locations of the observed entries be concentrated around the main diagonal of a matrix (referred to as
“nonuniform”). The observation fraction is set to be |Ω|/(mn) = 0.01, 0.05, · · · , 0.9, 0.95. For each
pair of (r0 , |Ω|/(mn)), we run 20 trials, resulting in 8000 simulations in total.
When p = m and the identity matrix is used to initialize the dictionary A, we have empirically found
that program (8) has the same performance as (2). This is not strange, because it has been proven
in [16] that kLk∗ = minA,X 21 (kAk2F + kXk2F ), s.t. L = AX. Figure 2 compares the bilinear
program (9) to the convex method (2). It can be seen that (9) works distinctly better than (2). Namely,
while handling the nonuniformly missing data, the number of matrices successfully restored by the
bilinear program (9) is 102% more than the convex program (2). Even for dealing with the missing
entries chosen uniformly at random, in terms of the number of successfully restored matrices, the
bilinear program (9) can still outperform the convex method (2) by 44%. These results illustrate that,
even in the cases where the rank of L0 is unknown, the bilinear program (9) can do much better than
the convex optimization based method (2).
5 Conclusion and Future Work

This work studied the problem of matrix completion with nonuniform sampling, a significant setting
not extensively studied before. To figure out the conditions under which exact recovery is possible,
we proposed a so-called isomeric condition, which provably holds when the standard assumptions
of low-rankness, incoherence and uniform sampling arise. In addition, we also exemplified that
the isomeric condition can be obeyed in the other cases beyond the setting of uniform sampling.
Even more, our theory implies that the isomeric condition is indeed necessary for making sure
that the minimal rank completion can identify the target matrix L0 . Equipped with the isomeric
condition, finally, we mathematically proved that the widely used bilinear programs can include the
exact solutions that recover the target matrix L0 as the critical points; this guarantees the recovery
performance of bilinear programs to some extend.
However, there still remain several problems for future work. In particular, it is unknown under which
conditions a specific optimization procedure for (9) can produce an optimal solution that exactly
restores the target matrix L0 . To do this, one needs to analyze the convergence property as well as
the recovery performance. Moreover, it is unknown either whether the isomeric condition suffices
for ensuring that the minimal rank completion can identify the target L0 . These require extensive
justifications and we leave them as future work.
Acknowledgment
We would like to thanks the anonymous reviewers and meta-reviewers for providing us many valuable
comments to refine this paper.
References
[1] Emmanuel Candès and Terence Tao. The power of convex relaxation: Near-optimal matrix completion.
[2] Emmanuel Candès and Yaniv Plan. Matrix completion with noise. In IEEE Proceeding, volume 98, pages
925–936, 2010.
[3] William E. Bishop and Byron M. Yu. Deterministic symmetric positive semidefinite matrix completion.
[4] Emmanuel Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations
of Computational Mathematics, 9(6):717–772, 2009.
[5] Eyal Heiman, Gideon Schechtman, and Adi Shraibman. Deterministic algorithms for matrix completion.
Random Structures and Algorithms, 45(2):306–317, 2014.
[6] Raghunandan H. Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries.
[7] Raghunandan H. Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from noisy entries.
Journal of Machine Learning Research, 11:2057–2078, 2010.
[8] Akshay Krishnamurthy and Aarti Singh. Low-rank matrix and tensor completion via adaptive sampling.
[9] Troy Lee and Adi Shraibman. Matrix completion from any given set of observations. In Neural Information
Processing Systems, pages 1781–1787, 2013.
[10] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning
large incomplete matrices. Journal of Machine Learning Research, 11:2287–2322, 2010.
[11] Karthik Mohan and Maryam Fazel. New restricted isometry results for noisy low-rank recovery. In IEEE
International Symposium on Information Theory, pages 1573–1577, 2010.
[12] B. Recht, W. Xu, and B. Hassibi. Necessary and sufficient conditions for success of the nuclear norm
heuristic for rank minimization. Technical report, CalTech, 2008.
[13] Markus Weimer, Alexandros Karatzoglou, Quoc V. Le, and Alex J. Smola. Cofi rank - maximum margin
matrix factorization for collaborative ranking. In Neural Information Processing Systems, 2007.
[14] Emmanuel J. Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis?
Journal of the ACM, 58(3):1–37, 2011.
[15] Alexander L. Chistov and Dima Grigoriev. Complexity of quantifier elimination in the theory of alge-
braically closed fields. In Proceedings of the Mathematical Foundations of Computer Science, pages
17–31, 1984.
[16] Maryam Fazel, Haitham Hindi, and Stephen P. Boyd. A rank minimization heuristic with application to
minimum order system approximation. In American Control Conference, pages 4734–4739, 2001.
[17] Rong Ge, Jason D. Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Neural
[18] Franz J. Király, Louis Theran, and Ryota Tomioka. The algebraic combinatorial approach for low-rank
matrix completion. J. Mach. Learn. Res., 16(1):1391–1436, January 2015.
[19] Guangcan Liu and Ping Li. Recovery of coherent data via low-rank dictionary pursuit. In Neural
[20] Daniel L. Pimentel-Alarcón and Robert D. Nowak. The Information-theoretic requirements of subspace
clustering with missing data. In International Conference on Machine Learning, 48:802–810, 2016.
[21] Guangcan Liu and Ping Li. Low-rank matrix completion in the presence of high coherence. IEEE
Transactions on Signal Processing, 64(21):5623–5633, 2016.
[22] Guangcan Liu, Zhouchen Lin, Shuicheng Yan, Ju Sun, Yong Yu, and Yi Ma. Robust recovery of subspace
structures by low-rank representation. IEEE Transactions on Pattern Recognition and Machine Intelligence,
35(1):171–184, 2013.
[23] Guangcan Liu, Qingshan Liu, and Ping Li. Blessing of dimensionality: Recovering mixture data via
dictionary pursuit. IEEE Transactions on Pattern Recognition and Machine Intelligence, 39(1):47–60,
2017.
[24] Guangcan Liu, Huan Xu, Jinhui Tang, Qingshan Liu, and Shuicheng Yan. A deterministic analysis for
LRR. IEEE Transactions on Pattern Recognition and Machine Intelligence, 38(3):417–430, 2016.
[25] Raghu Meka, Prateek Jain, and Inderjit S. Dhillon. Matrix completion from power-law distributed samples.
[26] Sahand Negahban and Martin J. Wainwright. Restricted strong convexity and weighted matrix completion:
Optimal bounds with noise. Journal of Machine Learning Research, 13:1665–1697, 2012.
[27] Yudong Chen, Srinadh Bhojanapalli, Sujay Sanghavi, and Rachel Ward. Completing any low-rank matrix,
provably. Journal of Machine Learning Research, 16: 2999-3034, 2015.
[28] Praneeth Netrapalli, U. N. Niranjan, Sujay Sanghavi, Animashree Anandkumar, and Prateek Jain. Non-
convex robust PCA. In Neural Information Processing Systems, pages 1107–1115, 2014.
[29] Yuzhao Ni, Ju Sun, Xiaotong Yuan, Shuicheng Yan, and Loong-Fah Cheong. Robust low-rank subspace
segmentation with semidefinite guarantees. In International Conference on Data Mining Workshops, pages
1179–1188, 2013.
[30] R. Rockafellar. Convex Analysis. Princeton University Press, Princeton, NJ, USA, 1970.
[31] Ruslan Salakhutdinov and Nathan Srebro. Collaborative filtering in a non-uniform world: Learning with
the weighted trace norm. In Neural Information Processing Systems, pages 2056–2064, 2010.
[32] Fanhua Shang, Yuanyuan Liu, and James Cheng. Scalable algorithms for tractable schatten quasi-norm
minimization. In AAAI Conference on Artificial Intelligence, pages 2016–2022, 2016.
[33] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via non-convex factorization. IEEE
Transactions on Information Theory, 62(11):6535–6579, 2016.
[34] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust PCA via outlier pursuit. IEEE Transactions
on Information Theory, 58(5):3047–3064, 2012.
[35] Yin Zhang. When is missing data recoverable? CAAM Technical Report TR06-15, 2006.
[36] Tuo Zhao, Zhaoran Wang, and Han Liu. A nonconvex optimization framework for low rank matrix
estimation. In Neural Information Processing Systems, pages 559–567, 2015.
[37] David L. Donoho and Michael Elad. Optimally sparse representation in general (nonorthogonal) dic-
tionaries via `1 minimization. Proceedings of the National Academy of Sciences, 100(5): 2197-2202,
2003.
10
1
Notice: This work has been submitted to the IEEE for possible
publication. Copyright may be transferred without notice, after which
this version may no longer be accessible.
arXiv:1907.11705v1 [cs.DS] 27 Jul 2019
July 30, 2019 DRAFT

2
Low-Rank Matrix Completion: A

Contemporary Survey
Luong Trung Nguyen, Junhan Kim, Byonghyo Shim
Information System Laboratory
Department of Electrical and Computer Engineering, Seoul National University
Email: {ltnguyen,junhankim,bshim}@islab.snu.ac.kr
Abstract
As a paradigm to recover unknown entries of a matrix from partial observations, low-rank matrix
completion (LRMC) has generated a great deal of interest. Over the years, there have been lots of works
on this topic but it might not be easy to grasp the essential knowledge from these studies. This is mainly
because many of these works are highly theoretical or a proposal of new LRMC technique. In this paper,
we give a contemporary survey on LRMC. In order to provide better view, insight, and understanding
of potentials and limitations of LRMC, we present early scattered results in a structured and accessible
way. Specifically, we classify the state-of-the-art LRMC techniques into two main categories and then
explain each category in detail. We next discuss issues to be considered when one considers using
LRMC techniques. These include intrinsic properties required for the matrix recovery and how to
exploit a special structure in LRMC design. We also discuss the convolutional neural network (CNN)
based LRMC algorithms exploiting the graph structure of a low-rank matrix. Further, we present the
recovery performance and the computational complexity of the state-of-the-art LRMC techniques. Our
hope is that this survey article will serve as a useful guide for practitioners and non-experts to catch
the gist of LRMC.
I. I NTRODUCTION
In the era of big data, the low-rank matrix has become a useful and popular tool to express two-
dimensional information. One well-known example is the rating matrix in the recommendation
systems representing users’ tastes on products [1]. Since users expressing similar ratings on
multiple products tend to have the same interest for the new product, columns associated with
users sharing the same interest are highly likely to be the same, resulting in the low rank structure
July 30, 2019 DRAFT

3
Michael Moore's The Awful Truth: Season 2, 2001

Richard Pryor: Live on the Sunset Strip, 1982
Star Trek: Deep Space Nine: Season 5, 1996
Ghost Dog: The Way of the Samurai, 2000

Rudolph the Red-Nosed Reindeer, 1964
The Hunchback of Notre Dame II, 2001

Aqua Teen Hunger Force: Vol. 1, 2000
What the #$*! Do We Know!?, 2004
Star Trek: Voyager: Season 1, 1995
Chappelle's Show: Season 1, 2003

The Weather Underground, 2002
Rambo: First Blood Part II, 1985
That '70s Show: Season 1, 1998

Something's Gotta Give, 2003
Airplane II: The Sequel, 1982
The Final Countdown, 1980

Duplex (Widescreen), 2003
Husbands and Wives, 1992
A Night at the Opera, 1935

North by Northwest, 1959
Herbie Rides Again, 1974

Woman of the Year, 1942
X2: X-Men United, 2003
Jingle All the Way, 1996
Death to Smoochy, 2002

Immortal Beloved, 1994
Sweet November, 2001

Never Die Alone, 2004
The Deer Hunter, 1978

A Little Princess, 1995
Charlotte's Web, 1973

Gross Anatomy, 1989
Reservoir Dogs, 1992
Mostly Martha, 2002
Lilo and Stitch, 2002
Stuart Little 2, 2002

Taking Lives, 2004
The Cookout, 2004

Spitfire Grill, 1996
Dragonheart, 1996
The Chorus, 2004

Funny Face, 1957
Parenthood, 1989
The Game, 1997
7 Seconds, 2005
Silkwood, 1983
Impostor, 2000
Spartan, 2004
Congo, 1995
Fame, 1980
Customer ID: 6
7
79
5 134
188
199
481
561
684
769
906
1310
1333
4 1409
1427
1442
1457
1500
1527
1626
1830
1871
3 1897
1918
2000
2128
2213
2225
2307
2455
2469
2 2678
2693
2757
2787
2794
2807
2878
2892
2905
2976
1 3039
3186
3292
3321
3363
3458
3595
3604
3694
0
(a) (b)
Netflix rating matrix with each entry an integer Submatrix M of size 50 × 50
from 1 to 5 and zero for unknown
(c) (d)
Observed matrix Mo (70% of known entries of M) c via LRMC using Mo
Reconstructed matrix M
Fig. 1. c are then simply rounded to integers, achieving 97.2%

Recommendation system application of LRMC. Entries of M
accuracy.
of the rating matrix (see Fig. I). Another example is the Euclidean distance matrix formed by
the pairwise distances of a large number of sensor nodes. Since the rank of a Euclidean distance
matrix in the k-dimensional Euclidean space is at most k + 2 (if k = 2, then the rank is 4), this
matrix can be readily modeled as a low-rank matrix [2], [3], [4].
One major benefit of the low-rank matrix is that the essential information, expressed in terms
of degree of freedom, in a matrix is much smaller than the total number of entries. Therefore,
even though the number of observed entries is small, we still have a good chance to recover
the whole matrix. There are a variety of scenarios where the number of observed entries of a
July 30, 2019 DRAFT

4
matrix is tiny. In the recommendation systems, for example, users are recommended to submit
the feedback in a form of rating number, e.g., 1 to 5 for the purchased product. However,
users often do not want to leave a feedback and thus the rating matrix will have many missing
entries. Also, in the internet of things (IoT) network, sensor nodes have a limitation on the radio
communication range or under the power outage so that only small portion of entries in the
Euclidean distance matrix is available.
When there is no restriction on the rank of a matrix, the problem to recover unknown entries
of a matrix from partial observed entries is ill-posed. This is because any value can be assigned
to unknown entries, which in turn means that there are infinite number of matrices that agree
with the observed entries. As a simple example, consider the following 2 × 2 matrix with one
unknown entry marked ?
 
1 5
M= . (1)
2 ?
If M is a full rank, i.e., the rank of M is two, then any value except 10 can be assigned to ?.
Whereas, if M is a low-rank matrix (the rank is one in this trivial example), two columns differ by
only a constant and hence unknown element ? can be easily determined using a linear relationship
between two columns (? = 10). This example is obviously simple, but the fundamental principle
to recover a large dimensional matrix is not much different from this and the low-rank constraint
plays a central role in recovering unknown entries of the matrix.
Before we proceed, we discuss a few notable applications where the underlying matrix is
modeled as a low-rank matrix.
1) Recommendation system: In 2006, the online DVD rental company Netflix announced
a contest to improve the quality of the company’s movie recommendation system. The
company released a training set of half million customers. Training set contains ratings
on more than ten thousands movies, each movie being rated on a scale from 1 to 5
[1]. The training data can be represented in a large dimensional matrix in which each
column represents the rating of a customer for the movies. The primary goal of the
recommendation system is to estimate the users’ interests on products using the sparsely
July 30, 2019 DRAFT

5
(a)
Partially observed distances of sensor nodes due to limitation of radio communication range r
(b) (c)
RSSI-based observation error of 1000 Reconstruction error
sensor nodes in an 100m× 100m area
Fig. 2. Localization via LRMC [4]. The Euclidean distance matrix can be recovered with 92% of distance error below 0.5m
using 30% of observed distances.
sampled1 rating matrix.2 Often users sharing the same interests in key factors such as the
type, the price, and the appearance of the product tend to provide the same rating on the
movies. The ratings of those users might form a low-rank column space, resulting in the
low-rank model of the rating matrix (see Fig. I).
2) Phase retrieval: The problem to recover a signal not necessarily sparse from the magni-
tude of its observation is referred to as the phase retrieval. Phase retrieval is an important
problem in X-ray crystallography and quantum mechanics since only the magnitude of
1
Netflix dataset consists of ratings of more than 17,000 movies by more than 2.5 million users. The number of known entries
is only about 1% [1].
2
Customers might not necessarily rate all of the movies.
July 30, 2019 DRAFT

6
Original image Image with noise and scribbles Reconstructed image
Fig. 3. Image reconstruction via LRMC. Recovered images achieve peak SNR ≥ 32dB.
the Fourier transform is measured in these applications [5]. Suppose the unknown time-
domain signal m = [m0 · · · mn−1 ] is acquired in a form of the measured magnitude of
the Fourier transform. That is,

1 Xn−1
−j2πωt/n
|zω | = √ mt e , ω ∈ Ω, (2)
n t=0

where Ω is the set of sampled frequencies. Further, let
1
fω = √ [1 e−j2πω/n · · · e−j2πω(n−1)/n ]H , (3)
n
M = mmH where mH is the conjugate transpose of m. Then, (2) can be rewritten as
|zω |2 = |hfω , mi|2 (4)
= tr(fωH mmH fω ) (5)
= tr(mmH fω fωH ) (6)
= hM, Fω i, (7)
where Fw = fw fwH is the rank-1 matrix of the waveform fω . Using this simple transform,
we can express the quadratic magnitude |zω |2 as linear measurement of M. In essence,
July 30, 2019 DRAFT

7
the phase retrieval problem can be converted to the problem to reconstruct the rank-1
matrix M in the positive semi-definite (PSD) cone3 [5]:
min rank(X)
X
subject to hM, Fω i = |zω |2 , ω ∈ Ω (8)
X 0.
3) Localization in IoT networks: In recent years, internet of things (IoT) has received
much attention for its plethora of applications such as healthcare, automatic metering,
environmental monitoring (temperature, pressure, moisture), and surveillance [6], [7], [2].
Since the action in IoT networks, such as fire alarm, energy transfer, emergency request, is
made primarily on the data center, data center should figure out the location information
of whole devices in the networks. In this scheme, called network localization (a.k.a.
cooperative localization), each sensor node measures the distance information of adjacent
nodes and then sends it to the data center. Then the data center constructs a map of sensor
nodes using the collected distance information [8]. Due to various reasons, such as the
power outage of a sensor node or the limitation of radio communication range (see Fig.
1), only small number of distance information is available at the data center. Also, in the
vehicular networks, it is not easy to measure the distance of all adjacent vehicles when
a vehicle is located at the dead zone. An example of the observed Euclidean distance
matrix is
 
0 d212 d213 ? ?
 
 d2 0 ? ? ? 
 21 
 
Mo =  2
 d31 ? 0 d234 2 ,
d35 
 
 ? ? d243 0 d245 
 
? ? d253 d254 0
where dij is the pairwise distance between two sensor nodes i and j. Since the rank
of Euclidean distance matrix M is at most k+2 in the k-dimensional Euclidean space
(k = 2 or k = 3) [3], [4], the problem to reconstruct M can be well-modeled as the
LRMC problem.
3
If M is recovered, then the time-domain vector m can be computed by the eigenvalue decomposition of M.
July 30, 2019 DRAFT

8
Matrix Completion Techniques

When the rank is unknown
Rank minimization technique

II.A.1 NNM via convex optimization [21]-[23]
Nuclear norm minimization (NNM) II.A.2 Singular value thresholding (SVT) [33]
II.A.3 Iteratively reweighted least squares (IRLS)

minimization [36], [37]
When the rank is known
II.B.1 Heuristic greedy algorithm [43], [44]
II.B.2 Alternating minimization technique [46], [47]

Frobenius norm minimization (FNM)
II.B.3 Optimization over smooth Riemannian manifold [49], [50]
II.B.4 Truncated NNM [57], [58]
Fig. 4. Outline of LRMC algorithms.
4) Image compression and restoration: When there is dirt or scribble in a two-dimensional

image (see Fig. 1), one simple solution is to replace the contaminated pixels with the
interpolated version of adjacent pixels. A better way is to exploit intrinsic domination
of a few singular values in an image. In fact, one can readily approximate an image to
the low-rank matrix without perceptible loss of quality. By using clean (uncontaminated)
pixels as observed entries, an original image can be recovered via the low-rank matrix
completion.
5) Massive multiple-input multiple-output (MIMO): By exploiting hundreds of antennas at
the basestation (BS), massive MIMO can offer a large gain in capacity. In order to optimize
the performance gain of the massive MIMO systems, the channel state information at the
transmitter (CSIT) is required [9]. One way to acquire the CSIT is to let each user directly
feed back its own pilot observation to BS for the joint CSIT estimation of all users [12].
In this setup, the MIMO channel matrix H can be reconstructed in two steps: 1) finding
the pilot matrix Y using the least squares (LS) estimation or linear minimum mean square
error (LMMSE) estimation and 2) reconstructing H using the model Y = HΦ where
July 30, 2019 DRAFT

9
each column of Φ is the pilot signal from one antenna at BS [10], [11]. Since the number
of resolvable paths P is limited in most cases, one can readily assume that rank(H) ≤ P
[12]. In the massive MIMO systems, P is often much smaller than the dimension of H
due to the limited number of clusters around BS. Thus, the problem to recover H at BS
can be solved via the rank minimization problem subject to the linear constraint Y = HΦ
[11].
Other than these, there are a bewildering variety of applications of LRMC in wireless com-
munication, such as millimeter wave (mmWave) channel estimation [13], [14], topological inter-
ference management (TIM) [15], [16], [17], [18] and mobile edge caching in fog radio access
networks (Fog-RAN) [19], [20].
The paradigm of LRMC has received much attention ever since the works of Fazel [21],
Candes and Recht [22], and Candes and Tao [23]. Over the years, there have been lots of works
on this topic [5], [57], [48], [49], but it might not be easy to grasp the essentials of LRMC from
these studies. One reason is because many of these works are highly theoretical and based on
random matrix theory, graph theory, manifold analysis, and convex optimization so that it is not
easy to grasp the essential knowledge from these studies. Another reason is that most of these
works are proposals of new LRMC technique so that it is difficult to catch a general idea and
big picture of LRMC from these.
The primary goal of this paper is to provide a contemporary survey on LRMC, a new paradigm
to recover unknown entries of a low-rank matrix from partial observations. To provide better
view, insight, and understanding of the potentials and limitations of LRMC to researchers and
practitioners in a friendly way, we present early scattered results in a structured and accessible
way. Firstly, we classify the state-of-the-art LRMC techniques into two main categories and
then explain each category in detail. Secondly, we present issues to be considered when using
LRMC techniques. Specifically, we discuss the intrinsic properties required for low-rank matrix
recovery and explain how to exploit a special structure, such as positive semidefinite-based
structure, Euclidean distance-based structure, and graph structure, in LRMC design. Thirdly, we
compare the recovery performance and the computational complexity of LRMC techniques via
numerical simulations. We conclude the paper by commenting on the choice of LRMC techniques
and providing future research directions.
Recently, there have been a few overview papers on LRMC. An overview of LRMC algorithms
July 30, 2019 DRAFT

10
and their performance guarantees can be found in [73]. A survey with an emphasis on first-order
LRMC techniques together with their computational efficiency is presented in [74]. Our work
is distinct from the previous studies in several aspects. Firstly, we categorize the state-of-the-
art LRMC techniques into two classes and then explain the details of each class, which can
help researchers to easily determine which technique can be used for the given problem setup.
Secondly, we provide a comprehensive survey of LRMC techniques and also provide extensive
simulation results on the recovery quality and the running time complexity from which one can
easily see the pros and cons of each LRMC technique and also gain a better insight into the
choice of LRMC algorithms. Finally, we discuss how to exploit a special structure of a low-
rank matrix in the LRMC algorithm design. In particular, we introduce the CNN-based LRMC
algorithm that exploits the graph structure of a low-rank matrix.
We briefly summarize notations used in this paper.
• For a vector a ∈ Rn , diag(a) ∈ Rn×n is the diagonal matrix formed by a.
• For a matrix A ∈ Rn1 ×n2 , ai ∈ Rn1 is the i-th column of A.
• rank(A) is the rank of A.
• AT ∈ Rn2 ×n1 is the transpose of A.
• For A, B ∈ Rn1 ×n2 , hA, Bi = tr(AT B) and A⊙B are the inner product and the Hadamard
product (or element-wise multiplication) of two matrices A and B, respectively, where
tr(·) denotes the trace operator.
• kAk, kAk∗ , and kAkF stand for the spectral norm (i.e., the largest singular value), the
nuclear norm (i.e., the sum of singular values), and the Frobenius norm of A, respectively.
• σi (A) is the i-th largest singular value of A.
• 0d1 ×d2 and 1d1 ×d2 are (d1 × d2 )-dimensional matrices with entries being zero and one,
respectively.
• Id is the d-dimensional identity matrix.
• If A is a square matrix (i.e., n1 = n2 = n), diag(A) ∈ Rn is the vector formed by the
diagonal entries of A.
• vec(X) is the vectorization of X.
July 30, 2019 DRAFT

11
II. BASICS OF L OW-R ANK M ATRIX C OMPLETION
In this section, we discuss the principle to recover a low-rank matrix from partial observations.
Basically, the desired low-rank matrix M can be recovered by solving the rank minimization
problem
min rank(X)
X
(9)
subject to xij = mij , (i, j) ∈ Ω,
where Ω is the index set of observed entries (e.g., Ω = {(1, 1), (1, 2), (2, 1)} in the example
in (1)). One can alternatively express the problem using the sampling operator PΩ . The sampling
operation PΩ (A) of a matrix A is defined as


aij if (i, j) ∈ Ω
[PΩ (A)]ij =

0 otherwise.
Using this operator, the problem (9) can be equivalently formulated as
min rank(X)
X
(10)
subject to PΩ (X) = PΩ (M).
A naive way to solve the rank minimization problem (10) is the combinatorial search. Specifically,
we first assume that rank(M) = 1. Then, any two columns of M are linearly dependent and thus
we have the system of expressions mi = αi,j mj for some αi,j ∈ R. If the system has no solution
for the rank-one assumption, then we move to the next assumption of rank(M) = 2. In this case,
we solve the new system of expressions mi = αi,j mj + αi,k mk . This procedure is repeated until
the solution is found. Clearly, the combinatorial search strategy would not be feasible for most
practical scenarios since it has an exponential complexity in the problem size [76]. For example,
when M is an n × n matrix, it can be shown that the number of the system expressions to be
solved is O(n2n ).
As a cost-effective alternative, various low-rank matrix completion (LRMC) algorithms have
been proposed over the years. Roughly speaking, depending on the way of using the rank
information, the LRMC algorithms can be classified into two main categories: 1) those without
the rank information and 2) those exploiting the rank information. In this section, we provide
an in depth discussion of two categories (see the outline of LRMC algorithms in Fig. 3).
July 30, 2019 DRAFT

12
A. LRMC Algorithms Without the Rank Information
In this subsection, we explain the LRMC algorithms that do not require the rank information
of the original low-rank matrix.
1) Nuclear Norm Minimization (NNM): Since the rank minimization problem (10) is NP-
hard [21], it is computationally intractable when the dimension of a matrix is large. One common
trick to avoid computational issue is to replace the non-convex objective function with its convex
surrogate, converting the combinatorial search problem into a convex optimization problem.
There are two clear advantages in solving the convex optimization problem: 1) a local optimum
solution is globally optimal and 2) there are many efficient polynomial-time convex optimization
solvers (e.g., interior point method [77] and semi-definite programming (SDP) solver).
In the LRMC problem, the nuclear norm kXk∗ , the sum of the singular values of X, has been
widely used as a convex surrogate of rank(X) [22]:
min kXk∗
X
(11)
subject to PΩ (X) = PΩ (M)
Indeed, it has been shown that the nuclear norm is the convex envelope (the “best” convex
approximation) of the rank function on the set {X ∈ Rn1 ×n2 : kXk ≤ 1} [21].4 Note that the
relaxation from the rank function to the nuclear norm is conceptually analogous to the relaxation
from ℓ0 -norm to ℓ1 -norm in compressed sensing (CS) [39], [40], [41].
Now, a natural question one might ask is whether the NNM problem in (11) would offer a
solution comparable to the solution of the rank minimization problem in (10). In [22], it has
been shown that if the observed entries of a rank r matrix M(∈ Rn×n ) are suitably random and
the number of observed entries satisfies
|Ω| ≥ Cµ0 n1.2 r log n, (12)
where µ0 is the largest coherence of M (see the definition in Subsection III-A2), then M is the
unique solution of the NNM problem (11) with overwhelming probability (see Appendix B).
4
For any function f : C → R, where C is a convex set, the convex envelope of f is the largest convex function g such that
f (x) ≥ g(x) for all x ∈ C. Note that the convex envelope of rank(X) on the set {X ∈ Rn1 ×n2 : kXk ≤ 1} is the nuclear
norm kXk∗ [21].
July 30, 2019 DRAFT

13
It is worth mentioning that the NNM problem in (11) can also be recast as a semidefinite
program (SDP) as (see Appendix A)
min tr(Y)
Y
subject to hY, Ak i = bk , k = 1, · · · , |Ω| (13)
Y 0,
 
W1 X
where Y =   ∈ R(n1 +n2 )×(n1 +n2 ) , {Ak }|Ω|
k=1 is the sequence of linear sampling
XT W2
|Ω|
matrices, and {bk }k=1 are the observed entries. The problem (13) can be solved by the off-the-
shelf SDP solvers such as SDPT3 [24] and SeDuMi [25] using interior-point methods [26], [27],
[28], [31], [30], [29]. It has been shown that the computational complexity of SDP techniques
is O(n3 ) where n = max(n1 , n2 ) [30]. Also, it has been shown that under suitable conditions,
c of SDP satisfies kM
the output M c − MkF ≤ ǫ in at most O(nω log( 1 )) iterations where
ǫ
ω is a positive constant [29]. Alternatively, one can reconstruct M by solving the equivalent
nonconvex quadratic optimization form of the NNM problem [32]. Note that this approach has
computational benefit since the number of primal variables of NNM is reduced from n1 n2 to
r(n1 + n2 ) (r ≤ min(n1 , n2 )). Interested readers may refer to [32] for more details.
2) Singular Value Thresholding (SVT): While the solution of the NNM problem in (11) can
be obtained by solving (13), this procedure is computationally burdensome when the size of the
matrix is large.
As an effort to mitigate the computational burden, the singular value thresholding (SVT)
algorithm has been proposed [33]. The key idea of this approach is to put the regularization
term into the objective function of the NNM problem:
1
min τ kXk∗ + kXk2F
X 2 (14)
subject to PΩ (X) = PΩ (M),
where τ is the regularization parameter. In [33, Theorem 3.1], it has been shown that the solution
to the problem (14) converges to the solution of the NNM problem as τ → ∞.5
5
In practice, a large value of τ has been suggested (e.g., τ = 5n for an n × n low rank matrix) for the fast convergence of
SVT. For example, when τ = 5000, it requires 177 iterations to reconstruct a 1000×1000 matrix of rank 10 [33].
July 30, 2019 DRAFT

14
Let L(X, Y) be the Lagrangian function associated with (14), i.e.,

1
L(X, Y) = τ kXk∗ + kXk2F + hY, PΩ (M) − PΩ (X)i (15)
2
b and Y
where Y is the dual variable. Let X b be the primal and dual optimal solutions. Then, by
the strong duality [77], we have
b Y)
max min L(X, Y) = L(X, b = min max L(X, Y). (16)
Y X X Y
b and Y
The SVT algorithm finds X b in an iterative fashion. Specifically, starting with Y0 = 0n1 ×n2 ,
SVT updates Xk and Yk as
Xk = arg min L(X, Yk−1), (17a)

X
∂L(Xk , Y)
Yk = Yk−1 + δk , (17b)
∂Y
where {δk }k≥1 is a sequence of positive step sizes. Note that Xk can be expressed as
1
Xk = arg min τ kXk∗ + kXk2F − hYk−1, PΩ (X)i
X 2
(a) 1
= arg min τ kXk∗ + kXk2F − hPΩ (Yk−1 ), Xi
X 2
(b) 1
= arg min τ kXk∗ + kXk2F − hYk−1 , Xi
X 2
1
= arg min τ kXk∗ + kX − Yk−1 k2F , (18)
X 2
where (a) is because hPΩ (A), Bi = hA, PΩ (B)i and (b) is because Yk−1 vanishes outside of
Ω (i.e., PΩ (Yk−1) = Yk−1 ) by (17b). Due to the inclusion of the nuclear norm, finding out the
solution Xk of (18) seems to be difficult. However, thanks to the intriguing result of Cai et al.,
we can easily obtain the solution.
Theorem 1 ([33, Theorem 2.1]). Let Z be a matrix whose singular value decomposition (SVD)
is Z = UΣVT . Define t+ = max{t, 0} for t ∈ R. Then,
1
Dτ (Z) = arg min τ kXk∗ + kX − Zk2F , (19)
X 2
where Dτ is the singular value thresholding operator defined as
Dτ (Z) = U diag({(σi (Σ) − τ )+ }i })VT . (20)
July 30, 2019 DRAFT

15
TABLE I. T HE SVT A LGORITHM
Input observed entries PΩ (M),

a sequence of positive step sizes {δk }k≥1 ,
a regularization parameter τ > 0,
and a stopping criterion T
Initialize iteration counter k = 0
and Y0 = 0n1 ×n2 ,
While T = false do
k =k+1
[Uk−1 , Σk−1 , Vk−1 ] = svd(Yk−1 )
T
Xk = Uk−1 diag({(σi (Σk−1 ) − τ )+ }i })Vk−1 using (20)
Yk = Yk−1 + δk (PΩ (M) − PΩ (Xk ))
End
Output Xk
By Theorem 1, the right-hand side of (18) is Dτ (Yk−1). To conclude, the update equations
for Xk and Yk are given by
Xk = Dτ (Yk−1 ), (21a)
Yk = Yk−1 + δk (PΩ (M) − PΩ (Xk )). (21b)
One can notice from (21a) and (21b) that the SVT algorithm is computationally efficient since
we only need the truncated SVD and elementary matrix operations in each iteration. Indeed,
let rk be the number of singular values of Yk−1 being greater than the threshold τ . Also,
we suppose {rk } converges to the rank of the original matrix, i.e., limk→∞ rk = r. Then the
computational complexity of SVT is O(rn1 n2 ). Note also that the iteration number to achieve
the ǫ-approximation6 is O( √1ǫ ) [33]. In Table I, we summarize the SVT algorithm. For the details
of the stopping criterion of SVT, see [33, Section 5].
Over the years, various SVT-based techniques have been proposed [35], [78], [79]. In [78], an
iterative matrix completion algorithm using the SVT-based operator called proximal operator has
been proposed. Similar algorithms inspired by the iterative hard thresholding (IHT) algorithm in
CS have also been proposed [35], [79].
6 c − M∗ kF ≤ ǫ where M
By ǫ-approximation, we mean kM c is the reconstructed matrix and M∗ is the optimal solution of
SVT.
July 30, 2019 DRAFT

16
3) Iteratively Reweighted Least Squares (IRLS) Minimization: Yet another simple and com-
putationally efficient way to solve the NNM problem is the IRLS minimization technique [36],
[37]. In essence, the NNM problem can be recast using the least squares minimization as
1
min kW 2 Xk2F
X,W
(22)
1
where W = (XXT )− 2 . It can be shown that (22) is equivalent to the NNM problem (11) since
we have [36]
1 1
kXk∗ = tr((XXT ) 2 ) = kW 2 Xk2F . (23)
The key idea of the IRLS technique is to find X and W in an iterative fashion. The update
expressions are
1
Xk = arg min kWk−1
2
Xk2F , (24a)
PΩ (X)=PΩ (M)
1
Wk = (Xk XTk )− 2 . (24b)
Note that the weighted least squares subproblem (24a) can be easily solved by updating each
and every column of Xk [36]. In order to compute Wk , we need a matrix inversion (24b). To
avoid the ill-behavior (i.e., some of the singular values of Xk approach to zero), an approach
to use the perturbation of singular values has been proposed [36], [37]. Similar to SVT, the
computational complexity per iteration of the IRLS-based technique is O(rn1 n2 ). Also, IRLS
requires O(log( 1ǫ )) iterations to achieve the ǫ-approximation solution. We summarize the IRLS
minimization technique in Table II.
B. LRMC Algorithms Using Rank Information
In many applications such as localization in IoT networks, recommendation system, and image
restoration, we encounter the situation where the rank of a desired matrix is known in advance.
As mentioned, the rank of a Euclidean distance matrix in a localization problem is at most
k + 2 (k is the dimension of the Euclidean space). In this situation, the LRMC problem can be
formulated as a Frobenius norm minimization (FNM) problem:
1
min kPΩ (M) − PΩ (X)k2F
X 2 (25)
subject to rank(X) ≤ r.
July 30, 2019 DRAFT

17
TABLE II. T HE IRLS A LGORITHM
Input a constant q ≥ r,
a scaling parameter γ > 0,
Initialize iteration counter k = 0,
a regularizing sequence ǫ0 = 1,
and W0 = I
While T = false do
k =k+1
1
2
Xk = arg minPΩ (X)=PΩ (M) kWk−1 Xk2F
ǫk = min(ǫk−1 , γσq+1 (Xk ))
e k of Xk [36]
Compute a SVD perturbation version X
e e T −1
Wk = (Xk X ) 2 k
End
Output Xk
Due to the inequality of the rank constraint, an approach to use the approximate rank information
(e.g., upper bound of the rank) has been proposed [43]. The FNM problem has two main
advantages: 1) the problem is well-posed in the noisy scenario and 2) the cost function is
differentiable so that various gradient-based optimization techniques (e.g., gradient descent,
conjugate gradient, Newton methods, and manifold optimization) can be used to solve the
problem.
Over the years, various techniques to solve the FNM problem in (25) have been proposed [43],
[44], [45], [46], [47], [48], [49], [50], [51], [57]. The performance guarantee of the FNM-
based techniques has also been provided [59], [60], [61]. It has been shown that under suitable
conditions of the sampling ratio p = |Ω|/(n1 n2 ) and the largest coherence µ0 of M (see the
definition in Subsection III-A2), the gradient-based algorithms globally converges to M with
high probability [60]. Well-known FNM-based LRMC techniques include greedy techniques
[43], alternating projection techniques [45], and optimization over Riemannian manifold [50].
In this subsection, we explain these techniques in detail.
1) Greedy Techniques: In recent years, greedy algorithms have been popularly used for LRMC
due to the computational simplicity. In a nutshell, they solve the LRMC problem by making a
heuristic decision at each iteration with a hope to find the right solution in the end.
Let r be the rank of a desired low-rank matrix M ∈ Rn×n and M = UΣVT be the singular
July 30, 2019 DRAFT

18
value decomposition of M where U, V ∈ Rn×r . By noting that

r
X
M= σi (M)ui viT , (26)
i=1
M can be expressed as a linear combination of r rank-one matrices. The main task of greedy
techniques is to investigate the atom set AM = {ϕi = ui viT }ri=1 of rank-one matrices representing
M. Once the atom set AM is found, the singular values σi (M) = σi can be computed easily by
solving the following problem
r
X
(σ1 , · · · , σr ) = arg min kPΩ (M) − PΩ ( αi ϕi )kF . (27)
αi
i=1
To be specific, let A = [vec(PΩ (ϕ1 )) · · · vec(PΩ (ϕr ))], α = [α1 · · · αr ]T and b = vec(PΩ (M)).
Then, we have (σ1 , · · · , σr ) = arg min kb − Aαk2 = A† b.
α
One popular greedy technique is atomic decomposition for minimum rank approximation
(ADMiRA) [43], which can be viewed as an extension of the compressive sampling matching
pursuit (CoSaMP) algorithm in CS [38], [39], [40], [41]. ADMiRA employs a strategy of adding
as well as pruning to identify the atom set AM . In the adding stage, ADMiRA identifies 2r rank-
one matrices representing a residual best and then adds the matrices to the pre-chosen atom set.
Specifically, if Xi−1 is the output matrix generated in the (i−1)-th iteration and Ai−1 is its atom
set, then ADMiRA computes the residual Ri = PΩ (M) − PΩ (Xi−1 ) and then adds 2r leading
principal components of Ri to Ai−1 . In other words, the enlarged atom set Ψi is given by
T
Ψi = Ai−1 ∪ {uRi ,j vR i ,j
: 1 ≤ j ≤ 2r}, (28)
where uRi ,j and vRi ,j are the j-th principal left and right singular vectors of Ri , respectively.
Note that Ψi contains at most 3r elements. In the pruning stage, ADMiRA refines Ψi into a set
e i is the best rank-3r approximation of M, i.e.,7
of r atoms. To be specific, if X
e i = arg
X min kPΩ (M) − PΩ (X)kF , (29)
X∈span(Ψi )
then the refined atom set Ai is expressed as

T
Ai = {uX e ,j : 1 ≤ j ≤ r},
e i ,j vX (30)
i
7
Note that the solution to (29) can be computed in a similar way as in (27).
July 30, 2019 DRAFT

19
TABLE III. T HE ADM I RA A LGORITHM
Input observed entries PΩ (M) ∈ Rn×n ,

rank of a desired low-rank matrix r,
X0 = 0n×n ,
and A0 = ∅
While T = false do
Rk = PΩ (M) − PΩ (Xk )
[URk , ΣRk , VRk ] = svds(Rk , 2r)
T
(Augment) Ψk+1 = Ak ∪ {uRk ,j vR : 1 ≤ j ≤ 2r}
k ,j
Xe k+1 = arg min kPΩ (M) − PΩ (X)kF using (27)
X∈span(Ψk+1 )
[UX
e , ΣX
e , VX
e
e k+1 , r)
] = svds(X
k+1 k+1 k+1
(Prune) Ak+1 = {uX

e vTe : 1 ≤ j ≤ r}
k+1 ,j Xk+1 ,j
(Estimate) Xk+1 = arg min kPΩ (M) − PΩ (X)kF
X∈span(Ak+1 )
using (27)
k =k+1
End
Output Ak , Xk
where uX
e i ,j and vX
e
e i ,j are the j-th principal left and right singular vectors of Xi , respectively.
The computational complexity of ADMiRA is mainly due to two operations: the least squares
operation in (27) and the SVD-based operation to find out the leading atoms of the required matrix
(e.g., Rk and Xe k+1 ). First, since (27) involves the pseudo-inverse of A (size of |Ω| × O(r)), its
computational cost is O(r|Ω|). Second, the computational cost of performing a truncated SVD
of O(r) atoms is O(rn1 n2 ). Since |Ω| < n1 n2 , the computational complexity of ADMiRA per
iteration is O(rn1 n2 ). Also, the iteration number of ADMiRA to achieve the ǫ-approximation
is O(log( 1ǫ )) [43]. In Table III, we summarize the ADMiRA algorithm.
Yet another well-known greedy method is the rank-one matrix pursuit algorithm [44], an
extension of the orthogonal matching pursuit algorithm in CS [42]. In this approach, instead of
choosing multiple atoms of a matrix, an atom corresponding to the largest singular value of the
residual matrix Rk is chosen.
2) Alternating Minimization Techniques: Many of LRMC algorithms [33], [43] require the
computation of (partial) SVD to obtain the singular values and vectors (expressed as O(rn2 )). As
July 30, 2019 DRAFT

20
an effort to further reduce the computational burden of SVD, alternating minimization techniques
have been proposed [45], [46], [47]. The basic premise behind the alternating minimization
techniques is that a low-rank matrix M ∈ Rn1 ×n2 of rank r can be factorized into tall and fat
matrices, i.e., M = XY where X ∈ Rn1 ×r and Y ∈ Rr×n2 (r ≪ n1 , n2 ). The key idea of this
approach is to find out X and Y minimizing the residual defined as the difference between the
original matrix and the estimate of it on the sampling space. In other words, they recover X and
Y by solving
1
min kPΩ (M) − PΩ (XY)k2F . (31)
X,Y 2
Power factorization, a simple alternating minimization algorithm, finds out the solution to (31)
by updating X and Y alternately as [45]
Xi+1 = arg min kPΩ (M) − PΩ (XYi )k2F , (32a)

X
Yi+1 = arg min kPΩ (M) − PΩ (Xi+1 Y)k2F . (32b)

Y
Alternating steepest descent (ASD) is another alternating method to find out the solution [46].
The key idea of ASD is to update X and Y by applying the steepest gradient descent method
to the objective function f (X, Y) = 21 kPΩ (M) − PΩ (XY)k2F in (31). Specifically, ASD first
computes the gradient of f (X, Y) with respect to X and then updates X along the steepest
gradient descent direction:
Xi+1 = Xi − txi ▽ fYi (Xi ), (33)
where the gradient descent direction ▽fYi (Xi ) and stepsize txi are given by
▽ fYi (Xi ) = −(PΩ (M) − PΩ (Xi Yi ))YiT , (34a)

k ▽ fYi (Xi )k2F
txi = . (34b)
kPΩ (▽fYi (Xi )Yi )k2F
After updating X, ASD updates Y in a similar way:
Yi+1 = Yi − tyi ▽ fXi+1 (Yi ), (35)
where
▽ fXi+1 (Yi ) = −XTi+1 (PΩ (M) − PΩ (Xi+1 Yi )), (36a)

k ▽ fXi+1 (Yi )k2F
tyi = . (36b)
kPΩ (Xi+1 ▽ fXi+1 (Yi ))k2F
July 30, 2019 DRAFT

21
The low-rank matrix fitting (LMaFit) algorithm finds out the solution in a different way by
solving [47]
arg min {kXY − Zk2F : PΩ (Z) = PΩ (M)}. (37)
X,Y,Z
With the arbitrary input of X0 ∈ Rn1 ×r and Y0 ∈ Rr×n2 and Z0 = PΩ (M), the variables X, Y,
and Z are updated in the i-th iteration as
Xi+1 = arg min kXYi − Zi k2F = Zi Y † , (38a)

X
Yi+1 = arg min kXi Y − Zi k2F = X†i+1 Zi , (38b)

Y
Zi+1 = Xi+1 Yi+1 + PΩ (M − Xi+1 Yi+1 ), (38c)
where X† is Moore-Penrose pseudoinverse of matrix X.

Running time of the alternating minimization algorithms is very short due to the following
reasons: 1) it does not require the SVD computation and 2) the size of matrices to be inverted
is smaller than the size of matrices for the greedy algorithms. While the inversion of huge
size matrices (size of |Ω| × O(1)) is required in a greedy algorithms (see (27)), alternating
minimization only requires the pseudo inversion of X and Y (size of n1 × r and r × n2 ,
respectively). Indeed, the computational complexity of this approach is O(r|Ω| + r 2 n1 + r 2 n2 ),
which is much smaller than that of SVT and ADMiRA when r ≪ min(n1 , n2 ). Also, the iteration
number of ASD and LMaFit to achieve the ǫ-approximation is O(log( 1ǫ )) [46], [47]. It has been
shown that alternating minimization techniques are simple to implement and also require small
sized memory [84]. Major drawback of these approaches is that it might converge to the local
optimum.
3) Optimization over Smooth Riemannian Manifold: In many applications where the rank of
a matrix is known in a priori (i.e., rank(M) = r), one can strengthen the constraint of (25) by
defining the feasible set, denoted by F , as
F = {X ∈ Rn1 ×n2 : rank(X) = r}.
Note that F is not a vector space8 and thus conventional optimization techniques cannot be
used to solve the problem defined over F . While this is bad news, a remedy for this is
8
This is because if rank(X) = r and rank(Y) = r, then rank(X + Y) = r is not necessarily true (and thus X + Y does not
need to belong F).
July 30, 2019 DRAFT

22
that F is a smooth Riemannian manifold [53], [48]. Roughly speaking, smooth manifold is
a generalization of Rn1 ×n2 on which a notion of differentiability exists. For more rigorous
definition, see, e.g., [55], [56]. A smooth manifold equipped with an inner product, often called
a Riemannian metric, forms a smooth Riemannian manifold. Since the smooth Riemannian
manifold is a differentiable structure equipped with an inner product, one can use all necessary
ingredients to solve the optimization problem with quadratic cost function, such as Riemannian
gradient, Hessian matrix, exponential map, and parallel translation [55]. Therefore, optimization
techniques in Rn1 ×n2 (e.g., steepest descent, Newton method, conjugate gradient method) can
be used to solve (25) in the smooth Riemannian manifold F .
In recent years, many efforts have been made to solve the matrix completion over smooth Rie-
mannian manifolds. These works are classified by their specific choice of Riemannian manifold
structure. One well-known approach is to solve (25) over the Grassmann manifold of orthogonal
matrices9 [49]. In this approach, a feasible set can be expressed as F = {QRT : QT Q = I, Q ∈
Rn1 ×r , R ∈ Rn2 ×r } and thus solving (25) is to find an n1 × r orthonormal matrix Q satisfying
f (Q) = min
n ×r
kPΩ (M) − PΩ (QRT )k2F = 0. (39)
R∈R 2
In [49], an approach to solve (39) over the Grassmann manifold has been proposed.
Recently, it has been shown that the original matrix can be reconstructed by the unconstrained
optimization over the smooth Riemannian manifold F [50]. Often, F is expressed using the
singular value decomposition as
F = {UΣVT : U ∈ Rn1 ×r , V ∈ Rn2 ×r , Σ 0,
UT U = VT V = I, Σ = diag([σ1 · · · σr ])}. (40)
The FNM problem (25) can then be reformulated as an unconstrained optimization over F :
1
min kPΩ (M) − PΩ (X)k2F . (41)
X∈F 2
One can easily obtain the closed-form expression of the ingredients such as tangent spaces, Rie-
mannian metric, Riemannian gradient, and Hessian matrix in the unconstrained optimization [53],
[55], [56]. In fact, major benefits of the Riemannian optimization-based LRMC techniques are
the simplicity in implementation and the fast convergence. Similar to ASD, the computational
9
The Grassmann manifold is defined as the set of the linear subspaces in a vector space [55].
July 30, 2019 DRAFT

23
complexity per iteration of these techniques is O(r|Ω|+r 2n1 +r 2n2 ), and they require O(log( 1ǫ ))
iterations to achieve the ǫ-approximation solution [50].
4) Truncated NNM: Truncated NNM is a variation of the NNM-based technique requiring
the rank information r.10 While the NNM technique takes into account all the singular values
of a desired matrix, truncated NNM considers only the n − r smallest singular values [57].
Specifically, truncated NNM finds a solution to
min kXkr
X
(42)
P
n
where kXkr = σi (X). We recall that σi (X) is the i-th largest singular value of X.
i=r+1
Using [57]
r
X
σi = max tr(UT XV), (43)
UT U=VT V=Ir
i=1
we have
kXkr = kXk∗ − max tr(UT XV), (44)

UT U=VT V=Ir
and thus the problem (42) can be reformulated to
min kXk∗ − max tr(UT XV)

X UT U=VT V=Ir
(45)
This problem can be solved in an iterative way. Specifically, starting from X0 = PΩ (M),
truncated NNM updates Xi by solving [57]
min kXk∗ − tr(UTi−1 XVi−1 )

X
(46)
where Ui−1 , Vi−1 ∈ Rn×r are the matrices of left and right-singular vectors of Xi−1 , respectively.
We note that an approach in (46) has two main advantages: 1) the rank information of the desired
matrix can be incorporated and 2) various gradient-based techniques including alternating direc-
tion method of multipliers (ADMM) [80], [81], ADMM with an adaptive penalty (ADMMAP)
10
Although truncated NNM is a variant of NNM, we put it into the second category since it exploits the rank information of
a low-rank matrix.
July 30, 2019 DRAFT

24
TABLE IV. T RUNCATED NNM
Input observed entries PΩ (M) ∈ Rn×n ,

rank of a desired low-rank matrix r,
and stopping threshold ǫ > 0
and X0 = PΩ (M)
While kXk − Xk−1 kF > ǫ do

[Uk , Σk , Vk ] = svd(Xk ) (Uk , Vk ∈ Rn×r )
Xk+1 = arg min kXk∗ − tr(UT
k XVk )
X:PΩ (X)=PΩ (M)
k =k+1
End
Output Xk
[82], and accelerated proximal gradient line search method (APGL)[83] can be employed. Note
also that the dominant operation is the truncated SVD operation and its complexity is O(rn1 n2 ),
which is much smaller than that of the NNM technique (see Table V) as long as r ≪ min(n1 , n2 ).
Similar to SVT, the iteration complexity of the truncated NNM to achieve the ǫ-approximation
is O( √1ǫ ) [57]. Alternatively, the difference of two convex functions (DC) based algorithm can
be used to solve (45) [58]. In Table IV, we summarize the truncated NNM algorithm.
III. I SSUES TO BE C ONSIDERED W HEN U SING LRMC T ECHNIQUES
In this section, we study the main principles that make the recovery of a low-rank matrix
possible and discuss how to exploit a special structure of a low-rank matrix in algorithm design.
A. Intrinsic Properties
There are two key properties characterizing the LRMC problem: 1) sparsity of the observed
entries and 2) incoherence of the matrix. Sparsity indicates that an accurate recovery of the
undersampled matrix is possible even when the number of observed entries is very small.
Incoherence indicates that nonzero entries of the matrix should be spread out widely for the
efficient recovery of a low-rank matrix. In this subsection, we go over these issues in detail.
1) Sparsity of Observed Entries: Sparsity expresses an idea that when a matrix has a low
rank property, then it can be recovered using only a small number of observed entries. Natural
question arising from this is how many elements do we need to observe for the accurate recovery
July 30, 2019 DRAFT

25
of the matrix. In order to answer this question, we need to know a notion of a degree of freedom
(DOF). The DOF of a matrix is the number of freely chosen variables in the matrix. One can
easily see that the DOF of the rank one matrix in (1) is 3 since one entry can be determined
after observing three. As an another example, consider the following rank one matrix
 
1 3 5 7
 
 2 6 10 14 
 
M= . (47)
 3 9 15 21 
 
4 12 20 28
One can easily see that if we observe all entries of one column and one row, then the rest can
be determined by a simple linear relationship between these since M is the rank-one matrix.
Specifically, if we observe the first row and the first column, then the first and the second columns
differ by the factor of three so that as long as we know one entry in the second column, rest
will be recovered. Thus, the DOF of M is 4 + 4 − 1 = 7. Following lemma generalizes our
observations.
Lemma 2. The DOF of a square n × n matrix with rank r is 2nr − r 2 . Also, the DOF of
n1 × n2 -matrix is (n1 + n2 )r − r 2 .
Proof: Since the rank of a matrix is r, we can freely choose values for all entries of the r
columns, resulting in nr degrees of freedom for the first r column. Once r independent columns,
say m1 , · · · mr , are constructed, then each of the rest n − r columns is expressed as a linear
combinations of the first r columns (e.g., mr+1 = α1 m1 +· · ·+αr mr ) so that r linear coefficients
(α1 , · · · αr ) can be freely chosen in these columns. By adding nr and (n − r)r, we obtain the
desired result. Generalization to n1 × n2 matrix is straightforward.
This lemma says that if n is large and r is small enough (e.g., r = O(1)), essential information
in a matrix is just in the order of n, DOF= O(n), which is clearly much smaller than the total
number of entries of the matrix. Interestingly, the DOF is the minimum number of observed
entries required for the recovery of a matrix. If this condition is violated, that is, if the number of
observed entries is less than the DOF (i.e., m < 2nr − r 2 ), no algorithm whatsoever can recover
the matrix. In Fig. III-A1, we illustrate how to recover the matrix when the number of observed
July 30, 2019 DRAFT

26
r r
r r
(a) (b)
Fig. 5. LRMC with colored entries being observed. The dotted boxes are used to compute: (a) linear coefficients and (b)
unknown entries.
entries equals the DOF. In this figure, we assume that blue colored entries are observed.11 In
a nutshell, unknown entries of the matrix are found in two-step process. First, we identify the
linear relationship between the first r columns and the rest. For example, the (r + 1)-th column
can be expressed as a linear combination of the first r columns. That is,
mr+1 = α1 m1 + · · · + αr mr . (48)
Since the first r entries of m1 , · · · mr+1 are observed (see Fig. III-A1(a)), we have r unknowns
(α1 , · · · , αr ) and r equations so that we can identify the linear coefficients α1 , · · · αr with the
computational cost O(r 3 ) of an r × r matrix inversion. Once these coefficients are identified,
we can recover the unknown entries mr+1,r+1 · · · mr+1,n of mr+1 using the linear relationship
in (48) (see Fig. III-A1(b)). By repeating this step for the rest of columns, we can identify all
unknown entries with O(rn2 ) computational complexity12.
11
Since we observe the first r rows and columns, we have 2nr − r 2 observations in total.
12
For each unknown entry, it needs r multiplication and r − 1 addition operations. Since the number of unknown entries is
(n − r)2 , the computational cost is (2r − 1)(n − r)2 . Recall that O(r 3 ) is the cost of computing (α1 , · · · , αr ) in (48). Thus,
the total cost is O(r 3 + (2r − 1)(n − r)2 ) = O(rn2 ).
July 30, 2019 DRAFT

27
(r, l ) r
Fig. 6. An illustration of the worst case of LRMC.
Now, an astute reader might notice that this strategy will not work if one entry of the column
(or row) is unobserved. As illustrated in Fig. III-A1, if only one entry in the r-th row, say (r, l)-th
entry, is unobserved, then one cannot recover the l-th column simply because the matrix in Fig.
III-A1 cannot be converted to the matrix form in Fig. III-A1(b). It is clear from this discussion
that the measurement size being equal to the DOF is not enough for the most cases and in fact
it is just a necessary condition for the accurate recovery of the rank-r matrix. This seems like a
depressing news. However, DOF is in any case important since it is a fundamental limit (lower
bound) of the number of observed entries to ensure the exact recovery of the matrix. Recent
results show that the DOF is not much different from the number of measurements ensuring the
recovery of the matrix [22], [75].13
2) Coherence: If nonzero elements of a matrix are concentrated in a certain region, we
generally need a large number of observations to recover the matrix. On the other hand, if
the matrix is spread out widely, then the matrix can be recovered with a relatively small number
13
In [75], it has been shown that the required number of entries to recover the matrix using the nuclear-norm minimization
is in the order of n1.2 when the rank is O(1).
July 30, 2019 DRAFT

28
P U e3 = 0
e3
P U e2 = 0
e2
e1
P U e 1 = e1
(a) (b)
Fig. 7. Coherence of matrices in (52) and (53): (a) maximum and (b) minimum.
of entries. For example, consider the following two rank-one matrices in Rn×n
 
1 1 0 ··· 0
 
 1 1 0 ··· 0 
 
 

M1 =  0 0 0 · · · 0  ,
 . . . . 
 .. .. .. . . ... 
 
0 0 0 ··· 0
 
1 1 1 ··· 1
 
 1 1 1 ··· 1 
 
 
M2 = 
 1 1 1 ··· 1   ..
 .. .. .. . . .. 
 . . . . . 
 
1 1 1 ··· 1
The matrix M1 has only four nonzero entries at the top-left corner. Suppose n is large, say
n = 1000, and all entries but the four elements in the top-left corner are observed (99.99% of
entries are known). In this case, even though the rank of a matrix is just one, there is no way to
recover this matrix since the information bearing entries is missing. This tells us that although
the rank of a matrix is very small, one might not recover it if nonzero entries of the matrix are
concentrated in a certain area.
July 30, 2019 DRAFT

29
In contrast to the matrix M1 , one can accurately recover the matrix M2 with only 2n − 1
(= DOF) known entries. In other words, one row and one column are enough to recover M2 ).
One can deduce from this example that the spread of observed entries is important for the
identification of unknown entries.
In order to quantify this, we need to measure the concentration of a matrix. Since the matrix
has two-dimensional structure, we need to check the concentration in both row and column
directions. This can be done by checking the concentration in the left and right singular vectors.
Recall that the SVD of a matrix is
r
X
T
M = UΣV = σi ui viT (49)
i=1
where U = [u1 · · · ur ] and V = [v1 · · · vr ] are the matrices constructed by the left and
right singular vectors, respectively, and Σ is the diagonal matrix whose diagonal entries are σi .
From (49), we see that the concentration on the vertical direction (concentration in the row) is
determined by ui and that on the horizontal direction (concentration in the column) is determined
by vi . For example, if one of the standard basis vector ei , say e1 = [1 0 · · · 0]T , lies on the space
spanned by u1 , · · · ur while others (e2 , e3 , · · · ) are orthogonal to this space, then it is clear that
nonzero entries of the matrix are only on the first row. In this case, clearly one cannot infer the
entries of the first row from the sampling of the other row. That is, it is not possible to recover
the matrix without observing the entire entries of the first row.
The coherence, a measure of concentration in a matrix, is formally defined as [75]
n
µ(U) = max kPU ei k2 (50)
r 1≤i≤n
where ei is standard basis and PU is the projection onto the range space of U. Since the columns
of U = [u1 · · · ur ] are orthonormal, we have
PU = UU† = U(UT U)−1 UT = UUT .
Note that both µ(U) and µ(V) should be computed to check the concentration on the vertical
and horizontal directions.
Lemma 3. (Maximum and minimum value of µ(U)) µ(U) satisfies

n
1 ≤ µ(U) ≤ (51)
r
July 30, 2019 DRAFT

30
Proof: The upper bound is established by noting that ℓ2 -norm of the projection is not greater
than the original vector (kPU ei k22 ≤ kei k22 ). The lower bound is because
n
1X
max kPU ei k22 ≥ kPU ei k22
i n i=1
n
1X T
= e PU ei
n i=1 i
n
1X T
= ei UUT ei
n i=1
n r
1 XX
= |uij |2
n i=1 j=1
r
=
n
where the first equality is due to the idempotency of PU (i.e., PUT PU = PU ) and the last equality
P
is because ni=1 |uij |2 = 1.
Coherence is maximized when the nonzero entries of a matrix are concentrated in a row (or
column). For example, consider the matrix whose nonzero entries are concentrated on the first
row
 
 3 2 1 
 
M =  0 0 0 . (52)
 
0 0 0
Note that the SVD of M is
 
 1 
 
M = σ1 u1 v1T = 3.8417  0  [0.8018 0.5345 0.2673].
 
0
Then, U = [1 0 0]T , and thus kPU e1 k2 = 1 and kPU e2 k2 = kPU e3 k2 = 0. As shown in Fig.
III-A2(a), the standard basis e1 lies on the space spanned by U while others are orthogonal to
this space so that the maximum coherence is achieved (maxi kPU ei k22 = 1 and µ(U) = 3).
In contrast, coherence is minimized when the nonzero entries of a matrix are spread out
July 30, 2019 DRAFT

31
widely. Consider the matrix

 
 2 1 0 
 
M =  2 1 0 . (53)
 
2 1 0
In this case, the SVD of M is
 
 −0.5774 
 
M = 3.8730  −0.5774  [−0.8944 − 0.4472 0].
 
−0.5774
Then, we have
 
1 1 1 
1
 T 
PU = UU =  1 1 1  ,
3 
1 1 1
and thus kPU e1 k22 = kPU e2 k2 = kPU e3 k2 = 13 . In this case, as illustrated in Fig. III-A2(b),
kPU ei k2 is the same for all standard basis vector ei , achieving lower bound in (51) and the
1
minimum coherence (maxi kPU ei k22 = 3
and µ(U) = 1). As discussed in (12), the number of
measurements to recover the low-rank matrix is proportional to the coherence of the matrix [22],
[23], [75].
B. Working With Different Types of Low-Rank Matrices
In many practical situations where the matrix has a certain structure, we want to make the most
of the given structure to maximize profits in terms of performance and computational complexity.
We go over several cases including LRMC of the PSD matrix [54], Euclidean distance matrix [4],
and recommendation matrix [67] and discuss how the special structure can be exploited in the
algorithm design.
1) Low-Rank PSD Matrix Completion: In some applications, a desired matrix M ∈ Rn×n not
only has a low-rank structure but also is positive semidefinite (i.e., M = MT and zT Mz ≥ 0
for any vector z). In this case, the problem to recover M can be formulated as
min rank(X)
X
subject to PΩ (X) = PΩ (M), (54)
X = XT , X 0.
July 30, 2019 DRAFT

32
Similar to the rank minimization problem (10), the problem (54) can be relaxed using the nuclear
norm, and the relaxed problem can be solved via SDP solvers.
The problem (54) can be simplified if the rank of a desired matrix is known in advance. Let
rank(M) = k. Then, since M is positive semidefinite, there exists a matrix Z ∈ Rn×k such that
M = ZZT . Using this, the problem (54) can be concisely expressed as
1
min kPΩ (M) − PΩ (ZZT )k2F . (55)
Z∈Rn×k 2
Since (55) is an unconstrained optimization problem with a differentiable cost function, many
gradient-based techniques such as steepest descent, conjugate gradient, and Newton methods can
be applied. It has been shown that under suitable conditions of the coherence property of M and
the number of the observed entries |Ω|, the global convergence of gradient-based algorithms is
guaranteed [59].
2) Euclidean Distance Matrix Completion: Low-rank Euclidean distance matrix completion

arises in the localization problem (e.g., sensor node localization in IoT networks). Let {zi }ni=1
be sensor locations in the k-dimensional Euclidean space (k = 2 or k = 3). Then, the Euclidean
distance matrix M = (mij ) ∈ Rn×n of sensor nodes is defined as mij = kzi − zj k22 . It is obvious
that M is symmetric with diagonal elements being zero (i.e., mii = 0). As mentioned, the rank
of the Euclidean distance matrix M is at most k + 2 (i.e., rank(M) ≤ k + 2). Also, one can
show that a matrix D ∈ Rn×n is a Euclidean distance matrix if and only if D = DT and [52]
1 1
(In − hhT )D(In − hhT ) 0, (56)
n n
where h = [1 1 · · · 1]T ∈ Rn . Using these, the problem to recover the Euclidean distance matrix
M can be formulated as
min kPΩ (D) − PΩ (M)k2F
D
subject to rank(D) ≤ k + 2,
(57)
D = DT ,
1 T 1
hh )D(In − hhT ) 0.
− (In −
n n
T T n×k
Let Y = ZZ where Z = [z1 · · · zn ] ∈ R is the matrix of sensor locations. Then, one can
easily check that
M = diag(Y)hT + hdiag(Y)T − 2Y. (58)
July 30, 2019 DRAFT

33
Thus, by letting g(Y) = diag(Y)hT + hdiag(Y)T − 2Y, the problem in (57) can be equivalently
formulated as
min kPΩ (g(Y)) − PΩ (M)k2F
Y
subject to rank(Y) ≤ k, (59)
Y = Y T , Y 0.
Since the feasible set associated with the problem in (59) is a smooth Riemannian manifold [53],
[54], an extension of the Euclidean space on which a notion of differentiation exists [55], [56],
various gradient-based optimization techniques such as steepest descent, Newton method, and
conjugate gradient algorithms can be applied to solve (59) [3], [4], [55].
3) Convolutional Neural Network Based Matrix Completion: In recent years, approaches to

use CNN to solve the LRMC problem have been proposed. These approaches are particular
useful when a desired low-rank matrix is expressed as a graph model (e.g., the recommendation
matrix with a user graph to express the similarity between users’ rating results) [67], [64], [65],
[66], [70], [68], [69]. The main idea of CNN-based LRMC algorithms is to express the low-rank
matrix as a graph structure and then apply CNN to the constructed graph to recover the desired
matrix.
Graphical Model of a Low-Rank Matrix: Suppose M ∈ Rn1 ×n2 is the rating matrix in which
the columns and rows are indexed by users and products, respectively. The first step of the CNN-
based LRMC algorithm is to model the column and row graphs of M using the correlations
between its columns and rows. Specifically, in the column graph Gc of M, users are represented
as vertices, and two vertices i and j are connected by an undirected edge if the correlation
|hmi ,mj i|
ρij = kmi k2 kmj k2
between the i and j-th columns of M is larger than the pre-determined threshold
ǫ. Similarly, we construct the row graph Gr of M by denoting each row (product) of M as a
vertex and then connecting strongly correlated vertices. To express the connection, we define
the adjacency matrix of each graph. The adjacency matrix Wc = (wijc ) ∈ Rn2 ×n2 of the column
graph Gc is defined as


1 if the vertices (users) i and j are connected
wijc = (60)

0 otherwise
The adjacency matrix Wr ∈ Rn1 ×n2 of the row graph Gr is defined in a similar way.
July 30, 2019 DRAFT

34
CNN-based LRMC: Let U ∈ Rn1 ×r and V ∈ Rn2 ×r be matrices such that M = UVT . The
primary task of the CNN-based approach is to find functions fr and fc mapping the vertex sets
of the row and column graphs Gr and Gc of M to U and V, respectively. Here, each vertex of
Gr (respective Gc ) is mapped to each row of U (respective V) by fr (respective fc ). Since it is
difficult to express fr and fc explicitly, we can learn these nonlinear mappings using CNN-based
models. In the CNN-based LRMC approach, U and V are initialized at random and updated in
each iteration. Specifically, U and V are updated to minimize the following loss function [67]:
X X
l(U, V) = kui − uj k22 + kvi − vj k22
r =1
(i,j):wij c =1
(i,j):wij
Xr
τ
+ kPΩ ( ui viT ) − PΩ (M)k2F , (61)
2 i=1
where τ is a regularization parameter. In other words, we find U and V such that the Euclidean
distance between the connected vertices is minimized (see kui − uj k2 (wijr = 1) and kvi − vj k2
(wijc = 1) in (61)). The update procedures of U and V are [67]:
1) Initialize U and V at random and assign each row of U and V to each vertex of the row
graph Gr and the column graph Gc , respectively.
2) Extract the feature matrices ∆U and ∆V by performing a graph-based convolution
operation on Gr and Gc , respectively.
3) Update U and V using the feature matrices ∆U and ∆V, respectively.
4) Compute the loss function in (61) using updated U and V and perform the back propa-
gation to update the filter parameters.
5) Repeat the above procedures until the value of the loss function is smaller than a pre-
chosen threshold.
One important issue in the CNN-based LRMC approach is to define a graph-based convolution
operation to extract the feature matrices ∆U and ∆V (see the second step). Note that the input
data Gr and Gc do not lie on regular lattices like images and thus classical CNN cannot be directly
applied to Gr and Gc . One possible option is to define the convolution operation in the Fourier
domain of the graph. In recent years, CNN models based on the Fourier transformation of graph-
structure data have been proposed [68], [69], [70], [71], [72]. In [68], an approach to use the
eigendecomposition of the Laplacian has been proposed. To further reduce the model complexity,
CNN models using the polynomial filters have been proposed [70], [69], [71]. In essence, the
July 30, 2019 DRAFT

35
Fourier transform of a graph can be computed using the (normalized) graph Laplacian. Let Rr
−1/2 −1/2
be the graph Laplacian of Gr (i.e., Rr = I − Dr W r Dr where Dr = diag(Wr 1n2 ×1 )) [63].
Then, the graph Fourier transform Fr (u) of a vertex assigned with the vector u is defined as
Fr (u) = QTr u, (62)
where Rr = Qr Λr QTr is an eigen-decomposition of the graph Laplacian Rr [63]. Also, the

inverse graph Fourier transform Fr−1 (u′ ) of u′ is defined as14
Fr−1 (u′ ) = Qr u′ . (63)
Let z be the filter used in the convolution, then the output ∆u of the graph-based convolution
on a vertex assigned with the vector u is defined as [63], [70]
∆u = z ∗ u = Fr−1 (Fr (z) ⊙ Fr (u)) (64)
From (62) and (63), (64) can be expressed as
∆u = Qr (Fr (z) ⊙ QTr u)
= Qr diag(Fr (z))QTr u
= Qr GQTr u, (65)
where G = diag(Fr (z)) is the matrix of filter parameters defined in the graph Fourier domain.
We next update U and V using the feature matrices ∆U and ∆V. In [67], a cascade of
multi-graph CNN followed by long short-term memory (LSTM) recurrent neural network has
been proposed. The computational cost of this approach is O(r|Ω| + r 2n1 + r 2 n2 ) which is much
lower than the SVD-based LRMC techniques (i.e., O(rn1 n2 )) as long as r ≪ min(n1 , n2 ).
Finally, we compute the loss function l(Ui , Vi ) in (61) and then update the filter parameters
using the backpropagation. Suppose {Ui }i and {Vi }i converge to U b and V,
b respectively, then
c=U
the estimate of M obtained by the CNN-based LRMC is M bVbT.
14
One can easily check that Fr−1 (Fr (u)) = u and Fr (Fr−1 (u′ )) = u′ .
July 30, 2019 DRAFT

36
4) Atomic Norm Minimization: In ADMiRA, a low-rank matrix can be represented using a

small number of rank-one matrices. Atomic norm minimization (ANM) generalizes this idea
for arbitrary data in which the data is represented using a small number of basis elements
called atom. Example of ANM include sound navigation ranging systems [85] and line spectral
P
r
estimation [86]. To be specific, let X = αi Hi be a signal with k distinct frequency components
i=1
Hi ∈ Cn1 ×n2 . Then the atom is defined as Hi = hi b∗ where
hi = [1 ej2πfi ej2πfi 2 · · · ej2πfi (n1 −1) ]T (66)
is the steering vector and bi ∈ Cn2 is the vector of normalized coefficients (i.e., kbi k2 = 1). We
denote the set of such atoms Hi as H. Using H, the atomic norm of X is defined as
X X
kXkH = inf{ αi : X = αi Hi , αi > 0, Hi ∈ H}. (67)
i i
Note that the atomic norm kXkH is a generalization of the ℓ1 -norm and also the nuclear norm
to the space of sinusoidal signals [86], [39].
Let Xo be the observation of X, then the problem to reconstruct X can be modeled as the
ANM problem:
1
min 2
kZ − Xo kF + τ kZkH , (68)
Z
where τ > 0 is a regularization parameter. By using [87, Theorem 1], we have

1
kZkH = inf (tr(W) + tr(Toep(u))) :
2
  
W Z ∗ 
 0 , (69)
Z Toep(u) 
and the equivalent expression of the problem (68) is
min kZ − Xo k2F + τ (tr(W) + tr(Toep(u)))

Z,W,u
 
W Z ∗ . (70)
s.t.  0
Z Toep(u)
Note that the problem (70) can be solved via the SDP solver (e.g., SDPT3 [24]) or greedy
algorithms [88], [89].
July 30, 2019 DRAFT

37
IV. N UMERICAL E VALUATION
In this section, we study the performance of the LRMC algorithms. In our experiments, we
focus on the algorithm listed in Table V. The original matrix is generated by the product of two
random matrices A ∈ Rn1 ×r and B ∈ Rn2 ×r , i.e., M = ABT . Entries of these two matrices,
aij and bpq are identically and independently distributed random variables sampled from the
normal distribution N (0, 1). Sampled elements are also chosen at random. The sampling ratio
p is defined as
|Ω|
p= ,
n1 n2
where |Ω| is the cardinality (number of elements) of Ω. In the noisy scenario, we use the additive
noise model where the observed matrix Mo is expressed as Mo = M+N where the noise matrix
N is formed by the i.i.d. random entries sampled from the Gaussian distribution N (0, σ 2). For
SNR
1
given SNR, σ 2 = n1 n2
kMk2F 10− 10 . Note that the parameters of the LRMC algorithm are chosen
from the reference paper. For each point of the algorithm, we run 1, 000 independent trials and
then plot the average value.
In the performance evaluation of the LRMC algorithms, we use the mean square error (MSE)
and the exact recovery ratio, which are defined, respectively, as
1 c
MSE = kM − Mk2F ,
n1 n2
number of successful trials
R= ,
total trials
c is the reconstructed low-rank matrix. We say the trial is successful if the MSE
where M
performance is less than the threshold ǫ. In our experiments, we set ǫ = 10−6. Here, R can
be used to represent the probability of successful recovery.
We first examine the exact recovery ratio of the LRMC algorithms in terms of the sampling
ratio and the rank of M. In our experiments, we set n1 = n2 = 100 and compute the phase
transition [90] of the LRMC algorithms. Note that the phase transition is a contour plot of the
success probability P (we set P = 0.5) where the sampling ratio (x-axis) and the rank (y-axis)
form a regular grid of the x-y plane. The contour plot separates the plane into two areas: the area
above the curve is one satisfying P < 0.5 and the area below the curve is a region achieving
P > 0.5 [90] (see Fig. IV). The higher the curve, therefore, the better the algorithm would be.
In general, the LRMC algorithms perform poor when the matrix has a small number of observed
July 30, 2019 DRAFT

38
40
NNM using SDPT3
SVT
35 ADMiRA
TNNR-APGL
TNNR-ADMM
30
25
Rank
20
15
10
0
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Sampling Ratio
Fig. 8. Phase transition of LRMC algorithms.
entries and the rank is large. Overall, NNM-based algorithms perform better than FNM-based
algorithms. In particular, the NNM technique using SDPT3 solver outperforms the rest because
the convex optimization technique always finds a global optimum while other techniques often
converge to local optimum.
In order to investigate the computational efficiency of LRMC algorithms, we measure the
running time of each algorithm as a function of rank (see Fig. IV). The running time is measured
in second, using a 64-bit PC with an Intel i5-4670 CPU running at 3.4 GHz. We observe that
the convex algorithms have a relatively high running time complexity.
We next examine the efficiency of the LRMC algorithms for different problem size (see Table
VI). For iterative LRMC algorithms, we set the maximum number of iteration to 300. We see
that LRMC algorithms such as SVT, IRLS-M, ASD, ADMiRA, and LRGeomCG run fast. For
example, it takes less than a minute for these algorithms to reconstruct 1000×1000 matrix, while
the running time of SDPT3 solver is more than 5 minutes. Further reduction of the running time
can be achieved using the alternating projection-based algorithms such as LMaFit. For example,
it takes about one second to reconstruct an (1000 × 1000)-dimensional matrix with rank 5 using
LMaFit. Therefore, when the exact recovery of the original matrix is unnecessary, the FNM-based
technique would be a good choice.
July 30, 2019 DRAFT

39
16
NNM using SDPT3
SVT
14 ADMiRA
TNNR-APGL
TNNR-ADMM
12
10
Running Time
8
0
5 10 15 20 25 30 35
Rank
Fig. 9. Running times of LRMC algorithms in noiseless scenario (40% of entries are observed).
In the noisy scenario, we also observe that FNM-based algorithms perform well (see Fig. IV
and Fig. IV). In this experiment, we compute the MSE of LRMC algorithms against the rank
of the original low-rank matrix for different setting of SNR (i.e., SNR = 20 and 50 dB). We
observe that in the low and mid SNR regime (e.g., SNR = 20 dB), TNNR-ADMM performs
comparable to the NNM-based algorithms since the FNM-based cost function is robust to the
noise. In the high SNR regime (e.g., SNR = 50 dB), the convex algorithm (NNM with SDPT3)
exhibits the best performance in term of the MSE. The performance of TNNR-ADMM is notably
better than that of the rest of LRMC algorithms. For example, given rank(M) = 20, the MSE
of TNNR-ADMM is around 0.04, while the MSE of the rest is higher than 1.
Finally, we apply LRMC techniques to recover images corrupted by impulse noise. In this
experiment, we use 256×256 standard grayscale images (e.g., boat, cameraman, lena, and pepper
images) and the salt-and-pepper noise model with different noise density ρ = 0.3, 0.5, and 0.7.
For the FNM-based LRMC techniques, the rank is given by the number of the singular values σi
being greater than a relative threshold ǫ > 0, i.e., σi > ǫ max σi . From the simulation results, we
i
observe that peak SNR (pSNR), defined as the ratio of the maximum pixel value of the image
to noise variance, of all LRMC techniques is at least 52dB when ρ = 0.3 (see Table VII). In
particular, NNM using SDPT3, SVT, and IRLS-M outperform the rest, achieving pSNR≥ 57 dB
July 30, 2019 DRAFT

40
102
NNM using SDPT3
SVT
LMaFit
ADMiRA
TNNR-ADMM
101
MSE
100
10-1
10-2
5 10 15 20 25 30 35
Rank
Fig. 10. MSE performance of LRMC algorithms in noisy scenario with SNR = 20 dB (70% of entries are observed).
102
NNM using SDPT3
SVT
LMaFit
ADMiRA
101 TNNR-ADMM
100
MSE
10-1
10-2
10-3
5 10 15 20 25 30 35
Rank
Fig. 11. MSE performance of LRMC algorithms in noisy scenario with SNR = 50 dB (70% of entries are observed).
even with high noise level ρ = 0.7.
July 30, 2019 DRAFT

41
V. C ONCLUDING R EMARKS
In this paper, we presented a contemporary survey of LRMC techniques. Firstly, we classified

state-of-the-art LRMC techniques into two main categories based on the availability of the rank
information. Specifically, when the rank of a desired matrix is unknown, we formulated the
LRMC problem as the NNM problem and discussed several NNM-based LRMC techniques
such as SDP-based NNM, SVT, and truncated NNM. When the rank of an original matrix
is known a priori, the LRMC problem can be modeled as the FNM problem. We discussed
various FNM-based LRMC techniques (e.g., greedy algorithms, alternating projection methods,
and optimization over Riemannian manifold). Secondly, we discussed fundamental issues and
principles that one needs to be aware of when solving the LRMC problem. Specifically, we
discussed two key properties, sparsity of observed entries and the incoherence of an original
matrix, characterizing the LRMC problem. We also explained how to exploit the special structure
of the desired matrix (e.g., PSD, Euclidean distance, and graph structures) in the LRMC algorithm
design. Finally, we compared the performance of LRMC techniques via numerical simulations
and provided the running time and computational complexity of each technique.
When one tries to use the LRMC techniques, a natural question one might ask is what
algorithm should one choose? While this question is in general difficult to answer, one can
consider the SDP-based NNM technique when the accuracy of a recovered matrix is critical
(see Fig. IV, IV, and IV). Another important point that one should consider in the choice of the
LRMC algorithm is the computational complexity. In many practical applications, dimension of
a matrix to be recovered is large, and in fact in the order of hundred or thousand. In such large-
scale problems, algorithms such as LMaFit and LRGeomCG would be a good option since their
computational complexity scales linearly with the number of observed entries O(r|Ω|) while
the complexity of SDP-based NNM is expressed as O(n3 ) (see Table V). In general, there is a
trade-off between the running time and the recovery performance. In fact, FNM-based LRMC
algorithms such as LMaFit and ADMiRA converge much faster than the convex optimization
based algorithms (see Table VI), but the NNM-based LRMC algorithms are more reliable than
FNM-based LRMC algorithms (see Fig. IV and IV). So, one should consider the given setup and
operating condition to obtain the best trade-off between the complexity and the performance.
We conclude the paper by providing some of future research directions.
July 30, 2019 DRAFT

42
• When the dimension of a low-rank matrix increases and thus computational complexity
increases significantly, we want an algorithm with good recovery guarantee yet its com-
plexity scales linearly with the problem size. Without doubt, in the real-time applications
such as IoT localization and massive MIMO, low-complexity and short running time are
of great importance. Development of implementation-friendly algorithm and architecture
would accelerate the dissemination of LRMC techniques.
• Most of the LRMC techniques assume that the original low-rank matrix is a random matrix
whose entries are randomly generated. In many practical situations, however, entries of
the matrix are not purely random but chosen from a finite set of integer numbers. In
the recommendation system, for example, each entry (rating for a product) is chosen
from integer value (e.g., 1 ∼ 5 scale). Unfortunately, there is no well-known practical
guideline and efficient algorithm when entries of a matrix are chosen from the discrete
set. It would be useful to come up with a simple and effective LRMC technique suited
for such applications.
• As mentioned, CNN-based LRMC technique is a useful tool to reconstruct a low-rank
matrix. In essence, unknown entries of a low-rank matrix are recovered based on the
graph model of the matrix. Since observed entries can be considered as labeled training
data, this approach can be classified as a supervised learning. In many practical scenarios,
however, it might not be easy to precisely express the graph model of the matrix since
there are various criteria to define the graph edge. In addressing this problem, new deep
learning technique such as the generative adversarial networks (GAN) [91] consisting of
the generator and discriminator would be useful.
A PPENDIX A
P ROOF OF THE SDP FORM OF NNM
Proof: We recall that the standard form of an SDP is expressed as
min hC, Yi
Y
subject to hAk , Yi ≤ bk , k = 1, · · · , l (71)
Y0
July 30, 2019 DRAFT

43
where C is a given matrix, and {Ak }lk=1 and {bk }lk=1 are given sequences of matrices and
constants, respectively. To convert the NNM problem in (11) into the standard SDP form in (71),
we need a few steps. First, we convert the NNM problem in (11) into the epigraph form:15
min t
X,t
subject to kXk∗ ≤ t, (72)
PΩ (X) = PΩ (M).
Next, we transform the constraints in (72) to generate the standard form in (71). We first
consider the inequality constraint (kXk∗ ≤ t). Note that kXk∗ ≤ t if and only if there are
symmetric matrices W1 ∈ Rn1 ×n1 and W2 ∈ Rn2 ×n2 such that [21, Lemma 2]
 
W1 X
tr(W1 ) + tr(W2 ) ≤ 2t and   0. (73)
X T W2
   
W1 X 0 M
Then, by denoting Y =  f =  n1 ×n1
 ∈ R(n1 +n2 )×(n1 +n2 ) and M  where
T T
X W2 M 0n2 ×n2
0s×t is the (s × t)-dimensional zero matrix, the problem in (72) can be reformulated as
min 2t
Y,t
subject to tr(Y) ≤ 2t,

(74)
Y 0,
f
PΩe (Y) = PΩe (M),
 
0n1 ×n1 PΩ (X)
where PΩe (Y) =   is the extended sampling operator. We now consider
T
(PΩ (X)) 0n2 ×n2
f in (74). One can easily see that this condition is
the equality constraint (PΩe (Y) = PΩe (M))
equivalent to
f ei eT i,
hY, ei eTj+n1 i = hM, (i, j) ∈ Ω, (75)
j+n1
f ei eT i =
where {e1 , · · · , en1 +n2 } be the standard ordered basis of Rn1 +n2 . Let Ak = ei eTj+n1 and hM, j+n1
bk for each of (i, j) ∈ Ω. Then,
hY, Ak i = bk , k = 1, · · · , |Ω|, (76)

15
Note that minkXk∗ = min min t = min t.
X X t:kXk∗ ≤t (X,t):kXk∗ ≤t
July 30, 2019 DRAFT

44
and thus (74) can be reformulated as

min 2t
Y,t
subject to tr(Y) ≤ 2t
(77)
Y0
hY, Ak i = bk , k = 1, · · · , |Ω|.
 
1 2
For example, we consider the case where the desired matrix M is given by M =   and
2 4
the index set of observed entries is Ω = {(2, 1), (2, 2)}. In this case,
A1 = e2 eT3 , A2 = e2 eT4 , b1 = 2, and b2 = 4. (78)
One can express (77) in a concise form as (13), which is the desired result.
A PPENDIX B
P ERFORMANCE GUARANTEE OF NNM
Sketch of proof: Exact recovery of the desired low-rank matrix M can be guaranteed under
the uniqueness condition of the NNM problem [22], [23], [75]. To be specific, let M = UΣVT
L ⊥
be the SVD of M where U ∈ Rn1 ×r , Σ ∈ Rr×r , and V ∈ Rn2 ×r . Also, let Rn1 ×n2 = T T
be the orthogonal decomposition in which T ⊥ is defined as the subspace of matrices whose row
and column spaces are orthogonal to the row and column spaces of M, respectively. Here, T
is the orthogonal complement of T ⊥ . It has been shown that M is the unique solution of the
NNM problem if the following conditions hold true [22, Lemma 3.1]:
1) there exists a matrix Y = UVT + W such that PΩ (Y) = Y, W ∈ T ⊥ , and kWk < 1,
2) the restriction of the sampling operation PΩ to T is an injective (one-to-one) mapping.
The establishment of Y obeying 1) and 2) are in turn conditioned on the observation model of
M and its intrinsic coherence property.
Under a uniform sampling model of M, suppose the coherence property of M satisfies
max(µ(U), µ(V)) ≤ µ0 , (79a)

r
r
max |eij | ≤ µ1 , (79b)
ij n1 n2
July 30, 2019 DRAFT

45
where µ0 and µ1 are some constants, eij is the entry of E = UVT , and µ(U) and µ(V) are the
coherences of the column and row spaces of M, respectively.
Theorem 4 ([22, Theorem 1.3]). There exists constants α and β such that if the number of
observed entries m = |Ω| satisfies
1 1
m ≥ α max(µ21 , µ02 µ1 , µ0n 4 )γnr log n (80)
where γ > 2 is some constant and n1 = n2 = n, then M is the unique solution of the NNM
problem with probability at least 1 − βn−γ . Further, if r ≤ µ−1
0 n
1/5
, (80) can be improved to
m ≥ Cµ0 γn1.2 r log n with the same success probability.
One direct interpretation of this theorem is that the desired low-rank matrix can be recon-
structed exactly using NNM with overwhelming probability even when m is much less than
n1 n2 .
R EFERENCES
[1] Netflix Prize. http://www.netflixprize.com
[2] A. Pal, “Localization algorithms in wireless sensor networks: Current approaches and future challenges,” Netw. Protocols
Algorithms, vol. 2, no. 1, pp. 45–74, 2010.
[3] L. Nguyen, S. Kim, and B. Shim, “Localization in Internet of things network: Matrix completion approach,” in Proc.
Inform. Theory Appl. Workshop, San Diego, CA, USA, 2016, pp. 1–4.
[4] L. T. Nguyen, J. Kim, S. Kim, and B. Shim, “Localization of IoT Networks Via Low-Rank Matrix Completion,” to
appear in IEEE Trans. Commun., 2019.
[5] E. J. Candes, Y. C. Eldar, and T. Strohmer, “Phase retrieval via matrix completion,” SIAM Rev., vol. 52, no. 2, pp.
225–251, May 2015.
[6] M. Delamom, S. Felici-Castell, J. J. Perez-Solano, and A. Foster, “Designing an open source maintenance-free
environmental monitoring application for wireless sensor networks,” J. Syst. Softw., vol. 103, pp. 238–247, May 2015.
[7] G. Hackmann, W. Guo, G. Yan, Z. Sun, C. Lu, and S. Dyke, “Cyber-physical codesign of distributed structural health
monitoring with wireless sensor networks,” IEEE Trans. Parallel Distrib. Syst., vol. 25, no. 1, pp. 63–72, Jan. 2014.
[8] W. S. Torgerson, “Multidimensional scaling: I. Theory and method,” Psychometrika, vol. 17, no. 4, pp. 401–419, Dec.
1952.
[9] H. Ji, Y. Kim, J. Lee, E. Onggosanusi, Y. Nam, J. Zhang, B. Lee, and B. Shim, “Overview of full-dimension MIMO in
LTE-advanced pro,” IEEE Commun. Mag., vol. 55, no. 2, pp. 176–184, Feb. 2017.
[10] T. L. Marzetta and B. M. Hochwald, “Fast transfer of channel state information in wireless systems,” IEEE Trans. Signal
Process., vol. 54, no. 4, pp. 1268–1278, Apr. 2006.
July 30, 2019 DRAFT

46
[11] F. Rusek, D. Persson, B. K. Lau, E. G. Larsson, T. L. Marzetta, O. Edfors, and F. Tufvesson, “Scaling up MIMO:
Opportunities and challenges with very large arrays,” IEEE Signal Process. Mag., vol. 30, no. 1, pp. 40–60, Jan. 2013.
[12] W. Shen, L. Dai, B. Shim, S. Mumtaz, and Z. Wang, “Joint CSIT acquisition based on low-rank matrix completion for
FDD massive MIMO systems,” IEEE Commun. Lett., vol. 19, no. 12, pp. 2178–2181, Dec. 2015.
[13] T. S. Rappaport et al., “Millimeter wave mobile communications for 5G cellular: It will work!,” IEEE Access, vol. 1,
no. 1, pp. 335–349, May 2013.
[14] X. Li, J. Fang, H. Li, H. Li, and P. Wang, “Millimeter wave channel estimation via exploiting joint sparse and low-rank
structures,” IEEE Trans. Wireless Commun., vol. 17, no. 2, pp. 1123–1133, Feb. 2018.
[15] Y. Shi, J. Zhang, and K. B. Letaief, “Low-rank matrix completion for topological interference management by Riemannian
pursuit,” IEEE Trans. Wireless Commun., vol. 15, no. 7, pp. 4703-4717, Jul. 2016.
[16] Y. Shi, B. Mishra, and W. Chen, “Topological interference management with user admission control via Riemannian
optimization,” IEEE Trans. Wireless Commun., vol. 16, no. 11, pp. 7362-7375, Nov. 2017.
[17] Y. Shi, J. Zhang, W. Chen, and K. B. Letaief, “Generalized sparse and low-rank optimization for ultra-dense networks,”
IEEE Commun. Mag., vol. 56, no. 6, pp. 42-48, Jun., 2018.
[18] G. Sridharan and W. Yu, “Linear Beamforming Design for Interference Alignment via Rank Minimization,” IEEE Trans.
Signal Process., vol. 63, no. 22, pp. 5910-5923, Nov. 2015.
[19] M. Peng, S. Yan, K. Zhang, and C. Wang, “Fog-computing-based radio access networks: issues and challenges,” IEEE
Network, vol. 30, pp. 46-53, July 2016.
[20] K. Yang, Y. Shi, and Z. Ding, “Low-rank matrix completion for mobile edge caching in Fog-RAN via Riemannian
optimization,” in Proc. IEEE Global Communications Conf. (GLOBECOM), Washington, DC, Dec. 2016.
[21] M. Fazel, “Matrix rank minimization with applications,” Ph.D. dissertation, Elec. Eng. Dept., Standford Univ., Stanford,
CA, 2002.
[22] E. J. Candes and B. Recht, “Exact matrix completion via convex optimization,” Found. Comput. Math., vol. 9, no. 6, pp.
717–772, Dec. 2009.
[23] E. J. Candes and T. Tao, “The power of convex relaxation: Near-optimal matrix completion,” IEEE Trans. Inform. Theory,
vol. 56, no. 5, pp. 2053–2080, May 2010.
[24] K. C. Toh, M. J. Todd, and R. H. Tutuncu, “SDPT3 — a MATLAB software package for semidefinite programming,”
Optim. Methods Softw., vol. 11, pp. 545–581, 1999.
[25] J. F. Sturm, “Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones,” Optim. Methods Softw.,
vol. 11, pp. 625–653, 1999.
[26] L. Vandenberghe and S. Boyd, “Semidefinite programming,” SIAM Rev., vol. 38, no. 1, pp. 49–95, 1996.
[27] Y. Zhang, “On extending some primal-dual interior-point algorithms from linear programming to semidefinite program-
ming,” SIAM J. Optim., vol. 8, no. 2, pp. 365–386, 1998.
[28] Y. E. Nesterov and M. Todd, “Primal-dual interior-point methods for self-scaled cones,” SIAM J. Optim., vol. 8, no. 2,
pp. 324–364, 1998.
[29] F. A. Potra and S. J. Wright, “Interior-point methods,” J. Comput. Appl. Math., vol. 124, no. 1-2, pp.281–302, 2000.
[30] L. Vandenberghe, V. R. Balakrishnan, R. Wallin, A. Hansson, and T. Roh, “Interior-point algorithms for semidefinite
July 30, 2019 DRAFT

47
programming problems derived from the KYP lemma,” In Positive polynomials in control (pp. 195-238). Berlin,
Heidelberg: Springer, 2005.
[31] F. A. Potra and R. Sheng, “A superlinearly convergent primal-dual infeasible-interior-point algorithm for semidefinite
programming,” SIAM J. Optim., vol. 8, no. 4, pp.1007–1028, 1998.
[32] B. Recht, M. Fazel, and P. A. Parillo, “Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm
minimization,” SIAM Rev., vol. 52, no. 3, pp. 471–501, 2010.
[33] J. F. Cai, E. J. Candes, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” SIAM J. Optim.,
vol. 20, no. 4, pp. 1956–1982, Mar. 2010.
[34] T. Blumensath and M. E. Davies, “Iterative hard thresholding for compressed sensing,” Appl. Comput. Harmon. Anal.,
vol. 27, no. 3, pp. 265–274, Nov. 2009.
[35] J. Tanner and K. Wei, “Normalized iterative hard thresholding for matrix completion,” SIAM J. Sci. Comput., vol. 35,
no. 5, pp. S104–S125, Oct. 2013.
[36] M. Fornasier, H. Rauhut, and R. Ward, “Low-rank matrix recovery via iteratively reweighted least squares minimization,”
SIAM J. Optim., vol. 21, no. 4, pp. 1614–1640, Dec. 2011.
[37] K. Mohan, and M. Fazel, “Iterative reweighted algorithms for matrix rank minimization,” J. Mach. Learning Research,
no. 13, pp. 3441–3473, Nov. 2012.
[38] D. Needell and J. A. Tropp, “CoSaMP: Iterative signal recovery from incomplete and inaccurate samples,” Appl. Comput.
Harmon. Anal., vol. 26, no. 3, pp. 301–321, May 2009.
[39] J. W. Choi, B. Shim, Y. Ding, B. Rao, and D. I. Kim, “Compressed sensing for wireless communications: Useful tips
and tricks,” IEEE Commun. Surveys Tuts., vol. 19, no. 3, pp. 1527–1550, Feb. 2017.
[40] S. Kwon, J. Wang, and B. Shim, “Multipath matching pursuit,” IEEE Trans. Inform. Theory, vol. 60, no. 5, pp. 2986–3001,
Mar. 2014.
[41] J. Wang, S. Kwon, and B. Shim, “Generalized orthogonal matching pursuit,” IEEE Trans. Signal Process., vol. 60, no.
12, pp. 6202–6216, Sep. 2012.
[42] J. A. Tropp and A. C. Gilbert, “Signal recovery from random measurements via orthogonal matching pursuit,” IEEE
Trans. Inform. Theory, vol. 53, no. 12, pp. 4655–4666, Dec. 2007.
[43] K. Lee and Y. Bresler, “ADMiRA: Atomic decomposition for minimum rank approximation,” IEEE Trans. Inform. Theory,
vol. 56, no. 9, pp. 4402–4416, Sept. 2010.
[44] Z. Wang, M-J. Lai, Z. Lu, W. Fan, H. Davulcu, and J. Ye, “Rank-one matrix pursuit for matrix completion,” in Proc.
Int. Conf. Mach. Learn., Beijing, China, 2014, pp. 91–99.
[45] J. P. Haldar and D. Hernando, “Rank-constrained solutions to linear matrix equations using power factorization,” IEEE
Signal Process. Lett., vol. 16, no. 7, pp. 584–587, Jul. 2009.
[46] J. Tanner and K. Wei, “Low rank matrix completion by alternating steepest descent methods,” Appl. Comput. Harmon.
Anal., vol. 40, no. 2, pp. 417–429, Mar. 2016.
[47] Z. Wen, W. Yin, and Y. Zhang, “Solving a low-rank factorization model for matrix completion by a nonlinear successive
over-relaxation algorithm,” Math. Prog. Comput., vol. 4, no. 4, pp. 333–361, Dec. 2012.
[48] B. Mishra, G. Meyer, S. Bonnabel, and R. Sepulchre, “Fixed-rank matrix factorizations and Riemannian low-rank
July 30, 2019 DRAFT

48
optimization,” Comput. Stat., vol. 3, no. 4, pp. 591–621, Jun. 2014.
[49] W. Dai and O. Milenkovic, “SET: An algorithm for consistent matrix completion,” in Proc. Int. Conf. Acoust., Speech,
Signal Process., Dallas, Texas, USA, 2010, pp. 3646–3649.
[50] B. Vandereycken, “Low-rank matrix completion by Riemannian optimization,” SIAM J. Optim., vol. 23, no. 2, pp.
1214–1236, Jun. 2013.
[51] T. Ngo and Y. Saad, “Scaled gradients on Grassmann manifolds for matrix completion,” in Proc. Adv. Neural Inform.
Process. Syst. Conf., Lake Tahoe, Nevada, USA, 2012, pp. 1412–1420.
[52] J. Dattorro, Convex optimization and Euclidean distance geometry. USA: Meboo, 2005.
[53] U. Helmke and J. B. Moore, Optimization and Dynamical Systems. New York, NY, USA: Springer, 1994.
[54] B. Vandereycken, P.-A. Absil, and S. Vandewalle, “Embedded geometry of the set of symmetric positive semidefinite
matrices of fixed rank,” in Proc. IEEE Workshop Stat. Signal Process., Cardiff, UK, 2009, pp. 389–392.
[55] P. A. Absil, R. Mahony, and R. Sepulchre, Optimization algorithms on matrix manifolds. Princeton, NJ, USA: Princeton
Univ., 2009.
[56] J. M. Lee, Smooth manifolds. New York, NY, USA: Springer, 2003.
[57] Y. Hu, D. Zhan, J. Ye, X. Li, and X. He, “Fast and accurate matrix completion via truncated nuclear norm regularization,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 9, pp. 2117–2130, Sept. 2013.
[58] J. Y. Gotoh, A. Takeda, and K. Tono, “DC formulations and algorithms for sparse optimization problems,” Math.
Programming, pp. 1-36, May 2018.
[59] R. Ge, J. D. Lee, and T. Ma, “Matrix completion has no spurious local minimum,” in Advances Neural Inform. Process.
Syst., pp. 2973-2981, 2016.
[60] R. Ge, C. Jin, and Y. Zheng, “No spurious local minima in nonconvex low rank problems: A unified geometric analysis,”
In Proc. 34th Int. Conf. on Machine Learning, JMLR. org., Aug. 2017, vol. 70, pp. 1233–1242.
[61] S. S. Du, C. Jin, J. D. Lee, M. I. Jordan, A. Singh, and B. Poczos, “Gradient descent can take exponential time to escape
saddle points,” in Advances Neural Inform. Process. Syst., pp. 1067-1077, 2017.
[62] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Trans.
Neural Netw., vol. 20, no. 1, pp. 61–80, Jan. 2009.
[63] D. K. Hammond, P. Vandergheynst, and R. Gribonval, “Wavelets on graphs via spectral graph theory,” Appl. Comput.
Harmon. Anal., vol. 30, no. 2, pp. 129–150, Mar. 2011.
[64] S. Sedhain, A. K. Menon, S. Sanner, and L. Xie, “Autorec: Autoencoders meet collaborative filtering,” in Proc. Int. Conf.
World Wide Web, Florence, Italy, 2015, pp. 111–112.
[65] Y. Zheng, B. Tang, W. Ding, and H. Zhou, “A neural autoregressive approach to collaborative filtering,” in Proc. Int.
Conf. Mach. Learn., New York, NY, USA, 2016, pp. 764–773.
[66] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. S. Chua, “Neural collaborative filtering,” in Proc. Int. Conf. World
Wide Web, Perth, Australia, 2017, pp. 173–182.
[67] F. Monti, M. Bronstein, and X. Bresson, “Geometric matrix completion with recurrent multi-graph neural networks,” in
Proc. Adv. Neural Inform. Process. Syst., Long Beach, CA, USA, 2017, pp. 3700–3710.
July 30, 2019 DRAFT

49
[68] J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun, “Spectral networks and locally connected networks on graphs,” in Proc.
Int. Conf. Learn. Representations, Banff, Canada, 2014, pp. 1–14.
[69] M. Henaff, J. Bruna, and Y. Lecun, “Deep convolutional networks on graph-structured data,” arXiv:1506.05163, 2015.
[70] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral
filtering,” in Proc. Adv. Neural Inform. Process. Syst., Barcelona, Spain, 2016, pp. 3844–3852.
[71] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint
arXiv:1609.02907, 2016.
[72] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein, “Geometric deep learning on graphs and
manifolds using mixture model cnns,” in Proc. IEEE Conf. Comput. Vision Pattern Recognition, pp. 5115–5124, 2017.
[73] M. A. Davenport and J. Romberg, “An overview of low-rank matrix recovery from incomplete observations,” IEEE J.
Sel. Topics Signal Process., vol. 10, no. 4, pp. 608–622, Jun. 2016.
[74] Y. Chen and Y. Chi, “Harnessing structures in big data via guaranteed low-rank matrix estimation,” IEEE Signal Process.
Mag., vol. 35, no. 4, pp. 14–31, Jul. 2018.
[75] B. Recht, “A simple approach to matrix completion,” J. Mach. Learn. Res., vol. 12, pp. 3413–3430, Dec. 2011.
[76] C. R. Berger, Double Exponential. IEEE Trans. Signal Process. 56(5), 1708–1721 (2010)
[77] S. Boyd and Van, Convex Optimization. Cambridge, England: Cambridge Univ., 2004.
[78] P. Combettes and J. C. Pesquet, Proximal splitting methods in signal processing. New York, NY, USA: Springer, 2011.
[79] P. Jain, R. Meka, and I. Dhillon, “Guaranteed rank minimization via singular value projection,” in Proc. Neural Inform.
Process. Syst. Conf., Vancouver, Canada, 2010, pp. 937–945.
[80] M. Tao and X. Yuan, “Recovering low-rank and sparse components of matrices from incomplete and noisy observations,”
SIAM J. Optim., vol. 21, no. 1, pp. 57–81, Jan. 2011.
[81] Z. Lin, R. Liu, and Z. Su, “Linearized alternating direction method with adaptive penalty for low-rank representation,”
in Proc. Adv. Neural Inform. Process. Syst., Montreal, Canada, 2011, pp. 612–620.
[82] B. S. He, H. Yang, and S. L. Wang, “Alternating direction method with self-adaptive penalty parameters for monotone
variational inequalities,” J. Optim. Theory Appl., vol. 106, no. 2, pp. 337–356, Aug. 2000.
[83] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J.
Imaging Sci., vol. 2, no. 1, pp. 183–202, Mar. 2009.
[84] R. Escalante and M. Raydan, Alternating projection methods. Philadelphia, PA, USA: SIAM, 2011.
[85] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, “The convex geometry of linear inverse problems,” Found.
Comput. Math., vol. 12, no. 6, pp. 805–849, Dec. 2012.
[86] B. N. Bhaskar, G. Tang, and B. Recht, “Atomic norm denoising with applications to line spectral estimation,” IEEE
Trans. Signal Process., vol. 61, no. 23, pp. 5987–5999, Dec. 2013.
[87] Y. Li and Y. Chi, “Off-the-grid line spectrum denoising and estimation with multiple measurement vectors,” IEEE Trans.
Signal Process., vol. 64, no. 5, pp. 1257–1269, Mar. 2016.
[88] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM Rev., vol. 43, no. 1,
pp. 129–159, Feb. 2001.
July 30, 2019 DRAFT

50
[89] N. Rao, P. Shah, and S. Wright, “Forwardbackward greedy algorithms for atomic norm regularization,” IEEE Trans.
Signal Process., vol. 63, no. 21, pp. 5798–5811, Nov. 2015.
[90] D. L. Donoho, A. Maleki, and A. Montanari, “Message passing algorithms for compressed sensing,” Proc. Nat. Acad.
Sci., vol. 106, no. 45, pp. 18914–18919, Nov. 2009.
[91] I. Goodfellow, “NIPS 2016 tutorial: Generative adversarial networks,” arXiv preprint arXiv:1701.00160, 2016.
July 30, 2019 DRAFT

51
TABLE V. S UMMARY OF THE LRMC ALGORITHMS . T HE RANK IS r AND n = max(n1 , n2 ).
Computational Iteration
Category Technique Algorithm Features
Complexity Complexity*
Convex SDPT3 A solver for conic programming prob- O(n3 ) O(nω log( 1ǫ ))**
Optimization (CVX) [24] lems
NNM
NNM via SVT [33] An extension of the iterative soft thresh- O(rn2 ) O( √1ǫ )
Singular olding technique in compressed sensing
Value for LRMC, based on a Lagrange mul-
Thresholding tiplier method
NIHT [35] An extension of the iterative hard O(rn2 ) O(log( 1ǫ ))
thresholding technique [34] in com-
pressed sensing for LRMC
IRLS IRLS-M An algorithm to solve the NNM prob- O(rn2 ) O(log( 1ǫ ))
Minimization Algo- lem by computing the solution of a
rithm [36] weighted least squares subproblem in
each iteration
Greedy ADMiRA [43] An extension of the greedy algorithm O(rn2 ) O(log( 1ǫ ))
Technique CoSaMP [38], [39] in compressed sens-
FNM
ing for LRMC, uses greedy projection
with
to identify a set of rank-one matrices
Rank
that best represents the original matrix
Constraint
Alternating LMaFit [47] A nonlinear successive over-relaxation O(r|Ω| + r 2 n) O(log( 1ǫ ))
Minimization LRMC algorithm based on nonlinear
Gauss-Seidel method
ASD [46] A steepest decent algorithm for the O(r|Ω| + r 2 n) O(log( 1ǫ ))
FNM-based LRMC problem (25)
Manifold SET [49] A gradient-based algorithm to solve the O(r|Ω| + r 2 n) O(log( 1ǫ ))
Optimization FNM problem on a Grassmann mani-
fold
LRGeomCG A conjugate gradient algorithm over a O(r|Ω| + r 2 n) O(log( 1ǫ ))
[50] Riemannian manifold of the fixed-rank
matrices
Truncated TNNR- This algorithm solves the truncated O(rn2 ) O( √1ǫ )
NNM APGL [57] NNM problem via accelerated proximal
gradient line search method [83]
TNNR- This algorithm solves the truncated O(rn2 ) O( √1ǫ )
ADMM [57] NNM problem via an alternating direc-
tion method of multipliers [80]
CNN-based CNN-based An gradient-based algorithm to express O(r|Ω| + r 2 n) O(log( 1ǫ ))
Technique LRMC Algo- a low-rank matrix as a graph structure
rithm [67] and then apply CNN to the constructed
graph to recover the desired matrix
* c satisfying kM
The number of iterations to achieve the reconstructed matrix M c − M∗ kF ≤ ǫ where M∗ is the optimal solution.
** ω is some positive constant controlling the iteration complexity.
July 30, 2019 DRAFT

52
TABLE VI. MSE RESULTS FOR DIFFERENT PROBLEM SIZES WHERE RANK(M) = 5, AND p = 2 × DOF
n1 = n2 = 50 n1 = n2 = 500 n1 = n2 = 1000
MSE Running Number of MSE Running Number of MSE Running Number of
Time (s) Iterations Time (s) Iterations Time (s) Iterations
NNM using SDPT3 0.0072 0.6 13 0.0017 74 16 0.0010 354 16
SVT 0.0154 0.4 300 0.4564 10 300 0.2110 32 300
NIHT 0.0008 0.2 253 0.0039 21 300 0.0019 93 300
IRLS-M 0.0009 0.2 60 0.0033 2 60 0.0025 8 60
ADMiRA 0.0075 0.3 300 0.0029 49 300 0.0016 52 300
ASD 0.0003 10−2 227 0.0006 2 300 0.0005 8 300
LMaFit 0.0002 10−2 241 0.0002 0.5 300 0.0500 1 300
SET 0.0678 11 9 0.0260 136 8 0.0108 270 8
LRGeomCG 0.0287 0.1 108 0.0333 12 300 0.0165 40 300
TNNR-ADMM 0.0221 0.3 300 0.0042 22 300 0.0021 94 300
TNNR-APGL 0.0055 0.3 300 0.0011 21 300 0.0009 95 300
TABLE VII. I MAGE RECOVERY VIA LRMC FOR DIFFERENT NOISE LEVELS ρ.
ρ = 0.3 ρ = 0.5 ρ = 0.7

pSNR Running Iteration pSNR Running Iteration pSNR Running Iteration
(dB) Time (s) (dB) Time (s) (dB) Time (s)
NNM 66 1801 14 59 883 14 58 292 15
using SDPT3
SVT 61 18 300 59 13 300 57 32 300
NIHT 58 16 300 54 6 154 53 2 35
IRLS-M 68 87 60 63 43 60 59 15 60
ADMiRA 57 1391 300 54 423 245 53 210 66
ASD 57 3 300 55 4 300 53 2 101
LMaFit 58 2 300 55 1 123 53 0.3 34
SET 61 716 6 55 321 4 53 5 2
LRGeomCG 52 47 300 48 18 75 44 5 21
TNNR-ADMM 57 15 300 54 18 300 53 18 300
TNNR-APGL 56 14 300 56 19 300 53 17 300
July 30, 2019 DRAFT

Completion: Matrix

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Completion: Matrix

Uploaded by

Copyright:

Available Formats

MATRIX

Guangcan Liu∗ Qingshan Liu† Xiao-Tong Yuan‡

A New Theory for Matrix Completion 3

A New Theory for Matrix Completion 4

A New Theory for Matrix Completion 5

3 Isomeric Condition and Matrix Completion

3.1 Isomeric Condition

A New Theory for Matrix Completion 6

3.1.2 Basic Properties

1. L0 is Ω-isomeric if and only if U0 is Ω-isomeric.

Proof. It could be manipulated that

Proof. By U = U0 U0T U and U0 is Ω-isomeric,

= rank U0 U0T U = rank (U ) , ∀1 ≤ j ≤ n.

3.1.3 Important Properties

A New Theory for Matrix Completion 7

3.2.1 Missing Data Recovery

A New Theory for Matrix Completion 8

3.2.2 Matrix Completion

A New Theory for Matrix Completion 9

observed entries (%)

A New Theory for Matrix Completion 10

5 Conclusion and Future Work

A New Theory for Matrix Completion 11

A New Theory for Matrix Completion 12

Noisy Tensor Completion via the Sum-of-Squares Hierarchy 13

Noisy Tensor Completion via the Sum-of-Squares Hierarchy 14

m = (n1 )1/2 (n2 + n3 )r log4 (n1 + n2 + n3 )

1.1 Our Results

Noisy Tensor Completion via the Sum-of-Squares Hierarchy 15

1.2 Our approach

Noisy Tensor Completion via the Sum-of-Squares Hierarchy 16

1.3 Computational vs. Sample Complexity Tradeoffs

Noisy Tensor Completion via the Sum-of-Squares Hierarchy 17

2 Noisy Tensor Completion and Refutation

2.1 The Distinguishing Problem

where a is a vector whose entries are ±1.

Noisy Tensor Completion via the Sum-of-Squares Hierarchy 18

2.2 Rademacher Complexity

Noisy Tensor Completion via the Sum-of-Squares Hierarchy 19

This definition greatly simplifies our notation. In particular we have

Noisy Tensor Completion via the Sum-of-Squares Hierarchy 20

2.3 The Tensor Nuclear Norm

Noisy Tensor Completion via the Sum-of-Squares Hierarchy 21

2.4 From Rademacher Complexity to Refutation

Noisy Tensor Completion via the Sum-of-Squares Hierarchy 22

1. For any 3-SAT formula φ, opt(φ) ≤ alg(φ)

Noisy Tensor Completion via the Sum-of-Squares Hierarchy 23

3 Using Resolution to Bound the Rademacher Complexity

e 2 ] ≥ 0, for any degree at most k/2 polynomial P (nonnegativity)

Noisy Tensor Completion via the Sum-of-Squares Hierarchy 24

To simplify our notation, we will define the following polynomial

By construction we have A = B + R. Finally:

Noisy Tensor Completion via the Sum-of-Squares Hierarchy 25

e (2) Y (3) Y (2) (3)

We will now bound the contribution of B and R separately.

And this completes the proof of the lemma.

Noisy Tensor Completion via the Sum-of-Squares Hierarchy 26

(i, j, k 0 ) ∈ Sr0 and (i, j 0 , k) ∈ Tr0 or vice-versa. Now let

Noisy Tensor Completion via the Sum-of-Squares Hierarchy 27

Noisy Tensor Completion via the Sum-of-Squares Hierarchy 28

Proof. We proceed by using Markov’s inequality:

Proofs of Theorem 1.1 and Corollary 1.2

which completes the proof of the theorem.

and {k } is a sequence

2. The restriction PΩ T : T → PΩ (Rn×n ) of the (sampling) operator PΩ restricted to T is