Optimal Embedding Dimension For Sparse Subspace Embeddings

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

OPTIMAL EMBEDDING DIMENSION FOR SPARSE

SUBSPACE EMBEDDINGS

SHABARISH CHENAKKOD, MICHAL DEREZIŃSKI, XIAOYU DONG, AND MARK RUDELSON

Abstract. A random m × n matrix S is an oblivious subspace embedding (OSE) with


arXiv:2311.10680v1 [cs.DS] 17 Nov 2023

parameters ϵ > 0, δ ∈ (0, 1/3) and d ≤ m ≤ n, if for any d-dimensional subspace W ⊆ Rn ,


P ∀x∈W (1 + ϵ)−1 ∥x∥ ≤ ∥Sx∥ ≤ (1 + ϵ)∥x∥ ≥ 1 − δ.


It is known that the embedding dimension of an OSE must satisfy m ≥ d, and for any θ > 0,
a Gaussian embedding matrix with m ≥ (1 + θ)d is an OSE with ϵ = Oθ (1). However, such
optimal embedding dimension is not known for other embeddings. Of particular interest are
sparse OSEs, having s ≪ m non-zeros per column (Clarkson and Woodruff, STOC 2013),
with applications to problems such as least squares regression and low-rank approximation.
We show that, given any θ > 0, an m × n random matrix S with m ≥ (1 + θ)d consisting

of randomly sparsified ±1/ s entries and having s = O(log4 (d)) non-zeros per column, is an
oblivious subspace embedding with ϵ = Oθ (1). Our result addresses the main open question
posed by Nelson and Nguyen (FOCS 2013), who conjectured that sparse OSEs can achieve
m = O(d) embedding dimension, and it improves on m = O(d log(d)) shown by Cohen
(SODA 2016). We use this to construct the first oblivious subspace embedding with O(d)
embedding dimension that can be applied faster than current matrix multiplication time, and
to obtain an optimal single-pass algorithm for least squares regression. We further extend
our results to construct even sparser non-oblivious embeddings, leading to the first subspace
embedding with low distortion ϵ = o(1) and optimal embedding dimension m = O(d/ϵ2 )
that can be applied in current matrix multiplication time, addressing a question posed by
Cherapanamjeri, Silwal, Woodruff and Zhou (SODA 2023).
Our analysis builds on recent advances in universality theory for random matrices, to
overcome the limitations of existing approaches such as those based on matrix concentration
inequalities. In the process, we establish new bounds on the extreme singular values for a
class of nearly-square random matrices that arise from applying an m×n sparse OSE matrix
to an n×d isometric embedding matrix, which may be of independent interest. To maximally
leverage these results, we introduce new non-uniformly sparsified embedding constructions
which, in particular, reduce the random bit complexity of our OSE to polylog(n).

University of Michigan, Ann Arbor, MI


E-mail addresses: shabari@umich.edu, derezin@umich.edu, xydong@umich.edu, rudelson@umich.edu.
Partially supported by DMS 2054408.
1
2 OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS

1. Introduction
Since their first introduction by Sarlós [1], subspace embeddings have been used as a key
dimensionality reduction technique in designing fast approximate randomized algorithms for
numerical linear algebra problems, including least squares regression, lp regression, low-rank
approximation, and approximating leverage scores, among others [2–8]; we refer to the surveys
[9–14] for an overview. The subspace embedding property can be viewed as an extension
of the classical Johnson-Lindenstrauss (JL) lemma [15], which provides a transformation
that reduces the dimensionality of a finite set of n-dimensional vectors while preserving
their pairwise distances (i.e., an embedding). Here, instead of a finite set, we consider a
d-dimensional subspace of Rn , where d ≪ n. One way to define such a transformation is via
an m × n random matrix S, where m ≪ n is the embedding dimension. Remarkably, when
the embedding dimension m and the distribution of S are chosen correctly, then matrix S
can provide an embedding for any d-dimensional subspace W with high probability. Such a
distribution of S is oblivious to the choice of the subspace. Since any d-dimensional subspace
of Rn can be represented as the range of an n × d matrix U with orthonormal columns, we
arrive at the following definition.
Definition 1.1. A random m × n matrix S is an (ϵ, δ)-subspace embedding (SE) for an n × d
matrix U with orthonormal columns, where ϵ > 0, δ ∈ (0, 1/3) and d ≤ m ≤ n, if

P ∀x∈Rd (1 + ϵ)−1 ∥x∥ ≤ ∥SU x∥ ≤ (1 + ϵ)∥x∥ ≥ 1 − δ.




If S is an (ϵ, δ)-SE for all such U , then it is an (ϵ, δ, d)-oblivious subspace embedding (OSE).
The embedding dimension of any OSE must satisfy m ≥ d, so ideally we would like an
embedding with m as close to d as possible, while making sure that the distortion ϵ is not
too large. For many applications, a constant distortion, i.e., ϵ = O(1), is sufficient, while for
some, one aims for low-distortion embeddings with ϵ ≪ 1.
One of the most classical families of OSE distributions are Gaussian matrices S with
i.i.d. entries si,j ∼ N (0, 1/m). Thanks to classical results on the spectral distribution of
Gaussian matrices [16] combined with their rotational invariance, there is a sharp character-
ization of the distortion factor ϵ for Gaussian subspace embeddings, which implies that the
embedding dimension can be arbitrarily close to d, i.e., m = (1 + θ)d for any constant θ > 0,
while ensuring the OSE property with ϵ = O(1) and δ = exp(−Ω(d)), where the big-O nota-
tion hides the dependence on θ. These guarantees have been partly extended, although only
with a sub-optimal constant θ = O(1), to a broader class of random matrices that satisfy the
Johnson-Lindenstrauss property, including √ dense subgaussian embeddings such as matrices
with i.i.d. random sign entries scaled by 1/ m.
Dense Gaussian and subgaussian embeddings are too expensive for many applications,
due to the high cost of dense matrix multiplication. One of the ways of addressing this, as
proposed by Clarkson and Woodruff [5], is to use very sparse random sign matrices, where the
sparsity is distributed uniformly so that there are s ≪ m non-zero entries per column of S (we
refer to s as the column-sparsity of S). Remarkably, choosing column-sparsity s = 1 (which
is the minimum necessary sparsity for any OSE [4]) is already sufficient to obtain a constant
distortion OSE, but only if we increase the embedding dimension to m = O(d2 ). On the
other hand, Nelson and Nguyen [17], along with follow up works [18], showed that if we allow
s = polylog(d), then we can get an OSE with m = d polylog(d). This was later improved
by Cohen [19] to m = O(d log(d)) with column-sparsity s = O(log(d)). Nevertheless, the
embedding dimension of sparse OSEs remains sub-optimal, not just by a constant, but by an
O(log(d)) factor, due to a fundamental limitation of the matrix Chernoff analysis employed
by [19]. Thus, we arrive at the central question of this work:
OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS 3

What is the optimal embedding dimension for sparse oblivious subspace embeddings?

This question is essentially the main open question posed by Nelson and Nguyen [17,
Conjecture 14]: They conjectured that a sparse random sign matrix with s = O(log(d)) non-
zeros per column achieves a constant distortion OSE with embedding dimension m = O(d),
i.e., within a constant factor of the optimum. We go one step further and ask whether a
sparse OSE can recover the optimal embedding dimension, i.e., m = (1 + θ)d for any θ > 0,
achieved by Gaussian embeddings. Note that this version of the question is open for any
sparsity s and any distortion ϵ, even including dense random sign matrices (i.e., s = m).
1.1. Main results. In our main result, we show that embedding matrices with column-
sparsity s polylogarithmic in d can recover the optimal Gaussian embedding dimension m =
(1 + θ)d, while achieving constant distortion ϵ. The below result applies to several standard
sparse embedding constructions, including a construction considered by Nelson and Nguyen
(among others), where, we split each column of S into s sub-columns
√ and sample a single
non-zero entry in each sub-column, assigning a random ±1/ s to each of those entries.
Theorem 1.2 (Sparse OSEs; informal Theorem 3.4). Given any constant θ > 0, an m × n
sparse embedding matrix S with n ≥ m ≥ (1 + θ)d and s = O(log4 (d)) non-zeros per column
is an oblivious subspace embedding with distortion ϵ = O(1). Moreover, given any ϵ, δ, it
suffices to use s = O(log4 (d/δ)/ϵ6 ) to get an (ϵ, δ)-OSE with m = O((d + log 1/δ)/ϵ2 ).
The embedding dimension of m = O(d/ϵ2 ) exactly matches a known lower bound of m =
Ω(d/ϵ2 ), given by Nelson and Nguyen [20]. To our knowledge, this result is the first to achieve
the optimal embedding dimension for a sparse OSE with any sparsity s = o(m), or indeed,
for any OSE that can be applied faster than dense d × d matrix multiplication, including
recent efforts [21, 22] (see Theorem 1.4 for our fast OSE algorithm).
A known lower bound on the level of sparsity achievable by any oblivious subspace embed-
ding is a single non-zero entry per column (s = 1) of the embedding matrix S [4]. However,
this limit can be circumvented by non-oblivious sparse embeddings, i.e., when we have ad-
ditional information about the orthonormal matrix U ∈ Rn×d that represents the subspace.
Of particular significance is the distribution of the squared row norms of U (also known as
leverage scores), which encode the relative importance of the rows of U in constructing a
good embedding. Knowing accurate approximations of the leverage scores lies at the core
of many subspace embedding techniques, including approximate leverage score sampling [7,
23] and the Subsampled Randomized Hadamard Transform [24, 25]. These approaches rely
on the fact that simply sub-sampling m = O(d log d) rows of U proportionally to their (ap-
proximate) leverage scores is a constant distortion subspace embedding. This corresponds
to the m × n embedding matrix S having one non-zero entry per row (a.k.a. a sub-sampling
matrix), which is much sparser than any OSE since n ≫ m. However, the sub-sampling
embeddings are bound to the sub-optimal O(d log d) embedding dimension m, and it is not
known when we can achieve the optimal m = (1 + θ)d or even m = O(d). We address this in
the following result, showing that a non-oblivious sparse embedding knowing leverage score
estimates requires only polylogarithmic in d row-sparsity (non-zeros per row). To do this, we
construct embedding matrices with a non-uniform sparsity pattern that favors high-leverage
rows, inspired by recently proposed Leverage Score Sparsified embeddings [26–28].
Theorem 1.3 (Sparser Non-oblivious SE; informal Theorem 4.3). Consider α ≥ 1 and any
matrix U ∈ Rn×d such that U ⊤ U = I. Given α-approximations of all squared row norms of
U , we can construct a (1 + θ)d × n subspace embedding for U having O(α log4 d) non-zeros
per row and ϵ = O(1). Moreover, for any ϵ, δ, we can construct an (ϵ, δ)-SE for U with
O(α log4 (d)/ϵ4 ) non-zeros per row and embedding dimension m = O((d + log 1/δ)/ϵ2 ).
4 OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS

Ref. Oblivious Dimension m Runtime


[21] ✓ O(d · pll(d)) – O(nnz(A) + d2+γ )
[21] – O(d log(d)/ϵ ) – O(nnz(A) + dω pll(d))
2

[22] – O(d) ✓ O(nnz(A) + d2+γ )


[22] – O(d log(d)/ϵ ) – O(nnz(A) + dω )
2

Thm. 1.4 ✓ O(d) ✓ O(nnz(A) + d2+γ )


Thm. 1.6 – 2
O(d/ϵ ) ✓ O(nnz(A) + dω )
Table 1. Comparison of our results to recent prior works towards obtaining
fast subspace embeddings with optimal embedding dimension. For clarity of
presentation, we assume that the distortion satisfies ϵ = Ω(d−c ) for a small
constant c > 0, and we use pll(d) to denote poly(log log d). We use a check-
mark ✓ to indicate which embeddings are oblivious, and which of them achieve
optimal dependence of dimension m relative to the distortion ϵ.

Even though the above result focuses on non-oblivious embeddings, one of its key im-
plications is a new guarantee for a classical family of oblivious embeddings known as Fast
Johnson-Lindenstrauss Transforms (FJLT), introduced by Ailon and Chazelle [24]. An FJLT
is defined as S = ΦHD, where Φ is an m × n uniformly sparsified embedding matrix, H is an
n×n orthogonal matrix with fast matrix-vector products (e.g., a Fourier or Walsh-Hadamard
matrix), and D is a diagonal matrix with random ±1 entries. This embedding is effectively
a two-step procedure: first, we use HD to randomly rotate the subspace defined by U , ob-
taining Ũ = HDU which has nearly-uniform leverage scores [25]; then we apply a uniformly
sparsified embedding Φ to Ũ , knowing that the uniform distribution is a good approximation
for the leverage scores of Ũ . Theorem 1.3 implies that an FJLT with O(log4 (d)/ϵ4 ) non-zeros
per row is an (ϵ, δ, d)-OSE with the optimal embedding dimension m = O((d + log 1/δ)/ϵ2 )
(see Theorem 4.5 for details). To our knowledge, this is the first optimal dimension OSE
result for FJLT matrices.
Yet, the application to FJLTs does not leverage the full potential of Theorem 1.3, which is
particularly useful for efficiently constructing optimal subspace embeddings when only coarse
leverage score estimates are available, which has arisen in recent works on fast subspace em-
beddings [21, 22]. In the following section, we use it to construct a new fast low-distortion
SE (i.e., ϵ ≪ 1) with optimal embedding dimension, addressing a question posed by Chera-
panamjeri, Silwal, Woodruff and Zhou [22] (see Theorem 1.6 for details).
1.2. Fast subspace embeddings. Next, we illustrate how our main results can be used to
construct fast subspace embeddings with optimal embedding dimension. In most applications,
OSEs are used to perform dimensionality reduction on an n × d matrix A by constructing the
smaller m × d matrix SA. The subspace embedding condition ensures that ∥SAx∥ ≈ ∥Ax∥
for all x ∈ Rd up to a multiplicative factor 1 + ϵ, which has numerous applications, including
fast linear regression (see Section 1.3). The key computational bottleneck here is the cost
of computing SA. We aim for input sparsity time, i.e., O(nnz(A)), where nnz(A) is the
number of non-zeros in A, possibly with an additional small polynomial dependence on d.
Our results for computing a fast subspace embedding SA with optimal embedding dimension
are summarized in Table 1, alongside recent prior works.
In the following result, we build on Theorem 1.2 to provide an input sparsity time algorithm
for constructing a fast oblivious subspace embedding with constant distortion ϵ = O(1), and
optimal embedding dimension m = O(d). This is the first optimal OSE construction that is
faster than current matrix multiplication time O(dω ).
OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS 5

Theorem 1.4 (Fast oblivious subspace embedding). Given d ≤ n and any γ > 0, there is
a distribution over m × n matrices S where m = O(d), such that for any A ∈ Rn×d , with
probability at least 0.9:
1
∥Ax∥ ≤ ∥SAx∥ ≤ 2∥Ax∥ ∀x ∈ Rd ,
2
and SA can be computed in O(γ −1 nnz(A) + d2+γ polylog(d)) time. Moreover, for γ = Ω(1),
we can generate such a random matrix S using only polylog(nd) many uniform random bits.
Remark 1.5. The problem of constructing an OSE using polylog(nd) many random bits was
also brought up by Nelson and Nguyen, who obtained this with m = d polylog(d). To achieve
it with m = O(d), we introduce a new sparse construction, likely of independent interest,
where the non-zeros are distributed along the diagonals instead of the columns of S.
The runtime of our method matches the best known non-oblivious SE with m = O(d),
recently obtained by [22], while at the same time being much simpler to implement: their
construction requires solving a semidefinite program to achieve the optimal dimension, while
we simply combine several sparse matrix multiplication steps. Moreover, thanks to its obliv-
iousness, our embedding can be easily adapted to streaming settings. For example, consider
numerical linear algebra in the turnstile model [3], where we wish to maintain a sketch of
A while receiving a sequence of updates Ai,j ← Ai,j + δ. Using the construction from The-
orem 1.4, we can maintain a constant-distortion subspace embedding of A in the turnstile
model with optimal space of O(d2 log(nd)) bits, while reducing the update time exponentially,
from O(d) (for a dense OSE matrix) to polylog(d) time.
In the next result, we build on Theorem 1.3 to provide the first subspace embedding
with low distortion ϵ = o(1) and optimal embedding dimension m = O(d/ϵ2 ) that can
be applied in current matrix multiplication time. This addresses the question posed by
Cherapanamjeri, Silwal, Woodruff and Zhou [22], who gave a current matrix multiplication
time algorithm for a low-distortion subspace embedding, but with a sub-optimal embedding
dimension m = O(d log(d)/ϵ2 ). We are able to improve upon this embedding dimension by
replacing leverage score sampling (used by [22]) with our leverage score sparsified embedding
construction, developed as part of the proof of Theorem 1.3.
Theorem 1.6 (Fast low-distortion subspace embedding). Given A ∈ Rn×d and ϵ > 0, we
can compute an m × d matrix SA such that m = O(d/ϵ2 ), and with probability at least 0.9:
(1 + ϵ)−1 ∥Ax∥ ≤ ∥SAx∥ ≤ (1 + ϵ)∥Ax∥ ∀x ∈ Rd ,
in time O(γ −1 nnz(A) + dω + poly(1/ϵ)d2+γ polylog(d)) for any 0 < γ < 1.
1.3. Applications to linear regression. Our fast subspace embeddings can be used to
accelerate numerous approximation algorithms in randomized numerical linear algebra, in-
cluding for linear regression, low-rank approximation, rank computation and more. Here,
we illustrate this with the application to linear regression tasks. In the following result, we
use Theorem 1.4 to provide the first single pass algorithm for a relative error least squares
approximation with optimal both time and space complexity.
Theorem 1.7 (Fast least squares). Given an n × d matrix A and an n × 1 vector b, specified
with O(log nd)-bit numbers, consider the task of finding x̃ such that:
∥Ax̃ − b∥2 ≤ (1 + ϵ) min ∥Ax − b∥2 .
x
The following statements are true for this task:
(1) For ϵ = Θ(1), we can find x̃ with a single pass over A and b in O(nnz(A) + dω ) time,
using O(d2 log(nd)) bits of space.
6 OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS

(2) For arbitrary ϵ > 0, we can compute x̃ in O(γ −1 nnz(A) + dω + d2+γ /ϵ) time, using
O(d2 log(nd)) bits of space, for any 0 < γ < 1.

For part (1) of the claim, we note that the obtained space complexity matches the lower
bound Ω(d2 log(nd)) of Clarkson and Woodruff [3]. Moreover, it is clear that solving a least
squares problem with any worst-case relative error guarantee requires at least reading the
entire matrix A and solving a d×d linear system, which implies that the O(nnz(A)+dω ) time is
also optimal. For part (2) of the claim, we note that a similar time complexity for a 1+ϵ (non-
single-pass) least squares approximation was shown by [22], except they had an additional
O(ϵ−1 d2 polylog(d) log(1/ϵ)). We avoid that extra term, thereby obtaining the correct O(1/ϵ)
dependence on the relative error, by employing a carefully tuned preconditioned mini-batch
stochastic gradient descent with approximate leverage score sampling. This approach is of
independent interest, as it is very different from that of [22], who computed a sketch-and-solve
estimate by running preconditioned gradient descent on the sketch.
Finally, we point out that our fast low-distortion subspace embeddings (Theorem 1.6) can
be used to construct reductions for a wider class of constrained/regularized least squares
problems, which includes Lasso regression among others [18]. The following result provides
the first O(d/ϵ2 ) × d such reduction for ϵ = o(1) in current matrix multiplication time.

Theorem 1.8 (Fast reduction for constrained/regularized least squares). Given A ∈ Rn×d ,
b ∈ Rn and ϵ > 0, consider an n × d linear regression task T (A, b, ϵ) of finding x̃ such that:

f (x̃) ≤ (1 + ϵ) min f (x), where f (x) = ∥Ax − b∥22 + g(x),


x∈C

for some g : Rd → R≥0 and a set C ⊆ Rd . We can reduce this task to solving an O(d/ϵ2 ) × d
instance T (Ã, b̃, 0.1ϵ) in O(γ −1 nnz(A) + dω + poly(1/ϵ)d2+γ polylog(d)) time.

1.4. Overview of techniques. One of the key ingredients in our analysis involves establish-
ing the universality of a class of random matrices, building on the techniques of Brailovskaya
and Van Handel [29], by characterizing when the spectrum of a sum of independent random
matrices is close to that of a Gaussian random matrix whose entries have the same mean and
covariance. We adapt these techniques to a class of nearly-square random matrices that arise
from applying an m × n sparse random matrix S to an n × d isometric embedding matrix
U , showing high probability bounds for the Hausdorff distance between the spectrum of SU
and the spectrum of a corresponding Gaussian random matrix.
A key limitation of the results of [29] is that they require full independence between
the random matrices in a sum (which correspond to sub-matrices of the matrix S), unlike,
for instance, the analysis of Nelson and Nguyen [17] which uses a moment method that
only requires O(log(d))-wise independence. We address this with the independent diagonals
construction: we propose a distribution over m × n sparse random matrices S where the non-
zeros are densely packed into a small number of diagonals (see Figure 1) so that, while the
diagonals are fully independent, the entries within a single diagonal only need to be 2-wise
independent. As a consequence, the resulting construction requires only n/m · polylog(n)
uniform random bits to generate, and we further improve that to polylog(n) by combining it
with the Nelson-Nguyen embedding.
Standard sparse embedding matrices are not very effective at producing low-distortion
subspace embeddings, i.e., with ϵ = o(1), because their density (non-zeros per column) has
to grow with 1/ϵ, so that their complexity is no longer input sparsity time. Prior work
has dealt with this problem by using a constant distortion subspace embedding as a pre-
conditioner for computing the leverage score estimates l1 , ..., ln of the input matrix A [21],
and then constructing a subspace embedding in a non-oblivious way out of a sub-sample
OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS 7

of m = O(d log d/ϵ2 ) rows of A. This leverage score sampling scheme is effectively equiv-
alent to using√ an extremely sparse embedding matrix S which has a single non-zero entry
Si,Ii ∼ ±1/ li in each row, withP its index Ii sampled according to the leverage score distribu-
tion (l1 /Z, ..., ln /Z), where Z = i li . Unfortunately due to the well-known coupon collector
problem, such a sparse embedding matrix cannot achieve the optimal embedding dimension
m = O(d/ϵ2 ). We circumvent this issue by making the embedding matrix S slightly denser,
with α poly(1/ϵ) polylog(d) non-zeros per row, where α is the approximation factor in the
leverage score distribution (i.e., leverage score sparsification, see Figure 2). Unlike the obliv-
ious sparse embedding, here it is the row-density (instead of column-density) that grows
with 1/ϵ, which means that the overall algorithm can still run in input sparsity time. We
note that our algorithms use α = O(dγ ) approximation factor for the leverage scores, where
0 < γ < 1 is a parameter that can be chosen arbitrarily. This parameter reflects a trade-off
in the runtime complexity, between the O(γ −1 (nnz(A) + d2 )) cost of estimating the leverage
scores, and the density of the leverage score sparsified embedding.
To construct a least squares approximation with ϵ = o(1) (Theorem 1.7 part 2), we use
our constant distortion subspace embedding to compute a preconditioner for matrix A. That
preconditioner is then used first to approximate the leverage scores, as well as to compute a
constant factor least squares approximation, and then to improve the convergence rate of a
gradient descent-type algorithm. However, unlike prior works [21, 22, 30], which either use
a full gradient or a stochastic gradient based on a single row-sample, we observe that the
computationally optimal strategy is to use a stochastic gradient based on a mini-batch of
O(αd) rows, where α is the leverage score approximation factor. With the right sequence of
decaying step sizes, this strategy leads to the optimal balance between the cost of computing
the gradient estimate and the cost of preconditioning it, while retaining fast per-iteration
convergence rate, leading to the O(αd2 /ϵ) overall complexity of stochastic gradient descent.

1.5. Related work. Our results follow a long line of work on matrix sketching techniques,
which have emerged as part of the broader area of randomized linear algebra; see [9–14] for
comprehensive overviews of the topic. These methods have proven pivotal in speeding up
fundamental linear algebra tasks such as least squares regression [1, 2, 5], lp regression [4, 8,
31, 32], low-rank approximation [33, 34], linear programming [35], and more [36, 37]. Many
of these results have also been studied in the streaming and turnstile computation models
[3].
Subspace embeddings are one of the key algorithmic tools in many of the above randomized
linear algebra algorithms. Using sparse random matrices for this purpose was first proposed
by Clarkson and Woodruff [5], via the CountSketch which has a single non-zero entry per col-
umn, and then further developed by several other works [4, 17, 19] to allow multiple non-zeros
per column as well as refining the embedding guarantees. Non-uniformly sparsified embed-
ding constructions have been studied recently, including leverage score sparsified embeddings
[26–28, 38], although these works use much denser matrices than we propose in this work, as
well as relying on somewhat different constructions. There have also been recent efforts on
achieving the optimal embedding dimension for subspace embeddings [21, 22], which do not
rely on sparse embedding matrices, and therefore do not address the conjecture of Nelson
and Nguyen (see Table 1 for a comparison).

2. Preliminaries
Notation. The following notation and terminology will be used in the paper. The notation
[n] is used for the set {1, 2, ..., n}. The operator norm of a matrix X as ∥X∥ and its condition
number by κ(X). For clarity, the operator norm is also denoted by ∥X∥op in some places
where other norms appear. We shall denote the spectrum of a matrix X, which is the set
8 OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS

of all eigenvalues of X, by spec(X). The standard probability measure is denoted by P, and


the symbol E means taking the expectation with respect to the probability measure. The
standard Lp norm of a random variable ξ is denoted by ∥ξ∥p , for 1 ≤ p ≤ ∞. Throughout
the paper, the symbols c1 , c2 , ..., and Const, Const′ , ... denote absolute constants.

Oblivious Subspace Embeddings. We define an oblivious subspace embedding, i.e., an (ϵ, δ, d)-
OSE, following Definition 1.1, to be a random m × n matrix S such that for any n × d matrix
U with orthonormal columns (i.e., U ⊤ U = Id ),
P ∀x∈Rd (1 + ϵ)−1 ∥x∥ ≤ ∥SU x∥ ≤ (1 + ϵ)∥x∥ ≥ 1 − δ.

(2.1)
For computational efficiency, we usually consider sparse OSEs. A standard construction for
a sparse OSE involves i.i.d. rademacher entries in each position, sparsified by multiplication
with independent Bernoulli random variables. More precisely, S has i.i.d. entries si,j =
δi,j ξi,j where δi,j are independent Bernoulli random variables taking value 1 with probability
pm,n,d ∈ (0, 1] and ξi,j are i.i.d. random variables with P(ξi,j = 1) = P(ξi,j = −1) = 1/2.
Note that this results in S having s = pm many non zero entries per column and pn many
non zero entries per row on average. We shall call this the oblivious subspace embedding
with independent entries distribution.
Definition 2.1 (OSE-IID-ENT). A m × n random matrix S is called an oblivious subspace
embedding with independent entries (OSE-IID-ENT) if S has i.i.d. entries si,j = δi,j ξi,j where
δi,j are independent Bernoulli random variables taking value 1 with probability pm,n,d ∈ (0, 1]
and ξi,j are i.i.d. random variables with P(ξi,j = 1) = P(ξi,j = −1) = 1/2.
Another example comes from a class of sparse sketching matrices proposed by Nelson
and Nguyen [17], called OSNAPs. They define a sketching matrix S as an oblivious sparse
norm-approximating projection (OSNAP) if it satisfies the following properties -

(1) sij = δij σij / s where σ are i.i.d. ±1 random variables, and δij is an indicator random
P the event Sij ̸= 0.
variable for
(2) ∀j ∈ [n], m i=1 δi,j = s with probability 1, i.e. every column has exactly s non-zero
entries.
(3) For any T ⊂ [m] × [n], E Π(i,j)∈T δij ≤ (s/m)|T | .
(4) The columns of S are i.i.d.
One example of an OSNAP can be constructed as follows when s divides m. In this case,
we divide each column of S into s many blocks, with each block having m s many rows. For
each block, we randomly and uniformly select one nonzero entry and set its value to be ±1
with probability 1/2. Note that √ the blocks in each column are i.i.d., and the columns of S
are i.i.d. We then see that S/ s satisfies the properties of an OSNAP. For convenience, in
this work we will refer to this as the OSNAP distribution, and we will define it using the
parameter p = s/m instead of s. To define such a distribution formally, we first define the
one hot distribution.
Definition 2.2 (One Hot Distribution). Let M be an a × b random matrix. Let γ be a
random variable taking values in [a] × [b] with P(γ = (i, j)) = (1/ab). Let ξ be a Rademacher
random variable (P(ξ = −1) = P(ξ = 1) = 12 ). M is said to have the one hot distribution if
1{(i,j)} (γ)Ei,j ) where Ei,j is an a × b matrix with 1 in (i, j)th entry and 0
P
M = ξ(
(i,j)∈[a]×[b]
everywhere else.
Definition 2.3 (OSNAP-IND-COL). An m×n random matrix S is called an oblivious sparse
norm-approximating projection with independent subcolumns distribution (OSNAP-IND-
COL) with parameter p such that s = pm divides m, if each submatrix S[(m/s)(i−1)+1:(m/s)i]×{j}
OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS 9

of S for i ∈ [s], j ∈ [n] has the one hot distribution, and all these submatrices are jointly
independent.
Below we collect the existing subspace embedding results for OSNAP matrices, which are
relevant to this work.
Lemma 2.4 (Existing sparse embedding guarantees). The following are some of the known
guarantees for OSNAP embedding matrices:
• [5] showed that there is an OSNAP matrix S with m = O(ϵ−2 d2 ) rows and 1 non-zero
per column (i.e., CountSketch) which is an OSE with distortion ϵ.
• [19] showed that there is an OSNAP matrix S with m = O(ϵ−2 d1+γ log d) rows and
s = O(1/γϵ) non-zero entries per column which is an OSE with distortion ϵ. Note
that setting γ = 1/ log(d), we get m = O(ϵ−2 d log d) and O(log(d)/ϵ) non-zeros per
column.
• [17] showed that there is an OSNAP matrix using O(log(d) log(nd)) uniform random
bits with m = O(ϵ−2 d1+γ log8 (d)) and O(1/γ 3 ϵ) non-zero entries per column.

Non-oblivious subspace embeddings. Following Definition 1.1, we say that an m × n random


matrix S is a (non-oblivious) subspace embedding for a given n×d matrix U with orthonormal
columns if it satisfies (2.1) for that matrix U . In this case, to obtain subspace embedding
guarantees with even sparser random matrices, we can use the information about the subspace
in the form of its leverage scores. For i = 1, ..., n, the ith leverage score of a d-dimensional
subspace of Rn is the squared norm of the ith row of its orthonormal basis matrix U , i.e.,
∥e⊤ 2
i U ∥2 (this definition is in fact independent of the choice of basis).
We note that in most applications (e.g., Theorem 1.6), subspace embedding matrices are
typically used to transform an arbitrary n × d matrix A (not necessarily with orthonormal
columns), constructing a smaller m × d matrix SA. In this case, we seek an embedding for
the subspace of vectors {z : z = Ax for x ∈ Rd }. Here, the corresponding U matrix has
columns that form an orthonormal basis for the column-span of A. Thus, in practice we do
not have access to matrix U or its leverage scores. Instead, we may compute leverage score
approximations [7].
Definition 2.5 (Approximate Leverage Scores). For β1 ≥ 1, β2 ≥ 1, a tuple (l1 , . . . , ln ) of
numbers are (β1 , β2 )-approximate leverage scores for U if, for 1 ≤ i ≤ n,
n n
∥eTi U ∥2 X X
≤ li and li ≤ β2 ( ∥eTi U ∥2 ) = β2 d.
β1
i=1 i=1

Uniformizing leverage scores by preconditioning. Another way of utilizing information about


the leverage scores to get embedding guarantees with sparser matrices is to precondition the
matrix U using the randomized Hadamard transform to uniformize the row norms, resulting
in (d/n, d/n, . . . d/n) becoming approximate leverage scores for the preconditioned matrix.
To this end, we first define the Walsh-Hadamard matrix.
Definition 2.6. The Walsh-Hadamard matrix H2k of dimension 2k × 2k for k ∈ N ∪ {0} is
the matrix obtained using the recurrence relation
 
Hn Hn
H0 = [1], H2n = .
Hn −Hn
In what follows, we drop the subscript of H2k when the dimension is clear.
10 OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS

Definition 2.7. The randomized Hadamard transform (RHT) of an n × d matrix U is the


product √1n HDU , where D is a random n × n diagonal matrix whose entries are independent
random signs, i.e., random variables uniformly distributed on {±1}. Here, by padding U
with zero rows if necessary, we may assume that n is a power of 2.
The key property of the randomized Hadamard transform that we use is that it uniformizes
the row norms of U with high probability. More precisely, we have,
Lemma 2.8 (Lemma 3.3, [25]). Let U be an n × d matrix with orthonormal columns. Then,
√1 HDU is an n × d matrix with orthonormal columns, and, for δ > 0
n
r r !
T 1 d 8 log(n/δ)
P max ∥ej ( √ HDU )∥ ≥ + ≤δ
j=1,...,n n n n

Universality. In this paragraph, we describe the random matrix universality result of [29],
which is central to our analysis of sparse subspace embedding matrices. The object of study
here is a random matrix model given by
n
X
(2.2) X := Z0 + Zi
i=1

where Z0 is a symmetric deterministic d×d matrix and Z1 , . . . , Zn are symmetric independent


random matrices with E[Zi ] = 0. We shall compare the spectrum of X to the spectrum of
a gaussian model G that has the same mean and covariance structure as X. More precisely,
denoting by Cov(X) the d2 × d2 covariance matrix of the entries of X,
Cov(X)i,j,k,l := E[(X − E X)ij (X − E X)kl ]
G is the d × d symmetric random matrix such that:
(1) {Gij : i, j ∈ [d]} are jointly Gaussian
(2) E[G] = E[X] and Cov(G) = Cov(X).
The above two properties uniquely define the distribution of G. We next define the notion
of Hausdorff distance, which will be used in the universality result below.
Definition 2.9 (Hausdorff distance). Let A, B ⊂ Rn . Then the Hausdorff distance between
A and B is given by,
dH (A, B) = inf{ε ≥ 0; A ⊆ Bε and B ⊆ Aε }
where Aε (resp. Bε ) denotes the ε-neighbourhood of A.
Lemma 2.10 (Theorem 2.4 [29]). Given the random matrix model (2.2), define the following:
1
σ(X) = ∥E[(X − E X)2 ]∥op
2

1
σ∗ (X) = sup E[|⟨v, (X − E X)w⟩|2 ] 2
∥v∥=∥w∥=1
and R(X) = ∥ max ∥Zi ∥op ∥∞ .
1≤i≤n

There is a universal constant C > 0 such that for any t ≥ 0,


 
P dH spec(X), spec(G) > Cϵ(t) ≤ de−t ,


1 1 2 2
where ϵ(t) = σ∗ (X)t 2 + R(X) 3 σ(X) 3 t 3 + R(X)t.
OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS 11

This result can be viewed as a sharper version of the Matrix Bernstein inequality [25] for
the concentration of sums of random matrices. To see this, note that for the random matrix
model (2.2), Matrix Bernstein implies that:
p
E∥X∥ ≲ σ(X) log d + R(X) log d,
which can be recovered by Lemma 2.10 (see Example 2.12 in [29]). However, Lemma 2.10
together with Theorem 1.2 in [39] implies that:
1 2
E(∥X∥) ≤ C(σ(X) + v(X)1/2 σ(X)1/2 (log d)3/4 + R(X) 3 σ(X) 3 (log d)2/3 + R(X) log d)
where v(X) = ∥Cov(X)∥ is the norm of the covariance matrix of the d2 scalar entries. This
result can be sharper than the Matrix Bernstein inequality because when v(X) and R(X)
are small enough, then we √will have E∥X∥ ≲ σ(X), which improves the Matrix Bernstein
inequality by removing the log d factor.

Spectrum of Gaussian Matrices. To leverage the universality properties of the random matrix
model, we shall rely on the following result about the singular values of Gaussian matrices,
which in particular can be used to recover the optimal subspace embedding guarantee for
Gaussian sketches.
Lemma 2.11 ((2.3), [16]). Let G be an m × n matrix whose entries are independent standard
normal variables. Then,
√ √ √ √ 2
P( m − n − t ≤ smin (G) ≤ smax (G) ≤ m + n + t) ≥ 1 − 2e−t /2

3. Analysis of Oblivious Sparse Embeddings


In this section, we state and prove our main OSE result, Theorem 3.4 (given as Theorem 1.2
in Section 1). Before we get to the proof, however, we propose a new model for sparse OSEs
that is designed to exploit the strength of our proof in dealing with the number of independent
random bits required. To illustrate the issue, consider that in the OSE-IID-ENT model
(Definiton 2.1), we need mn many random bits to determine the nonzero entries of the matrix
S, even though the matrix will have far fewer non-zero entries. Naturally, there are many
known strategies for improving this, including OSNAP-IND-COL (Definition 2.3) and even
more elaborate hashing constructions based on polynomials over finite fields [17], which allow
reducing the random bit complexity to polylogarithmic in the dimensions. However, these
constructions do not provide sufficient independence needed in the random matrix model
(2.2) to apply the universality result of Brailovskaya and Van Handel (Lemma 2.10). We
address this with the independent diagonals distribution family, defined shortly.
In order to use universality for establishing a subspace embedding guarantee, we must
analyze a symmetrized version of the matrix SU , for an n × d orthogonal matrix U , with
S being a sum of sparse independent random matrices, say Yi ’s. Naturally, to compare
the spectra of SU and an appropriate Gaussian model, the matrix S cannot be too sparse.
However, since the individual entries of each Yi need not be independent, we can reduce the
number of independent summands in S by making each individual summand Yi denser. At
the same time, we need to control ∥Yi U ∥, just as we would when using the standard matrix
Bernstein inequality.
Both these goals can be achieved by placing non-zero entries along a diagonal of Yi . Placing
±1 entries along a diagonal of a matrix keeps its norm bounded by 1 whereas other arrange-
ments (say, along a row or column) do not. Moreover, these ±1 entries along the diagonal
need not be independent, they can simply be uncorrelated. As a result, an instance of Yi can
be generated with just O(1) random bits.
12 OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS

Figure 1. Left: Structure of the random matrix Fγl (Wl ). Right: Illustration
of an embedding matrix S with independent diagonals with parameters d = 8,
m = 10, and n = 30, showing how the nonzero entries of S occur along
diagonals. The number of diagonals is controlled by the parameter p.

3.1. Independent Diagonals Construction. With this motivation, we define the inde-
pendent diagonals distribution formally (Figure 1 illustrates this construction).
Definition 3.1 (OSE-IND-DIAG). An m × n random matrix S is called an oblivious sub-
space embedding with independent diagonals (OSE-IND-DIAG) with parameter p if it is
constructed in the following way. Assume that np is an integer. Let W = (w1 , . . . , wm ) be
a random vector whose components are ±1 valued and uncorrelated, i.e. E[wi ] = 0, E[wi2 ] =
1, E[wi wj ] = 0. We define γ to be a random variable uniformly distributed in [n]. Let
γ1 , ..., γnp be i.i.d. copies of γ. Let W1 , ..., Wnp be i.i.d. copies of W . Let Fj (x) be a function
that transforms a m dimensional vector x to the m × n matrix putting x on the j th diagonal,
i.e. positions (1, j) throughP (m, j + m mod n) with all other entries zero (See Fig 1 for an
illustration). Let S = Fγl (Wl ).
l∈[np]

Universality results show that the properties of a general random matrix are similar to
the properties of a gaussian random matrix with the same covariance profile. Therefore, to
analyze the OSE models, we need to first calculate the covariances between entries.
Lemma 3.2 (Variance and uncorrelatedness). Let p = pm,n ∈ (0, 1] and S = {sij }i∈[m],j∈[n]
be a m × n random matrix distributed according to the OSE-IID-ENT, OSNAP-IND-COL,
or OSE-IND-DIAG distributions. Then, E(sij ) = 0 and Var(sij ) = p for all i ∈ [m], j ∈ [n],
and Cov(si1 j1 , si2 j2 ) = 0 for any {i1 , i2 } ⊂ [m], {j1 , j2 } ⊂ [n] and (i1 , j1 ) ̸= (i2 , j2 )
Proof. For i.i.d. entries case, the result is straightforward.
For the independent diagonals case, since S is a sum of i.i.d. copies of Y := Fγ (W ), we
analyze Y first. By definition,
 
m n
1{j−i+1} (γ)Ei,j 
X X
Y = wi 
i=1 j=1

where W = (w1 , . . . , wm ) be a random vector whose components are ±1 valued and uncor-
related, i.e. E[wi ] = 0, E[wi2 ] = 1, E[wi wj ] = 0 and j − i + 1 is considered modulo n in [n].
Thus,
yij = wi 1{j−i+1} (γ)
First, note that we have E(yij ) = 0, and,
1
2
E(yij ) = (E(1{j−i+1} (γ)2 ) E(wi2 ) =
n
OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS 13

Moreover,
Cov(yi1 j1 , yi2 j2 ) = E(yi1 j1 yi2 j2 )
= E(wi wk 1{j1 −i1 +1} (γ)1{j2 −i2 +1} (γ))
= E(wi1 wi2 ) E(1{j1 −i1 +1} (γ)1{j2 −i2 +1} (γ))
by independence between γ and W .
When j1 − i1 ̸= j2 − i2 , we know that 1{j1 −i1 +1} (γ)1{j2 −i2 +1} (γ) is always 0 because γ can
only take one value.
When j1 − i1 = j2 − i2 , since (i1 , j1 ) ̸= (i2 , j2 ), we must have i1 = ̸ i2 . In this case, by
uncorrelatedness of the components of W ,
Cov(yi1 j1 , yi2 j2 ) = 0 · E(1{j1 −i1 +1} (γ)1{j2 −i2 +1} (γ)) = 0
Now, since
np
X
S= Yl ,
l=1

we have, by using the independence of Yl , Var(si1 j1 ) = p, and, when (i1 , j1 ) ̸= (i2 , j2 ),


np
X
Cov(si1 j1 , si2 j2 ) = E(si1 j1 si2 j2 ) = E ((Yl )i1 j1 (Yl )i2 j2 ) = 0
l=1

We now consider the independent subcolumns distribution OSNAP-IND-COL. Recall that


in the OSNAP-IND-COL distribution, each subcolumn S[(m/s)(i−1)+1:(m/s)i]×{j} of S for i ∈
[s], j ∈ [n] has the one hot distribution, and all these subcolumns are jointly independent.
Therefore, the same argument as above, we directly calculate the variances of the entries
1
E(s2ij ) = P(sij ̸= 0) = m =p
pm

For the covariances, we first observe that Cov(si1 j1 , si2 j2 ) = 0 if (i1 , j1 ) and (i2 , j2 ) belong
to two different subcolumns by independence. If (i1 , j1 ) and (i2 , j2 ) belong to the same
subcolumn, we have
Cov(si1 j1 , si2 j2 ) = E(si1 j2 si2 j1 ) = E(0) = 0
because each subcolumn has one hot distribution which means at most one of si1 j1 and si2 j2
can be nonzero.

Lemma 3.3 (Random bit complexity). For a parameter 0 < p < 1, the random bit complexity
to generate the OSE-IND-DIAG distribution is O(pn log(n)).
Proof. We can generate the vector W using log(m) random bits, using a construction involv-
ing linear functions on finite fields, to get m 2-wise independent random variables (Prop.
3.24 and Corollary 3.34. in [40]) and then transforming them to be ±1 valued. To gener-
ate the random variable γ which is uniformly distributed in [n], the number of random bits
needed is proportional to the information entropy of γ, which is O(log(n)) by [41, Theorem
5.11.2]. Taken together, the total number of random bits needed to generate each Fγl (Wl ) is
O(log(m) + log(n)).
Finally, since S is the sum of np many independent copies of Fγl (Wl ), it takes O((log(n) +
log(m)) · np) = O(pn log(n)) many random bits to generate S. □
14 OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS

3.2. Proof of Theorem 1.2. We now state the main theorem of this section, which is the
detailed version of Theorem 1.2.
Theorem 3.4 (Analysis of OSE by Universality). Let S be an m × n matrix distributed
according to the OSE-IID-ENT, OSNAP-IND-COL, or OSE-IND-DIAG distributions with
parameter p. Let U be an arbitrary n × d deterministic matrix such that U T U = I. Then,
there exist constants c3.4.1 and c3.4.2 such that for any ε, δ > 0, we have
√ √
P (1 − ε ≤ smin ((1/ pm)SU ) ≤ smax ((1/ pm)SU ) ≤ 1 + ε) ≥ 1 − δ
when m ≥ c3.4.1 max(d, log(4/δ))/ε2 and pm > c3.4.2 (log(d/δ))4 /ε6 .
Alternatively, given a fixed θ < 3, there exist constants c3.4.3 , c3.4.4 and c3.4.5 such that for
m ≥ max{(1 + θ)d, c3.4.3 log(4/δ))/θ2 } and pm > c3.4.4 (log(d/δ))4 /θ6 ,
 
1 c3.4.5
P κ( √ SU ) ≤ ≥1−δ
pm θ
The proof of Theorem 3.4 follows by setting X = sym(SU ), where, for a rectangular
matrix A, we define
 
0 A
sym(A) = .
AT 0
To use Lemma 2.10 we need to bound the parameters σ(X), σ∗ (X) and R(X), which we do
before proceeding with the proof. σ(X) and σ∗ (X) do not depend on the decomposition of X
as a sum of independent random matrices and can be calculated explicitly using the covariance

structure of X, which also tells us that the relevant gaussian model here is sym( pG) where
G is a m × d i.i.d. standard gaussian random matrix.
We use the following lemma to bound σ(X) and σ∗ (X).
Lemma 3.5 (Covariance Parameters). Let S = {sij }i∈[m],j∈[n] be a m × n random matrix
such that E(sij ) = 0 and Var(sij ) = p for all i ∈ [m], j ∈ [n], and Cov(sij , skl ) = 0 for any
{i, k} ⊂ [m], {j, l} ⊂ [n] and (i, j) ̸= (k, l). We then have
√ √
σ∗ (sym(SU )) ≤ 2 p and σ(sym(SU )) ≤ pm
1
Proof. Let X = sym(SU ). Recall that σ∗ (X) = sup E[|⟨v, (X − E X)w⟩|2 ] 2 . Fixing
∥v∥=∥w∥=1
vectors v, w, we have
X X
⟨v, (X − E X)w⟩ = sym (SU )i,j wj vi
1≤i≤d+m 1≤j≤d+m
X X X X
= sym (SU )m+i,j wj vm+i + sym (SU )i,j wm+j vi
1≤i≤d 1≤j≤m 1≤i≤m 1≤j≤d
X X X X
= (SU )j,i wj vm+i + (SU )i,j wm+j vi
1≤i≤d 1≤j≤m 1≤i≤m 1≤j≤d

Let η = U v[m+1:d] , ξ = U w[m+1:d] , then ∥η∥2 , ∥ξ∥2 = 1, and


⟨v, (X − E X)w⟩ = ⟨w[1:m] , Sη⟩ + ⟨v[1:m] , Sξ⟩
X X
= si,j (wi ηj + vi ξj )
1≤i≤m 1≤j≤n

Using the condition that E(sij ) = 0 and Var(sij ) = p for all i ∈ [m], j ∈ [n], and
Cov(sij , skl ) = 0 for any i ∈ [m], j ∈ [n], we conclude
OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS 15

X X
E[|⟨v, (X − E X)w⟩|2 ] ≤ p( (wi ηj + vi ξj )2 )
1≤i≤m 1≤j≤n
X X
≤ p( wi2 ηj2 + vi2 ξj2 + 2wi ηj vi ξj )
1≤i≤m 1≤j≤n
≤ p(1 + 1 + 2)⟨w[1:m] , v[1:m] ⟩⟨η, ξ⟩ ≤ 4p

which means σ∗ (X) ≤ 2 p.
1/2
Now, σ(sym(SU )) = ∥E[(sym(SU ))2 ]∥op , and
SU U T S T
 
2 0
(sym(SU )) =
0 U T S T SU

Since E[SU U T S T ] = pd · I and E[U T S T SU ] = pm · I, we see that σ(sym(SU )) ≤ pm.

Proof of Theorem 3.4. Using Lemma 3.5 and Lemma 3.2, we have

σ∗ (X) ≤ 2 p
and

σ(sym(SU )) ≤ pm
for all the three distributions.
R(X) depends on the decomposition of X as a sum of independent random matrices,
so we write the matrix SU as a sum of independent random matrices in each of the three
distributions:
si,j (ei ej T ), so we have SU = si,j (ei ej T )U =
P P
(1) (OSE-IID-ENT) Observe that S =
i,j i,j
si,j (ei uj T ), where uj is the jth row vector of the matrix U . Therefore, we have
P
i,j
sym(SU ) = si,j (sym(ei uj T )).
P
i,j
(2) (OSE-IND-DIAG) Here SU = pn
P Ppn
j=1 Yj U , so sym(SU ) = j=1 sym(Yj U ).
(3) (OSNAP-IND-COL) In this case as well, the sum has the same form as in (1), but
the indices corresponding to each subcolumn are random, and the sum is over the
number of subcolumns.
and calculate R(X),
(1) (OSE-IID-ENT) Since si,j is bounded by 1, we have

R(sym(SU )) ≤ max∥sym(ei uj T )∥op ≤ 1


i,j

(2) (OSE-IND-DIAG) Since ∥Yi U ∥op ≤ ∥Yi ∥op ∥U ∥op ≤ 1 and ∥sym(A)∥op = ∥A∥op , we
see that R(sym(SU )) ≤ 1.
(3) (OSNAP-IND-COL) This is the same as the case for (OSE-IID-ENT).
Using lemma 2.10 with t = log(2d/δ), we have

P(dH (spec(sym(SU )), spec(sym( pG))) > Cϵ(log(2d/δ))) ≤ δ/2
Using the SVD, we see that the m + d eigenvalues of sym(SU ) are the d singular values of
SU , their negatives and m − d zeroes.
16 OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS

Using lemma 2.11, we have



√ p p √
P pm − pd − 2p log(4/δ) ≤ smin ( pG)

√ √ p p
≤ smax ( pG) ≤ pm + pd + 2p log(4/δ) ≥ 1 − δ/2

So,

√ p p
P pm − pd − 2p log(4/δ) − Cϵ(log(2d/δ)) ≤ smin (SU )

√ p p
≤ smax (SU ) ≤ pm + pd + 2p log(4/δ) + Cϵ(log(2d/δ)) ≥ 1 − δ
√ q
2p log(4/δ)
Thus, when m > max{ λd2 , 8 log(4/δ)
we have λ2
<}, √
pm
λ
2 and d
m < λ, and therefore we
conclude

λ Cϵ(log(2d/δ)) √
P 1−λ− − √ ≤ smin ((1/ pm)SU )
2 pm

√ λ Cϵ(log(2d/δ))
≤ smax ((1/ pm)SU ) ≤ 1 + λ + + √ ≥1−δ
2 pm
To obtain the desired high probability bound for the largest and smallest singular value,

we need Cϵ(log(2d/δ)) ≤ λ2 pm. By the earlier calculations of σ(X), σ∗ (X) and R(X), we
√ √
have that ϵ(t) = 2 pt1/2 + ( pm)2/3 t2/3 + t. It is enough to require that,
√ λ √
( pm)2/3 (log(2d/δ))2/3 ≤ pm
8C
which is equivalent to
(8C)6 (log(2d/δ))4
≤ pm
λ6
4
λ √
Thus, if pm ≥ C ′ (log(2d/δ))
λ6
for a large enough C ′ , we also have log(2d/δ) ≤ 8C pm and
1/2 λ √ λ √
2(log(2d/δ)) ≤ 4C pm, which gives us Cϵ(log(2d/δ)) ≤ 2 pm. Setting λ = ε/2, we
prove the first part of the theorem.
On the other hand, when m ≥ (1 + θ)d,

√ p p
P pm − pd − 2p log(4/δ) − Cϵ(log(2d/δ)) ≤ smin (SU )

√ p p
≤ smax (SU ) ≤ pm + pd + 2p log(4/δ) + Cϵ(log(2d/δ)) ≥ 1 − δ

implies
 p
1 2p log(4/δ) Cϵ(log(2d/δ)) √
P 1− √ − √ − √ ≤ smin ((1/ pm)SU )
1+θ pm pm
p 
√ 1 2p log(4/δ) Cϵ(log(2d/δ))
≤ smax ((1/ pm)SU ) ≤ 1 + √ + √ + √ ≥1−δ
1+θ pm pm
OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS 17

Figure 2. LESS-IND-ENT with decreasing leverage scores. Since the proba-


bility of an entry being non-zero is proportional to the corresponding leverage
score, we see that the matrix becomes sparser as we move in the direction of
decreasing leverage scores. Since the scaling of entries is inversely proportional
to the square root of the corresponding leverage score, the magnitude of the
non-zero entries becomes larger as we move to the right.

36·16·2·log(4/δ)
Now for θ < 3, √1 < 1 − 6θ , and for m ≥ , we get,
1+θ θ2

θ θ Cϵ(log(2d/δ)) √
P − − √ ≤ smin ((1/ pm)SU )
6 24 pm

√ Cϵ(log(2d/δ))
≤ smax ((1/ pm)SU ) ≤ 3 + √ ≥1−δ
pm
θ √
Repeating the previous analysis now shows that Cϵ(log(2d/δ)) ≤ 24 pm when pm ≥
4
C ′ (log(2d/δ))
θ6
, so we have
 
θ √ √
P ≤ smin ((1/ pm)SU ) ≤ smax ((1/ pm)SU ) ≤ 4 ≥ 1 − δ
12

4. Analysis of Non-oblivious Sparse Embeddings


In the non-oblivious setting, we are looking to embed a specific d-dimensional subspace of
Rn represented by a matrix U , and we assume that we have access to (β1 , β2 )-approximate
leverage scores of U as defined in Section 2. Given access to this information, we can modify
the oblivious models of S to give more weight to certain coordinates of the ambient space.
We refer to this approach as Leverage Score Sparsitication (LESS).
4.1. Leverage Score Sparsification. We propose two variants of a LESS embedding. First,
we consider an extension of the OSE-IND-ENT model (with i.i.d. entries) studied in Section 3,
with the Bernoulli sparsifier for each entry of S being non-zero with a probability proportional
18 OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS

to the leverage score of the corresponding component. However, we still want the variance of
each entry of S to be p, so the entries are appropriately scaled copies of ±1 (See Figure 2).
Similar to the previous section, the analysis proceeds by considering a symmetrized version
of SU where S is viewed as a sum of independent random matrices Zi . The scaling of entries
of S retains the bound on ∥Zi U ∥ as mentioned in Section 3. A formal definition follows.
Definition 4.1 (LESS-IND-ENT). An m × n random matrix S is called a leverage score
sparsified embedding with independent entries (LESS-IND-ENT) corresponding to (β1 , β2 )-
approximate leverage scores (l1 , ..., ln ) with parameter p, if S has entries si,j = √ 1 δi,j ξi,j
β1 l j
where δi,j are independent Bernoulli random variables taking value 1 with probability pij =
β1 lj p and ξi,j are i.i.d. random variables with P(ξi,j = 1) = P(ξi,j = −1) = 1/2.
In the next variant of a LESS embedding, we are able to reduce the computational and
random bit complexity for generating sparsity by only generating as many non-zero entries
as required. Here, S is the sum of mnp many i.i.d. random matrices Zi where each Zi is
determined by choosing one entry from the mn possible entries and setting the remaining
entries to 0. Here, instead of choosing the positions uniformly at random, we choose them
proportionally to the corresponding leverage score.
P
Definition 4.2 (LESS-IND-POS). Assume that (β1 pm li ) is an integer. An m × n ran-
dom matrix S is called a leverage score sparsified embedding with independent positions
(LESS-IND-POS) corresponding to P the (β1 , β2 )-approximate leverage scores (l1 , ..., ln ) with
parameter p, if S is a sum of (β1 pm li ) i.i.d. copies Zk of Z, i.e.,
P
(β1 pm li )
X
S= Zk ,
k=1
where Z is defined as follows. Let γ be a random variable taking values in [m] × [n] with
n
P
P(γ = (i, j)) = (1/m) · (lj /( lk )). Let ξ be a Rademacher random variable (P(ξ = −1) =
k=1
P(ξ = 1) = 21 ). Then,
1
1{(i,j)} (γ) p
X
Z = ξ( Ei,j )
(i,j)∈[m]×[n]
β1 lj

where Ei,j is an m × n matrix with 1 in the (i, j)th entry and 0 everywhere else.
4.2. Proof of Theorem 1.3. These modifications allow us to prove subspace embedding
guarantees for sparser matrices than in the oblivious case, thereby showing Theorem 1.3. In
particular, we show that with LESS it suffices to use O(log4 (d/δ)) nonzero entries per row of
S, instead of per column of S, which is much sparser since S is a wide matrix.
Theorem 4.3. Let U be an arbitrary n × d deterministic matrix such that U T U = I with
(β1 , β2 )-approximate leverage scores (l1 , ..., ln ). There exist constants c4.3.1 , c4.3.2 , c4.3.3 , such
that for any 0 < ε < 1, 0 < δ < 1, and any LESS-IND-ENT or LESS-IND-POS random
matrix S corresponding to (l1 , ..., ln ) with embedding dimension m ≥ c4.3.1 max(d, log(4/δ))/ε2
and parameter p ≥ c4.3.2 (log(d/δ))4 /(mε6 ), we have
√ √
P (1 − ε ≤ smin ((1/ pm)SU ) ≤ smax ((1/ pm)SU ) ≤ 1 + ε) ≥ 1 − δ,
and if we choose m = c4.3.1 max(d, log(4/δ))/ε2 , p = c4.3.2 (log(d/δ))4 /(mε6 ), we have the
following high probability bound for the maximum number of nonzero entries per row
 
4 4
P max(card({j ∈ [n] : sij ̸= 0})) ≤ c4.3.3 β1 β2 (log(d/δ)) /ε ≥ 1 − δ
i∈[m]
OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS 19

Alternatively, given a fixed θ < 3, there exist constants c4.3.4 , c4.3.5 , c4.3.6 and c4.3.7 such that
for m ≥ max{(1 + θ)d, c4.3.4 log(4/δ))/θ2 } and p > c4.3.5 (log(d/δ))4 /(mθ6 ),
 
1 c4.3.6
P κ( √ SU ) ≤ ≥1−δ
pm θ
and if we choose m = max{(1 + θ)d, c4.3.4 log(4/δ))/θ2 } and p = c4.3.5 (log(d/δ))4 /(mθ6 ), we
have the following high probability bound for the maximum number of nonzero entries per row
 
4 6
P max(card({j ∈ [n] : sij ̸= 0})) ≤ c4.3.7 β1 β2 (log(d/δ)) /θ ≥ 1 − δ
i∈[m]

To prove Theorem 4.3, we need to first calculate the covariance profile of the three models.
Lemma 4.4 (Variance and uncorrelatedness). Let S = {sij }i∈[m],j∈[n] be a m × n ran-
dom matrix with LESS-IND-ENT or LESS-IND-POS distribution with parameter p. Then,
E(sij ) = 0 and Var(sij ) = p for all i ∈ [m], j ∈ [n], and Cov(si1 j1 , si2 j2 ) = 0 for any
{i1 , i2 } ⊂ [m], {j1 , j2 } ⊂ [n] and (i1 , j1 ) ̸= (i2 , j2 )
Proof. For the independent entries case, we have Var(sij ) = βl1j βlj p = p and the entries are
not correlated because they are independent.
For the independent positions case, since S is a sum of i.i.d. copies of Z, we analyze Z
first. By definition, we have
1
1{(i,j)} (γ) p
X
Z = ξ( Ei,j )
(i,j)∈[m]×[n]
β1 lj

Let zij be the (i, j)th entry of Z. Then, we have


1
zij = p 1{(i,j)} (γ)ξ
β1 lj
Since ξ and γ are independent, we know that E(zij ) = √ 1 E(1{(i,j)} (γ)) E(ξ) = 0 because
β1 l j
E(ξ) = 0. Also, we have
1 1
2
E(zij ) = (p )2 E(1{(i,j)} (γ)2 ) E(ξ 2 ) = n
β1 lj β1 m(
P
lj )
j=1

Now, for the covariance, since the random variables zij are centered, we have
Cov(zi1 j1 , zi2 j2 )
1 1
= E(zi1 j1 zi2 j2 ) = E( p 1{(i1 ,j1 )} (γ)ξ √ 1{(i2 ,j2 )} (γ)ξ)
β1 lj β 1 lk
1 1
=p √ E(ξ 2 ) E(1{(i1 ,j2 )} (γ)1{(i2 ,j2 )} (γ))
β1 lj β1 ll
We now observe that 1{(i1 ,j2 )} (γ)1{(i2 ,j2 )} (γ) = 0 on the entire sample space because γ can
only take one value and (i1 , j1 ) ̸= (i2 , j2 ).
Now, since
P
(β1 pm li )
X
S= Zk ,
k=1
20 OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS
P 1
we have, by independence, Var(sij ) = (β1 pm li ) Pn = p and, when (i1 , j1 ) ̸= (i2 , j2 ),
β1 m( lj )
j=1
P
(β1 pm li )
X
Cov(si1 j1 , si2 j2 ) = E(si1 j1 si2 j2 ) = E ((Zk )i1 j1 (Zk )i2 j2 ) = 0
k=1

Now, we can prove Theorem 4.3.

Proof of Theorem 4.3. By Lemma 4.4, we see that both distributions satisfy the condition of
√ √
Lemma 3.5, so we conclude that σ∗ (X) ≤ 2 p and σ(sym(SU )) ≤ pm.
For the IND-ENT case, since |si,j | is bounded by √ 1 , we have
β1 lj

1
R(sym(SU )) ≤ max p ∥sym(ei uj T )∥op ≤ 1
i,j β1 lj

And for the IND-POS case, we have |(zk )i,j | ≤ √ 1 , we have


β1 l j

R(sym(SU )) ≤ max(zk )i,j ∥sym(ei uj T )∥op ≤ 1


i,j,k

For both cases, since their matrix parameters σ∗ (X), σ(X) and R(X) are the same as in
Theorem 3.4, following the proof for Theorem 3.4, we conclude that there exist constants
c1 , c2 ,, such that for any ε, δ > 0, we have
√ √
P (1 − ε ≤ smin ((1/ pm)SU ) ≤ smax ((1/ pm)SU ) ≤ 1 + ε) ≥ 1 − δ

when m ≥ c1 max(d, log(4/δ))/ε2 and pm > c2 (log(d/δ))4 /ε6 . The proof of the claim when
m ≥ (1 + θ)d also follows in the same manner as Theorem 3.4.
Since p is just a parameter in the LESS models, we want to know what the requirement
pm > c2 (log(d/δ))4 /ε6 (similarly pm > c2 (log(d/δ))4 /θ6 ) means for the average number of
nonzeroPentries in each row. First, the average number of nonzero entries in each row will
be β1 p nj=1 lj ≤ β1 β2 pd for both cases, just by the construction of these matrices. Then we
observe that the condition pm > c2 (log(d/δ))4 /ε6 is equivalent to
d
pd > c2 (log(d/δ))4 /ε6 ( )
m
d d
Since m ≥ d/ϵ2 , we have m ≤ ϵ2 . Therefore, to meet the requirement pd > c2 (log(d/δ))4 /ε6 m ,
it suffices to have pd > c2 (log(d/δ)) /ε . So the optimal choice of p leads to β1 β2 c2 (log(d/δ)) /ε4
4 4 4

many nonzero entries in each row on average.


In the case when m ≥ (1+θ)d and pm > c2 (log(d/δ))4 /θ6 , since we can only claim d/m < 1
in general, we need pd > c2 (log(d/δ))4 /θ6 so the average number of nonzero entries in each
row would be β1 β2 c2 (log(d/δ))4 /θ6 .
We now obtain the high probability bound for the number of non zero entries in each row.
For the independent entries distribution, the number of nonzero entries is
n n
1R \{0} (sij ) =
X X
δij
j=1 j=1
n
P
Therefore, the total variance of the sum is v1 = β1 lj p(1 − β1 lj p) ≤ β1 β2 pd.
j=1
OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS 21

By Bernstein’s Inequality, we have


n
X (tβ1 β2 pd)2 tβ1 β2 pd
P( δij > tβ1 β2 pd) ≤ exp(− ) ≤ exp(− )
2(v1 + (tβ1 β2 pd)/3) 4
j=1
for t ≥ 3.
By union bound, we have
 
P max(card({j ∈ [n] : sij ̸= 0})) ≥ tβ1 β2 pd
i∈[m]
tβ1 β2 pd
≤m exp(− )
4
tβ1 β2 pd d
= exp(− + log(c1 2 ))
4 ε
4 4
Since pd > c2 (log(d/δ)) /ε , we have
tβ1 β2 pd d
exp(− + log(c1 2 ))
4 ε
tβ1 β2 (c2 (log(d/δ))4 /ε4 ) d
≤ exp(− + log(c1 2 ))
4 ε
4 4
When t is large enough, we have (1/2) tβ1 β2 (c2 (log(d/δ))
4
/ε )
> log(c1 εd2 ), because 1
ϵ4
domi-
1 4
nates log( ϵ2 ) and (log(d/δ)) dominates log(d).
In that case, we have
 
P max(card({j ∈ [n] : sij ̸= 0})) ≥ tβ1 β2 pd
i∈[m]

tβ1 β2 (c2 (log(d/δ))4 /ε4 )


≤ exp(−(1/2) )
4
t(c2 (log(d/δ))4 )
≤ exp(−(1/2) )
4
because β1 β2 > 1 and ε < 1. We can argue similarly for the m = (1 + θ)d case.
We observed that (log(d/δ))4 > log(1/δ) since d > 1. Therefore, as long as we choose t
such that c82 t > 1, we will have
c2
exp(−t (log(d/δ))4 ) ≤ exp(−(log(d/δ))4 ) ≤ exp(−(log(1/δ))) = δ
8
For the independent positions distribution, the number of nonzero entries is
P
β1 pm li n X n
1R \{0} (sij ) = 1{i,j} (γk ))
X X
(
j=1 j=k j=1
n
1{i,j} (γk ) is a Bernoulli-( m1 ) random variable. Therefore, the total
P
We observe that
j=1
P
β1 pm
P li
1R \{0} (sij ) is v2 ≤ β1 pm( 1
P
variance of li ) m ≤ β1 β2 pd, which is the same as in the
j=1
i.i.d. entries case. Therefore, the same result follows from Bernstein’s Inequality.

As an immediate corollary of Theorem 4.3 we give a new subspace embedding guarantee
for the Fast Johnson-Lindenstrauss Transform (FJLT). Recall that an FJLT preconditions
the matrix U with the Randomized Hadamard Transform (see Definition 2.7) to transform
it into another matrix V whose row norms can be well controlled. In this way, we obtain
22 OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS

approximate leverage scores for V by construction rather than by estimation. Then, we can
apply LESS-IND-ENT or LESS-IND-POS random matrices to the preconditioned matrix V
according to these approximate leverage scores.
Theorem 4.5 (Analysis of Preconditioned Sparse OSE). Let U be an arbitrary n × d
deterministic matrix such that U T U = I. There exist constants c4.5.1 , c4.5.2 , c4.5.3 , such
that for any 0 < ε < 1, 2n ed
< δ < 1, the following holds. Let S = Φ( √1n )HD where
H and D are as in definition 2.6 and 2.7, and Φ has LESS-IND-ENT or LESS-IND-
POS distribution corresponding to uniform leverage scores (d/n, ..., d/n) with embedding di-
mension m ≥ c4.5.1 max(d, log(8/δ))/ε2 and average number of nonzero entries per row
≥ c4.5.2 (log(2d/δ))4 /ε4 . Then we have
√ √
P (1 − ε ≤ smin ((1/ pm)SU ) ≤ smax ((1/ pm)SU ) ≤ 1 + ε) ≥ 1 − δ
Proof of Theorem 4.5. First, note that √1n HDU is an n×d matrix with orthonormal columns.
Let E denote the event that the tuple (l1 = d/n, l2 = d/n, . . . , ln = d/n)
Pn of numbers are
(16, 1)-approximate leverage scores for the matrix √1n HDU . Clearly, i=1 li ≤ d. Since
2n ≤ δed , we have log(2n/δ) ≤ d, and the claim from Lemma 2.8 reads,
r r !
T 1 d 8d δ
P max ∥ej ( √ HDU )∥op ≥ + ≤
j=1,...,n n n n 2
Thus, with probability greater than 1 − δ/2, we have, for all j ∈ [n],
√ 2 d
 
T 1 2 16d
∥ej ( √ HDU )∥ < (1 + 2 2) < = 16lj
n n n
Thus, the conditions for (d/n, d/n, . . . , d/n) to be (16, 1)-approximate leverage scores for the
matrix √1n HDU are satisfied with probability greater than 1 − δ/2, i.e.,
P(E) ≥ 1 − δ/2
Let V = √1 HDU . Conditioning on the random matrix V , we can apply Theorem 4.3 and
n
obtain
√ √
P (1 − ε ≤ smin ((1/ pm)ΦV ) ≤ smax ((1/ pm)ΦV ) ≤ 1 + ε| V )(ω) ≥ 1 − δ/2
on the event E.
Therefore, by independence between V and Φ, we have
√ √
P (1 − ε ≤ smin ((1/ pm)SU ) ≤ smax ((1/ pm)SU ) ≤ 1 + ε) ≥ (1 − δ/2)2 ≥ 1 − δ
as long as m ≥ c4.3.1 max(d, log(8/δ))/ε2 and average number of nonzero entries per row
≥ c4.3.2 (log(2d/δ))4 /ε4 .

5. Analysis of Fast Subspace Embeddings


In this section, we give proofs for the fast subspace embedding results.
5.1. Proof of Theorem 1.4. Recall that our goal is to construct an m × n OSE matrix S
with m = O(d) that can be applied to an A ∈ Rn×d in time
O(γ −1 nnz(A) + d2+γ polylog(d)).
We will present two separate constructions for achieving this:
(1) Our first construction optimizes the time complexity of computing SA, including the
dependence on polylogarithmic factors.
OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS 23

(2) Our second construction optimizes the number of uniform random bits required to
generate S, trading off the dependence on polylogarithmic factors and on γ.
1. Optimizing time complexity. We combine several different transformations to construct the
embedding S. Namely, we compute SA by chaining four transformations, SA = S3 (HD(S2 (S1 A))),
defined as follows:
(1) S1 is an m1 × n OSNAP matrix with m1 = O(d1+γ log d) rows and O(1/γ) non-zeros
per column. The cost of computing S1 A is O(γ −1 nnz(A)).
(2) S2 is an m2 × m1 OSNAP matrix with m2 = O(d log d) rows and O(log d) non-zeros
per column. The cost of computing S2 (S1 A) is O(m1 d log d) = O(d1+γ log2 d).
(3) HD is an m2 ×m2 randomized Hadamard transform (padded with zeros, if necessary).
The cost of computing HD(S2 S1 A) is O(m2 d log d) = O(d2 log2 d).
(4) S3 is our Leverage Score Sparsified embedding from Theorem 1.3 having m3 = O(d)
rows, implemented with all li = m2 /d and O(log4 d) non-zeros per row. The cost of
computing S3 (HDS2 S1 A) is O(m3 d log4 d) = O(d2 log4 d).
Note that, via Lemma 2.4, both S1 and S2 are constant-distortion subspace embeddings.
Moreover, randomized Hadamard transform HD ensures that with high probability the lever-
age scores of HDS2 S1 A are nearly uniform. This allows us to use the leverage score sparsified
embedding from Theorem 1.3 in an oblivious way. Overall, the cost of computing SA be-
comes O(γ −1 nnz(A) + d2+γ log2 d + d2 log4 d). We note that, using standard choices for γ,
this complexity can be simplified in the following ways:
• O(nnz(A) + d2+θ ) for any absolute constant θ > 0, obtained by choosing γ = θ/2;
• O(nnz(A) log d + d2 log4 d), obtained by choosing γ = 1/ log d.
2. Optimizing random bits. We next present a slightly modified construction that addresses
the claim discussed in Remark 1.5. Namely, we construct a constant-distortion oblivious
subspace embedding S that can be generated using only polylog(n) uniform random bits. To
do that, we combine three embeddings, S = S3 S2 S1 , where:
(1) S1 is an OSNAP matrix with m1 = O(d1+γ log8 (d)) rows and O(1/γ 3 ) non-zero entries
per column, which requires O(log d log n) random bits;
(2) S2 is an OSNAP matrix with m2 = O(d log8 (d)) rows and O(log3 (d)) non-zero entries
per column, which also requires O(log d log n) random bits;
(3) S3 is the diagonal construction from Theorem 1.2 with m3 = O(d) rows and O(log4 (d))
non-zeros per column. This requires O(m2 /m3 · log4 d log n) = O(log12 d log n) many
random bits.

The overall cost of computing SA is thus O(γ −3 nnz(A) + d2+γ polylog(d)), which for any
constant γ > 0 matches our optimal construction up to the polylog(d) term, while requiring
only O(log12 d log n) uniform random bits.
5.2. Proof of Theorem 1.6. Here, we provide the analysis for our fast low-distortion sub-
space embedding, i.e., with ϵ = o(1). First, we assume that nnz(A) ≥ n, by ignoring the rows
of A that are all zeros. Next, note that we can also assume without loss of generality that
d ≥ nc for c = 0.1. To see this, suppose otherwise, i.e., that d < nc . Then, we can show that:
(5.1) O(nnz(A) + d4 /ϵ2 ) = O(γ −1 nnz(A) + dω + ϵ−6 d2+γ ),
where the left-hand side is the cost of applying a CountSketch matrix S (i.e., OSNAP with
one non-zero per column) with embedding dimension m = O(d2 /ϵ2 ), and then performing
the thin eigendecomposition SA = U DV ⊤ to get a size d × d sketch DV ⊤ that is the desired
subspace embedding. To show (5.1), we consider two cases:
(1) Suppose that ϵ ≥ n−1/4 . Then, we have d4 /ϵ2 ≤ n ≤ nnz(A), which implies (5.1).
24 OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS

(2) Now, suppose otherwise, i.e., ϵ < n−1/4 . Then, we get ϵ−6 d2 ≥ ϵ−2 nd2 ≥ ϵ−2 d4 , which
also implies (5.1).
For the rest of the proof, we assume that d ≥ nc . The basic idea of our low-distortion
embedding is to first use the constant-distortion embedding (from Theorem 1.4) to compute
rough approximations of the leverage scores, and then use those leverage scores estimates as
input to our Leverage Score Sparsified embedding (from Theorem 1.3). We start by quoting
a result by [21] who show how one can quickly approximate the row norms of matrix AR,
given A ∈ Rn×d and R ∈ Rd×d , without computing the matrix product AR itself.
Lemma 5.1 (Modified Lemma 7.2 from [21]). Given A ∈ Rn×d and R ∈ Rd×d , we can
compute in O(γ −1 (nnz(A) + d2 )) time estimates ˜l1 , ..., ˜ln such that with probability ≥ 0.9:
X
n−γ ∥e⊤ 2 ˜ ⊤ 2
i AR∥2 ≤ li ≤ O(log(n))∥ei AR∥2 ∀i and ˜li ≤ O(1) · ∥AR∥2 .
F
i

The idea of the construction of [21] essentially follows the usual JL-type row-norm approx-
imation strategy, e.g., from [7]. But, while the standard approach is to compute constant-
factor approximations in O((nnz(A)+d2 ) log n) time, they observe that we can obtain coarser
O(nγ ) factor approximations in true input√sparsity time. The procedure is as follows. Let G
be a d × k Gaussian matrix, scaled by 1/ k, where k = O(γ −1 ). Then, we compute:
˜li = ∥e⊤ ARG∥2 .
i 2
Note that these can be computed in time: O(γ −1 d2 ) for computing RG, plus O(γ −1 nnz(A))
for computing A(RG) and its row norms. These estimates will then with high probability
satisfy the guarantee from lemma 5.1. Finally, while this was not part of Lemma 7.2 in [21],
we also observe that this construction satisfies an O(1) upper bound, sharper than O(log n),
if
Pwe consider the aggregate of all estimates, instead of the individual ˜li ’s. This holds since
˜li = ∥ARG∥2 , which is an unbiased estimate of ∥AR∥2 so a constant factor approximation
i F F
holds with probability 0.9 via Markov’s inequality (which can easily be improved to a high-
probability bound, e.g., via Bernstein’s inequality). We note that in all of our applications,
an element-wise lower-bound and an aggregate upper-bound are sufficient.
Now, we are ready to describe our own procedure for constructing a low-distortion subspace
embedding which is described in Algorithm 1.

Algorithm 1 Fast low-distortion subspace embedding.


1: Compute S1 A with S1 from Theorem 1.4
2: Compute R such that S1 A = QR−1 for orthogonal Q
3: Compute leverage score estimates ˜
li ≈ ∥e⊤ 2
i AR∥2 using Lemma 5.1
4: Compute S, the Leverage Score Sparsified embedding from Theorem 1.3 using ˜
l1 , ..., ˜ln
5: return S

Note that thanks to the subspace embedding property of S1 , we have:


∥e⊤ 2 ⊤ 2
i AR∥2 ≈O(1) ∥ei U ∥2 = ℓi (A),

where U is the orthonormal basis matrix for A’s column span. Instead of computing ∥e⊤ 2
i AR∥2
directly, we approximate them using Lemma 5.1, to obtain O(nγ )-approximations ˜li . Having
computed the ˜li ’s, we compute the leverage score sparsified embedding S from Theorem 1.3,
initialized with ˜li ’s. Setting m = O(d/ϵ2 ) and using O(nγ log4 (d)/ϵ4 ) non-zeros per row, we
obtain a subspace embedding with distortion ϵ. The cost of computing SA is
O(mdnγ log4 (d)/ϵ4 ) = O(nγ d2 log4 (d)/ϵ6 ).
OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS 25

Since we assumed that d ≥ nc , by adjusting γ we can replace nγ with dγ . Also, note


that computing R takes O(dω ) time and the cost of computing S1 A and ˜li ’s is bounded by
O(γ −1 nnz(A) + d2+γ log4 d). Thus, the overall time complexity of Algorithm 1 is:
O(γ −1 nnz(A) + dω + ϵ−6 d2+γ log4 (d)).

6. Analysis of Fast Linear Regression


In this section, we give proofs for the linear regression results.

6.1. Proof of Theorem 1.8. This result follows immediately from our fast low-distortion
embedding result (Theorem 1.6). Compute [Ã | b̃] = S[A | b] where S is our low-distortion
embedding for [A | b]. Now, suppose that we find x̃ such that:
f˜(x̃) ≤ (1 + ϵ) min f˜(x), where f˜(x) = ∥Ãx − b̃∥22 + g(x).
x∈C

Note that, the subspace embedding property implies that:


(1 + ϵ)−2 f (x) ≤ f˜(x) ≤ (1 + ϵ)2 f (x), ∀x ∈ C.
Then, defining x∗ = argminx∈C f (x), our estimate x̃ satisfies:
f (x̃) ≤ (1 + ϵ)2 f˜(x̃) ≤ (1 + ϵ)4 min f˜(x)
x∈C

≤ (1 + ϵ) f˜(x∗ ) ≤ (1 + ϵ)6 f (x∗ ) ≤ (1 + 10ϵ)f (x∗ )


4

Adjusting the ϵ by a constant factor concludes the reduction.

6.2. Proof of Theorem 1.7. Part 1 of Theorem 1.7 follows immediately from our fast
oblivious subspace embedding result, Theorem 1.4, combined with the above reduction result,
Theorem 1.8. Crucially, the algorithm relies on the fact that our embedding is oblivious, which
allows us to construct the sketches SA and Sb in a single pass over the data, and using the
optimal O(d2 log(nd)) bits of space.
Next, we move on to part 2 of Theorem 1.7, i.e., computing a 1 + ϵ approximation of the
least squares solution, x∗ = argminx ∥Ax − b∥2 (not necessarily in a single pass). We propose
the following algorithm, which is a variant of preconditioned mini-batch stochastic gradient
descent with O(d1+γ ) mini-batch size, using our constant-distortion OSE (Theorem 1.4) for
initialization and preconditioning. We note that this is a different approach than the one
proposed by [22], who run preconditioned gradient descent on an O(d polylog(d)/ϵ) sketch,
resulting in an additional O(ϵ−1 log(1/ϵ)d2 polylog(d)) cost.

Algorithm 2 Fast least squares via preconditioned mini-batch SGD


1: Compute SA and Sb with S from Theorem 1.4
2: Compute R such that SA = QR−1 for orthogonal Q
3: Compute leverage score estimates ˜
li using Lemma 5.1
4: Compute x0 = argminx ∥SAx − Sb∥2
5: for t = 0 to T − 1 do
6: Construct a sub-sampling matrix St with k rows using distribution (p1 , ..., pn ) where
pi = ˜li / i ˜li
P

7: Compute gt ← 2(St A)⊤ (St Axt − St b)


8: Compute xt+1 ← xt − ηt RR⊤ gt
9: end for
10: return xT
26 OPTIMAL EMBEDDING DIMENSION FOR SPARSE SUBSPACE EMBEDDINGS

To establish the result, we show the following convergence guarantee for Algorithm 2. The
result is potentially of independent interest, so we state it here in a general form using any
mini-batch size k, and then later specify the right choice for our setting.
Lemma 6.1. Given A ∈ Rn×d and b ∈ Rn , suppose that the sub-sampling probabilities satisfy:
pi ≥ ℓi (A)/(αd) ∀i .
β k/8
Then, there is an absolute constant c > 0 such that, letting ηt := 1+βt/8 for β = k+4αd ,
Algorithm 2 satisfies:
f (x0 )
E[f (xt ) − f (x∗ )] ≤ ∀t≥1 .
1 + kt/(cαd)
Proof. Thanks to the subspace embedding property of S, we know that with high probability
the condition number of AR satisfies κ(AR)2 ≤ 2. Consider the quadratic objective function
f (x) = ∥Ax − b∥22 . We start by showing a bound on the variance of the stochastic gradient
gt (which is an unbiased estimator
√ of ∇f (xt )), as measured via E[∥M (gt − ∇f (xt ))∥2Σ ]. Here,
we use the notation ∥x∥Σ = x⊤ Σx, letting Σ = A⊤ A and M = RR⊤ .
First, we use the fact that R is a good preconditioner for A:
E ∥M (gt − ∇f (xt ))∥2Σ = E ∥Σ1/2 M Σ1/2 Σ−1/2 (gt − ∇f (xt ))∥22
  

≤ 4 E ∥Σ−1/2 (gt − ∇f (xt ))∥22 .


 

Next, denoting rt = 2(Axt − b) and U = AΣ−1/2 , using the approximation guarantee for the
leverage score estimate, we obtain:
E ∥Σ−1/2 (gt − ∇f (xt ))∥22 = E ∥U ⊤ St⊤ St rt − U ⊤ rt ∥22
   

1 h1 2
i
= E 2 ∥uI (a⊤ ⊤
I xt − bI ) − U rt 2
k pI
1 h1 i
≤ E 2 ∥uI ∥22 (a⊤ I x t − b I )2
k pI
1 h αd ⊤ i αd
≤ E (aI xt − bI )2 = ∥Axt − b∥22 .
k pI k
We next incorporate this bound on the stochastic gradient noise into the SGD analysis to
establish a one-step expected convergence bound. Below, we use that E[gt ] = ∇f (xt ) =
2A⊤ (Axt − b) = 2A⊤ A(xt − x∗ ).
η2 
E ∥xt+1 − x∗ ∥2Σ = ∥xt − x∗ ∥2Σ − ηt (xt − x∗ )⊤ ΣM ∇f (xt ) + t E ∥M gt ∥2Σ
  
2
η 2
= ∥xt − x∗ ∥2Σ − ηt ∥xt − x∗ ∥2ΣM Σ + t E ∥M gt ∥2Σ
 
2
 ηt  η 2
∥xt − x∗ ∥2Σ + t E ∥M gt ∥2Σ .
 
≤ 1−
2 2
Next, we decompose the last term as follows:
E ∥M gt ∥2Σ = ∥M ∇f (xt )∥2Σ + E ∥M (gt − ∇f (xt ))∥2Σ
   

4αd
≤ ∥xt − x∗ ∥2ΣM ΣM Σ + f (xt )
k
4αd
≤ 4∥xt − x∗ ∥2Σ + f (xt ).
k
REFERENCES 27

Putting this together, for ηt ≤ min{ 41 , 8αd


k
} we get:
 4η 2   η 2 4αd
E f (xt+1 ) − f (x∗ ) ≤ 1 − ηt + t f (xt ) − f (x∗ ) + t
 
f (xt )
2 2 k
 ηt 2αd  2αd
≤ 1 − + ηt2 f (xt ) − f (x∗ ) + ηt2 f (x∗ )

2 k k
 ηt   2αd
f (xt ) − f (x∗ ) + ηt2 f (x∗ ).

≤ 1−
4 k
k/4
Now, we proceed by induction. Let ηt = uβt , where ut = 1 + βt/8 and β = k+4αd . Also, define
G = 2αd/k. Our inductive hypothesis is:
 f (x0 ) f (x0 )
E f (xt ) − f (x∗ ) ≤

= k/32
.
ut 1 + k+4αd t
The base case, t = 0, is obvious since u0 = 1. Now, suppose that the hypothesis holds for
some t. Then, using that 4βG ≤ 1/2, we have:
  β    β2
E f (xt+1 ) − f (x∗ ) ≤ 1 − E f (xt ) − f (x∗ ) + 2 Gf (x∗ )

4ut ut
 β f (x0 ) β
 2
≤ 1− + 2 Gf (x0 )
4ut ut ut
 β  f (x0 )
= 1− (1 − 4βG)
4ut ut
 β  f (x0 ) 8ut − β 1
≤ 1− = f (x0 )
8ut ut 8ut ut
8 f (x0 ) f (x0 )
≤ f (x0 ) = = .
8ut + β ut + β/8 ut+1
This concludes the proof of Lemma 6.1. □
Now, to finalize the proof of Theorem 1.7 part 2, we set α = O(nγ ) and k = αd, so that
E[f (xt ) − f (x∗ )] ≤ f (x0 )/t. Note that we initialized x0 so that f (x0 ) = O(1) · f (x∗ ). Thus,
the complexity of the mini-batch SGD (excluding the preprocessing) is O(kd/ϵ) = O(nγ d2 /ϵ).
Now, including the O(γ −1 nnz(A)+dω ) time for computing R and the leverage score estimates,
we obtain the overall complexity of:
O(γ −1 nnz(A) + dω + nγ d2 /ϵ).
Here, again, we can assume without loss of generality that nnz(A) ≥ n and that d ≥ nc for
c = 0.1, which gives the final claim.

References
1. Sarlos, T. Improved approximation algorithms for large matrices via random projections
in 2006 47th annual IEEE symposium on foundations of computer science (FOCS’06)
(2006), 143–152.
2. Rokhlin, V. & Tygert, M. A fast randomized algorithm for overdetermined linear least-
squares regression. Proceedings of the National Academy of Sciences 105, 13212–13217
(2008).
3. Clarkson, K. L. & Woodruff, D. P. Numerical linear algebra in the streaming model in
Proceedings of the forty-first annual ACM symposium on Theory of computing (2009),
205–214.
28 REFERENCES

4. Meng, X. & Mahoney, M. W. Low-distortion subspace embeddings in input-sparsity time


and applications to robust linear regression in Proceedings of the forty-fifth annual ACM
symposium on Theory of computing (2013), 91–100.
5. Clarkson, K. L. & Woodruff, D. P. Low rank approximation and regression in input
sparsity time in Proceedings of the forty-fifth annual ACM symposium on Theory of
Computing (2013), 81–90.
6. Clarkson, K. L. et al. The fast cauchy transform and faster robust linear regression.
SIAM Journal on Computing 45, 763–810 (2016).
7. Drineas, P., Magdon-Ismail, M., Mahoney, M. W. & Woodruff, D. P. Fast approximation
of matrix coherence and statistical leverage. The Journal of Machine Learning Research
13, 3475–3506 (2012).
8. Wang, R. & Woodruff, D. P. Tight Bounds for l1 Oblivious Subspace Embeddings. ACM
Transactions on Algorithms (TALG) 18, 1–32 (2022).
9. Halko, N., Martinsson, P.-G. & Tropp, J. A. Finding structure with randomness: Prob-
abilistic algorithms for constructing approximate matrix decompositions. SIAM review
53, 217–288 (2011).
10. Woodruff, D. P. Sketching as a tool for numerical linear algebra. Foundations and
Trends® in Theoretical Computer Science 10, 1–157 (2014).
11. Drineas, P. & Mahoney, M. W. RandNLA: randomized numerical linear algebra. Com-
munications of the ACM 59, 80–90 (2016).
12. Martinsson, P.-G. & Tropp, J. A. Randomized numerical linear algebra: Foundations
and algorithms. Acta Numerica 29, 403–572 (2020).
13. Derezinski, M. & Mahoney, M. W. Determinantal point processes in randomized numer-
ical linear algebra. Notices of the American Mathematical Society 68, 34–45 (2021).
14. Murray, R. et al. Randomized Numerical Linear Algebra – A Perspective on the Field
with an Eye to Software tech. rep. arXiv preprint arXiv:2302.11474 (2023).
15. Johnson, W. B. & Lindenstrauss, J. Extensions of Lipschitz mappings into a Hilbert
space. Contemp. Math. 26, 189–206 (1984).
16. Rudelson, M. & Vershynin, R. Non-asymptotic theory of random matrices: extreme sin-
gular values in Proceedings of the International Congress of Mathematicians 2010 (ICM
2010) (In 4 Volumes) Vol. I: Plenary Lectures and Ceremonies Vols. II–IV: Invited
Lectures (2010), 1576–1602.
17. Nelson, J. & Nguyên, H. L. OSNAP: Faster numerical linear algebra algorithms via
sparser subspace embeddings in 2013 ieee 54th annual symposium on foundations of
computer science (2013), 117–126.
18. Bourgain, J., Dirksen, S. & Nelson, J. Toward a unified theory of sparse dimensionality
reduction in euclidean space in Proceedings of the forty-seventh annual ACM symposium
on Theory of Computing (2015), 499–508.
19. Cohen, M. B. Nearly tight oblivious subspace embeddings by trace inequalities in Pro-
ceedings of the twenty-seventh annual ACM-SIAM symposium on Discrete algorithms
(2016), 278–287.
20. Nelson, J. & Nguyên, H. L. Lower bounds for oblivious subspace embeddings in Interna-
tional Colloquium on Automata, Languages, and Programming (2014), 883–894.
21. Chepurko, N., Clarkson, K. L., Kacham, P. & Woodruff, D. P. Near-optimal algorithms
for linear algebra in the current matrix multiplication time in Proceedings of the 2022
Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (2022), 3043–3068.
22. Cherapanamjeri, Y., Silwal, S., Woodruff, D. P. & Zhou, S. Optimal algorithms for linear
algebra in the current matrix multiplication time in Proceedings of the 2023 Annual
ACM-SIAM Symposium on Discrete Algorithms (SODA) (2023), 4026–4049.
REFERENCES 29

23. Drineas, P., Mahoney, M. W. & Muthukrishnan, S. Sampling algorithms for ℓ2 regression
and applications in Proceedings of the seventeenth annual ACM-SIAM symposium on
Discrete algorithm (2006), 1127–1136.
24. Ailon, N. & Chazelle, B. The fast Johnson–Lindenstrauss transform and approximate
nearest neighbors. SIAM Journal on computing 39, 302–322 (2009).
25. Tropp, J. A. Improved analysis of the subsampled randomized Hadamard transform.
Advances in Adaptive Data Analysis 3, 115–126 (2011).
26. Dereziński, M., Liao, Z., Dobriban, E. & Mahoney, M. Sparse sketches with small inver-
sion bias in Conference on Learning Theory (2021), 1467–1510.
27. Dereziński, M., Lacotte, J., Pilanci, M. & Mahoney, M. W. Newton-LESS: Sparsification
without Trade-offs for the Sketched Newton Update. Advances in Neural Information
Processing Systems 34, 2835–2847 (2021).
28. Dereziński, M. Algorithmic gaussianization through sketching: Converting data into sub-
gaussian random designs in The Thirty Sixth Annual Conference on Learning Theory
(2023), 3137–3172.
29. Brailovskaya, T. & van Handel, R. Universality and sharp matrix concentration inequal-
ities. arXiv preprint arXiv:2201.05142 (2022).
30. Yang, J., Chow, Y.-L., Ré, C. & Mahoney, M. W. Weighted SGD for lp regression with
randomized preconditioning. The Journal of Machine Learning Research 18, 7811–7853
(2017).
31. Cohen, M. B. & Peng, R. Lp row sampling by lewis weights in Proceedings of the forty-
seventh annual ACM symposium on Theory of computing (2015), 183–192.
32. Chen, X. & Dereziński, M. Query complexity of least absolute deviation regression via
robust uniform convergence in Conference on Learning Theory (2021), 1144–1179.
33. Cohen, M. B., Musco, C. & Musco, C. Input sparsity time low-rank approximation via
ridge leverage score sampling in Proceedings of the Twenty-Eighth Annual ACM-SIAM
Symposium on Discrete Algorithms (2017), 1758–1777.
34. Li, Y. & Woodruff, D. Input-sparsity low rank approximation in Schatten norm in In-
ternational Conference on Machine Learning (2020), 6001–6009.
35. Cohen, M. B., Lee, Y. T. & Song, Z. Solving linear programs in the current matrix
multiplication time. Journal of the ACM (JACM) 68, 1–39 (2021).
36. Song, Z., Woodruff, D. P. & Zhong, P. Relative error tensor low rank approximation
in Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms
(2019), 2772–2789.
37. Jiang, H., Kathuria, T., Lee, Y. T., Padmanabhan, S. & Song, Z. A faster interior
point method for semidefinite programming in 2020 IEEE 61st annual symposium on
foundations of computer science (FOCS) (2020), 910–918.
38. Dereziński, M. & Rebrova, E. Sharp analysis of sketch-and-project methods via a con-
nection to randomized singular value decomposition. arXiv preprint arXiv:2208.09585
(2022).
39. Bandeira, A. S., Boedihardjo, M. T. & van Handel, R. Matrix concentration inequalities
and free probability. Inventiones mathematicae, 1–69 (2023).
40. Vadhan, S. P. et al. Pseudorandomness. Foundations and Trends® in Theoretical Com-
puter Science 7, 1–336 (2012).
41. Cover, T. M. Elements of information theory, Second Edition (John Wiley & Sons,
2006).

You might also like