Research Project Paper CS754

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO.

9, SEPTEMBER 2016 5145

Unknown Sparsity in Compressed Sensing:


Denoising and Inference
Miles E. Lopes, Member, IEEE

Abstract— The theory of compressed sensing (CS) asserts that problem can be solved reliably when x is sparse. Specifically,
an unknown signal x ∈ R p can be accurately recovered from when the sparsity level of x is measured in terms of the
an underdetermined set of n linear measurements with n  p 0 norm x0 := card{ j : x j = 0}, it is well known that if
provided that x is sufficiently sparse. However, in applications,
the degree of sparsity x0 is typically unknown, and the problem n  x0 log( p), then accurate recovery can be achieved with
of directly estimating x0 has been a longstanding gap between high probability when A is drawn from a suitable ensem-
theory and practice. A closely related issue is that x0 is a highly ble [2]–[5]. In this way, the parameter x0 is often treated as
idealized measure of sparsity, and for real signals with entries being known in much of the theoretical CS literature — despite
not equal to 0, the value x0 = p is not a useful description the fact that x0 is usually unknown in practice. Due to the
of compressibility. In our previous conference paper [1] that
examined these problems, we considered an alternative measure fact that the sparsity parameter plays a fundamental role in CS,
of soft sparsity, x21 /x22 , and designed a procedure to estimate the issue of unknown sparsity has become recognized as a
x21 /x22 that does not rely on sparsity assumptions. This gap between theory and practice [6]–[9]. Likewise, our overall
paper offers a new deconvolution-based method for estimating focus is the problem of estimating the unknown sparsity level
unknown sparsity, which has wider applicability and sharper of x without relying on any sparsity assumptions. However,
theoretical guarantees. In particular, we introduce a family of rather than concentrating solely on x0 , we will consider a
 (q/1−q)
entropy-based sparsity measures sq (x) = xq /x1 generalized family of sparsity measures, {sq (x)}q∈[0,∞] , which
parameterized by q ∈ [0, ∞]. This family interpolates between will be introduced in Section I-B.
x0 = s0 (x) and x21 /x22 = s2 (x) as q ranges over [0, 2].
For any q ∈ (0, 2] \ {1}, we propose an estimator ŝq (x) √ whose A. Motivations and the Role of Sparsity
relative error converges at the dimension-free rate of 1/ n, even Given that many well-developed methods are avail-
when p/n → ∞. Our main results also describe the limiting
distribution of ŝq (x), as well as some connections to basis pursuit able for estimating the full signal x, or its support set
denosing, the Lasso, deterministic measurement matrices, and S := { j ∈ {1, . . . , p} : x j = 0}, it might seem surprising that
inference problems in CS. the problem of estimating x0 has remained unsettled.
Index Terms— Compressed sensing, deterministic measure- Indeed, given an estimate of x or S, it might seem natural to
ment matrices, 1 -regularization, Rényi entropy, unknown estimate x0 via a “plug-in rule”, such as x̂0 or card( Ŝ).
sparsity. However, it is important to note that methods for computing
x̂ and Ŝ often implicitly rely on prior knowledge of x0 .
I. I NTRODUCTION Consequently, when using a plug-in rule to estimate x0 ,

I N THIS paper, we consider the standard compressed


sensing (CS) model, involving n measurements
y = (y1 , . . . , yn ), generated according to
there is a danger of circular reasoning, and as a result,
estimating x0 does not simply reduce to estimating x or S.
To give a concrete sense for the issue of unknown sparsity,
y = Ax + σ , (1) the following list illustrates aspects of CS where sparsity
assumptions play a role, and where an “assumption-free”
where x ∈ R p is an unknown signal, A ∈ Rn× p is a estimate of x0 would be of use.
measurement matrix specified by the user, σ  ∈ Rn is a • The number of measurements. If the choice of n is
random noise vector, and n  p. The central problem of too small compared to the “critical” number x0 log( p),
CS is to recover the signal x using only the observations y then there are known information-theoretic barriers to the
and the matrix A. Over the course of the past decade, a accurate reconstruction of x [10]. At the same time, if n is
large body of research has shown that this seemingly ill-posed much larger than x0 log( p), then the measurement
process is wasteful (since there are known algorithms
Manuscript received August 6, 2015; revised March 10, 2016; accepted
June 13, 2016. Date of publication July 7, 2016; date of current version that can reliably recover x with n  x0 log( p) mea-
August 16, 2016. This work was supported in part by the NSF GRFP surements [4]). For this reason, it is not only important
Fellowship under Grant DGE-1106400 and in part by the DOE CSGF to ensure that n  x0 log( p), but to keep n as small as
Fellowship under Grant DE-FG02-97ER25308.
The author is with the Department of Statistics, University of California at possible. To deal with the selection of n, a sparsity esti-
Davis, Davis, CA 95616 USA (e-mail: melopes@ucdavis.edu).  0 may be used in two different ways, depending
mate x
Communicated by A. Montanari, Associate Editor for Statistical Learning. on whether measurements are collected sequentially, or in
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. a single batch. In the sequential case, an estimate of x0
Digital Object Identifier 10.1109/TIT.2016.2587772 could be computed from a small set of “preliminary”
0018-9448 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
5146 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016

measurements, and then x  0 could determine how many then π(x) has low entropy. From the viewpoint of information
extra measurements should be collected to recover x. theory, it is well known that the entropy of a distribution can
Alternatively, if measurements are collected in batch, the be interpreted as the logarithm of the distribution’s effective
 0 could certify that enough measurements were
value x number of states. Likewise, it is natural to count effective
actually taken. coordinates of x by counting effective states of π(x) via
• The measurement matrix. The properties of the mea- entropy. To this end, we define the effective sparsity,
surement matrix A that lead to good recovery are often 
directly linked to x0 . Two specific properties that exp(Hq (π(x))) if x = 0
sq (x) := (2)
have been intensively studied are the restricted isometry 0 if x = 0,
property of order k (RIP-k), [11], and the null-space where Hq is the Rényi entropy of order q ∈ [0, ∞]. (Our
property of order k (NSP-k), [12], [13], where k is terminology derives from the works [18], [19], where a
a presumed upper bound on x0 . Because recovery particular case of sq (x) is called effective sparsity.) When
guarantees are closely tied to RIP-k and NSP-k, a growing q ∈ {0, 1, ∞}, the Rényi
body of work has analyzed “certificates” for whether or   pentropy qis given explicitly by
Hq (π(x)) := 1−q 1
log i=1 πi (x) , and the cases of
not a given matrix satisfies these properties [14]–[16]. q ∈ {0, 1, ∞} are defined by evaluating limits, with H1 being
When k is treated as given, this problem is already the ordinary Shannon entropy. Combining the definitions of
computationally difficult. Yet, when sparsity is unknown, π(x) and Hq , we see that for x = 0 and q ∈ {0, 1, ∞}, the
such a “certificate” is less meaningful if we cannot check effective sparsity may be written conveniently in terms of q
that k conforms to the true signal. norms as
• Recovery algorithms. When recovery algorithms are   q
implemented, the sparsity level of x is often treated xq 1−q
sq (x) = , (3)
as a tuning parameter. For example, if k is a x1
conjectured bound on x0 , then the Orthogonal  p 
q 1/q for q > 0. As with H , the
Matching Pursuit algorithm (OMP) is typically ini- where xq := j =1 |x j | q

tialized to run for k iterations [17]. A second cases of q ∈ {0, 1, ∞} are evaluated as limits:
example is the Lasso algorithm, which computes a s0 (x) = lim sq (x) = x0 (4)
solution x̂ ∈ argmin {y − Av22 + λv1 : v ∈ R p }, for q→0

some choice of λ ≥ 0 governing the sparsity level of x̂. In s1 (x) = lim sq (x) = exp(H1(π(x))) (5)
q→1
the case of either OMP or Lasso, an estimate x  0 could x1
reduce computation by restricting the possible choices of s∞ (x) = lim sq (x) = x∞ . (6)
q→∞
λ or k, and it would also ensure that the solution’s sparsity
2) Background on the Definition of sq (x): To the best of our
level conforms to the true signal. Section V, gives further
knowledge, the definition of effective sparsity (2) in terms of
details on how our proposed methodology can adaptively
Rényi entropy is new in the context of CS. However, numerous
tune the Lasso with certain guarantees.
special cases and related definitions have been considered else-
where. For instance, in the early study of wavelet bases, Coif-
B. A Numerically Stable Measure of Sparsity   p |xi |2 |x i |2 
man and Wickerhauser proposed exp − i=1 x 2 log( x2 )
Despite the important theoretical role of the parameter x0 , 2 2
as a measure of effective dimension [20]. (See also the
it has a severe practical drawback of being sensitive to small
papers [21]–[23].) The important difference between this quan-
entries of x. In particular, for real signals x ∈ R p whose
tity and sq (x) is that the Rényi entropy leads to a convenient
entries are not equal to 0, the value x0 = p is not a useful
ratio of norms, which will play an essential role in our
description of compressibility. In order to estimate sparsity in
procedure for estimating sq (x).
a way that accounts for the instability of x0 , it is desirable to
In recent years, there has been growing interest in ratios of
replace the 0 norm with a “soft” version. More precisely, we
norms as measures of sparsity, but such ratios have generally
would like to identify a function of x that can be interpreted
been introduced in an ad-hoc manner, and there has not been
as counting the “effective number of coordinates of x”, but
a principled way to explain where they “come from”. To this
remains stable under small perturbations. Below, we derive
extent, our entropy-based definition of sq (x) conceptually
such a function by showing that x0 is a special case of a
unifies these ratios. Examples of previously studied instances
more general sparsity measure based on entropy.
include x21 /x22 corresponding to q = 2 [1], [16], [18],
1) A Link Between Sparsity and Entropy: Any non-zero
[24], [25], x1 /x∞ corresponding to q = ∞ [25]–[27], as
vector x ∈ R p induces a distribution π(x) ∈ R p on the set
well as a ratio of the form (xa /xb )ab/(b−a) with a, b > 0,
of indices {1, . . . , p}, assigning mass π j (x) := |x j |/x1 at
which is implicit in the paper [28]. (To see how the latter
index j .1 Under this correspondence, if x places most of its
quantity fits in the scope of our entropy-based definition,
mass at a small number of coordinates, and J ∼ π(x) is
consider the normalization π j (x) = |x j |t /xtt with t = b
a random index in {1, . . . , p}, then J is likely to occupy a
and q = a/b.)
small set of effective states. This means that if x is sparse,
Outside of CS, the use of Rényi entropy to count
1 Other normalizations are possible, e.g. π (x) = |x |2 /x2 . See also
j j
the effective number of states of a distribution has been
2
Section I-B3. well-established in the ecology literature. If a distribution π

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
LOPES: UNKNOWN SPARSITY IN CS: DENOISING AND INFERENCE 5147

on {1, . . . , p} measures the abundance of p species in a c) Numerically stable measures of rank: The framework
community, then exp(Hq (π)) is a standard measure of the of CS naturally extends to recovering an unknown low-rank
effective number of species. This number has been called the matrix X ∈ R p1 × p2 on the basis of a linear measurement
Hill index or diversity number [29], [30]. Thus, the conceptual model [33], [34]. In many ways, the quantity rank(X) is
link with CS is to interpret the signal x ∈ R p as a distribution analogous to x0 in recovering a sparse vector x. If we let
on the indices {1, . . . , p}. ς (X) denote the vector of singular values of X, the connection
3) Properties of sq (x): The following list summarizes some is suggested by the fact that rank(X) = ς (X)0 . Likewise,
of the most important properties of sq (x). we can consider
• (continuity). Unlike the 0 norm, the function sq (·) is
q q
ς(X )q 1−q || X ||q 1−q
continuous on R p \ {0} for all q > 0, and is hence stable rq (X) := sq (ς (X)) = ς(X )1 = || X ||1
under small perturbations of x. as a measure of the effective rank of X, where the latter
• (range equal to [0, p]). For all x ∈ R p and all expression holds for q ∈ {0, 1, ∞}, and |||X|||q := ς (X)q .
q ∈ [0, ∞], the effective sparsity satisfies Essentially all of the properties of sq (·) described earlier carry
0 ≤ sq (x) ≤ p. over to rq (·). We also note that quantities related to rq (X), or
special instances, have been considered elsewhere as a measure
This property follows from the fact that for any q, and any of rank [19], [35]–[39].
distribution π on {1, . . . , p}, the Rényi entropy satisfies d) Graphical interpretations: The fact that sq (x) with
0 ≤ Hq (π) ≤ log( p). q > 0 is a sensible measure of sparsity for non-idealized
• (scale-invariance). The fact that cx0 = x0 for all signals is illustrated in Figure 1. In essence, if x has k large
scalars c = 0 is familiar for the 0 norm, and this coordinates and p − k small coordinates, then sq (x) ≈ k,
generalizes to sq (x) for all q ∈ [0, ∞]. Scale-invariance whereas x0 = p. In the left panel, the sorted coordinates
encodes the idea that sparsity should be based on relative of three vectors in R100 are plotted, indicating that s2 (x)
(rather than absolute) magnitudes of the entries of x. measures the “elbow” in the decay plot. This idea can be seen
• (monotonicity in q). For any x ∈ R p , the function in a more geometric way in the right panel, which plots the
q → sq (x) is non-increasing on [0, ∞], and interpolates the sub-level sets Sc := {x ∈ R p : s2 (x) ≤ c} with p = 2.
between s∞ (x) and s0 (x). That is, for any q ≥ q ≥ 0, When c ≈ 1, the vectors in Sc are closely aligned with the
x1 coordinate axes.
x∞ = s∞ (x) ≤ sq (x) ≤ sq (x) ≤ s0 (x) = x0 .
The monotonicity follows from the fact that the Rényi C. Contributions
entropy Hq is non-increasing in q. 1) The Family of Sparsity Measures {sq (x)}q∈[0,∞] :
a) The choice of q: The parameter q controls how much As mentioned in Section I-B.2, our definition of sq (x) in terms
weight sq (x) assigns to small coordinates. When q = 0, an of Rényi entropy gives a conceptual foundation for several
arbitrarily small coordinate is still counted as being “effective”. norm ratios that have appeared elsewhere in the sparsity
By contrast, when q = ∞, a coordinate is not counted as being literature. Furthermore, we clarify the meaning of s2 (x) in
effective unless its magnitude is close to x∞ . The choice Section II by showing that it plays an intuitive role in the
of q is also relevant to other considerations. For instance, performance of the Basis Pursuit Denoising algorithm.
we will show in Section II that q = 2 is important because 2) Estimation Results, Confidence Intervals, and Appli-
x2
recovery guarantees can be derived in terms of s2 (x) = x12 . cations: Our central methodological contribution is a new
In addition, the choice of q affects the type of measurements
2
deconvolution technique for estimating xq and sq (x) from
we use to estimate sq (x). In this respect, the case of q = 2 linear measurements. Conceptually, the technique is of interest
turns out to be attractive because some of our proposed in that it blends the tools of sketching with stable laws
measurements are Gaussian — and they can naturally be re- and deconvolution with characteristic functions. These tools
used for recovering x. are typically applied in different contexts, as discussed in
b) Minimization of sq (x): Our work in this paper does Section I-D. Also, the computational cost of our method is
not require the minimization of sq (x), and neither do the small relative to recovering x by standard methods.
applications we propose. Nevertheless, some readers may be The most important features of our estimator ŝq (x) are
curious about what can be done in this direction. It turns that it does not rely on any sparsity assumptions, and that √
out that for certain values of q, or certain normalizations its relative error decays at the dimension-free rate of 1/ n
of π(x), the minimization of sq (x) may be tractable (under (in probability). In particular, only O(1) measurements are
suitable constraints). The recent paper [31] discusses min- needed to obtain a good estimate of sq (x), even when
imization of s2 (x). Another example is the minimization p/n → ∞.
of s∞ (x), which can be reduced to a sequence of linear In addition to proving ratio-consistency, we derive a CLT for
programs [26], [27]. Lastly, a third example deals with the ŝq (x), which allows us to obtain confidence intervals for sq (x)
normalization π j (x) = |x j |t /xtt with t = 1/2, which leads with asymptotically exact coverage probability. A notable
to exp(H2(π(x)) = (x2 /x4 )4 . The paper [28] shows that feature of this CLT is that it is “uniform” with respect to
minimization of x2 /x4 has interesting connections with the tuning parameter in our procedure for computing ŝq (x).
sums-of-squares (SOS) optimization problems [32]. The uniformity is important because it allows us to make

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
5148 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016

Fig. 1. Left panel: Three vectors (red, blue, black) in R100 are plotted with entries in decreasing order. Right panel: Sub-level sets Sc with c = 1.1 in dark
grey, and c = 1.9 in light grey.

an optimal data-dependent selection of the tuning parameter sparsity are new (Section V). Lastly, our theoretical results in
and still find the estimator’s limiting distribution (see Theo- Sections II and VI are sharpened versions of parallel results
rem 2 and Corollary 1). In terms of applications, we show in in the paper [1].
Section V how this CLT can be used in inferential problems, 2) Sketching With Stable Laws: Our approach to estimating
e.g. testing the null hypothesis that sq (x) is greater than a given sq (x) is based on the sub-problem of estimating xq , and we
level, and ensuring that the true signal lies in the constraint employ the technique of sketching with stable laws from the
set of the (primal) Lasso or Elastic net problems. streaming computation literature [42]–[49]. Over the last few
3) The Necessity of Randomized Measurements: The prob- years, the exchange of ideas between sketching and CS has
lem of constructing deterministic sensing matrices with good begun to accelerate [1], [50]–[52]. Nevertheless, to the best of
performance guarantees is one of the major unresolved theo- our knowledge, the present paper and our earlier work [1] are
retical issues in CS (We refer to the book [5, Secs. 1.3 and 6.1], the first to apply sketching ideas to unknown sparsity in CS.
as well as [40] and [41].) Since our proposed method for esti- 3) Empirical Characteristic Functions: Given that stable
mating sq (x) depends on randomized measurements, one may laws have a simple formula for their characteristic function
wonder if randomization is essential to estimating unknown (ch.f.), but have no general likelihood  formula,√it is natural
sparsity. In Section VI, we show that randomization is essential to use the empirical ch.f.  ˆ n (t) = 1 ni=1 exp( −1t yi ) as a
n
from a worst-case point of view. Our main result in this foundation for estimating scale parameters. With attention to
direction (Theorem 4) shows that for any deterministic matrix denoising, ch.f.’s are also attractive as they factor over convo-
A ∈ Rn× p , and any deterministic procedure for estimating lution and allow for heavy-tailed noise. As such, the favorable
sq (x), there is always a signal for which the relative estimation properties of empirical ch.f.’s have been applied to the decon-
error is at least of order 1 (even with noiseless measurements). volution of scale parameters by several authors [53]–[57].
This contrasts √with our randomized method, whose relative Although our approach shares similarities with these works,
error is O P (1/ n) for any choice of x. Furthermore, the result our results are not directly comparable due to differences in
has a negative implication for estimating unknown sparsity in model assumptions. Also, it seems that our work is the first
linear regression, where the “design matrix” is often viewed to derive the limiting distribution of a scale estimate under a
as fixed. data-dependent choice of tuning parameter t.
4) Model Selection and Validation in CS: Some of the
D. Related Work challenges described in Section I-A can be approached with
1) Extensions Beyond the Conference Paper [1]: Whereas the tools of cross-validation (CV) and empirical risk mini-
our earlier work deals exclusively with x21 /x22 , this work mization (ERM). Such approaches have been used to select
considers the family of parameters sq (x). The procedure we various parameters in CS, such as the number of measure-
propose for estimating sq (x) has several improvements on ments n [7], [9], the number of OMP iterations k [9], or the
the earlier approach: It tolerates noise with infinite variance Lasso regularization parameter λ [8]. Although methods based
and leads to confidence intervals with asymptotically exact on CV and ERM share common motivations with our work,
coverage probability (whereas the previous approach led to these methods differ from our approach in several ways.
conservative intervals). Also, the applications of our procedure In particular, the problem of estimating a soft measure of spar-
to tuning recovery algorithms and testing the hypothesis of sity, such as sq (x), has not been considered from that angle.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
LOPES: UNKNOWN SPARSITY IN CS: DENOISING AND INFERENCE 5149

(Note that even if x̂ − x2 is small, x̂0 may not be close computed by retaining the largest T entries of x in magnitude,
to x0 .) Computationally, CV and ERM can also be costly, and setting all others to 0.
e.g. if many solutions x̂ must be computed with different Theorem 1 [2], [36], [58]: Suppose the model (1) satisfies
tuning parameters. By contrast, our method is computationally the conditions A1 and A2. Let x ∈ R p \{0} be arbitrary, and fix
inexpensive. a number T ∈ {1, . . . , p}. Then, there are absolute constants
c2 , c3 > 0, and numbers c0 , c1 > 0 depending only on the
E. Outline sub-Gaussian norm of G 0 , such that the following statement
is true. If n ≥ c0 T log( pe/T ), then with probability at least
The paper is organized as follows. In Section II, we give
1 − 2 exp(−c1 n), any solution x̂ to the problem (BPDN)
a tight recovery guarantee for the Basis Pursuit Denoising
satisfies
algorithm in terms of s2 (x). Next, in Section III, we pro-
pose estimators for xq and sq (x). In Section IV, we state x̂−x2 σ 0 x−x |T 1
x2 ≤ c2 x2 + c3 √ . (7)
T x2
consistency results and provide confidence intervals for xq
and sq (x). Applications to testing the hypothesis of sparsity
and adaptive tuning of the Lasso are presented in Section V. C. An Upper Bound in Terms of s2 (x)
In Section VI, we show that randomized measurements A limitation of the previous bound is that the detailed
are essential for estimating sq (x) in a minimax sense. relationship between T and the approximation error
In Section VII, we present simulations that confirm our √ 1 x − x |T 1 is typically unknown. Consequently, it is
T x2
theoretical results. Proofs are deferred to the appendices. not clear how large n should be chosen in order to ensure that
x−x̂2
x2 is small with high probability. The next proposition
II. R ECOVERY G UARANTEES IN T ERMS OF s2 (x) resolves this issue by modifying the bound (7) so that the
In this section, we state two propositions giving matching relative 2 -error is bounded by an explicit function of n and
upper and lower bounds for relative 2 reconstruction error the estimable parameter s2 (x).
of Basis Pursuit Denoising (BPDN) in terms of s2 (x). The Proposition 1: Suppose the model (1) satisfies the condi-
main purpose here is to highlight the fact that s2 (x) and x0 tions A1 and A2, and let x ∈ R p \{0} be arbitrary. Then, there
play analogous roles with respect to the sample complexity of is an absolute constant κ2 > 0, and numbers κ0 , κ1 , κ3 > 0
sparse recovery. depending only on the sub-Gaussian norm of G 0 , such that the
following statement is true. If n and p satisfy κ0 log(κ0 pe
n )≤
A. Setup for BPDN n ≤ p, then with probability at least 1 − 2 exp(−κ1 n), any
solution x̂ to the problem (BPDN) satisfies
In order to explain the connection between s2 (x) and
recovery, we recall a fundamental result describing the BPDN x̂−x2 σ 0 s2 (x) log( pe
n )
x2 ≤ κ2 x + κ3 n . (8)
algorithm, first obtained in the paper [2, Th. 2]. Our state- 2

ment of the result includes some refinements from the later


papers [58, Th. 3.3] and [36, Th. 5.65], and we use two basic D. A Matching Lower Bound in Terms of s2 (x)
assumptions from these works. (We also rely on the standard Our next result shows that the upper bound (8) for BPDN
notions of sub-Gaussian distribution, and sub-Gaussian norm, is sharp in the case of noiseless measurements. In fact,
which are defined in [36].) the matching lower bound is applicable beyond BPDN, and
A 1: There is a constant 0 such that all realizations of the imposes a limit of performance on all algorithms that satisfy
noise vector  ∈ Rn satisfy 2 ≤ 0 . the mild condition of being homogenous with degree-1 in the
A 2: The entries of A ∈ Rn× p are i.i.d. samples of a random noiseless setting. To be specific, if a recovery algorithm is
variable √1n ζ , where ζ is drawn from a fixed sub-Gaussian viewed as a map R : Rn → R p that sends a vector of noiseless
distribution G 0 , with mean 0 and variance 1. measurements Ax ∈ Rn to a solution x̂ = R(Ax) ∈ R p , then
R is said to be homogenous with degree-1 if R(A(cx)) =
B. Remark c · R(Ax) for all c > 0. (It is simple to verify that BPDN
To clarify assumption A1, the condition 2 = O(1) might is homogenous with degree-1 in the noiseless case.) Apart
seem unnatural from the viewpoint of the statistics literature from the basic condition of homogeneity with degree-1, our
on sparse regression. There, it is common to treat the entries lower bound requires no other assumptions, and involves no
randomness.
of the design matrix A and the√ values i as being of order 1, Proposition 2: There is an absolute constant c0 > 0
which makes 2 of order n. In CS, it is common to use
an alternative√scaling, where the entries of A and the values i for which the following statement is true. If n and p
are order 1/ n, which leads to A1. satisfy log( pe
n ) ≤ n < p, then for any recovery algorithm
When the noise distribution satisfies A1, the output of the R : Rn → R p that is homogenous with degree-1, and any
BPDN algorithm [59] is a solution A ∈ Rn× p , there is at least one point x̃ ∈ R p \ {0} such
that
x̂ ∈ argmin v1 : Av − y2 ≤ σ 0 , v ∈ R p . (BPDN)
x̂−x̃2 s2 ( x̃) log( pe
n )
x̃2 ≥ c0 , (9)
As a final piece of notation, for any T ∈ {1, . . . , p}, we use n
x |T to denote the best T -term approximation of x, which is where x̂ = R(A x̃).

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
5150 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016

q
III. E STIMATION P ROCEDURES FOR sq (x) AND xq Lemma 1: Let x ∈ R p be fixed, and suppose a1 ∼
Here we describe the model assumptions underlying our stableq (γq )⊗ p with parameters q ∈ (0, 2] and γq > 0.
estimation procedure for sq (x). (These are different from the Then, the random variable a1 , x is distributed according to
assumptions in the previous section.) In scalar notation, we stableq (γq xq ).
consider linear measurements given by Using this fact, if we generate a set of i.i.d. measurement
vectors a1 , . . . , an from the distribution stableq (γq )⊗ p and
yi = ai , x + σ i , i = 1, . . . , n. (M) let ỹi = ai , x, then ỹ1 , . . . , ỹn is an i.i.d. sample from
stableq (γq xq ). Hence, in the special case of noiseless linear
1) The Deconvolution Model: For the remainder of the q
measurements, the task of estimating xq is equivalent to a
paper, we assume x = 0 unless stated otherwise. The well-studied univariate problem: estimating the scale parame-
noise variables i are generated in an i.i.d. manner from a ter of a stable law from an i.i.d. sample.
distribution F0 , and when the ai are random, we assume Our work below substantially extends this idea in the noisy
the sets {a1 , . . . , an } and {1 , . . . , n } are independent. The case. We also emphasize that the extra deconvolution step is
noise variables are assumed to be symmetric about 0, with an important aspect of our method that distinguishes it from
median(|1 |) = 1, and 0 < E|1 | < ∞, but they may have existing work in the sketching literature — where noise is
infinite variance. (The assumption of symmetry is only for typically not considered. Indeed, in the streaming computation
convenience, and in Section III-B2.e we explain how it can be literature, the model giving rise to a1 , x is quite different than
dropped.) A minor technical condition we place on F0 is that in CS. Roughly speaking, the vector x is viewed as a massive
the roots of its ch.f. ϕ0 are isolated (i.e. no limit points). This data stream whose entries can be observed sequentially, but
condition is satisfied by many families of distributions, such cannot be stored entirely in memory.
as Gaussian, Student’s t, Laplace, uniform[a, b], and stable Remark: The parameter γq governs the “energy level” of the
laws. In fact, a large portion of the deconvolution literature ai . For instance, in the case of Gaussian measurements with
assumes ϕ0 has no roots at all.2 The noise scale parameter ai ∼ stable2 (γ2 )⊗ p , we have Eai 22 = 2 pγ22 . In general, as
σ > 0 and the distribution F0 are treated as being known, the energy level is increased, the effect of noise is diminished.
which is a common assumption in deconvolution problems. Hence, we view γq as a physical aspect of the measurement
(Simulations in Section VII address the effects of estimating system that is known to the user. Asymptotically, we allow
the noise distribution.) γq = γq (ξ ) to vary as (n, p) → ∞.
2) Asymptotics: We allow the parameters of the deconvo-
lution model to vary as (n, p) → ∞. This means there is an
q
implicit index ξ = 1, 2, . . . , such that n = n(ξ ), p = p(ξ ) and B. The Sub-Problem of Estimating xq
both diverge as ξ → ∞. It will turn out that our asymptotic
Our procedure for estimating sq (x) uses two separate sets
results do not depend on the ratio p/n, and so we allow p to 1
of measurements of the form (M) to compute estimators x
be arbitrarily large with respect to n. We also allow x = x(ξ ),  q
σ = σ (ξ ) and ai = ai (ξ ), but the distribution F0 is fixed with and xq . (The measurements can also be collected in a
respect to ξ . Going forward, we mostly suppress the index ξ . single batch by concatenating the ai into a single matrix.)
The respective sizes of each measurement set will be denoted
by n 1 and n q . To unify the discussion, we will describe just
A. Sketching With Stable Laws in the Presence of Noise
one procedure to compute x q for any q ∈ (0, 2], since the
q
For any q ∈ (0, 2], the sketching technique offers a way to 1 norm is a special case. The two estimators are combined
q
estimate xq from randomized linear measurements. Building to obtain an estimator of sq (x), defined by
on this technique, we estimate sq (x) = (xq /x1 )q/(1−q)
q   1
by estimating xq and x1 from separate measurement  q 1−q
xq
vectors ai . The core idea is to generate ai ∈ R p using stable ŝq (x) := , (10)
q
laws [61].  1 ) 1−q
(x
Definition 1: A random variable V has a symmetric
q-stable distribution if there are constants γ √> 0 and which makes sense for any q ∈ (0, 2] except q = 1. The
q ∈ (0, 2] such that its ch.f. has the form E[exp( −1t V )] = parameters x0 and s1 (x) can still be estimated in practice
exp(−|γ t|q ) for all t ∈ R. We denote the distribution by by using ŝq (x) for some value of q that is close to 0 or 1.
V ∼ stableq (γ ), and γ is referred to as the scale parameter. (See also Section IV-C.)
The most well-known examples of stable laws are the cases a) An estimating equation based on characteristic
of q = 2 (Gaussian) and q = 1 (Cauchy). To fix notation, functions: If we draw i.i.d. measurement vectors
if a vector a1 = (a11 , . . . , a1 p ) ∈ R p has i.i.d. entries drawn ai ∼ stableq (γq )⊗ p , with i = 1, . . . , n q , then the ch.f. of yi
from stableq (γ ), we write a1 ∼ stableq (γ )⊗ p . Also, since our is given by
work involves different choices of q, we will write γq instead √ q q
(t) := E[exp( −1t yi )] = exp(−γq |t|q xq ) · ϕ0 (σ t),
of γ . The link with q norms hinges on the following basic
property of stable laws. (11)

2 It is known that a subset S ⊂ R is the zero set of a ch.f. if and only where t ∈ R, and we recall that ϕ0 is the ch.f. of i . Note that
if S is symmetric, closed, and excludes 0 [60]. ϕ0 is real-valued when i is a symmetric random variable.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
LOPES: UNKNOWN SPARSITY IN CS: DENOISING AND INFERENCE 5151

Using the measurements y1 , . . . , ynq , we can approximate To state the uniform CLT, we need to introduce the noise-
(t) with the empirical ch.f., to-signal ratio
nq √−1t y σ
ˆ nq (t) := 1
 i. ρq = ρq (ξ ) := γq xq . (17)
nq i=1 e (12)
q
Next, by solving for xq in the approximate equation Although we allow ρq to vary with (n q , p), we assume that
it stabilizes to a finite limiting value:
ˆ nq (t) ≈ exp(−γqq |t|q xqq ) · ϕ0 (σ t),
 (13) A 3: For each q ∈ (0, 2], there is a constant ρ̄q ∈ [0, ∞)
−1/2
such that ρq = ρ̄q + o(n q ) as (n q , p) → ∞.
we obtain an estimator x q . To make the dependence
q This assumption merely encodes the idea that the signal
q . Proceeding with the
on t explicit, we write ν̂ (t) = x is not overwhelmed by noise asymptotically. In stating the
q q
w
arithmetic in the previous line leads us to define following result, we use −−→ to denote weak convergence.
ˆ n q (t )

Theorem 2 (Uniform CLT for q Norm Estimator): Let
−1
ν̂q (t) := q
γq |t |q
Log+ Re ϕ0 (σ t) , (14) q ∈ (0, 2]. Assume that the measurement model (M) and
assumption A3 hold. Let tˆ be any function of y1 , . . . , ynq that
when t = 0 and ϕ0 (σ t) = 0. The symbol Re(z) denotes satisfies
the real part of a complex number z. Also, we define
Log+ (r ) := log(|r |) for any real number r = 0, and γq tˆxq −
→ P c0 (18)
Log+ (0) := −1 (arbitrarily). When t = 0 or ϕ0 (σ t) = 0, we as (n q , p) → ∞ for some finite constant c0 = 0 with
arbitrarily define ν̂q (t) = 1. (We only mention these details ϕ0 (ρ̄q c0 ) = 0. Then, the estimator ν̂q (tˆ ) satisfies
so that ν̂q (t) is defined for all t ∈ R, and such details are
irrelevant asymptotically.) √ ν̂q (tˆ) w
nq xq
q − 1 −−→ N(0, v q (c0 , ρ̄q )) (19)
b) Optimal selection of the tuning parameter: A crucial
aspect of the estimator ν̂q (t) is the choice of t ∈ R, which as (n q , p) → ∞, where the limiting variance v q (c0 , ρ̄q ) is
plays the role of a tuning parameter. This choice turns out to be strictly positive and defined according to the formula
somewhat delicate, especially in situations where xq → ∞
as (n, p) → ∞. To see why this matters, consider the equation v q (c0 , ρ̄q ) := 1 1
|c0 |2q 2ϕ0 (ρ̄q |c0 |)2
exp(2|c0 |q )
ϕ (2ρ̄ |c |)
ν̂q (t ) −1
ˆ n q (t )
 + 2ϕ0 (ρ̄ q|c 0|)2 exp((2 − 2q )|c0 |q ) − 1 .
xq
q = (γq |t |xq )q Log+ Re ϕ0 (σ t ) . (15) 0 q 0

Remarks: This result is proved in Appendix B.


If we happen to be in a situation where xq → ∞ as An important property of the CLT is that it is dimension-free
(n q , p) → ∞ while the parameters γq , σ , and t remain of and leaves the growth of p/n unrestricted. In essence, this
order 1, then γq |t|xq → ∞ and the empirical charcteristic occurs because when ai ∼ stableq (γq )⊗ p , the distribution of
function  ˆ nq will tend to 0 (due to line (13)). The scaled ai , x only depends on x through the norm xq , and the
q
estimate ν̂q (t)/xq may then become unstable as it can tend dimension of x becomes irrelevant.
to a limit of the form ∞ ∞ in line (15). Consequently, it is Now that the limiting distribution of ν̂q (tˆ ) is available, we
desirable to choose t adaptively so that as (n q , p) → ∞, focus on constructing an estimate tˆ so that c0 minimizes the
variance function v q (·, ρ̄q ). Since the formula for v q (c0 , ρ̄q )
γq txq → c0 , (16)
is ill-defined for certain c0 , the following subsection extends
for some finite constant c0 = 0, ensuring that ν̂q (t)/xq
q
the domain of v q (·, ρ̄q ). This extension will not affect the set
remains asymptotically stable. When this scaling can be of minimizers, and can be skipped on a first reading.
achieved, the next step is to further refine choice of t to c) Extending the variance function: Based on the previ-
minimize the limiting variance of ν̂q (t). Our proposed method ous theorem, our aim is to construct tˆ so that as (n q , p) → ∞,
will solve both of these problems.
γq tˆxq → P cq (ρ̄q ), (20)
Of course, the ability to choose t adaptively requires some
knowledge of xq , which is precisely the quantity we are where cq (ρ̄q ) is a minimizer of v q (·, ρ̄q ). Since ϕ0 is symmet-
trying to estimate. As soon as we select a data-dependent ric about 0, it follows that v q (·, ρ̄q ) is as well, and we may
value, say tˆ, we introduce a significant technical challenge: restrict our attention to cq (ρ̄q ) ≥ 0.
Inferences based on the adaptive estimator ν̂q (tˆ ) must take An inconvenient aspect of minimizing v q (·, ρ) is that its
the randomness of tˆ into account. Our approach is to prove a domain depends on ρ — since the formula (20) is ill-
uniform CLT for the random function ν̂q (·). As will be shown defined where ϕ0 (ρc0 ) = 0. Because we will be interested in
in the next result, the uniformity will allow us to determine minimizing v q (·, ρ̂q ) for some estimate ρ̂q of ρ̄q , this leads to
the limiting law of ν̂q (tˆ ) as if the optimal choice of t was analyzing the minimizer of a random function whose domain
known in advance of observing any data. To make the notion is also random. To alleviate this complication, we will define
of optimality precise, we first describe the limiting law of an extension of v q (·, ·) whose domain does not depend on the
ν̂q (tˆ) for any data-dependent value tˆ that satisfies the scaling second argument. Specifically, whenever q ∈ (0, 2), Lemma 2
condition (16) (in probability). A method for constructing such below shows that an extension ṽ q of v q can be found with the
a value tˆ will be given in the next subsection. Consistency properties that ṽ q (·, ·) is continuous on [0, ∞) × [0, ∞), and
results for the procedure are given in Section IV. ṽ q (·, ρ̄q ) has the same minimizers as v q (·, ρ̄q ).

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
5152 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016

When q = 2, one additional detail must be handled. Next, we use ρ̂q to estimate cq (ρ̄q ) via cq (ρ̂q ). Finally, we
In this case, it may happen for certain noise distribu- obtain an optimal choice of t using
tions that v 2 (c, ρ̄2 ) tends to a minimum as c → 0+ , cq (ρ̂q )
i.e. limc→0+ v 2 (c, ρ̄2 ) = inf c>0 v 2 (c, ρ̄2 ). This creates a tech- tˆopt := γq (|ν̂q (tˆpilot )|)1/q
. (25)
nical nuisance in using Theorem 2 because v 2 (c0 , ρ̄2 ) is not
defined for c0 = 0. There are various ways of handling In the exceptional case when ν̂q (tˆpilot) = 0, the esti-
this “edge case”, but for simplicity, we construct tˆ so that mates ρ̂q and tˆopt can be arbitrarily set to 1, but this
γ2 tˆx2 → P ε2 for some (arbitrarily) small constant ε2 > 0. choice does not matter asymptotically. Consistency results for
For this reason, in the particular case of q = 2, we will restrict ν̂(tˆpilot), ρ̂q , and tˆopt are given in Propositions 3 and 4 of
the domain of the extended function ṽ 2 (·, ·) to be [ε2 , ∞) × Section IV.
[0, ∞). The following lemma summarizes the properties of e) Handling asymmetric noise distributions: The pro-
the extended variance function that will be needed later on. posed method can be adjusted in a simple way to allow
Lemma 2 (Extended Variance Function): Suppose that ϕ0 for asymmetric noise distributions — and the theory for
satisfies the assumptions of the model (M), and let v q be the symmetric case carries over. To explain the adjust-
as in formula (20). For each q ∈ (0, 2), put εq := 0, and ment, suppose that the deconvolution model yi = ai , x + σ i
let ε2 > 0. For all values q ∈ (0, 2], define the function holds as before, except that the noise variables i may
ṽ q : [εq , ∞) × [0, ∞) → [0, ∞], according to be asymmetric. Also, suppose that the measurement vec-
 tors ai are still drawn i.i.d. from stableq (γq )⊗ p for
v q (c, ρ) if c = 0 and ϕ0 (ρc) = 0, i = 1, . . . , n q .
ṽ q (c, ρ) := (21)
∞ otherwise. After y1 , . . . , ynq are observed, they can be “sym-
metrized” via multiplication with an independent sequence of
Then, the function ṽ q (·, ·) is continuous on [εq , ∞) × [0, ∞),
i.i.d. Rademacher variables, say ri = ±1 with equal probabil-
and for any ρ ≥ 0, the function ṽ q (·, ρ) attains its minimum
ity. Define the symmetrized measurements as y̆i := ri yi . Due
in the set [εq , ∞).
to our choice of the ai , each variable ai , x is symmetric, and
(i) Remark: A simple consequence of the definition of
it follows that
ṽ q (·, ·) is that any choice of c ∈ [εq , ∞) that minimizes
ṽ q (·, ρ̄q ) also minimizes v q (·, ρ̄q ). Hence there is nothing lost d
y̆i = ai , x + σ ri i , (26)
in working with ṽ q .
d
(ii) Minimizing the extended variance function: We are where = denotes equality in distribution. (This can be checked
now in position to specify the limiting value cq (ρ̄q ) from by computing the ch.f. of each side.) Hence, by symmetrizing,
line (20). For any ρ ≥ 0, we define we obtain “new” observations y̆i such that the new noise
variable ri i is symmetric. Likewise, the original method
cq (ρ) ∈ argmin ṽ q (c, ρ), (22)
c≥εq can be applied to the y̆i — with the only change being
that ϕ0 must be replaced everywhere by the ch.f. of the
where q ∈ (0, 2] and εq is as defined in Lemma 2. Note that ṽ q new noise variable ri i , i.e. ϕ̆0 (t) = 12 ϕ0 (t) + 12 ϕ0 (−t),
is a known function, and so the value cq (ρ) can be computed for all t ∈ R. Similarly, our theoretical results for the
for any given ρ. However, since the limiting noise-to-signal symmetric case can be applied with ϕ̆0 replacing ϕ0 . Simula-
ratio ρ̄q is unknown, it will be necessary to use cq (ρ̂q ) for tions demonstrating the efficacy of the symmetrization step
some estimate ρ̂q of ρ̄q , which is described below. are given in Section VII (including the case where ϕ̆0 is
d) A procedure for optimal selection of t: The first step estimated).
of our procedure for selecting t is to compute a simple “pilot”
value tˆpilot. Next, we refine the pilot value to obtain an optimal
value tˆopt that will satisfy IV. M AIN R ESULTS FOR E STIMATORS
AND C ONFIDENCE I NTERVALS
tˆopt γq xq → P cq (ρ̄q ). (23)
In this section, we first show in Proposition 3 that tˆpilot
q
To describe the pilot value, let η0 > 0 be any number leads to consistent estimates of xq , and ρ̄q . Next, we
such that ϕ0 (η) > 12 for all η ∈ [0, η0 ] (which exists for show that the optimal constant cq (ρ̄q ) and optimal variance
any characteristic function). Also, define the median absolute v(cq (ρ̄q ), ρ̄q ) are consistently estimated. These estimators then
q
deviation statistic m̂ q := median(|y1 |, . . . , |ynq |), which is lead to adaptive confidence intervals for xq and sq (x), in
a coarse-grained, yet robust, estimate of γq xq . Then, the Theorem 3 and Corollary 1.
pilot value is defined as tˆpilot := min m̂1 , ησ0 , which is
q
finite with probability 1, even if σ = 0. The role of tˆpilot is A. Consistency Results
q
to provide a ratio-consistent estimator of xq through the
statistic ν̂q (tˆpilot) — but the limiting variance may not be Proposition 3 (Consistency of the Pilot Estimator): Let
optimal. With ν̂q (tˆpilot) in hand, we can then obtain a consistent q ∈ (0, 2] and assume that the measurement model (M) and
estimator of ρ̄q , A3 hold. Then as (n q , p) → ∞,
σ ν̂q (tˆpilot )
ρ̂q := γq (|ν̂q (tˆpilot )|)1/q
. (24) q −→ P 1 and ρ̂q −→ P ρ̄q . (27)
xq

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
LOPES: UNKNOWN SPARSITY IN CS: DENOISING AND INFERENCE 5153

1) Remarks: We now turn our attention from the estimators rely on measurement sets of sizes n q and n 1 , we
pilot value tˆpilot to the optimal value tˆopt . In order to make the following simple scaling assumption:
construct tˆopt , our method relies on the M-estimator A 5: For each q ∈ (0, 2] \ {1}, there is a constant
cq (ρ̂q ) ∈ argmin c≥εq ṽ q (c, ρ̂q ). Consistency proofs for π̄q ∈ (0, 1), such that as (n 1 , n q , p) → ∞,
M-estimators typically require that the objective function has
nq −1/2
a unique optimizer, and our situation is no exception. Note n 1 +n q = π̄q + o(n q ). (31)
that we do not need to assume that a minimizer exists, since Corollary 1 (Confidence Interval for sq (x)): Assume
this is guaranteed by Lemma 2. q ∈ (0, 2] \ {1}, and that the conditions of Theorem 3 hold,
A 4: For each q ∈ (0, 2], the function v q (·, ρ̄q ) has at most as well as assumption A5. Assume ŝq (x) is constructed
one minimizer in [εq , ∞), where εq is defined in Lemma 2. from independent sets of measurement vectors drawn from
In an approximate sense, this assumption can be checked stable1 (γ1 )⊗ p and stableq (γq )⊗ p , of sizes n 1 and n q . Letting
empirically by simply plotting the function v q (·, ρ̂q ). It is ω̂q be as in Theorem 3, define
also possible to analytically verify that it holds in the case
of stableq noise, but we omit these details. Based on inspec- ω̂q ω̂1 q 2
πq := n q /(n 1 + n q ) and ϑ̂q := πq ( 1−q )
1 2
+ 1−πq ( 1−q ) .
tion of plots, the assumption appears to hold for a vari-
ety of other parametric noise distributions, such as Laplace, Then as (n 1 , n q , p) → ∞,
uniform[−1, 1], and the t distribution).

Proposition 4 (Consistency of cq (ρ̂q )): Let q ∈ (0, 2] and n 1 +n q ŝq (x) w
− 1 −−→ N(0, 1), (32)
sq (x)
assume that the measurement model (M) holds, as well as ϑ̂q
assumptions A3, and A4. Then, as (n q , p) → ∞,
and consequently for any fixed α, α ∈ [0, 21 ], the probability
cq (ρ̂q ) −→ P cq (ρ̄q ) and tˆopt γq xq −→ P cq (ρ̄q ). (28)
 
ϑ̂q z 1−α ϑ̂q z 1−α
Furthermore, ṽ q (cq (ρ̂q ), ρ̂q ) −→ P ṽ q (cq (ρ̄q ), ρ̄q ). P 1− √ ŝq (x) ≤ sq (x) ≤ 1 + √ ŝq (x)
n 1 +n q n 1 +n q
2) Remark: This result is proved in Appendix C.
converges to 1 − α − α .
q
B. Confidence Intervals for xq and sq (x) 2) Remark: As in Theorem 3, we give a simpler formula for
In this subsection, we assemble the work in Proposition 4 the confidence interval above by using a CLT for the reciprocal
with Theorem 2 to obtain confidence intervals for xq
q sq (x)/ŝq (x).
and sq (x). In the next two results, we will use z 1−α to
denote the 1 − α quantile of the standard normal distribution, C. Estimating x0 With ŝq (x) and Small q
i.e. (z 1−α ) = 1 − α. To allow our result to be applied to
The following proposition quantifies the approximation
both one-sided and two-sided intervals, we state our result in
ŝq (x) ≈ x0 when q is close to 0. To state the proposition,
terms of two possibly distinct quantiles z 1−α and z 1−α .
q define the dynamic range of a non-zero signal x ∈ R p as
Theorem 3 (Confidence Interval for xq ): Let q ∈ (0, 2],
and define the estimated variance x∞
DNR (x) := |x|min , (33)
ω̂q := ṽ q (cq (ρ̂q ), ρ̂q ). (29)
where |x|min is the magnitude of the smallest non-zero
Assume that the measurement model (M) holds, as well as entry of x, i.e. |x|min = min{|x j | : x j = 0, j = 1, . . . , p}. The
assumptions A3, and A4. Then as (n q , p) → ∞, result involves no randomness and is applicable to any esti-
√ mator (say s̃q (x)).
nq ν̂q (tˆopt ) w
√ q − 1 −−→ N(0, 1), (30) Proposition 5: Let q ∈ (0, 1), x ∈ R p \ {0}, and let s̃q (x)
ω̂q xq
be any real number. Then,
and consequently for any fixed α, α ∈ [0, 12 ], the probability    
 s̃q (x)   s̃ (x)  q
   x0 −1 ≤  sqq (x) − 1 + 1−q log(DNR (x)) + q log(x0 ) .
√ √
ω̂q z 1−α q ω̂q z 1−α
P 1 − √nq ν̂q (tˆopt) ≤ xq ≤ 1 + √nq ν̂q (tˆopt ) Remarks: To interpret the proposition, note that when
s̃q (x) is chosen to be the proposed estimator ŝq (x), the first
ŝ (x)
converges to 1 − α − α . term | sqq (x) − 1| is already controlled by Corollary 1. The
1) Remarks: This result follows by combining Theorem 2 second term in the bound is the “approximation error” that
with Proposition 4. We note that if the limit (30) is used improves for smaller choices of q. A favorable property of
q
directly to obtain a confidence interval for xq , the result- the bound is that it behaves better in the specialized case
ing formulas are somewhat cumbersome. Instead, a simpler where estimating x0 is of interest, i.e. when the non-zero
confidence interval (given above) is obtained using a CLT for coordinates of x have roughly similar sizes. In this case, the
q
the reciprocal xq /ν̂q (tˆopt ), via the delta method. quantity log(DNR (x)) will not be too large. On the other hand,
As a corollary of Theorem 3, we obtain a CLT and a if log(DNR(x)) is large, then x has non-zero coordinates of
confidence interval for ŝq (x) by combining the estimators very different sizes, and x0 may not be the best measure of
ν̂q and ν̂1 with their respective tˆopt values. Since the norm sparsity to estimate.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
5154 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016

V. A PPLICATIONS where Bq (r ) := {v ∈ R p : vq ≤ r } is the q ball of radius


In this section, we give some illustrative applications of r ≥ 0, and r is viewed as a regularization parameter. (Note
q and ŝ (x). First, we describe how the
our results for x
that the matrix A here may be different from the one used to
q q
estimate sq (x).) If x denotes the true signal, then B1 (x1 ) is
assumption of sparsity may be checked in a hypothesis testing
the smallest constraint set for which the true signal is feasible.
framework. Second, we consider the problems of tuning the
Hence, one would expect r = x1 to be an ideal choice of
Lasso and Elastic Net algorithms (in primal form).
the tuning parameter, and the recent paper [63] shows that this
intuition is correct in a precise sense.
A. Testing the Hypothesis of Sparsity When using a data-dependent tuning parameter r̂ , it is of
A null hypothesis is often viewed as a “straw man” that we interest to have some guarantee that x lies in B1 (r̂ ). Our
would like to reject in favor of a “more interesting” alternative. one-sided confidence interval for x1 precisely solves this
Hence, to verify the assumption of sparsity, it is natural for problem. Under the assumptions of Theorem 3, if we choose
the null hypothesis to correspond to a non-sparse signal. α = 0, and α ∈ [0, 12 ], then the choice
If 1 < κ ≤ p is a given reference value of sparsity, then √
ω̂ z
we consider the testing problem r̂ := 1 + √1n1−α · ν̂1 (tˆopt ), (37)
1
 
H0 : sq (x) ≥ κ versus H1 : 1 ≤ sq (x) < κ. (34) gives P x ∈ B1 (r̂) → 1 − α as (n 1 , p) → ∞. In fact,
To construct a test statistic, we use the well-known duality this idea can be extended further by adding extra q norm
between confidence intervals and hypothesis tests. Consider a constraints. A natural example is a primal form of the well
one-sided confidence interval for sq (x) of the form (−∞, û α ], known Elastic Net algorithm [64], which constrains both the
with coverage probability P(sq (x) ≤ û α ) = 1 − α + o(1). 1 and 2 norms, leading to
Clearly, if H0 holds, then this one-sided interval must also minimize y − Av22
contain κ with probability at least 1−α+o(1). Said differently, v∈R p (38)
this means that under H0 , the chance that (−∞, û α ] fails to subject to v ∈ B1 (r ) ∩ B2 ()
contain κ is at most α + o(1). Likewise, one may consider for some parameters r,  ≥ 0. Again, under the assumptions
the test statistic T := 1 û α < κ , and reject H0 if and only of Theorem 3, if we define
if T = 1, which gives an asymptotically valid level-α testing √
ω̂ z
procedure. Now, under the conditions of Corollary 1, if we ˆ := 1 + √2n1−α · ν̂2 (tˆopt ), (39)
2
choose
and if independent measurement sets of size n 1 and n 2
ϑ̂q z 1−α 
û α := (1 + √
n 1 +n q q
ŝ (x), are
 used to construct  r̂ and ,
ˆ then we obtain the limit
P x ∈ B1 (r̂ ) ∩ B2 ()
ˆ → (1 − α )2 as (n 1 , n 2 , p) → ∞. The
then the reasoning just given ensures the false alarm rate same reasoning applies to other q norms.
satisfies
VI. D ETERMINISTIC M EASUREMENT M ATRICES
PH0 (T = 1) ≤ α + o(1),
The problem of constructing deterministic matrices A with
as (n 1 , n q , p) → ∞. It is also possible to derive the asymp- good recovery properties (e.g. RIP-k or NSP-k) has been a
totic power function of the test statistic. Let ϑ̂q be as defined in longstanding open direction within CS. Since our procedure
Corollary 1, and note that this variable converges in probability in Section III-B2 selects A at random, it is natural to ask
to a positive constant, say ϑq (by Proposition 4). Then, under if randomization is essential to the estimation of unknown
the conditions of Corollary 1, as (n 1 , n q , p) → ∞, the sparsity. In this section, we show that estimating sq (x) with
asymptotic power satisfies a deterministic matrix A leads to results that are inherently

n 1 +n q κ different from our randomized procedure.
PH1 (T = 1) =  ϑq sq (x) − 1 − z 1−α + o(1).
This distinction makes sense if we think of the estimation
(35) problem as a game between nature and a statistician. The
The details of obtaining this limit are based on standard statistician first chooses a matrix A ∈ Rn× p and an estimation
arguments, and hence omitted. Note that as sq (x) becomes rule δ : Rn → R. (The function δ takes y ∈ Rn as input and
close to the reference value κ (i.e. the detection boundary), the returns an estimate of sq (x).) In turn, nature chooses a signal
power approaches that of random guessing, (0 − z 1−α ) = α, x ∈ R p , with the goal of maximizing the statistician’s error.
as we would expect. When the statistician chooses A deterministically, nature has
the freedom to select an x that is ill-suited to A. By contrast,
if the statistician draws A randomly, then nature has less
B. Tuning the Lasso and Elastic Net in Primal Form knowledge to choose a “bad” signal.
In primal form, the Lasso algorithm [62] can be expressed In the case of random measurements, Corollary 1 implies
as that our particular estimation rule ŝq (x) can achieve a rel-

ative error |ŝq (x)/sq (x) − 1| = O P (1/ n 1 + n q ) for any
minimize y − Av22
v∈R p non-zero x. Our aim is now to show that for any set of
subject to v ∈ B1 (r ), (36) noiseless deterministic measurements, all estimation rules

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
LOPES: UNKNOWN SPARSITY IN CS: DENOISING AND INFERENCE 5155

δ : Rn → R have a worst-case relative error |δ(Ax)/sq (x)−1| (n 1 , n 2 ) = (50, 50), (100, 100), . . . , (500, 500). For a fixed

that is much larger than 1/ n 1 + n q . Specifically, when q ∈ pair (n 1 , n 2 ), we generated 500 independent pairs of vectors
[0, 2], we give a lower bound that is of order (1 − np )2 . More y (1) ∈ Rn1 and y (2) ∈ Rn2 according to
informally, this means that there is always a choice of x that
y (1) = A(1) x + σ  (1) and y (2) = A(2) x + σ  (2) , (40)
can defeat a deterministic procedure, whereas the estimator
ŝq (x) based on randomized measurements is likely to succeed where the entries of A(1) ∈ Rn1 × p are i.i.d. stable1 (1)
for any x. In stating the following result, we note that it (Cauchy) variables, and the entries of A(2) ∈ Rn2 × p are i.i.d.
involves no randomness — since the measurements y = Ax stable2 (1) (Gaussian) variables. In all three plots, the noise
are noiseless and obtained from a deterministic matrix A. vectors  (1) and  (2) were generated with i.i.d. entries from a
Furthermore, the bounds are non-asymptotic. centered t-distribution on 2 degrees of freedom (which has
Theorem 4: Suppose n < p, and q ∈ [0, 2]. Then, the infinite variance). Applying our estimation method to each
minimax relative error for estimating sq (x) from noiseless pair (y (1), y (2) ) produced 500 realizations of ŝ2 (x) for each
deterministic measurements y = Ax satisfies (n 1 , n 2 ). We then averaged the quantity | ŝs22 (x)
(x) − 1| over the
   2
inf inf
 Ax)
sup  δ( − 1

 ≥ 2πe
1
1 − np − 21p . 500 realizations as an approximation of E| ŝs22 (x)
(x) − 1|.
δ:Rn →R sq (x)
A∈Rn× p x∈R p \{0} The simulation just described was conducted under several
choices of the parameters p, s2 (x), and ρ2 — with each
Alternatively, if q ∈ (2, ∞], then
  parameter corresponding to a separate plot in Figure 2. In all
 Ax)  √1
inf inf sup  δ(s (x) −1 ≥ 2πe √
1−(n/ p)
− 1. three plots, the signal x ∈ R p was chosen to have decaying
1+4 log(2 p) 2 p
A∈R δ:R →R x∈R p \{0} entries x i = c · i −τ (x), with c chosen so that x2 = 1. The
n× p n q

Remarks dimension p was set to 104 in all cases, except in the top plot,
where p = 10, 102 , 103 , 104 . The decay exponent τ (x) was
The proof of this result is based on the classical technique
set to 1 in all cases, except in the middle plot where τ (x) =
of a two-point prior. In essence, the idea is that for any
2, 1, 1/2, 1/10 in order to give a variety of sparsity levels.
choice of A, it is possible to find two signals x̃ and x ◦ that
With regard to the generation of measurements, the energy
are indistinguishable with respect to A, i.e. A x̃ = Ax ◦ , and
level was set to γ1 = γ2 = 1 in all cases, which gives ρ2 =
yet have very different sparsity levels, sq (x ◦ )  sq (x̃). Due
σ/γ2 x2 = σ . In turn, we set ρ2 = σ = 1/4 in all cases,
to the relation A x̃ = Ax ◦ , the statistician has no way of
except in the bottom plot where ρ2 = σ = 0, 0.1, 0.3, 0.5.
knowing whether x̃ or x ◦ has been selected by nature, and
Lastly, our algorithm for computing ŝ2 (x) involves a choice
if nature chooses x ◦ and x̃ with equal probability, then it
of two input parameters η0 and ε2 . In all cases, we set η0 = 0.3
is impossible for the statistician to improve upon the trivial
and ε2 = 0.3. The performance of ŝ2 (x) did not change much
estimator 21 sq (x̃) + 12 sq (x ◦ ) (which does not even make use
under different choices.
of the data). Furthermore, since sq (x ◦ )  sq (x̃), it follows
2) Comments: There are three high-level conclusions to
that the trivial estimator has a large relative error – implying
take away from Figure 2. The first is that the black theoretical
that the minimax relative error is also large. (A proof is given
curves agree well with the colored empirical curves. Second,
in Appendix D.)
the average relative error E| ŝs22 (x)
(x) − 1| has no observable
dependence on p or s2 (x) (when ρ2 is held fixed), as expected
VII. S IMULATIONS from Corollary 1. Third, the dependence on the noise-to-signal
In this section, we describe four sets of simulations that ratio is mild. (Note that the theoretical curves do predict that
validate our theoretical results, and demonstrate the effective- the relative error for ρ2 = 0.5 will be somewhat larger than
ness of our methodology. Section VII-A studies the expected for ρ2 = 0.)
relative error E| ŝs22 (x)
(x) − 1| and how it is affected by the
In all three plots, the theoretical curves were com-
dimension p, the sparsity level s2 (x), the noise-to-signal puted in the following
√ way. From Corollary 1 we have
ratio ρ2 . Section VII-B considers the case where the noise | ŝs22 (x) − 1| ≈ √ ϑ2 |Z |, where Z is a standard Gaussian
(x) 1n +n
2
distribution is unknown and asymmetric. (The effects of esti- random variable, and ϑ2 is the limit of ϑ̂2 that
mating ϕ0 and σ , as well as the effect of the symmetrization √ results from
Corollary 1 and Proposition 4. Since E|Z | = 2/π, the the-
step (cf. Section III-B.2e), are measured under several values 2
ϑ2
of ρ2 .) Section VII-C considers the estimation of x0 using oretical curves are simply √n π+n , as a function of (n 1 + n 2 ).
1 2
ŝq (x) for a small value of q. (This simulation also treats Note that ϑ2 depends only on ρ2 , which is why there is only
the noise distribution as unknown.) Lastly, Section VII-D one theoretical curve in the top two plots.
examines the quality of the normal approximation for ŝ2 (x)
given in Corollary 1. B. Robustness to Unknown and Asymmetric
Noise Distribution
A. Error Dependence on Dimension, Sparsity, and Noise 1) Design of Simulations: To study how the estimator ŝ2 (x)
1) Design of Simulations: When estimating s2 (x), our pro- performs when the noise distribution is unknown and asym-
posed method requires a set of n 1 Cauchy measurements, metric, we considered a modified version of the experiment
and a set of n 2 Gaussian measurements. In each of the in Section VII-A, where ρ2 was ramped up. (All parameter
three plots in Figure 2, we considered a sequence of pairs settings were the same as in the bottom plot of Figure 2,

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
5156 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016

Fig. 3. Top panel: When the noise distribution is unknown, and when the
parameters ϕ0 and σ are estimated, the performance of ŝ2 (x) is gradually
affected by an increase in ρ2 . The performance is similar to that shown in
Figure 2 where the noise distribution is treated as known. Bottom panel: The
plot shows that the parameters x0 and DNR(x) have a small effect on the
averaged relative error of ŝ0.05 (x).

be distributed as t2,1 /median(|t2,1 |), where t2,1 is a non-


central t variable with 2 degrees of freedom and non-centrality
parameter equal to 1. (The rescaling merely standardizes the
noise so that median(|1 |) = 1.) For each choice of n 1 and n 2
and each value ρ2 = σ = 0, 0.1, 0.3, 0.5, we generated 1000
instances of y (1), y (2), and (e1 , . . . , e100 ), leading to 1000
instances of ŝ2 (x) in each setting (in order to approximate
ŝq (x)
E| x − 1|).
ŝ (x) 0
Fig. 2. The three plots show that E| s2 (x) − 1| has no observable dependence
2 In order to estimate σ and ϕ0 , we set σ̂ to be
on p or s2 (x). Meanwhile, the dependence on the noise-to-signal ratio ρ2 is
mild.
the sample median √of |e1 |, . . . , |e100 |, and we set
100
ϕ̂0 (t) := 1001
i=1 exp( −1tei /σ̂ ). Next, to handle
except where stated below.) As in Section VII-A, we generated the fact that F0 is asymmetric, we applied the symmetrization
y (1) and y (2) according to line (40). In addition, we generated procedure from Section III-B2e to the observations
an i.i.d. sample of noise variables e1 , . . . , e100 from the law y (1) and y (2) , which involved using the symmetrized
of σ 1 . To make the noise asymmetric, we chose 1 to version of ϕ̂0 (t), namely 12 ϕ̂0 (t) + 21 ϕ̂0 (−t).

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
LOPES: UNKNOWN SPARSITY IN CS: DENOISING AND INFERENCE 5157


(n 1 +n 2 )  ŝ2 (x) 
Fig. 4. The solid blue curve is the standard normal density in all four plots. The histograms were generated from 3000 realizations of
ϑ̂q s2 (x) − 1
under four conditions — depending on the number of measurements (n 1 + n 2 ) and the tails of the noise distribution. The dashed red curve is the default R
kernel density estimate.

4
2) Comments: The top panel of Figure 3 shows that as signals in R10 of the form x = c(1, 12 , 13 , . . . , x
1
0
, 0, . . . , 0),
the noise-to-signal ratio ρ2 is increases from 0 to 0.5, the where x0 = 10, 10 , 10 , 10 , and c was chosen so that
2 3 4
averaged relative error E| ŝs22 (x)
(x) − 1| increases in a gradual x2 = 1.
manner. Comparing this plot with the bottom plot in Figure 2, This experiment also involved an unknown and asymmetric
we see that when the noise distribution is estimated with only noise distribution — and hence included the effects of esti-
100 samples, the performance of ŝ2 (x) is not much different mating ϕ0 and σ . The noise level was set to σ = 14 , and the
than when the noise distribution is known. noise distribution was the same as in Section VII-B, as was
the procedure for estimating ϕ and σ . Lastly, we note that if
C. Simulations for Estimating x0 With ŝq (x) and Small q q is chosen much smaller than 0.05, then floating point errors
1) Design of Simulations: We now consider the estima- can arise when generating random variables from stableq (1),
tion of x0 using ŝq (x) with q = 0.05. This experiment since the tails become very heavy. (This has also been reported
followed essentially the same design as in Section VII-B, in the sketching literature [44].)
but with two differences. (All parameter settings were the
same, except where stated below.) First, instead of sampling 2) Comments: Figure 3 shows that ŝ0.05 (x) accurately esti-
A(2) ∈ Rn2 × p with i.i.d. rows from the Gaussian distribution mates x0 over a wide range of the parameters x0 and
stable2 (γ2 )⊗ p with γ2 = 1, we generated matrices A(q) with DNR(x), and that these parameters have a small effect on
(x)
i.i.d. rows drawn from stableq (γq )⊗ p with γq = 1. Second, E| ŝ0.05
x0 − 1|, which is expected based on Proposition 5.
instead of varying ρ2 , we varied x0 and the dynamic range In addition, the estimator performs well in spite of the fact
DNR (x) (cf. Proposition 5). Specifically, we considered four that the noise distribution is unknown and asymmetric.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
5158 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016

D. Normal Approximation Letting c2 and c3 be as in inequality (7), and dividing through


1) Design of Simulations: Under four different conditions, by x2 leads to
we generated 3000 instances of the measurement scheme given √

x−x̂2 σ 0 c3 κ0 x21
in line (40), giving rise to 3000 samples of the standardized x2 ≤ c2 x2 +

x2
log( pe
n ),
  n 2
statistic (n1 +n2 ) ŝs22 (x)
(x) − 1 . The four conditions correspond
ϑ̂2
to setting (n 1 , n 2 ) = (500, 500), (1000, 1000), or choosing the and the proof is complete with κ2 := c2 and

entries of the noise vectors  (1) ,  (2) to be i.i.d. t-distributed κ3 := c3 κ0 . 
(centered), with either 2 or 20 degrees of freedom. (Note that
the noise with 2 degrees of freedom has infinite variance, while B. Proposition 2
the noise with 20 degrees of freedom has finite variance.) The
signal x ∈ R p was constructed as x = c(1, 12 , 13 , . . . , 1p ), with Proof: Let B1 (1) ⊂ R p be the 1 -ball of radius 1, and
c chosen so that x2 = 1 in all cases. Also, in all four let R : Rn → R p be a homogeneous with degree-1 recovery
cases, we set p = 104 , γ1 = γ2 = 1, σ = 0.1, as well as algorithm. Also define the number η := 12 c1 log( pe/n)
where
n
η0 = ε2 = 0.3.
c1 > 0 is an absolute constant to be defined below. Clearly,
2) Comments: Figure 4 illustrates that the CLT
(n 1 +n 2 ) ŝ2 (x) w for any such η, we can find a point x̃ ∈ B1 (1) satisfying
( s2 (x) − 1) −−→ N(0, 1) is able to tolerate two
ϑ̂2
challenging issues: the first being that the tuning parameter x̃ − R(A x̃)2 ≥ sup x − R(Ax)2 − η. (44)
tˆopt is selected in a data-dependent way, and the second being x∈B1 (1)
that the noise distribution is heavy-tailed. When comparing Furthermore, we may choose x̃ to satisfy x̃1 = 1. (Note that
the four plots, increasing the sample size (n 1 + n 2 ), and if x̃1 < 1, then we can use the homogeneity of R to replace
reducing the noise variance both lead to improvement of the x̃ with the 1 unit vector x̃/x̃1 ∈ B1 (1) and obtain an even
normal approximation. larger value on the left side.) The next step is to further lower
A PPENDIX A bound the right side in terms of minimax error, leading to
P ROOFS FOR S ECTION II
x̃ − R(A x̃)2 ≥ inf inf sup x − R(Ax)2 − η,
A. Proposition 1 A∈Rn× p R: Rn →R p x∈B1 (1)

Proof: Let c0 > 0 be as in Theorem 1, and let κ0 ≥ 1 be where the infima are over all sensing matrices A and all recov-
any number such that ery algorithms R (possibly non-homogenous). It is known
log(κ0 ) + ≤ from the theory of Gelfand widths that when log( pe
n ) ≤ n < p,
c0 .
2 4 1
κ0 κ0 (41)
n/κ0
there is an absolute constant c1 > 0, such that the minimax
Next, define the positive number t := log( pe , and choose 2 error over B1 (1) is lower-bounded by
n )
T = t in Theorem 1. (Note that when n ≤ p, this choice
of T is clearly at most n, and hence lies in {1, . . . , p}.) inf inf sup x − R(Ax)2 ≥ c1 log( pe/n)
.
n
We will verify that n ≥ c0 T log( pe A∈Rn× p R:Rn →R p x∈B1 (1)
T ) for every n satisfying
our assumption κ0 log(κ0 pe
n ) ≤ n ≤ p, which will then allow
(See [65, Proposition 1.2], as well as [66] and [67].) Using
us to apply Theorem 1.
our choice of η, as well as the fact that x̃1 = 1, we obtain
To be begin, observe that
T log( pe pe x̃21 ·log( pe/n)
T ) ≤ (t + 1) log( t ) x̃ − R(A x̃)2 ≥ 12 c1 n . (45)
 n/κ0  pe pe
= log( pe + 1 · log κ0 n · log( n )
) Dividing both sides by x̃2 completes the proof. 
n
 n/κ0   
≤ log( pe
+ 1 · log (κ0 pe
n )
2
n ) A PPENDIX B
2n/κ0  pe    P ROOFS FOR S ECTION III
= log( pe
log(κ0 ) + log + 2 log κ0 pe
n )
n n
  A. Theorem 2 – Uniform CLT for q Norm Estimator
≤ 2n
κ0 log(κ0 ) + 2n
κ0 + 2 log κ0 pe
n . (42)
The proof of Theorem 2 consists of two parts. First, we
Applying our assumption κ0 log(κ0 pe
n ) ≤ n gives prove a uniform CLT for a re-scaled version of  ˆ n (t), which
is given below in Lemma 3. Second, we extend this limit to the
T log( pe
T )≤
2
κ0 log(κ0 ) + 4
κ0 n.
statistic ν̂q (t) by way of the functional delta method, which
Hence, the choice of κ0 in line (41) ensures n ≥ c0 T log( pe
T ),
is described at the end of the section. The positivity of the
as desired. variance formula v q (c0 , ρ̄q ) follows from the lower bound (65)
To finish the proof, let κ1 := c1 be as in Theorem 1 so that given in the proof of Lemma 2. (See the remark following the
the bound (7) holds with probability at least 1 − 2 exp(−κ1 n). proof of Lemma 2.)
Next, observe that 1) Remark on the Subscript of n q : For ease of notation, we
√ will generally drop the subscript from n q in the remainder of
κ
√1 x − x |T 1 ≤ √1 x1
t
= √0
n
x21 log( pe
n ). (43) the appendices, as it will not cause confusion.
T

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
LOPES: UNKNOWN SPARSITY IN CS: DENOISING AND INFERENCE 5159

2) Weak Convergence of the Empirical ch.f.: To introduce To see this, let Dn (c) := |χn (c) − χ̃n (c)| for c ∈ [−b, b], and
a few pieces of notation, let c ∈ [−b, b] for some fixed b > 0, observe that
and define the re-scaled empirical ch.f., √
√ Dn (c) = n| ψ̂n (c) − ψ̃n (c)| 
n cyi
−1 γq x  √ √

ψ̂n (c) := 1
n i=1 e
q , (46) = √1n  ni=1 e −1c(Si +ρq i ) − e −1c(Si +ρ̄q i ) 
 √  √ 

yi
obtained from the re-scaled observations γ x q
. The relation = √1n  ni=1 e −1c(Si +ρq i ) 1 − e −1c(ρ̄q −ρq )i 
ˆ ˆ
between n and ψ̂n is given by n (t) = ψ̂n (γq txq ). The   √ 
≤ √1n ni=1 1 − e −1c(ρ̄q −ρq )i 
re-scaled population ch.f. is 
≤ √1n ni=1 |c(ρ̄n − ρq )i |
ψn (c) := exp(−|c|q )ϕ0 (ρq c), (47) √ 
≤ n|ρq − ρ̄q | · nb ni=1 |i |,
which converges to the function
where the last bound does not depend on c, and tends to 0
ψ(c) := exp(−|c| )ϕ0 (ρ̄q c), q
(48) with probability 1. Here we are using the assumption √ that
E|1 | < ∞ and assumption A3 that ρq = ρ̄q + o(1/ n). Now
as ρq → ρ̄q . Lastly, define the normalized process that line (53) has been verified, it remains (by the functional
version of Slutsky’s Lemma [68, p. 32]) to prove

χn (c) := n ψ̂n (c) − ψ(c) , (49) w
χ̃n (·) −−→ χ∞ (·) in C ([−b, b]; C), (54)
and let C ([−b, b]; C) be the space of continuous complex-
valued functions on [−b, b] equipped with the sup-norm. and that the limiting process χ∞ has the stated variance
Lemma 3: Fix any b > 0. Under the assumptions of formula. We first show that this limit holds, and then derive
Theorem 2, the random function χn satisfies the limit the variance formula at the end of the proof. (Note that it is
clear that the limiting process must be Gaussian due to the
w finite-dimensional CLT.)
χn (·) −→ χ∞ (·) in C ([−b, b]; C), (50)
By a result of Marcus [69, Th. 1], it is known that the
where χ∞ is a centered Gaussian process whose marginals uniform CLT for empirical ch.f.’s (54) holds as long as the
satisfy Re(χ∞ (c)) ∼ N(0, wq (c, ρ̄q )), and limiting process χ∞ has continuous sample paths (almost
surely).3 To show that the sample paths of χ∞ are continuous,
wq (c, ρ̄q ) := 1
2 + 1
2 exp(−2q |c|q )ϕ0 (2ρ̄q c) we employ as sufficient condition derived by Csörgő [71].
− exp(−2|c|q )ϕ02 (ρ̄q c). Let Fq denote the distribution function of the random variable
Proof: It is important to notice that ψ̂n is not the empirical yi◦ = Si + ρ̄q i described earlier. Also let δ > 0 and define
ch.f. associated with n samples from the distribution of ψ the function

(since ρq = ρ̄q ). Instead of working with χn , the more natural log(|u|) · log(log(|u|))2+δ if |u| ≥ exp(1)
+
process to consider is gδ (u) :=
0 if |u| < exp(1).
√  
χ̃n (·) := n ψ̃n (·) − ψ(·) , (51)
At line 1.17 of the paper
 ∞ [71], it is argued that if there is
where we define some δ > 0 such that −∞ gδ+ (|u|)d Fq (u) < ∞, then χ∞ has

continuous sample paths. Next, note that for any δ, δ > 0, we
n −1(·)yi◦
ψ̃n (·) := 1
n i=1 e have

in terms of the variables yi◦ := Si + ρ̄q i , with Si ∼ stableq (1) gδ+ (|u|) = O |u|δ as |u| → ∞. (55)
and i ∼ F0 being independent. (In other words, ψ̃n is the
empirical ch.f. associated with ψ.) Also note that the re-scaled Hence, χ∞ has continuous sample paths as long as Fq has a
yi d d fractional moment. To see that this is true, recall the basic
observations satisfy γ x = Si + ρq i , where = denotes
q fact that if Si ∼ stableq (1) then E[|Si |δ ] < ∞ for any
equality in distribution. Likewise, the process ψ̂n (·) may be δ ∈ (0, q) [61, p. 63]. Also, our deconvolution model assumes
represented in distribution as E[|i |] < ∞, and so it follows that for any q ∈ (0, 2], the
d 1 n √
−1(·)(Si +ρq i ) ,
distribution Fq has a fractional moment, which proves the
ψ̂n (·) = n i=1 e (52) limit (54).4
Finally, we compute the variance of the marginal distri-
and going forward, it will be convenient to identify ψ̂n with butions Re(χ∞ (c)). By the ordinary central limit theorem,
the right side above.
As a first main step in the proof, we show that the difference 3 The paper [69] only states the result when b = 1/2, but it holds for any
between χn and χ̃n is negligible in a uniform sense, i.e. b > 0. See the paper [70] and the first few lines of the proof of Theorem 3.1
therein.
4 We are using U + V a∧1 ≤ U a∧1 + V a∧1 for random variables U
sup |χn (c) − χ̃n (c)| → P 0. (53) a a a
c∈[−b,b] and V , where U a := E[|U |a ]1/a and a > 0.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
5160 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016


we only need to calculate the variance of Re(exp( −1y1◦ ). In words, n is the difference that comes from replacing ĉ
The first moment is given by with (ĉ), and n is the difference that comes from replacing
√ √ terms of the form − |·| 1
Log+ (. . . ) with terms involving φ.
E[Re(exp( −1cy1◦ )] = ReE[exp( −1cy1◦ )]
We now argue that both of the terms n and n are
= exp(−|c|q )ϕ0 (ρ̄q c). asymptotically negligible. As a result, we will complete the
proof of Theorem 2 by showing that the first term in line (59)
The second moment is given by has the desired Gaussian limit — which is the purpose of the

E[(Re(exp( −1cy1◦ ))2 ] = E[cos2 (cy1◦ )] functional delta method.
= E[ 12 + 21 cos(2cy1◦ )] To see that n → P 0, first recall that ĉ → P c0 ∈ I
√ by assumption. Consequently, along any subsequence, there
= 12 + 21 Re E[exp( −1 · 2cy1◦ )] is a further subsequence on which ĉ and (ĉ) eventually
= 1
2 + 1
2 exp(−2q |c|q )ϕ0 (2ρ̄q c). agree with probability 1. In turn, if gn is a generic sequence
of functions, then eventually gn (ĉ) − gn ((ĉ)) = 0 along
This completes the proof of Lemma 3. 
subsequences (with probability 1). Said differently, this means
3) Applying the Functional Delta Method: We now com-
gn (ĉ) − gn ((ĉ)) → P 0, and this implies n → P 0 since n
plete the proof of Theorem 2 by applying the functional delta
can be expressed in the form gn (ĉ) − gn ((ĉ)).
method to Lemma 3 (with a suitable map φ to be defined
Next, to see that n → P 0, notice that as soon as
below). In the following, C (I) denotes the space of continuous
Re(ψ̂n (·)/ϕ0 (ρq ·)) lies in the neighborhood N ( f 0 ; ε), it fol-
real-valued functions on an interval I, equipped with the sup-
lows from the definition of φ that n = 0. Also, the function
norm.
Re(ψ̂n (·)/ϕ0 (ρq ·)) converges uniformly to f 0 on the interval
Proof of Theorem 2: Since ϕ0 (ρ̄q c0 ) = 0, and the roots
I with probability 1.6 Consequently, n = o P (1).
of ϕ0 are assumed to be isolated, there is some δ0 ∈ (0, |c0 |)
We have now taken care of most of the preparations needed
such that over the interval c ∈ I := [c0 − δ0 , c0 + δ0 ], the
to apply the functional delta method. Multiplying the limit
value ϕ0 (ρ̄q c) is bounded away from 0. Define the function
in Lemma 3 through by 1/ϕ0 (ρq ·) and using the functional
f 0 ∈ C (I) by
version of Slutsky’s Lemma [68, p. 32], it follows that
f 0 (c) = exp(−|c|q ), √   ψ̂n (·)   w
n Re ϕ0 (ρq ·) − exp(−| · |q ) −−→ z̃(·) in C (I), (60)
and let N ( f 0 ; ε) ⊂ C (I) be a fixed ε-neighborhood of f 0 in
the sup-norm, such that all functions in the neighborhood are where z̃(·) is a centered Gaussian process with continuous
bounded away from 0. Consider the map φ : C (I) → C (I) sample paths, and the marginals z̃(c) have variance equal to
defined according to ϕ (2ρ̄ c)
⎧ 1 1
+ 1
exp(−2q |c|q ) ϕ02 (ρ̄ qc) − exp(−2|c|q ). (61)
⎨− 1q log( f (c)) if f ∈ N ( f 0 ; ε) 2 ϕ 2 (ρ̄q c)
0
2
0 q
|c|
φ( f )(c) = (56) It is straightforward to verify that φ is Hadamard differen-
⎩1 (c) if f ∈ N ( f 0 ; ε),
I tiable at f 0 , and that the Hadamard derivative φ f0 is the
where 1I is the indicator function of the interval I. The ) q

q
linear map that multiplies by − exp(|·| |·|q . (See [68, Lemmas
importance of φ is that it can be related to ν̂q (tˆ)/xq in 3.9.3 and 3.9.25].) Consequently, the functional delta method
the following way. First, let ĉ := tˆγq xq and observe that [68, Th. 3.9.4] applied to line (60) with the map φ gives
the definition of ψ̂n gives √    w
√  ν̂q (tˆ)  √   n (ĉ)  n φ Re ϕ0ψ̂(ρnq ·) (·) − φ( f 0 )(·) − → φ f0 (z̃)(·)
n xq − 1 = n − |ĉ|1q Log+ Re ϕψ̂(ρ
0 q ĉ) ) q
q
  = − exp(|·|
|·|q z̃(·)
+ |ĉ|q Log+ exp(−|ĉ| ) .
1 q (57) =: z(·), (62)
(The cases of ĉ = 0 and ϕ0 (ρq ĉ) = 0 are asymptotically in the space C (I). It is clear that z(·), defined in the previous
irrelevant.) Next, let (ĉ) be the point in the interval I that line, is a centered Gaussian process on I, since z̃(·) is.
is nearest to ĉ,5 and define the quantity n according to Combining the last two displays shows that the marginals z(c)
√  ν̂q (tˆ)  √   n ((ĉ)) 
n xq − 1 = n − |(1ĉ)|q Log+ Re ϕψ̂(ρ q (ĉ))
are given by
q 0

1   z(c) ∼ N(0, v(c, ρ̄q )),


+ Log+ exp(−|(ĉ)|q ) + n .
|(ĉ)| q
where
(58)
v q (c, ρ̄q ) = 1
2 ϕ0 (ρ̄q c)2 exp(2|c| )
1 1 q
Also, define n according to |c|2q
√  ν̂q (tˆ)  √      ϕ (2ρ̄ c)
+ 12 ϕ0 (ρ̄ qc)2 exp((2 − 2q )|c|q ) − 1 .
n xq − 1 = n φ Re ϕ0ψ̂(ρnq ·) ((ĉ)) − φ f0 ((ĉ)) 0 q
q

+ n + n . (59) 6 The fact that ψ̂ (·) converges uniformly to ψ(·) on I with probability 1
n
follows essentially from a version of the Glivenko-Cantelli Theorem [71].
5 The purpose of introducing (ĉ) is that it always lies in the interval I, To see that 1/ϕ0 (ρq ·) converges uniformly to 1/ϕ0 (ρ̄q ·) on I as ρq → ρ̄q ,
and hence allows us to work entirely on I. note that since E|1 | < ∞, the ch.f. ϕ0 is Lipschitz on any compact interval.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
LOPES: UNKNOWN SPARSITY IN CS: DENOISING AND INFERENCE 5161

The final step of the proof essentially involves plug- Next, we consider the second case where (c j , ρ j ) →
ging (ĉ) into the limit (62) and using Slutsky’s Lemma. (c0 , ρ0 ) with c0 = 0 and ρ0 ≥ 0. Due
 to the fact that all ch.f.’s
Since (ĉ) → P c0 , and z(·) takes values in the separable satisfy |ϕ0 | ≤ 1, and the fact that exp(2|c|q ) − exp(κq |c|q )
space C (I) almost surely, the functional version of Slutsky’s is positive when c > 0, the previous lower bound gives
Lemma [68, p. 32] gives the following convergence of pairs
  v q (c, ρ) ≥ 1 1
exp(2|c|q ) + 1
exp(κq |c|q ) − 1 , (65)
√   ψ̂n  w   |c|2q 2 2
n φ Re ϕ0 (ρq ·) (·) − φ( f 0 )(·) , (ĉ) → z(·), c0 ,

which again holds for all (c, ρ) where v q is defined. Note
in the space C (I) × I. To finish, note that the evaluation map that this bound does not depend on ρ. A simple calculation
from C (I) × I to R, defined by ( f, c) → f (c), is continuous involving L’Hospital’s rule shows that whenever q ∈ (0, 2),
with respect to the product topology. The continuous mapping the lower bound tends to ∞ as c → 0+ .
theorem [68, Th. 1.3.6] then gives To finish the proof, we must show that for any ρ ≥ 0,
√      w
the function ṽ q (·, ρ) attains its minimum on the set [εq , ∞).
n φ Re ϕ0ψ̂(ρnq ·) ((ĉ)) − φ f 0 ((ĉ)) − → z(c0 ). This is simple because the lower bound (65) tends to ∞ as
(63) c → ∞. 
Remark on the Positivity of the Variance Formula: Note
The proof is completed by substituting this limit into that for q ∈ (0, 2], the number κq is minimized when q = 2,
line (59). 
 1 case κq2 = 1−2. Likewise,
in which the bound (65) is at least
1
|c|2q 2
exp(2|c| ) + 2 exp(−2|c| q ) − 1 , which is the same as

B. Lemma 2 – Extending the Variance Function 1


(cosh(2|c|q ) − 1), and this is clearly positive when c = 0.
|c|2q
Proof: It is simple to verify that v q (·, ·) is continuous at
any pair (c0 , ρ0 ) for which c0 > 0 and ϕ0 (ρ0 c0 ) = 0, and
hence ṽ q inherits continuity at those pairs. To show that ṽ q is A PPENDIX C
continuous elsewhere, it is necessary to handle two cases. P ROOFS FOR S ECTION IV
– First, we show that for any q ∈ (0, 2], if (c0 , ρ0 ) is a pair A. Proposition 3 – Consistency of Pilot Estimator
such that ϕ0 (ρ0 c0 ) = 0 and c0 > 0, then v q (c j , ρ j ) → ∞ Proof: The limit ρ̂q → P ρ̄q follows immediately from
for any sequence satisfying (c j , ρ j ) → (c0 , ρ0 ) with ν̂q (tˆpilot )
→ P 1 and assumption A3. To prove the latter
ϕ0 (ρ j c j ) = 0 and c j > 0 for all j > 0. xq
q

– Second, we show that for any q ∈ (0, 2), if c0 = 0, then statement, it is enough to show that tˆpilotγq xq → P c0 for
v q (c j , ρ j ) → ∞ for any sequence (c j , ρ j ) → (0, ρ0 ), some finite c0 > 0 such that ϕ0 (ρ̄q c0 ) = 0 (by Theorem 2).
where ρ0 ≥ 0 is arbitrary, and c j > 0 for all j > 0. Reducing one step further, it is enough to show that there is
To handle the first case where c0 > 0, we derive a lower bound some finite c1 > 0 such that m̂1q γq xq → P c1 . To see this,
on v q (c, ρ) for all q ∈ (0, 2] and all pairs where v q is defined. let a ∧ b = min{a, b} and note that assumption A3 implies
Recall the formula  γ x 
tˆpilotγq xq = ( m̂1 γq xq ) ∧ η0 q σ q (66)
q
v q (c, ρ) = 1 1 1
exp(2|c|q )
|c|2q 2 ϕ0 (ρc)2 = c1 ∧ ( ρ̄ηq0 ) + o P (1) (67)
1 ϕ0 (2ρ|c|)
+ 2 ϕ0 (ρ|c|)2 exp((2 − 2q )|c|q ) −1 . =: c0 + o P (1). (68)
The lower bound is obtained by manipulating the factor Here, the constant c0 = c1 ∧( ρ̄ηq0 ) satisfies the desired condition
1 ϕ0 (2ρc)
2 ϕ (ρc)2 . Consider the following instance of Jensen’s inequal-
0
ϕ0 (ρ̄q c0 ) = 0 due to the choice of η0 .
ity, followed by a trigonometric identity, Now we turn to the last step of showing m̂1q γq xq → P c1 .
yi σ
Note that for i = 1, . . . , n, we have γq x ∼ Si + γq x i ,
ϕ0 (ρc)2 = (E[cos(ρc1 )])2 q q
where the Si are i.i.d. samples from stableq (1). Let Fn denote
≤ E[cos2 (ρc1 )] σ
the distribution function of the random variable |S1 + γq x 1 |
= 1
+ 21 E[cos(2ρc1 )] q
2 and let Fn be the empirical distribution function obtained
= 1
2 + 21 ϕ0 (2ρc), from n samples from Fn . (The subscript on Fn is used to
which gives 1 ϕ0 (2ρc)
≥ 1− 1 1
Letting κq := 2 − 2q , indicate that σ/γq xq may vary with n.) Then,
2 ϕ0 (ρc)2 2 ϕ0 (ρc)2 .
we have the lower bound m̂ q
γq xq = 1
γq xq · median(|y1 |, . . . , |yn |)
 
v q (c, ρ) ≥ 1
|c|2q
1 1
2 ϕ0 (ρc)2 exp(2|c|q ) − exp(κq |c|q ) = median( γq|yx
1|
, . . . , γq|yx
n|
)
q q

+ exp(κq |c|q ) − 1 , (64) = median(Fn ).


w
which holds for all (c, ρ) where v q (c, ρ) Note also that Fn → F, where F is the distribution
 is defined. Since
the quantity exp(2|c0 |q ) − exp(κq |c0 |q ) is positive for all function of the variable |S1 + ρ̄q 1 |. Consequently, a stan-
q ∈ (0, 2] and c0 > 0, it follows that v q (c j , ρ j ) → ∞ as dard argument shows median(Fn ) → P median(F), and then
m̂ q
ϕ0 (ρ j c j ) → ϕ0 (ρ0 c0 ) = 0. γq xq −
→ P median(F) =: 1/c1 > 0. 

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
5162 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016

B. Proposition 4 – Consistency of cq (ρ̂q ) calculation leads to


Proof: It is enough to show cq (ρ̂q ) → p cq (ρ̄q ), since the d 
 su (x) = − d su (x)
second two limits in the proposition are direct consequences du du

(using Lemma 2 and Proposition 3). Reducing further, it is = − du


d
exp(Hu (π(x)))
enough to show that cq (ρ j ) → cq (ρ̄q ) for any deterministic
sequence ρ j ≥ 0 satisfying ρ j → ρ̄q as j → ∞. = −su (x) du
d
Hu (π(x))
As a preliminary step, it is helpful to note that   w (x) 
−1
cq (ρ j ) must be a bounded sequence, i.e. there is a = −su (x) (1−u) 2 w j (x) log π jj (x)
fixed number cmax > εq such that cq (ρ j ) ∈ [εq , cmax ] j :x j  =0
for all j . This essentially follows from the fact
x0
  w j (x) 
that ṽ q (·, ·) is continuous, and that ṽ q (c, ρ) → ∞ ≤ (1−u)2
w j (x) log π j (x) ,
as c → ∞ for any fixed ρ (as shown in line (65) in j :x j  =0
the proof of Lemma 2), but we omit the remaining details for
checking that cq (ρ j ) is bounded, since they are elementary. where we have used su (x) ≤ x0 , and the formula for du d
Hu
Since cq (ρ j ) is a bounded sequence, we can prove that can be found in the book [72, p. 53] (where the definition
cq (ρ j ) converges to cq (ρ̄q ) by showing that all convergent
 of Rényi entropy has the opposite sign from that used here).
subsequences of cq (ρ j ) do. Likewise, let c̆ ∈ [εq , cmax ], and w (x) π (x)u−1
Next, simple algebra shows that π jj (x) = s j (x)1−u , which leads
u
suppose
to the following bounds for j with x j = 0,
cq (ρ jl ) → c̆ (69) w j (x) π j (x)u−1
π j (x) ≤ s∞ (x)1−u
since s∞ (x) ≤ su (x)
along a subsequence jl → ∞. We now argue that c̆ must π j (x)−1
≤ s∞ (x) · s∞ (x)
u since π (x) ∈ (0, 1]
be equal to cq (ρ̄q ). Due to the continuity of ṽ q (·, ·) and the
j
x1 x∞
limit (69), we have = |x j | x1 · s∞ (x)
u since s∞ (x) = x 1
x∞
≤ DNR(x) · xu0 since s∞ (x) ≤ x0 .
ṽ q (c̆, ρ̄q ) = lim ṽ q (cq (ρ jl )), ρ jl )
l→∞
Since this bound does not depend on j , and since the w j (x)
≤ lim ṽ q (cq (ρ̄q ), ρ jl )), sum to 1, we obtain for all u ∈ (0, q] and q ∈ (0, 1),
l→∞
 
= ṽ q (cq (ρ̄q ), ρ̄q ))   x0
du su (x) ≤ log(DNR (x)) + u log(x0 )
d
(1−u)2
≤ ṽ q (c̆, ρ̄q ), x0
≤ (1−u)2
log(DNR (x)) + q log(x0 ) . (70)

which forces ṽ q (c̆, ρ̄q ) = ṽ q (cq (ρ̄q ), ρ̄q ), and the uniqueness q q
Finally, the basic integral 1
0 (1−u)2 du = gives the stated
assumption A4 gives c̆ = cq (ρ̄q ), as desired.  1−q
result. 

C. Proposition 5 – Estimating x0 With ŝq (x) and Small q A PPENDIX D


P ROOFS FOR S ECTION VI
Proof: The triangle inequality gives
|s̃q (x)−x0 | |s̃q (x)−sq (x)| |sq (x)−x0 |
Since the proof of Theorem 4 is based on the technique of
x0 ≤ x0 + x0 . a two-point prior, the main technical challenge is to show that
for any choice of A ∈ Rn× p , two vectors satisfying A x̃ = Ax ◦
Next, since sq (x) ≤ x0 , we have and sq (x ◦ )  sq (x̃) actually exist. This is the content of the
following lemma.
s̃ (x) s̃ (x) |sq (x)−x0 |
| x
q
− 1| ≤ | sqq (x) − 1| + x0 . Lemma 4: Let A ∈ Rn× p be an arbitrary matrix with
n < p, and let x ◦ ∈ R p be an arbitrary signal. Then, for each
0

It remains to bound the last term on the right. Since sq (x) q ∈ [0, 2], there exists a non-zero vector x̃ ∈ R p satisfying
is a non-increasing function of q, and s0 (x) = x0 , we have A x̃ = Ax ◦ and
for any fixed non-zero x,
sq (x̃) ≥ 1
· (1 − np )2 · p. (71)
 q πe
d 
|sq (x) − x0 | =  su (x)du.
du
0
Alternatively, if q ∈ (2, ∞], then there exists a non-zero vector
x̃ ∈ R p satisfying A x̃ = Ax ◦ and
We now derive a bound on | du d
su (x)|, which will then be
πe ( p − n)
2
integrated. For any fixed uu ∈ (0, q], if we define the probability
π j (x)
vector w j (x) := π(x) sq (x̃) ≥  . (72)
u with j = 1, . . . , p, then direct
u 1 + 4 log(2 p)

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
LOPES: UNKNOWN SPARSITY IN CS: DENOISING AND INFERENCE 5163

D. Remarks and then E|x̃ i | is roughly |x i◦ |. Thus, it is natural to consider


Although it might seem intuitively obvious that every affine the set of indices I1 = {i : b̃i 2 ≥ |x i◦ |}, and its complement
subspace contains a non-sparse vector, the technical substance I2 = {i : b̃i 2 < |x i◦ |}. This leads us to the following bounds,
of the result is the fact that the bounds hold uniformly over all  
matrices A ∈ Rn× p . This uniformity is necessary when taking Ex̃1 = E|x̃ i | + E|x̃ i |
the infimum over all A ∈ Rn× p in Theorem 4. Furthermore, i∈I1 i∈I2
 
the order of magnitude of the bounds for q ∈ [0, 2] is ≥ b̃i 2 2
exp(− 12 ) + |x i◦ |(1 − 2(−1))
π
unimprovable when n  p, since sq (x) ≤ p. Similarly, the i∈I1 i∈I2
bound for q ∈ (2, ∞] is unimprovable up to a logarithmic  
factor. ≥ b̃i 2 2
πe + b̃i 2 (1 − 2(−1))
1) Proof of Lemma 4: i∈I1 i∈I2
Proof of Inequality (71): It is enough to prove the result 
p
for q = 2, since sq (·) ≥ s2 (·) for all q ∈ [0, 2]. Let d be ≥ 2
πe b̃i 2 using (1 − 2(−1)) ≥ πe ,
2

the dimension of the null space of A, and let B ∈ R p×d i=1


be a matrix whose columns are an orthonormal basis for the 
p

null space of A. If x ◦ = 0, then define the scaled matrix = πe x ∞
2
bi 2 ,
B̃ := x ◦ ∞ B. (If x ◦ = 0, the steps of the proof can be i=1
repeated using B̃ = B.) Letting z ∈ Rd be a standard Gaussian where bi is the i th row of B ∈ R p×d . Since the matrix
vector, we will study the random vector B ∈ R p×d has orthonormal columns, it may be regarded as a
x̃ := x ◦ + B̃z, submatrix of an orthogonal p × p matrix, and so the rows bi
satisfy bi 2 ≤ 1, yielding bi 2 ≥ bi 22 . Hence,
which satisfies Ax ◦ = A x̃ for all realizations of z. We begin p p
i=1 bi 2 ≥ i=1 bi 2 = B F = d ≥ p − n.
2 2
the argument by defining a function f : R p → R according
to f (x̃) := x̃1 − c(n, p)x̃2 , where we define the constant
c(n, p) := √1πe ( p−n)
√ . The essential point to notice is that the
Altogether, we obtain the bound
p
event { f (x̃) > 0} is equivalent to Ex̃1 ≥ ◦
πe x ∞ ( p − n).
2
(75)
x̃21
x̃22
> c(n, p)2 = πe (1 − p ) p,
1 n 2
Combining the bounds (75) and (73), and noting that d ≤ p,
which is the desired bound. (Note that x̃ is non-zero with we obtain
probability 1.) Hence, a vector x̃ satisfying the bound (71) Ex̃ 1 2 ◦
πe x ∞ ( p−n)
exists if the event { f (x̃) > 0} occurs with positive probability. >
Ex̃2 x ◦ 2 +x ◦ 2∞ p
2
We will prove that the probability P( f (x̃) > 0) is positive
by showing E[ f (x̃)] > 0, which in turn can be reduced to 2
upper-bounding Ex̃2 , and lower-bounding Ex̃1 . The upper =  πe ( p−n)
bound on Ex̃2 follows from Jensen’s inequality and a direct x ◦ 22
x ◦ 2∞
+p
calculation,
Ex̃ 2 = Ex ◦ + B̃z2 ≥ √1 ( p−n)

πe p
< Ex ◦ + B̃z22
= c(n, p),
= x ◦ 22 +  B̃2F x ◦ 2
where we have used the fact that x ◦ 22 ≤ p. This proves

= x ◦ 22 + x ◦ 2∞ d. (73) E[ f (x̃)] > 0 as desired. 
Proof of Inequality (72): It is enough to prove the result
The lower bound on Ex̃1 is more involved. If we let b̃i for q = ∞ since sq (·) ≥ s∞ (·) for all q ∈ [0, ∞]. We retain
denote the i th row of B̃, then the i th coordinate of x̃ can be the same notation as in the proof above. Following the same
written as x̃ i = x i◦ + b̃i , z, which is distributed according general argument, it is enough to show that
to N(x i◦ , b̃i 22 ). Taking the absolute value |x̃ i | results in
a “folded normal” distribution, whose expectation can be Ex̃1
> c̄(n, p), (76)
calculated exactly as Ex̃ ∞

 −|xi◦ |  where
2b̃i 22 −(x i◦ )2
E|x̃ i | = π exp + |x i◦ | 1 − 2 , (74)
2b̃i 2
2 b̃i 2
πe ( p
− n)
2
where  is the standard normal distribution function. Note c̄(n, p) :=  . (77)
that it is possible to have b̃i 2 = 0, in which case x̃ i = x i◦ . 1 + 4 log(2 p)
When |x i◦ |/b̃i 2 is small, the first term on the right side In particular, we will re-use the bound
of (74) dominates, and then E|x̃ i | is roughly b̃i 2 . Alterna-

tively, when |x i◦ |/b̃i 2 is large, the second term dominates, Ex̃1 ≥ πe x ∞ ( p − n),
2
(78)

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
5164 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016

when x ◦ = 0. (The case x ◦ = 0 can be handled as in the and we will accomplish this using the classical technique of
previous proof.) The new item to handle is an upper bound on constructing a Bayes rule with constant risk. For any decision
Ex̃∞ . Clearly, we have x̃∞ ≤ x ◦ ∞ +  B̃z∞ , and so rule δ : Rn → R, any A ∈ Rn× p , and any point x ∈ {e1 , x̃},
it is enough to upper-bound E B̃z∞ . We will do this using a define the (deterministic) risk function
version of Slepian’s inequality. If b̃i denotes the i th row of B̃,  
R(x, δ) := δ(Ax) − sq (x).
define the random variable gi = b̃i , z, and let w1 , . . . , w p
be i.i.d. N(0, 1) variables. The idea is to √ compare the Also, for any prior π on the two-point set {e1 , x̃}, define
Gaussian process gi with the Gaussian process 2x ◦ ∞ wi . 
By [68, Proposition A.2.6], the inequality r (π, δ) := R(x, δ)dπ(x).
   

E B̃z∞ = E max |gi | ≤ 2 2x ◦ ∞ E max |wi | , By [73, Propositions 3.3.1 and 3.3.2], the inequality (82) holds
1≤i≤ p 1≤i≤ p if there exists a prior distribution π ∗ on {e1 , x̃} and a decision
holds when the condition E(gi − g j )2 ≤ 2x ◦ 2∞ E(wi − w j )2 rule δ ∗ : Rn → R with the following three properties:
is satisfied for all i, j ∈ {1, . . . , p}. This can be verified by 1) The rule δ ∗ is Bayes for π ∗ , i.e. r (π ∗ , δ ∗ ) =
first noting that gi − g j = b̃i − b̃ j , z, which is distributed inf δ r (π ∗ , δ).
according to N(0, b̃i − b̃ j 22 ). Since b̃i 2 ≤ x ◦ ∞ for all i , 2) The rule δ ∗ has constant risk over {e1 , x̃},
it follows that i.e. R(e1 , δ ∗ ) = R(x̃, δ ∗ ).
3) The constant risk of δ ∗ is at least 12 πe 1
· (1 − np )2 · p − 21 .
E(gi − g j )2 = b̃i − b̃ j 22 ∗ ∗
To exhibit π and δ with these properties, we define
≤ 4x ◦ 2∞ π ∗ to be the two-point prior that puts equal mass at e1
= 2x ◦ 2∞ E(wi − w j )2 , (79) and x̃, and we define δ ∗ to be the trivial decision rule
δ ∗ (y) = 12 (sq (x̃) + sq (e1 )) for all y ∈ Rn . It is simple to check
as needed. To finish the proof, we make use of a standard the second and third properties. To check that δ ∗ is Bayes
bound for the expectation of Gaussian maxima for π ∗ , the triangle inequality and the relation Ae1 = A x̃ give
  the following bound for any decision rule δ,

E max |wi | < 2 log(2 p).    
   
1≤i≤ p r (π ∗ , δ) = 21 δ(A x̃) − sq (x̃) + 12 δ(Ae1 ) − sq (e1 ),
 
Combining the last two steps, we obtain ≥ 21 sq (x̃) − sq (e1 )
√     
Ex̃∞ < x ◦ ∞ + 2 2x ◦ ∞ 2 log(2 p). (80) = 1 δ ∗ (A x̃) − sq (x̃) + 1 δ ∗ (Ae1 ) − sq (e1 )
2 2
= r (π ∗ , δ ∗ ), (83)
Hence, the bounds (78) and (80) lead to (76). 
2) Proof of Theorem 4: which shows that δ∗ is Bayes for π ∗. 
We only handle the case q ∈ (0, 2], since the case q ∈
(2, ∞] is essentially the same (using Lemma 4). We begin by ACKNOWLEDGMENT
making some reductions. First, it is enough to show that for The author would like to thank Peter Bickel for helpful
any fixed matrix A ∈ Rn× p , discussions, as well as the reviewers and associate editor for
  their careful reading and insightful comments, which have
 
inf sup δ(Ax) − sq (x) ≥ 12 πe1
· (1 − np )2 · p − 12 . improved the paper.
δ:Rn →R x∈R p \{0}
(81) R EFERENCES
[1] M. E. Lopes, “Estimating unknown sparsity in compressed sens-
To see this, note that the generalinequality sq (x) ≤ p implies ing,” in Proc. 30th Int. Conf. Mach. Learn. (ICML), 2013,
 δ( Ax) − 1 ≥ 1 δ(Ax) − sq (x), and we can optimize both pp. 217–225.
sq (x) p [2] E. J. Candés, J. K. Romberg, and T. Tao, “Stable signal recovery from
sides with respect x, δ, and A. To make another reduction, it incomplete and inaccurate measurements,” Commun. Pure Appl. Math.,
vol. 59, no. 8, pp. 1207–1223, 2006.
is enough to prove the same bound when R p \ {0} is replaced [3] D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52,
with any subset, as this can only make the supremum smaller. no. 4, pp. 1289–1306, Apr. 2006.
In particular, we replace R p \ {0} with the two-point subset [4] Y. C. Eldar and G. Kutyniok, Compressed Sensing: Theory and Appli-
cations. Cambridge, U.K.: Cambridge Univ. Press, 2012.
{e1 , x̃}, where e1 = (1, 0, . . . , 0) ∈ R p , and by Lemma 4, [5] S. Foucart and H. Rauhut, A Mathematical Introduction to Compressive
x̃ can be chosen to satisfy Ae1 = A x̃, with Sensing. New York, NY, USA: Springer-Verlag, 2013.
[6] P. Boufounos, M. F. Duarte, and R. G. Baraniuk, “Sparse signal
sq (x̃) ≥ 1
πe · (1 − np )2 · p. reconstruction from noisy compressive measurements using cross vali-
dation,” in Proc. IEEE/SP 14th Workshop Statist. Signal Process. (SSP),
(For q ∈ (2, ∞], the second bound in Lemma 4 can be used Aug. 2007, pp. 299–303.
[7] D. M. Malioutov, S. Sanghavi, and A. S. Willsky, “Compressed sensing
instead.) We now complete the proof by showing that the lower with sequential observations,” in Proc. IEEE Int. Conf. Acoust., Speech
bound (81) holds for the two-point problem, i.e. Signal Process. (ICASSP), Mar./Apr. 2008, pp. 3357–3360.
  [8] Y. C. Eldar, “Generalized SURE for exponential families: Applica-
 
inf sup δ(Ax) − sq (x) ≥ 12 πe
1
(1 − np )2 p − 12 , tions to regularization,” IEEE Trans. Signal Process., vol. 57, no. 2,
δ:Rn →R x∈{e1 , x̃} pp. 471–481, Feb. 2009.
[9] R. Ward, “Compressed sensing with cross validation,” IEEE Trans. Inf.
(82) Theory, vol. 55, no. 12, pp. 5773–5782, Dec. 2009.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
LOPES: UNKNOWN SPARSITY IN CS: DENOISING AND INFERENCE 5165

[10] G. Raskutti, M. J. Wainwright, and B. Yu, “Minimax rates of estimation [37] M. Lopes, L. Jacob, and M. J. Wainwright, “A more powerful two-
for high-dimensional linear regression over q -balls,” IEEE Trans. Inf. sample test in high dimensions using random projection,” in Proc. Adv.
Theory, vol. 57, no. 10, pp. 6976–6994, Oct. 2011. Neural Inf. Process. Syst. 24 (NIPS), 2011, pp. 1206–1214.
[11] E. J. Candès and T. Tao, “Decoding by linear programming,” IEEE [38] G. Tang and A. Nehorai, “The stability of low-rank matrix recon-
Trans. Inf. Theory, vol. 51, no. 12, pp. 4203–4215, Dec. 2005. struction: A constrained singular value view,” IEEE Trans. Inf. Theory,
[12] D. L. Donoho and X. Huo, “Uncertainty principles and ideal atomic vol. 58, no. 9, pp. 6079–6092, Sep. 2012.
decomposition,” IEEE Trans. Inf. Theory, vol. 47, no. 7, pp. 2845–2862, [39] S. Negahban and M. J. Wainwright, “Restricted strong convexity and
Nov. 2001. weighted matrix completion: Optimal bounds with noise,” J. Mach.
[13] A. Cohen, W. Dahmen, and R. DeVore, “Compressed sensing and best Learn. Res., vol. 13, no. 1, pp. 1665–1697, 2012.
κ-term approximation,” J. Amer. Math. Soc., vol. 22, no. 1, pp. 211–231, [40] R. A. DeVore, “Deterministic constructions of compressed sensing
2009. matrices,” J. Complex., vol. 23, nos. 4–6, pp. 918–925, 2007.
[14] A. d’Aspremont and L. El Ghaoui, “Testing the nullspace property [41] R. Calderbank, S. Howard, and S. Jafarpour, “Construction of a large
using semidefinite programming,” Math. Program., vol. 127, no. 1, class of deterministic sensing matrices that satisfy a statistical isom-
pp. 123–144, 2011. etry property,” IEEE J. Sel. Topics Signal Process., vol. 4, no. 2,
[15] A. Juditsky and A. Nemirovski, “On verifiable sufficient conditions for pp. 358–374, Apr. 2010.
sparse signal recovery via 1 minimization,” Math. Program., vol. 127, [42] N. Alon, Y. Matias, and M. Szegedy, “The space complexity of approxi-
no. 1, pp. 57–88, 2011. mating the frequency moments,” in Proc. 28th Annu. ACM Symp. Theory
[16] G. Tang and A. Nehorai, “Performance analysis of sparse recovery based Comput. (STOC), 1996, pp. 20–29.
on constrained minimal singular values,” IEEE Trans. Signal Process., [43] J. Feigenbaum, S. Kannan, M. J. Strauss, and M. Viswanathan,
vol. 59, no. 12, pp. 5734–5745, Dec. 2011. “An approximate L 1 -difference algorithm for massive data streams,”
[17] J. A. Tropp and A. C. Gilbert, “Signal recovery from random mea- SIAM J. Comput., vol. 32, no. 1, pp. 131–151, 2002.
surements via orthogonal matching pursuit,” IEEE Trans. Inf. Theory, [44] G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan, “Comparing
vol. 53, no. 12, pp. 4655–4666, Dec. 2007. data streams using Hamming norms (how to zero in),” IEEE Trans.
[18] Y. Plan and R. Vershynin, “One-bit compressed sensing by lin- Knowl. Data Eng., vol. 15, no. 3, pp. 529–540, May/Jun. 2003.
ear programming,” Commun. Pure Appl. Math., vol. 66, no. 8, [45] P. Indyk, “Stable distributions, pseudorandom generators, embeddings,
pp. 1275–1297, 2013. and data stream computation,” J. Assoc. Comput. Mach., vol. 53, no. 3,
[19] R. Vershynin, “Estimation in high dimensions: A geometric perspective,” pp. 307–323, 2006.
in Sampling Theory, a Renaissance, G. E. Pfander, Ed. Switzerland: [46] P. Li, T. J. Hastie, and K. W. Church, “Nonlinear estimators and tail
Springer-Verlag, 2012. bounds for dimension reduction in l1 using Cauchy random projections,”
[20] R. R. Coifman and M. V. Wickerhauser, “Entropy-based algorithms J. Mach. Learn. Res., vol. 8, pp. 2497–2532, 2007.
for best basis selection,” IEEE Trans. Inf. Theory, vol. 38, no. 2, [47] P. Li, “Estimators and tail bounds for dimension reduction in
pp. 713–718, Mar. 1992. lα (0 < α ≤ 2) using stable random projections,” in Proc. 19th Annu.
[21] D. L. Donoho, “On minimum entropy segmentation,” Wavelets: Theory, ACM-SIAM Symp. Discrete Algorithms, 2008, pp. 10–19.
Algorithms, and Applications. San Diego, CA, USA: Academic, 1994, [48] P. Clifford and I. A. Cosma, “A statistical analysis of probabilistic
pp. 233–269. counting algorithms,” Scandin. J. Statist., vol. 39, no. 1, pp. 1–14, 2012.
[22] B. D. Rao and K. Kreutz-Delgado, “An affine scaling methodology [49] G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine, “Synopses for
for best basis selection,” IEEE Trans. Signal Process., vol. 47, no. 1, massive data: Samples, histograms, wavelets, sketches,” Found. Trends
pp. 187–200, Jan. 1999. Databases, vol. 4, nos. 1–3, pp. 1–294, 2011.
[23] N. Hurley and S. Rickard, “Comparing measures of sparsity,” IEEE [50] A. Gilbert and P. Indyk, “Sparse recovery using sparse matrices,” Proc.
Trans. Inf. Theory, vol. 55, no. 10, pp. 4723–4741, Oct. 2009. IEEE, vol. 98, no. 6, pp. 937–947, Jun. 2010.
[24] P. O. Hoyer, “Non-negative matrix factorization with sparseness [51] P. Li, C.-H. Zhang, and T. Zhang, “Compressed counting meets
constraints,” J. Mach. Learn. Res., vol. 5, pp. 1457–1469, Dec. 2004. compressed sensing,” in Proc. Conf. Learn. Theory (COLT), 2014,
[25] A. G. Dimakis, R. Smarandache, and P. O. Vontobel, “LDPC codes pp. 1058–1077.
for compressed sensing,” IEEE Trans. Inf. Theory, vol. 58, no. 5, [52] P. Indyk, “Sketching via hashing: From heavy hitters to compressed
pp. 3093–3114, May 2012. sensing to sparse Fourier transform,” in Proc. 32nd Symp. Principles
[26] M. Pilanci, L. El Ghaoui, and V. Chandrasekaran, “Recovery of sparse Database Syst. (PODS), 2013, pp. 87–90.
probability measures via convex programming,” in Proc. Adv. Neural [53] M. Markatou, J. L. Horowitz, and R. V. Lenth, “Robust scale estimation
Inf. Process. Syst. 25 (NIPS), 2012, pp. 2420–2428. based on the the empirical characteristic function,” Statist. Probab. Lett.,
[27] L. Demanet and P. Hand, “Scaling law for recovering the sparsest vol. 25, no. 2, pp. 185–192, 1995.
element in a subspace,” Inf. Inference, vol. 3, no. 4, pp. 295–309, 2014. [54] M. Markatou and J. L. Horowitz, “Robust scale estimation in the error-
[28] B. Barak, J. A. Kelner, and D. Steurer, “Rounding sum-of-squares components model using the empirical characteristic function,” Can.
relaxations,” in Proc. 46th Annu. ACM Symp. Theory Comput. (STOC), J. Statist., vol. 23, no. 4, pp. 369–381, 1995.
2014, pp. 31–40. [55] C. Butucea and C. Matias, “Minimax estimation of the noise level and
[29] M. O. Hill, “Diversity and evenness: A unifying notation and its of the deconvolution density in a semiparametric convolution model,”
consequences,” Ecology, vol. 54, no. 2, pp. 427–432, 1973. Bernoulli, vol. 11, no. 2, pp. 309–340, 2005.
[30] L. Jost, “Entropy and diversity,” Oikos, vol. 113, no. 2, pp. 363–375, [56] A. Meister, “Density estimation with normal measurement error with
2006. unknown variance,” Statist. Sinica, vol. 16, no. 1, pp. 195–211, 2006.
[31] A. Repetti, M. Q. Pham, L. Duval, É. Chouzenoux, and J.-C. Pesquet, [57] C. Matias, “Semiparametric deconvolution with unknown noise vari-
“Euclid in a taxicab: Sparse blind deconvolution with smoothed 1 /2 ance,” ESAIM, Probab. Statist., vol. 6, pp. 271–292, Nov. 2002.
regularization,” IEEE Signal Process. Lett., vol. 22, no. 5, pp. 539–543, [58] T. T. Cai, L. Wang, and G. Xu, “New bounds for restricted isometry
May 2015. constants,” IEEE Trans. Inf. Theory, vol. 56, no. 9, pp. 4388–4394,
[32] J. B. Lasserre, Moments, Positive Polynomials and Their Applications, Sep. 2010.
vol. 1. Singapore: World Scientific, 2009. [59] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition
[33] E. J. Candés and Y. Plan, “Tight oracle inequalities for low-rank matrix by basis pursuit,” SIAM Rev., vol. 43, no. 1, pp. 129–159, 2001.
recovery from a minimal number of noisy random measurements,” IEEE [60] T. Gneiting, “Curiosities of characteristic functions,” Expo. Math.,
Trans. Inf. Theory, vol. 57, no. 4, pp. 2342–2359, Apr. 2011. vol. 19, no. 4, pp. 359–363, 2001.
[34] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, “The [61] V. M. Zolotarev, One-Dimensional Stable Distributions, vol. 65.
convex geometry of linear inverse problems,” Found. Comput. Math., Providence, RI, USA: AMS, 1986.
vol. 12, pp. 805–849, Dec. 2012. [62] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Roy.
[35] M. Rudelson and R. Vershynin, “Sampling from large matrices: Statist. Soc. B, vol. 58, no. 1, pp. 267–288, 1996.
An approach through geometric functional analysis,” J. Assoc. Comput. [63] S. Chatterjee, “A new perspective on least squares under convex con-
Mach., vol. 54, no. 4, 2007, Art. no. 21. straint,” Ann. Statist., vol. 42, no. 6, pp. 2340–2381, 2014.
[36] R. Vershynin, “Introduction to the non-asymptotic analysis of random [64] H. Zou and T. Hastie, “Regularization and variable selection via the
matrices,” in Compressed Sensing: Theory and Applications, Y. C. Eldar elastic net,” J. Roy. Statist. Soc., B (Statist. Methodol.), vol. 67, no. 2,
and G. Kutyniok, Eds. Cambridge, U.K.: Cambridge Univ. Press, 2012. pp. 301–320, 2005.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
5166 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016

[65] S. Foucart, A. Pajor, H. Rauhut, and T. Ullrich, “The Gelfand widths [72] C. Beck and F. Schögl, Thermodynamics of Chaotic Systems. Cambridge,
of  p -balls for 0 < p ≤ 1,” J. Complex., vol. 26, no. 6, pp. 629–640, U.K.: Cambridge Univ. Press, 1995.
2010. [73] P. J. Bickel and K. A. Doksum, Mathematical Statistics, vol. 1.
[66] B. Kashin, “Diameters of some finite-dimensional sets and classes of Upper Saddle River, NJ, USA: Prentice-Hall, 2001.
smooth functions,” Math. USSR Izv., vol. 11, no. 2, p. 317, 1977.
[67] A. Garnaev and E. D. Gluskin, “The widths of a Euclidean ball,” Dokl.
Akad. Nauk SSSR, vol. 277, no. 5, pp. 1048–1052, 1984.
[68] A. W. van der Vaart and J. A. Wellner, Weak Convergence and Empirical Miles E. Lopes (M’16) is currently Assistant Professor in the Department of
Processes. New York, NY, USA: Springer-Verlag, 1996. Statistics at the University of California, Davis. He received B.S. degrees in
[69] M. B. Marcus, “Weak convergence of the empirical characteristic mathematics and physics from the University of California, Los Angeles, as
function,” Ann. Probab., vol. 9, no. 2, pp. 194–201, 1981. well as an M.S. in computer science and a Ph.D. in statistics from the Univer-
[70] S. Csörgõ, “Multivariate empirical characteristic functions,” Probab. sity of California at Berkeley, where his research was supported by the DOE
Theory Rel. Fields, vol. 55, no. 2, pp. 203–229, 1981. CSGF and NSF GRFP fellowships. His research interests include compressed
[71] S. Csörgõ, “Limit behaviour of the empirical characteristic function,” sensing, statistical machine learning, and high-dimensional statistics - with a
Ann. Probab., pp. 130–144, 1981. focus on resampling methods.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.

You might also like