Professional Documents
Culture Documents
Research Project Paper CS754
Research Project Paper CS754
Research Project Paper CS754
Abstract— The theory of compressed sensing (CS) asserts that problem can be solved reliably when x is sparse. Specifically,
an unknown signal x ∈ R p can be accurately recovered from when the sparsity level of x is measured in terms of the
an underdetermined set of n linear measurements with n p 0 norm x0 := card{ j : x j = 0}, it is well known that if
provided that x is sufficiently sparse. However, in applications,
the degree of sparsity x0 is typically unknown, and the problem n x0 log( p), then accurate recovery can be achieved with
of directly estimating x0 has been a longstanding gap between high probability when A is drawn from a suitable ensem-
theory and practice. A closely related issue is that x0 is a highly ble [2]–[5]. In this way, the parameter x0 is often treated as
idealized measure of sparsity, and for real signals with entries being known in much of the theoretical CS literature — despite
not equal to 0, the value x0 = p is not a useful description the fact that x0 is usually unknown in practice. Due to the
of compressibility. In our previous conference paper [1] that
examined these problems, we considered an alternative measure fact that the sparsity parameter plays a fundamental role in CS,
of soft sparsity, x21 /x22 , and designed a procedure to estimate the issue of unknown sparsity has become recognized as a
x21 /x22 that does not rely on sparsity assumptions. This gap between theory and practice [6]–[9]. Likewise, our overall
paper offers a new deconvolution-based method for estimating focus is the problem of estimating the unknown sparsity level
unknown sparsity, which has wider applicability and sharper of x without relying on any sparsity assumptions. However,
theoretical guarantees. In particular, we introduce a family of rather than concentrating solely on x0 , we will consider a
(q/1−q)
entropy-based sparsity measures sq (x) = xq /x1 generalized family of sparsity measures, {sq (x)}q∈[0,∞] , which
parameterized by q ∈ [0, ∞]. This family interpolates between will be introduced in Section I-B.
x0 = s0 (x) and x21 /x22 = s2 (x) as q ranges over [0, 2].
For any q ∈ (0, 2] \ {1}, we propose an estimator ŝq (x) √ whose A. Motivations and the Role of Sparsity
relative error converges at the dimension-free rate of 1/ n, even Given that many well-developed methods are avail-
when p/n → ∞. Our main results also describe the limiting
distribution of ŝq (x), as well as some connections to basis pursuit able for estimating the full signal x, or its support set
denosing, the Lasso, deterministic measurement matrices, and S := { j ∈ {1, . . . , p} : x j = 0}, it might seem surprising that
inference problems in CS. the problem of estimating x0 has remained unsettled.
Index Terms— Compressed sensing, deterministic measure- Indeed, given an estimate of x or S, it might seem natural to
ment matrices, 1 -regularization, Rényi entropy, unknown estimate x0 via a “plug-in rule”, such as x̂0 or card( Ŝ).
sparsity. However, it is important to note that methods for computing
x̂ and Ŝ often implicitly rely on prior knowledge of x0 .
I. I NTRODUCTION Consequently, when using a plug-in rule to estimate x0 ,
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
5146 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016
measurements, and then x 0 could determine how many then π(x) has low entropy. From the viewpoint of information
extra measurements should be collected to recover x. theory, it is well known that the entropy of a distribution can
Alternatively, if measurements are collected in batch, the be interpreted as the logarithm of the distribution’s effective
0 could certify that enough measurements were
value x number of states. Likewise, it is natural to count effective
actually taken. coordinates of x by counting effective states of π(x) via
• The measurement matrix. The properties of the mea- entropy. To this end, we define the effective sparsity,
surement matrix A that lead to good recovery are often
directly linked to x0 . Two specific properties that exp(Hq (π(x))) if x = 0
sq (x) := (2)
have been intensively studied are the restricted isometry 0 if x = 0,
property of order k (RIP-k), [11], and the null-space where Hq is the Rényi entropy of order q ∈ [0, ∞]. (Our
property of order k (NSP-k), [12], [13], where k is terminology derives from the works [18], [19], where a
a presumed upper bound on x0 . Because recovery particular case of sq (x) is called effective sparsity.) When
guarantees are closely tied to RIP-k and NSP-k, a growing q ∈ {0, 1, ∞}, the Rényi
body of work has analyzed “certificates” for whether or pentropy qis given explicitly by
Hq (π(x)) := 1−q 1
log i=1 πi (x) , and the cases of
not a given matrix satisfies these properties [14]–[16]. q ∈ {0, 1, ∞} are defined by evaluating limits, with H1 being
When k is treated as given, this problem is already the ordinary Shannon entropy. Combining the definitions of
computationally difficult. Yet, when sparsity is unknown, π(x) and Hq , we see that for x = 0 and q ∈ {0, 1, ∞}, the
such a “certificate” is less meaningful if we cannot check effective sparsity may be written conveniently in terms of q
that k conforms to the true signal. norms as
• Recovery algorithms. When recovery algorithms are q
implemented, the sparsity level of x is often treated xq 1−q
sq (x) = , (3)
as a tuning parameter. For example, if k is a x1
conjectured bound on x0 , then the Orthogonal p
q 1/q for q > 0. As with H , the
Matching Pursuit algorithm (OMP) is typically ini- where xq := j =1 |x j | q
tialized to run for k iterations [17]. A second cases of q ∈ {0, 1, ∞} are evaluated as limits:
example is the Lasso algorithm, which computes a s0 (x) = lim sq (x) = x0 (4)
solution x̂ ∈ argmin {y − Av22 + λv1 : v ∈ R p }, for q→0
some choice of λ ≥ 0 governing the sparsity level of x̂. In s1 (x) = lim sq (x) = exp(H1(π(x))) (5)
q→1
the case of either OMP or Lasso, an estimate x 0 could x1
reduce computation by restricting the possible choices of s∞ (x) = lim sq (x) = x∞ . (6)
q→∞
λ or k, and it would also ensure that the solution’s sparsity
2) Background on the Definition of sq (x): To the best of our
level conforms to the true signal. Section V, gives further
knowledge, the definition of effective sparsity (2) in terms of
details on how our proposed methodology can adaptively
Rényi entropy is new in the context of CS. However, numerous
tune the Lasso with certain guarantees.
special cases and related definitions have been considered else-
where. For instance, in the early study of wavelet bases, Coif-
B. A Numerically Stable Measure of Sparsity p |xi |2 |x i |2
man and Wickerhauser proposed exp − i=1 x 2 log( x2 )
Despite the important theoretical role of the parameter x0 , 2 2
as a measure of effective dimension [20]. (See also the
it has a severe practical drawback of being sensitive to small
papers [21]–[23].) The important difference between this quan-
entries of x. In particular, for real signals x ∈ R p whose
tity and sq (x) is that the Rényi entropy leads to a convenient
entries are not equal to 0, the value x0 = p is not a useful
ratio of norms, which will play an essential role in our
description of compressibility. In order to estimate sparsity in
procedure for estimating sq (x).
a way that accounts for the instability of x0 , it is desirable to
In recent years, there has been growing interest in ratios of
replace the 0 norm with a “soft” version. More precisely, we
norms as measures of sparsity, but such ratios have generally
would like to identify a function of x that can be interpreted
been introduced in an ad-hoc manner, and there has not been
as counting the “effective number of coordinates of x”, but
a principled way to explain where they “come from”. To this
remains stable under small perturbations. Below, we derive
extent, our entropy-based definition of sq (x) conceptually
such a function by showing that x0 is a special case of a
unifies these ratios. Examples of previously studied instances
more general sparsity measure based on entropy.
include x21 /x22 corresponding to q = 2 [1], [16], [18],
1) A Link Between Sparsity and Entropy: Any non-zero
[24], [25], x1 /x∞ corresponding to q = ∞ [25]–[27], as
vector x ∈ R p induces a distribution π(x) ∈ R p on the set
well as a ratio of the form (xa /xb )ab/(b−a) with a, b > 0,
of indices {1, . . . , p}, assigning mass π j (x) := |x j |/x1 at
which is implicit in the paper [28]. (To see how the latter
index j .1 Under this correspondence, if x places most of its
quantity fits in the scope of our entropy-based definition,
mass at a small number of coordinates, and J ∼ π(x) is
consider the normalization π j (x) = |x j |t /xtt with t = b
a random index in {1, . . . , p}, then J is likely to occupy a
and q = a/b.)
small set of effective states. This means that if x is sparse,
Outside of CS, the use of Rényi entropy to count
1 Other normalizations are possible, e.g. π (x) = |x |2 /x2 . See also
j j
the effective number of states of a distribution has been
2
Section I-B3. well-established in the ecology literature. If a distribution π
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
LOPES: UNKNOWN SPARSITY IN CS: DENOISING AND INFERENCE 5147
on {1, . . . , p} measures the abundance of p species in a c) Numerically stable measures of rank: The framework
community, then exp(Hq (π)) is a standard measure of the of CS naturally extends to recovering an unknown low-rank
effective number of species. This number has been called the matrix X ∈ R p1 × p2 on the basis of a linear measurement
Hill index or diversity number [29], [30]. Thus, the conceptual model [33], [34]. In many ways, the quantity rank(X) is
link with CS is to interpret the signal x ∈ R p as a distribution analogous to x0 in recovering a sparse vector x. If we let
on the indices {1, . . . , p}. ς (X) denote the vector of singular values of X, the connection
3) Properties of sq (x): The following list summarizes some is suggested by the fact that rank(X) = ς (X)0 . Likewise,
of the most important properties of sq (x). we can consider
• (continuity). Unlike the 0 norm, the function sq (·) is
q q
ς(X )q 1−q || X ||q 1−q
continuous on R p \ {0} for all q > 0, and is hence stable rq (X) := sq (ς (X)) = ς(X )1 = || X ||1
under small perturbations of x. as a measure of the effective rank of X, where the latter
• (range equal to [0, p]). For all x ∈ R p and all expression holds for q ∈ {0, 1, ∞}, and |||X|||q := ς (X)q .
q ∈ [0, ∞], the effective sparsity satisfies Essentially all of the properties of sq (·) described earlier carry
0 ≤ sq (x) ≤ p. over to rq (·). We also note that quantities related to rq (X), or
special instances, have been considered elsewhere as a measure
This property follows from the fact that for any q, and any of rank [19], [35]–[39].
distribution π on {1, . . . , p}, the Rényi entropy satisfies d) Graphical interpretations: The fact that sq (x) with
0 ≤ Hq (π) ≤ log( p). q > 0 is a sensible measure of sparsity for non-idealized
• (scale-invariance). The fact that cx0 = x0 for all signals is illustrated in Figure 1. In essence, if x has k large
scalars c = 0 is familiar for the 0 norm, and this coordinates and p − k small coordinates, then sq (x) ≈ k,
generalizes to sq (x) for all q ∈ [0, ∞]. Scale-invariance whereas x0 = p. In the left panel, the sorted coordinates
encodes the idea that sparsity should be based on relative of three vectors in R100 are plotted, indicating that s2 (x)
(rather than absolute) magnitudes of the entries of x. measures the “elbow” in the decay plot. This idea can be seen
• (monotonicity in q). For any x ∈ R p , the function in a more geometric way in the right panel, which plots the
q → sq (x) is non-increasing on [0, ∞], and interpolates the sub-level sets Sc := {x ∈ R p : s2 (x) ≤ c} with p = 2.
between s∞ (x) and s0 (x). That is, for any q ≥ q ≥ 0, When c ≈ 1, the vectors in Sc are closely aligned with the
x1 coordinate axes.
x∞ = s∞ (x) ≤ sq (x) ≤ sq (x) ≤ s0 (x) = x0 .
The monotonicity follows from the fact that the Rényi C. Contributions
entropy Hq is non-increasing in q. 1) The Family of Sparsity Measures {sq (x)}q∈[0,∞] :
a) The choice of q: The parameter q controls how much As mentioned in Section I-B.2, our definition of sq (x) in terms
weight sq (x) assigns to small coordinates. When q = 0, an of Rényi entropy gives a conceptual foundation for several
arbitrarily small coordinate is still counted as being “effective”. norm ratios that have appeared elsewhere in the sparsity
By contrast, when q = ∞, a coordinate is not counted as being literature. Furthermore, we clarify the meaning of s2 (x) in
effective unless its magnitude is close to x∞ . The choice Section II by showing that it plays an intuitive role in the
of q is also relevant to other considerations. For instance, performance of the Basis Pursuit Denoising algorithm.
we will show in Section II that q = 2 is important because 2) Estimation Results, Confidence Intervals, and Appli-
x2
recovery guarantees can be derived in terms of s2 (x) = x12 . cations: Our central methodological contribution is a new
In addition, the choice of q affects the type of measurements
2
deconvolution technique for estimating xq and sq (x) from
we use to estimate sq (x). In this respect, the case of q = 2 linear measurements. Conceptually, the technique is of interest
turns out to be attractive because some of our proposed in that it blends the tools of sketching with stable laws
measurements are Gaussian — and they can naturally be re- and deconvolution with characteristic functions. These tools
used for recovering x. are typically applied in different contexts, as discussed in
b) Minimization of sq (x): Our work in this paper does Section I-D. Also, the computational cost of our method is
not require the minimization of sq (x), and neither do the small relative to recovering x by standard methods.
applications we propose. Nevertheless, some readers may be The most important features of our estimator ŝq (x) are
curious about what can be done in this direction. It turns that it does not rely on any sparsity assumptions, and that √
out that for certain values of q, or certain normalizations its relative error decays at the dimension-free rate of 1/ n
of π(x), the minimization of sq (x) may be tractable (under (in probability). In particular, only O(1) measurements are
suitable constraints). The recent paper [31] discusses min- needed to obtain a good estimate of sq (x), even when
imization of s2 (x). Another example is the minimization p/n → ∞.
of s∞ (x), which can be reduced to a sequence of linear In addition to proving ratio-consistency, we derive a CLT for
programs [26], [27]. Lastly, a third example deals with the ŝq (x), which allows us to obtain confidence intervals for sq (x)
normalization π j (x) = |x j |t /xtt with t = 1/2, which leads with asymptotically exact coverage probability. A notable
to exp(H2(π(x)) = (x2 /x4 )4 . The paper [28] shows that feature of this CLT is that it is “uniform” with respect to
minimization of x2 /x4 has interesting connections with the tuning parameter in our procedure for computing ŝq (x).
sums-of-squares (SOS) optimization problems [32]. The uniformity is important because it allows us to make
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
5148 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016
Fig. 1. Left panel: Three vectors (red, blue, black) in R100 are plotted with entries in decreasing order. Right panel: Sub-level sets Sc with c = 1.1 in dark
grey, and c = 1.9 in light grey.
an optimal data-dependent selection of the tuning parameter sparsity are new (Section V). Lastly, our theoretical results in
and still find the estimator’s limiting distribution (see Theo- Sections II and VI are sharpened versions of parallel results
rem 2 and Corollary 1). In terms of applications, we show in in the paper [1].
Section V how this CLT can be used in inferential problems, 2) Sketching With Stable Laws: Our approach to estimating
e.g. testing the null hypothesis that sq (x) is greater than a given sq (x) is based on the sub-problem of estimating xq , and we
level, and ensuring that the true signal lies in the constraint employ the technique of sketching with stable laws from the
set of the (primal) Lasso or Elastic net problems. streaming computation literature [42]–[49]. Over the last few
3) The Necessity of Randomized Measurements: The prob- years, the exchange of ideas between sketching and CS has
lem of constructing deterministic sensing matrices with good begun to accelerate [1], [50]–[52]. Nevertheless, to the best of
performance guarantees is one of the major unresolved theo- our knowledge, the present paper and our earlier work [1] are
retical issues in CS (We refer to the book [5, Secs. 1.3 and 6.1], the first to apply sketching ideas to unknown sparsity in CS.
as well as [40] and [41].) Since our proposed method for esti- 3) Empirical Characteristic Functions: Given that stable
mating sq (x) depends on randomized measurements, one may laws have a simple formula for their characteristic function
wonder if randomization is essential to estimating unknown (ch.f.), but have no general likelihood formula,√it is natural
sparsity. In Section VI, we show that randomization is essential to use the empirical ch.f. ˆ n (t) = 1 ni=1 exp( −1t yi ) as a
n
from a worst-case point of view. Our main result in this foundation for estimating scale parameters. With attention to
direction (Theorem 4) shows that for any deterministic matrix denoising, ch.f.’s are also attractive as they factor over convo-
A ∈ Rn× p , and any deterministic procedure for estimating lution and allow for heavy-tailed noise. As such, the favorable
sq (x), there is always a signal for which the relative estimation properties of empirical ch.f.’s have been applied to the decon-
error is at least of order 1 (even with noiseless measurements). volution of scale parameters by several authors [53]–[57].
This contrasts √with our randomized method, whose relative Although our approach shares similarities with these works,
error is O P (1/ n) for any choice of x. Furthermore, the result our results are not directly comparable due to differences in
has a negative implication for estimating unknown sparsity in model assumptions. Also, it seems that our work is the first
linear regression, where the “design matrix” is often viewed to derive the limiting distribution of a scale estimate under a
as fixed. data-dependent choice of tuning parameter t.
4) Model Selection and Validation in CS: Some of the
D. Related Work challenges described in Section I-A can be approached with
1) Extensions Beyond the Conference Paper [1]: Whereas the tools of cross-validation (CV) and empirical risk mini-
our earlier work deals exclusively with x21 /x22 , this work mization (ERM). Such approaches have been used to select
considers the family of parameters sq (x). The procedure we various parameters in CS, such as the number of measure-
propose for estimating sq (x) has several improvements on ments n [7], [9], the number of OMP iterations k [9], or the
the earlier approach: It tolerates noise with infinite variance Lasso regularization parameter λ [8]. Although methods based
and leads to confidence intervals with asymptotically exact on CV and ERM share common motivations with our work,
coverage probability (whereas the previous approach led to these methods differ from our approach in several ways.
conservative intervals). Also, the applications of our procedure In particular, the problem of estimating a soft measure of spar-
to tuning recovery algorithms and testing the hypothesis of sity, such as sq (x), has not been considered from that angle.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
LOPES: UNKNOWN SPARSITY IN CS: DENOISING AND INFERENCE 5149
(Note that even if x̂ − x2 is small, x̂0 may not be close computed by retaining the largest T entries of x in magnitude,
to x0 .) Computationally, CV and ERM can also be costly, and setting all others to 0.
e.g. if many solutions x̂ must be computed with different Theorem 1 [2], [36], [58]: Suppose the model (1) satisfies
tuning parameters. By contrast, our method is computationally the conditions A1 and A2. Let x ∈ R p \{0} be arbitrary, and fix
inexpensive. a number T ∈ {1, . . . , p}. Then, there are absolute constants
c2 , c3 > 0, and numbers c0 , c1 > 0 depending only on the
E. Outline sub-Gaussian norm of G 0 , such that the following statement
is true. If n ≥ c0 T log( pe/T ), then with probability at least
The paper is organized as follows. In Section II, we give
1 − 2 exp(−c1 n), any solution x̂ to the problem (BPDN)
a tight recovery guarantee for the Basis Pursuit Denoising
satisfies
algorithm in terms of s2 (x). Next, in Section III, we pro-
pose estimators for xq and sq (x). In Section IV, we state x̂−x2 σ 0 x−x |T 1
x2 ≤ c2 x2 + c3 √ . (7)
T x2
consistency results and provide confidence intervals for xq
and sq (x). Applications to testing the hypothesis of sparsity
and adaptive tuning of the Lasso are presented in Section V. C. An Upper Bound in Terms of s2 (x)
In Section VI, we show that randomized measurements A limitation of the previous bound is that the detailed
are essential for estimating sq (x) in a minimax sense. relationship between T and the approximation error
In Section VII, we present simulations that confirm our √ 1 x − x |T 1 is typically unknown. Consequently, it is
T x2
theoretical results. Proofs are deferred to the appendices. not clear how large n should be chosen in order to ensure that
x−x̂2
x2 is small with high probability. The next proposition
II. R ECOVERY G UARANTEES IN T ERMS OF s2 (x) resolves this issue by modifying the bound (7) so that the
In this section, we state two propositions giving matching relative 2 -error is bounded by an explicit function of n and
upper and lower bounds for relative 2 reconstruction error the estimable parameter s2 (x).
of Basis Pursuit Denoising (BPDN) in terms of s2 (x). The Proposition 1: Suppose the model (1) satisfies the condi-
main purpose here is to highlight the fact that s2 (x) and x0 tions A1 and A2, and let x ∈ R p \{0} be arbitrary. Then, there
play analogous roles with respect to the sample complexity of is an absolute constant κ2 > 0, and numbers κ0 , κ1 , κ3 > 0
sparse recovery. depending only on the sub-Gaussian norm of G 0 , such that the
following statement is true. If n and p satisfy κ0 log(κ0 pe
n )≤
A. Setup for BPDN n ≤ p, then with probability at least 1 − 2 exp(−κ1 n), any
solution x̂ to the problem (BPDN) satisfies
In order to explain the connection between s2 (x) and
recovery, we recall a fundamental result describing the BPDN x̂−x2 σ 0 s2 (x) log( pe
n )
x2 ≤ κ2 x + κ3 n . (8)
algorithm, first obtained in the paper [2, Th. 2]. Our state- 2
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
5150 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016
q
III. E STIMATION P ROCEDURES FOR sq (x) AND xq Lemma 1: Let x ∈ R p be fixed, and suppose a1 ∼
Here we describe the model assumptions underlying our stableq (γq )⊗ p with parameters q ∈ (0, 2] and γq > 0.
estimation procedure for sq (x). (These are different from the Then, the random variable a1 , x is distributed according to
assumptions in the previous section.) In scalar notation, we stableq (γq xq ).
consider linear measurements given by Using this fact, if we generate a set of i.i.d. measurement
vectors a1 , . . . , an from the distribution stableq (γq )⊗ p and
yi = ai , x + σ i , i = 1, . . . , n. (M) let ỹi = ai , x, then ỹ1 , . . . , ỹn is an i.i.d. sample from
stableq (γq xq ). Hence, in the special case of noiseless linear
1) The Deconvolution Model: For the remainder of the q
measurements, the task of estimating xq is equivalent to a
paper, we assume x = 0 unless stated otherwise. The well-studied univariate problem: estimating the scale parame-
noise variables i are generated in an i.i.d. manner from a ter of a stable law from an i.i.d. sample.
distribution F0 , and when the ai are random, we assume Our work below substantially extends this idea in the noisy
the sets {a1 , . . . , an } and {1 , . . . , n } are independent. The case. We also emphasize that the extra deconvolution step is
noise variables are assumed to be symmetric about 0, with an important aspect of our method that distinguishes it from
median(|1 |) = 1, and 0 < E|1 | < ∞, but they may have existing work in the sketching literature — where noise is
infinite variance. (The assumption of symmetry is only for typically not considered. Indeed, in the streaming computation
convenience, and in Section III-B2.e we explain how it can be literature, the model giving rise to a1 , x is quite different than
dropped.) A minor technical condition we place on F0 is that in CS. Roughly speaking, the vector x is viewed as a massive
the roots of its ch.f. ϕ0 are isolated (i.e. no limit points). This data stream whose entries can be observed sequentially, but
condition is satisfied by many families of distributions, such cannot be stored entirely in memory.
as Gaussian, Student’s t, Laplace, uniform[a, b], and stable Remark: The parameter γq governs the “energy level” of the
laws. In fact, a large portion of the deconvolution literature ai . For instance, in the case of Gaussian measurements with
assumes ϕ0 has no roots at all.2 The noise scale parameter ai ∼ stable2 (γ2 )⊗ p , we have Eai 22 = 2 pγ22 . In general, as
σ > 0 and the distribution F0 are treated as being known, the energy level is increased, the effect of noise is diminished.
which is a common assumption in deconvolution problems. Hence, we view γq as a physical aspect of the measurement
(Simulations in Section VII address the effects of estimating system that is known to the user. Asymptotically, we allow
the noise distribution.) γq = γq (ξ ) to vary as (n, p) → ∞.
2) Asymptotics: We allow the parameters of the deconvo-
lution model to vary as (n, p) → ∞. This means there is an
q
implicit index ξ = 1, 2, . . . , such that n = n(ξ ), p = p(ξ ) and B. The Sub-Problem of Estimating xq
both diverge as ξ → ∞. It will turn out that our asymptotic
Our procedure for estimating sq (x) uses two separate sets
results do not depend on the ratio p/n, and so we allow p to 1
of measurements of the form (M) to compute estimators x
be arbitrarily large with respect to n. We also allow x = x(ξ ), q
σ = σ (ξ ) and ai = ai (ξ ), but the distribution F0 is fixed with and xq . (The measurements can also be collected in a
respect to ξ . Going forward, we mostly suppress the index ξ . single batch by concatenating the ai into a single matrix.)
The respective sizes of each measurement set will be denoted
by n 1 and n q . To unify the discussion, we will describe just
A. Sketching With Stable Laws in the Presence of Noise
one procedure to compute x q for any q ∈ (0, 2], since the
q
For any q ∈ (0, 2], the sketching technique offers a way to 1 norm is a special case. The two estimators are combined
q
estimate xq from randomized linear measurements. Building to obtain an estimator of sq (x), defined by
on this technique, we estimate sq (x) = (xq /x1 )q/(1−q)
q 1
by estimating xq and x1 from separate measurement q 1−q
xq
vectors ai . The core idea is to generate ai ∈ R p using stable ŝq (x) := , (10)
q
laws [61]. 1 ) 1−q
(x
Definition 1: A random variable V has a symmetric
q-stable distribution if there are constants γ √> 0 and which makes sense for any q ∈ (0, 2] except q = 1. The
q ∈ (0, 2] such that its ch.f. has the form E[exp( −1t V )] = parameters x0 and s1 (x) can still be estimated in practice
exp(−|γ t|q ) for all t ∈ R. We denote the distribution by by using ŝq (x) for some value of q that is close to 0 or 1.
V ∼ stableq (γ ), and γ is referred to as the scale parameter. (See also Section IV-C.)
The most well-known examples of stable laws are the cases a) An estimating equation based on characteristic
of q = 2 (Gaussian) and q = 1 (Cauchy). To fix notation, functions: If we draw i.i.d. measurement vectors
if a vector a1 = (a11 , . . . , a1 p ) ∈ R p has i.i.d. entries drawn ai ∼ stableq (γq )⊗ p , with i = 1, . . . , n q , then the ch.f. of yi
from stableq (γ ), we write a1 ∼ stableq (γ )⊗ p . Also, since our is given by
work involves different choices of q, we will write γq instead √ q q
(t) := E[exp( −1t yi )] = exp(−γq |t|q xq ) · ϕ0 (σ t),
of γ . The link with q norms hinges on the following basic
property of stable laws. (11)
2 It is known that a subset S ⊂ R is the zero set of a ch.f. if and only where t ∈ R, and we recall that ϕ0 is the ch.f. of i . Note that
if S is symmetric, closed, and excludes 0 [60]. ϕ0 is real-valued when i is a symmetric random variable.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
LOPES: UNKNOWN SPARSITY IN CS: DENOISING AND INFERENCE 5151
Using the measurements y1 , . . . , ynq , we can approximate To state the uniform CLT, we need to introduce the noise-
(t) with the empirical ch.f., to-signal ratio
nq √−1t y σ
ˆ nq (t) := 1
i. ρq = ρq (ξ ) := γq xq . (17)
nq i=1 e (12)
q
Next, by solving for xq in the approximate equation Although we allow ρq to vary with (n q , p), we assume that
it stabilizes to a finite limiting value:
ˆ nq (t) ≈ exp(−γqq |t|q xqq ) · ϕ0 (σ t),
(13) A 3: For each q ∈ (0, 2], there is a constant ρ̄q ∈ [0, ∞)
−1/2
such that ρq = ρ̄q + o(n q ) as (n q , p) → ∞.
we obtain an estimator x q . To make the dependence
q This assumption merely encodes the idea that the signal
q . Proceeding with the
on t explicit, we write ν̂ (t) = x is not overwhelmed by noise asymptotically. In stating the
q q
w
arithmetic in the previous line leads us to define following result, we use −−→ to denote weak convergence.
ˆ n q (t )
Theorem 2 (Uniform CLT for q Norm Estimator): Let
−1
ν̂q (t) := q
γq |t |q
Log+ Re ϕ0 (σ t) , (14) q ∈ (0, 2]. Assume that the measurement model (M) and
assumption A3 hold. Let tˆ be any function of y1 , . . . , ynq that
when t = 0 and ϕ0 (σ t) = 0. The symbol Re(z) denotes satisfies
the real part of a complex number z. Also, we define
Log+ (r ) := log(|r |) for any real number r = 0, and γq tˆxq −
→ P c0 (18)
Log+ (0) := −1 (arbitrarily). When t = 0 or ϕ0 (σ t) = 0, we as (n q , p) → ∞ for some finite constant c0 = 0 with
arbitrarily define ν̂q (t) = 1. (We only mention these details ϕ0 (ρ̄q c0 ) = 0. Then, the estimator ν̂q (tˆ ) satisfies
so that ν̂q (t) is defined for all t ∈ R, and such details are
irrelevant asymptotically.) √ ν̂q (tˆ) w
nq xq
q − 1 −−→ N(0, v q (c0 , ρ̄q )) (19)
b) Optimal selection of the tuning parameter: A crucial
aspect of the estimator ν̂q (t) is the choice of t ∈ R, which as (n q , p) → ∞, where the limiting variance v q (c0 , ρ̄q ) is
plays the role of a tuning parameter. This choice turns out to be strictly positive and defined according to the formula
somewhat delicate, especially in situations where xq → ∞
as (n, p) → ∞. To see why this matters, consider the equation v q (c0 , ρ̄q ) := 1 1
|c0 |2q 2ϕ0 (ρ̄q |c0 |)2
exp(2|c0 |q )
ϕ (2ρ̄ |c |)
ν̂q (t ) −1
ˆ n q (t )
+ 2ϕ0 (ρ̄ q|c 0|)2 exp((2 − 2q )|c0 |q ) − 1 .
xq
q = (γq |t |xq )q Log+ Re ϕ0 (σ t ) . (15) 0 q 0
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
5152 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016
When q = 2, one additional detail must be handled. Next, we use ρ̂q to estimate cq (ρ̄q ) via cq (ρ̂q ). Finally, we
In this case, it may happen for certain noise distribu- obtain an optimal choice of t using
tions that v 2 (c, ρ̄2 ) tends to a minimum as c → 0+ , cq (ρ̂q )
i.e. limc→0+ v 2 (c, ρ̄2 ) = inf c>0 v 2 (c, ρ̄2 ). This creates a tech- tˆopt := γq (|ν̂q (tˆpilot )|)1/q
. (25)
nical nuisance in using Theorem 2 because v 2 (c0 , ρ̄2 ) is not
defined for c0 = 0. There are various ways of handling In the exceptional case when ν̂q (tˆpilot) = 0, the esti-
this “edge case”, but for simplicity, we construct tˆ so that mates ρ̂q and tˆopt can be arbitrarily set to 1, but this
γ2 tˆx2 → P ε2 for some (arbitrarily) small constant ε2 > 0. choice does not matter asymptotically. Consistency results for
For this reason, in the particular case of q = 2, we will restrict ν̂(tˆpilot), ρ̂q , and tˆopt are given in Propositions 3 and 4 of
the domain of the extended function ṽ 2 (·, ·) to be [ε2 , ∞) × Section IV.
[0, ∞). The following lemma summarizes the properties of e) Handling asymmetric noise distributions: The pro-
the extended variance function that will be needed later on. posed method can be adjusted in a simple way to allow
Lemma 2 (Extended Variance Function): Suppose that ϕ0 for asymmetric noise distributions — and the theory for
satisfies the assumptions of the model (M), and let v q be the symmetric case carries over. To explain the adjust-
as in formula (20). For each q ∈ (0, 2), put εq := 0, and ment, suppose that the deconvolution model yi = ai , x + σ i
let ε2 > 0. For all values q ∈ (0, 2], define the function holds as before, except that the noise variables i may
ṽ q : [εq , ∞) × [0, ∞) → [0, ∞], according to be asymmetric. Also, suppose that the measurement vec-
tors ai are still drawn i.i.d. from stableq (γq )⊗ p for
v q (c, ρ) if c = 0 and ϕ0 (ρc) = 0, i = 1, . . . , n q .
ṽ q (c, ρ) := (21)
∞ otherwise. After y1 , . . . , ynq are observed, they can be “sym-
metrized” via multiplication with an independent sequence of
Then, the function ṽ q (·, ·) is continuous on [εq , ∞) × [0, ∞),
i.i.d. Rademacher variables, say ri = ±1 with equal probabil-
and for any ρ ≥ 0, the function ṽ q (·, ρ) attains its minimum
ity. Define the symmetrized measurements as y̆i := ri yi . Due
in the set [εq , ∞).
to our choice of the ai , each variable ai , x is symmetric, and
(i) Remark: A simple consequence of the definition of
it follows that
ṽ q (·, ·) is that any choice of c ∈ [εq , ∞) that minimizes
ṽ q (·, ρ̄q ) also minimizes v q (·, ρ̄q ). Hence there is nothing lost d
y̆i = ai , x + σ ri i , (26)
in working with ṽ q .
d
(ii) Minimizing the extended variance function: We are where = denotes equality in distribution. (This can be checked
now in position to specify the limiting value cq (ρ̄q ) from by computing the ch.f. of each side.) Hence, by symmetrizing,
line (20). For any ρ ≥ 0, we define we obtain “new” observations y̆i such that the new noise
variable ri i is symmetric. Likewise, the original method
cq (ρ) ∈ argmin ṽ q (c, ρ), (22)
c≥εq can be applied to the y̆i — with the only change being
that ϕ0 must be replaced everywhere by the ch.f. of the
where q ∈ (0, 2] and εq is as defined in Lemma 2. Note that ṽ q new noise variable ri i , i.e. ϕ̆0 (t) = 12 ϕ0 (t) + 12 ϕ0 (−t),
is a known function, and so the value cq (ρ) can be computed for all t ∈ R. Similarly, our theoretical results for the
for any given ρ. However, since the limiting noise-to-signal symmetric case can be applied with ϕ̆0 replacing ϕ0 . Simula-
ratio ρ̄q is unknown, it will be necessary to use cq (ρ̂q ) for tions demonstrating the efficacy of the symmetrization step
some estimate ρ̂q of ρ̄q , which is described below. are given in Section VII (including the case where ϕ̆0 is
d) A procedure for optimal selection of t: The first step estimated).
of our procedure for selecting t is to compute a simple “pilot”
value tˆpilot. Next, we refine the pilot value to obtain an optimal
value tˆopt that will satisfy IV. M AIN R ESULTS FOR E STIMATORS
AND C ONFIDENCE I NTERVALS
tˆopt γq xq → P cq (ρ̄q ). (23)
In this section, we first show in Proposition 3 that tˆpilot
q
To describe the pilot value, let η0 > 0 be any number leads to consistent estimates of xq , and ρ̄q . Next, we
such that ϕ0 (η) > 12 for all η ∈ [0, η0 ] (which exists for show that the optimal constant cq (ρ̄q ) and optimal variance
any characteristic function). Also, define the median absolute v(cq (ρ̄q ), ρ̄q ) are consistently estimated. These estimators then
q
deviation statistic m̂ q := median(|y1 |, . . . , |ynq |), which is lead to adaptive confidence intervals for xq and sq (x), in
a coarse-grained, yet robust, estimate of γq xq . Then, the Theorem 3 and Corollary 1.
pilot value is defined as tˆpilot := min m̂1 , ησ0 , which is
q
finite with probability 1, even if σ = 0. The role of tˆpilot is A. Consistency Results
q
to provide a ratio-consistent estimator of xq through the
statistic ν̂q (tˆpilot) — but the limiting variance may not be Proposition 3 (Consistency of the Pilot Estimator): Let
optimal. With ν̂q (tˆpilot) in hand, we can then obtain a consistent q ∈ (0, 2] and assume that the measurement model (M) and
estimator of ρ̄q , A3 hold. Then as (n q , p) → ∞,
σ ν̂q (tˆpilot )
ρ̂q := γq (|ν̂q (tˆpilot )|)1/q
. (24) q −→ P 1 and ρ̂q −→ P ρ̄q . (27)
xq
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
LOPES: UNKNOWN SPARSITY IN CS: DENOISING AND INFERENCE 5153
1) Remarks: We now turn our attention from the estimators rely on measurement sets of sizes n q and n 1 , we
pilot value tˆpilot to the optimal value tˆopt . In order to make the following simple scaling assumption:
construct tˆopt , our method relies on the M-estimator A 5: For each q ∈ (0, 2] \ {1}, there is a constant
cq (ρ̂q ) ∈ argmin c≥εq ṽ q (c, ρ̂q ). Consistency proofs for π̄q ∈ (0, 1), such that as (n 1 , n q , p) → ∞,
M-estimators typically require that the objective function has
nq −1/2
a unique optimizer, and our situation is no exception. Note n 1 +n q = π̄q + o(n q ). (31)
that we do not need to assume that a minimizer exists, since Corollary 1 (Confidence Interval for sq (x)): Assume
this is guaranteed by Lemma 2. q ∈ (0, 2] \ {1}, and that the conditions of Theorem 3 hold,
A 4: For each q ∈ (0, 2], the function v q (·, ρ̄q ) has at most as well as assumption A5. Assume ŝq (x) is constructed
one minimizer in [εq , ∞), where εq is defined in Lemma 2. from independent sets of measurement vectors drawn from
In an approximate sense, this assumption can be checked stable1 (γ1 )⊗ p and stableq (γq )⊗ p , of sizes n 1 and n q . Letting
empirically by simply plotting the function v q (·, ρ̂q ). It is ω̂q be as in Theorem 3, define
also possible to analytically verify that it holds in the case
of stableq noise, but we omit these details. Based on inspec- ω̂q ω̂1 q 2
πq := n q /(n 1 + n q ) and ϑ̂q := πq ( 1−q )
1 2
+ 1−πq ( 1−q ) .
tion of plots, the assumption appears to hold for a vari-
ety of other parametric noise distributions, such as Laplace, Then as (n 1 , n q , p) → ∞,
uniform[−1, 1], and the t distribution).
√
Proposition 4 (Consistency of cq (ρ̂q )): Let q ∈ (0, 2] and n 1 +n q ŝq (x) w
− 1 −−→ N(0, 1), (32)
sq (x)
assume that the measurement model (M) holds, as well as ϑ̂q
assumptions A3, and A4. Then, as (n q , p) → ∞,
and consequently for any fixed α, α ∈ [0, 21 ], the probability
cq (ρ̂q ) −→ P cq (ρ̄q ) and tˆopt γq xq −→ P cq (ρ̄q ). (28)
ϑ̂q z 1−α ϑ̂q z 1−α
Furthermore, ṽ q (cq (ρ̂q ), ρ̂q ) −→ P ṽ q (cq (ρ̄q ), ρ̄q ). P 1− √ ŝq (x) ≤ sq (x) ≤ 1 + √ ŝq (x)
n 1 +n q n 1 +n q
2) Remark: This result is proved in Appendix C.
converges to 1 − α − α .
q
B. Confidence Intervals for xq and sq (x) 2) Remark: As in Theorem 3, we give a simpler formula for
In this subsection, we assemble the work in Proposition 4 the confidence interval above by using a CLT for the reciprocal
with Theorem 2 to obtain confidence intervals for xq
q sq (x)/ŝq (x).
and sq (x). In the next two results, we will use z 1−α to
denote the 1 − α quantile of the standard normal distribution, C. Estimating x0 With ŝq (x) and Small q
i.e. (z 1−α ) = 1 − α. To allow our result to be applied to
The following proposition quantifies the approximation
both one-sided and two-sided intervals, we state our result in
ŝq (x) ≈ x0 when q is close to 0. To state the proposition,
terms of two possibly distinct quantiles z 1−α and z 1−α .
q define the dynamic range of a non-zero signal x ∈ R p as
Theorem 3 (Confidence Interval for xq ): Let q ∈ (0, 2],
and define the estimated variance x∞
DNR (x) := |x|min , (33)
ω̂q := ṽ q (cq (ρ̂q ), ρ̂q ). (29)
where |x|min is the magnitude of the smallest non-zero
Assume that the measurement model (M) holds, as well as entry of x, i.e. |x|min = min{|x j | : x j = 0, j = 1, . . . , p}. The
assumptions A3, and A4. Then as (n q , p) → ∞, result involves no randomness and is applicable to any esti-
√ mator (say s̃q (x)).
nq ν̂q (tˆopt ) w
√ q − 1 −−→ N(0, 1), (30) Proposition 5: Let q ∈ (0, 1), x ∈ R p \ {0}, and let s̃q (x)
ω̂q xq
be any real number. Then,
and consequently for any fixed α, α ∈ [0, 12 ], the probability
s̃q (x) s̃ (x) q
x0 −1 ≤ sqq (x) − 1 + 1−q log(DNR (x)) + q log(x0 ) .
√ √
ω̂q z 1−α q ω̂q z 1−α
P 1 − √nq ν̂q (tˆopt) ≤ xq ≤ 1 + √nq ν̂q (tˆopt ) Remarks: To interpret the proposition, note that when
s̃q (x) is chosen to be the proposed estimator ŝq (x), the first
ŝ (x)
converges to 1 − α − α . term | sqq (x) − 1| is already controlled by Corollary 1. The
1) Remarks: This result follows by combining Theorem 2 second term in the bound is the “approximation error” that
with Proposition 4. We note that if the limit (30) is used improves for smaller choices of q. A favorable property of
q
directly to obtain a confidence interval for xq , the result- the bound is that it behaves better in the specialized case
ing formulas are somewhat cumbersome. Instead, a simpler where estimating x0 is of interest, i.e. when the non-zero
confidence interval (given above) is obtained using a CLT for coordinates of x have roughly similar sizes. In this case, the
q
the reciprocal xq /ν̂q (tˆopt ), via the delta method. quantity log(DNR (x)) will not be too large. On the other hand,
As a corollary of Theorem 3, we obtain a CLT and a if log(DNR(x)) is large, then x has non-zero coordinates of
confidence interval for ŝq (x) by combining the estimators very different sizes, and x0 may not be the best measure of
ν̂q and ν̂1 with their respective tˆopt values. Since the norm sparsity to estimate.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
5154 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
LOPES: UNKNOWN SPARSITY IN CS: DENOISING AND INFERENCE 5155
δ : Rn → R have a worst-case relative error |δ(Ax)/sq (x)−1| (n 1 , n 2 ) = (50, 50), (100, 100), . . . , (500, 500). For a fixed
√
that is much larger than 1/ n 1 + n q . Specifically, when q ∈ pair (n 1 , n 2 ), we generated 500 independent pairs of vectors
[0, 2], we give a lower bound that is of order (1 − np )2 . More y (1) ∈ Rn1 and y (2) ∈ Rn2 according to
informally, this means that there is always a choice of x that
y (1) = A(1) x + σ (1) and y (2) = A(2) x + σ (2) , (40)
can defeat a deterministic procedure, whereas the estimator
ŝq (x) based on randomized measurements is likely to succeed where the entries of A(1) ∈ Rn1 × p are i.i.d. stable1 (1)
for any x. In stating the following result, we note that it (Cauchy) variables, and the entries of A(2) ∈ Rn2 × p are i.i.d.
involves no randomness — since the measurements y = Ax stable2 (1) (Gaussian) variables. In all three plots, the noise
are noiseless and obtained from a deterministic matrix A. vectors (1) and (2) were generated with i.i.d. entries from a
Furthermore, the bounds are non-asymptotic. centered t-distribution on 2 degrees of freedom (which has
Theorem 4: Suppose n < p, and q ∈ [0, 2]. Then, the infinite variance). Applying our estimation method to each
minimax relative error for estimating sq (x) from noiseless pair (y (1), y (2) ) produced 500 realizations of ŝ2 (x) for each
deterministic measurements y = Ax satisfies (n 1 , n 2 ). We then averaged the quantity | ŝs22 (x)
(x) − 1| over the
2
inf inf
Ax)
sup δ( − 1
≥ 2πe
1
1 − np − 21p . 500 realizations as an approximation of E| ŝs22 (x)
(x) − 1|.
δ:Rn →R sq (x)
A∈Rn× p x∈R p \{0} The simulation just described was conducted under several
choices of the parameters p, s2 (x), and ρ2 — with each
Alternatively, if q ∈ (2, ∞], then
parameter corresponding to a separate plot in Figure 2. In all
Ax) √1
inf inf sup δ(s (x) −1 ≥ 2πe √
1−(n/ p)
− 1. three plots, the signal x ∈ R p was chosen to have decaying
1+4 log(2 p) 2 p
A∈R δ:R →R x∈R p \{0} entries x i = c · i −τ (x), with c chosen so that x2 = 1. The
n× p n q
Remarks dimension p was set to 104 in all cases, except in the top plot,
where p = 10, 102 , 103 , 104 . The decay exponent τ (x) was
The proof of this result is based on the classical technique
set to 1 in all cases, except in the middle plot where τ (x) =
of a two-point prior. In essence, the idea is that for any
2, 1, 1/2, 1/10 in order to give a variety of sparsity levels.
choice of A, it is possible to find two signals x̃ and x ◦ that
With regard to the generation of measurements, the energy
are indistinguishable with respect to A, i.e. A x̃ = Ax ◦ , and
level was set to γ1 = γ2 = 1 in all cases, which gives ρ2 =
yet have very different sparsity levels, sq (x ◦ ) sq (x̃). Due
σ/γ2 x2 = σ . In turn, we set ρ2 = σ = 1/4 in all cases,
to the relation A x̃ = Ax ◦ , the statistician has no way of
except in the bottom plot where ρ2 = σ = 0, 0.1, 0.3, 0.5.
knowing whether x̃ or x ◦ has been selected by nature, and
Lastly, our algorithm for computing ŝ2 (x) involves a choice
if nature chooses x ◦ and x̃ with equal probability, then it
of two input parameters η0 and ε2 . In all cases, we set η0 = 0.3
is impossible for the statistician to improve upon the trivial
and ε2 = 0.3. The performance of ŝ2 (x) did not change much
estimator 21 sq (x̃) + 12 sq (x ◦ ) (which does not even make use
under different choices.
of the data). Furthermore, since sq (x ◦ ) sq (x̃), it follows
2) Comments: There are three high-level conclusions to
that the trivial estimator has a large relative error – implying
take away from Figure 2. The first is that the black theoretical
that the minimax relative error is also large. (A proof is given
curves agree well with the colored empirical curves. Second,
in Appendix D.)
the average relative error E| ŝs22 (x)
(x) − 1| has no observable
dependence on p or s2 (x) (when ρ2 is held fixed), as expected
VII. S IMULATIONS from Corollary 1. Third, the dependence on the noise-to-signal
In this section, we describe four sets of simulations that ratio is mild. (Note that the theoretical curves do predict that
validate our theoretical results, and demonstrate the effective- the relative error for ρ2 = 0.5 will be somewhat larger than
ness of our methodology. Section VII-A studies the expected for ρ2 = 0.)
relative error E| ŝs22 (x)
(x) − 1| and how it is affected by the
In all three plots, the theoretical curves were com-
dimension p, the sparsity level s2 (x), the noise-to-signal puted in the following
√ way. From Corollary 1 we have
ratio ρ2 . Section VII-B considers the case where the noise | ŝs22 (x) − 1| ≈ √ ϑ2 |Z |, where Z is a standard Gaussian
(x) 1n +n
2
distribution is unknown and asymmetric. (The effects of esti- random variable, and ϑ2 is the limit of ϑ̂2 that
mating ϕ0 and σ , as well as the effect of the symmetrization √ results from
Corollary 1 and Proposition 4. Since E|Z | = 2/π, the the-
step (cf. Section III-B.2e), are measured under several values 2
ϑ2
of ρ2 .) Section VII-C considers the estimation of x0 using oretical curves are simply √n π+n , as a function of (n 1 + n 2 ).
1 2
ŝq (x) for a small value of q. (This simulation also treats Note that ϑ2 depends only on ρ2 , which is why there is only
the noise distribution as unknown.) Lastly, Section VII-D one theoretical curve in the top two plots.
examines the quality of the normal approximation for ŝ2 (x)
given in Corollary 1. B. Robustness to Unknown and Asymmetric
Noise Distribution
A. Error Dependence on Dimension, Sparsity, and Noise 1) Design of Simulations: To study how the estimator ŝ2 (x)
1) Design of Simulations: When estimating s2 (x), our pro- performs when the noise distribution is unknown and asym-
posed method requires a set of n 1 Cauchy measurements, metric, we considered a modified version of the experiment
and a set of n 2 Gaussian measurements. In each of the in Section VII-A, where ρ2 was ramped up. (All parameter
three plots in Figure 2, we considered a sequence of pairs settings were the same as in the bottom plot of Figure 2,
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
5156 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016
Fig. 3. Top panel: When the noise distribution is unknown, and when the
parameters ϕ0 and σ are estimated, the performance of ŝ2 (x) is gradually
affected by an increase in ρ2 . The performance is similar to that shown in
Figure 2 where the noise distribution is treated as known. Bottom panel: The
plot shows that the parameters x0 and DNR(x) have a small effect on the
averaged relative error of ŝ0.05 (x).
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
LOPES: UNKNOWN SPARSITY IN CS: DENOISING AND INFERENCE 5157
(n 1 +n 2 ) ŝ2 (x)
Fig. 4. The solid blue curve is the standard normal density in all four plots. The histograms were generated from 3000 realizations of
ϑ̂q s2 (x) − 1
under four conditions — depending on the number of measurements (n 1 + n 2 ) and the tails of the noise distribution. The dashed red curve is the default R
kernel density estimate.
4
2) Comments: The top panel of Figure 3 shows that as signals in R10 of the form x = c(1, 12 , 13 , . . . , x
1
0
, 0, . . . , 0),
the noise-to-signal ratio ρ2 is increases from 0 to 0.5, the where x0 = 10, 10 , 10 , 10 , and c was chosen so that
2 3 4
averaged relative error E| ŝs22 (x)
(x) − 1| increases in a gradual x2 = 1.
manner. Comparing this plot with the bottom plot in Figure 2, This experiment also involved an unknown and asymmetric
we see that when the noise distribution is estimated with only noise distribution — and hence included the effects of esti-
100 samples, the performance of ŝ2 (x) is not much different mating ϕ0 and σ . The noise level was set to σ = 14 , and the
than when the noise distribution is known. noise distribution was the same as in Section VII-B, as was
the procedure for estimating ϕ and σ . Lastly, we note that if
C. Simulations for Estimating x0 With ŝq (x) and Small q q is chosen much smaller than 0.05, then floating point errors
1) Design of Simulations: We now consider the estima- can arise when generating random variables from stableq (1),
tion of x0 using ŝq (x) with q = 0.05. This experiment since the tails become very heavy. (This has also been reported
followed essentially the same design as in Section VII-B, in the sketching literature [44].)
but with two differences. (All parameter settings were the
same, except where stated below.) First, instead of sampling 2) Comments: Figure 3 shows that ŝ0.05 (x) accurately esti-
A(2) ∈ Rn2 × p with i.i.d. rows from the Gaussian distribution mates x0 over a wide range of the parameters x0 and
stable2 (γ2 )⊗ p with γ2 = 1, we generated matrices A(q) with DNR(x), and that these parameters have a small effect on
(x)
i.i.d. rows drawn from stableq (γq )⊗ p with γq = 1. Second, E| ŝ0.05
x0 − 1|, which is expected based on Proposition 5.
instead of varying ρ2 , we varied x0 and the dynamic range In addition, the estimator performs well in spite of the fact
DNR (x) (cf. Proposition 5). Specifically, we considered four that the noise distribution is unknown and asymmetric.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
5158 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016
Proof: Let c0 > 0 be as in Theorem 1, and let κ0 ≥ 1 be where the infima are over all sensing matrices A and all recov-
any number such that ery algorithms R (possibly non-homogenous). It is known
log(κ0 ) + ≤ from the theory of Gelfand widths that when log( pe
n ) ≤ n < p,
c0 .
2 4 1
κ0 κ0 (41)
n/κ0
there is an absolute constant c1 > 0, such that the minimax
Next, define the positive number t := log( pe , and choose 2 error over B1 (1) is lower-bounded by
n )
T = t in Theorem 1. (Note that when n ≤ p, this choice
of T is clearly at most n, and hence lies in {1, . . . , p}.) inf inf sup x − R(Ax)2 ≥ c1 log( pe/n)
.
n
We will verify that n ≥ c0 T log( pe A∈Rn× p R:Rn →R p x∈B1 (1)
T ) for every n satisfying
our assumption κ0 log(κ0 pe
n ) ≤ n ≤ p, which will then allow
(See [65, Proposition 1.2], as well as [66] and [67].) Using
us to apply Theorem 1.
our choice of η, as well as the fact that x̃1 = 1, we obtain
To be begin, observe that
T log( pe pe x̃21 ·log( pe/n)
T ) ≤ (t + 1) log( t ) x̃ − R(A x̃)2 ≥ 12 c1 n . (45)
n/κ0 pe pe
= log( pe + 1 · log κ0 n · log( n )
) Dividing both sides by x̃2 completes the proof.
n
n/κ0
≤ log( pe
+ 1 · log (κ0 pe
n )
2
n ) A PPENDIX B
2n/κ0 pe P ROOFS FOR S ECTION III
= log( pe
log(κ0 ) + log + 2 log κ0 pe
n )
n n
A. Theorem 2 – Uniform CLT for q Norm Estimator
≤ 2n
κ0 log(κ0 ) + 2n
κ0 + 2 log κ0 pe
n . (42)
The proof of Theorem 2 consists of two parts. First, we
Applying our assumption κ0 log(κ0 pe
n ) ≤ n gives prove a uniform CLT for a re-scaled version of ˆ n (t), which
is given below in Lemma 3. Second, we extend this limit to the
T log( pe
T )≤
2
κ0 log(κ0 ) + 4
κ0 n.
statistic ν̂q (t) by way of the functional delta method, which
Hence, the choice of κ0 in line (41) ensures n ≥ c0 T log( pe
T ),
is described at the end of the section. The positivity of the
as desired. variance formula v q (c0 , ρ̄q ) follows from the lower bound (65)
To finish the proof, let κ1 := c1 be as in Theorem 1 so that given in the proof of Lemma 2. (See the remark following the
the bound (7) holds with probability at least 1 − 2 exp(−κ1 n). proof of Lemma 2.)
Next, observe that 1) Remark on the Subscript of n q : For ease of notation, we
√ will generally drop the subscript from n q in the remainder of
κ
√1 x − x |T 1 ≤ √1 x1
t
= √0
n
x21 log( pe
n ). (43) the appendices, as it will not cause confusion.
T
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
LOPES: UNKNOWN SPARSITY IN CS: DENOISING AND INFERENCE 5159
2) Weak Convergence of the Empirical ch.f.: To introduce To see this, let Dn (c) := |χn (c) − χ̃n (c)| for c ∈ [−b, b], and
a few pieces of notation, let c ∈ [−b, b] for some fixed b > 0, observe that
and define the re-scaled empirical ch.f., √
√ Dn (c) = n| ψ̂n (c) − ψ̃n (c)|
n cyi
−1 γq x √ √
ψ̂n (c) := 1
n i=1 e
q , (46) = √1n ni=1 e −1c(Si +ρq i ) − e −1c(Si +ρ̄q i )
√ √
yi
obtained from the re-scaled observations γ x q
. The relation = √1n ni=1 e −1c(Si +ρq i ) 1 − e −1c(ρ̄q −ρq )i
ˆ ˆ
between n and ψ̂n is given by n (t) = ψ̂n (γq txq ). The √
≤ √1n ni=1 1 − e −1c(ρ̄q −ρq )i
re-scaled population ch.f. is
≤ √1n ni=1 |c(ρ̄n − ρq )i |
ψn (c) := exp(−|c|q )ϕ0 (ρq c), (47) √
≤ n|ρq − ρ̄q | · nb ni=1 |i |,
which converges to the function
where the last bound does not depend on c, and tends to 0
ψ(c) := exp(−|c| )ϕ0 (ρ̄q c), q
(48) with probability 1. Here we are using the assumption √ that
E|1 | < ∞ and assumption A3 that ρq = ρ̄q + o(1/ n). Now
as ρq → ρ̄q . Lastly, define the normalized process that line (53) has been verified, it remains (by the functional
version of Slutsky’s Lemma [68, p. 32]) to prove
√
χn (c) := n ψ̂n (c) − ψ(c) , (49) w
χ̃n (·) −−→ χ∞ (·) in C ([−b, b]; C), (54)
and let C ([−b, b]; C) be the space of continuous complex-
valued functions on [−b, b] equipped with the sup-norm. and that the limiting process χ∞ has the stated variance
Lemma 3: Fix any b > 0. Under the assumptions of formula. We first show that this limit holds, and then derive
Theorem 2, the random function χn satisfies the limit the variance formula at the end of the proof. (Note that it is
clear that the limiting process must be Gaussian due to the
w finite-dimensional CLT.)
χn (·) −→ χ∞ (·) in C ([−b, b]; C), (50)
By a result of Marcus [69, Th. 1], it is known that the
where χ∞ is a centered Gaussian process whose marginals uniform CLT for empirical ch.f.’s (54) holds as long as the
satisfy Re(χ∞ (c)) ∼ N(0, wq (c, ρ̄q )), and limiting process χ∞ has continuous sample paths (almost
surely).3 To show that the sample paths of χ∞ are continuous,
wq (c, ρ̄q ) := 1
2 + 1
2 exp(−2q |c|q )ϕ0 (2ρ̄q c) we employ as sufficient condition derived by Csörgő [71].
− exp(−2|c|q )ϕ02 (ρ̄q c). Let Fq denote the distribution function of the random variable
Proof: It is important to notice that ψ̂n is not the empirical yi◦ = Si + ρ̄q i described earlier. Also let δ > 0 and define
ch.f. associated with n samples from the distribution of ψ the function
(since ρq = ρ̄q ). Instead of working with χn , the more natural log(|u|) · log(log(|u|))2+δ if |u| ≥ exp(1)
+
process to consider is gδ (u) :=
0 if |u| < exp(1).
√
χ̃n (·) := n ψ̃n (·) − ψ(·) , (51)
At line 1.17 of the paper
∞ [71], it is argued that if there is
where we define some δ > 0 such that −∞ gδ+ (|u|)d Fq (u) < ∞, then χ∞ has
√
continuous sample paths. Next, note that for any δ, δ > 0, we
n −1(·)yi◦
ψ̃n (·) := 1
n i=1 e have
in terms of the variables yi◦ := Si + ρ̄q i , with Si ∼ stableq (1) gδ+ (|u|) = O |u|δ as |u| → ∞. (55)
and i ∼ F0 being independent. (In other words, ψ̃n is the
empirical ch.f. associated with ψ.) Also note that the re-scaled Hence, χ∞ has continuous sample paths as long as Fq has a
yi d d fractional moment. To see that this is true, recall the basic
observations satisfy γ x = Si + ρq i , where = denotes
q fact that if Si ∼ stableq (1) then E[|Si |δ ] < ∞ for any
equality in distribution. Likewise, the process ψ̂n (·) may be δ ∈ (0, q) [61, p. 63]. Also, our deconvolution model assumes
represented in distribution as E[|i |] < ∞, and so it follows that for any q ∈ (0, 2], the
d 1 n √
−1(·)(Si +ρq i ) ,
distribution Fq has a fractional moment, which proves the
ψ̂n (·) = n i=1 e (52) limit (54).4
Finally, we compute the variance of the marginal distri-
and going forward, it will be convenient to identify ψ̂n with butions Re(χ∞ (c)). By the ordinary central limit theorem,
the right side above.
As a first main step in the proof, we show that the difference 3 The paper [69] only states the result when b = 1/2, but it holds for any
between χn and χ̃n is negligible in a uniform sense, i.e. b > 0. See the paper [70] and the first few lines of the proof of Theorem 3.1
therein.
4 We are using U + V a∧1 ≤ U a∧1 + V a∧1 for random variables U
sup |χn (c) − χ̃n (c)| → P 0. (53) a a a
c∈[−b,b] and V , where U a := E[|U |a ]1/a and a > 0.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
5160 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016
√
we only need to calculate the variance of Re(exp( −1y1◦ ). In words, n is the difference that comes from replacing ĉ
The first moment is given by with (ĉ), and n is the difference that comes from replacing
√ √ terms of the form − |·| 1
Log+ (. . . ) with terms involving φ.
E[Re(exp( −1cy1◦ )] = ReE[exp( −1cy1◦ )]
We now argue that both of the terms n and n are
= exp(−|c|q )ϕ0 (ρ̄q c). asymptotically negligible. As a result, we will complete the
proof of Theorem 2 by showing that the first term in line (59)
The second moment is given by has the desired Gaussian limit — which is the purpose of the
√
E[(Re(exp( −1cy1◦ ))2 ] = E[cos2 (cy1◦ )] functional delta method.
= E[ 12 + 21 cos(2cy1◦ )] To see that n → P 0, first recall that ĉ → P c0 ∈ I
√ by assumption. Consequently, along any subsequence, there
= 12 + 21 Re E[exp( −1 · 2cy1◦ )] is a further subsequence on which ĉ and (ĉ) eventually
= 1
2 + 1
2 exp(−2q |c|q )ϕ0 (2ρ̄q c). agree with probability 1. In turn, if gn is a generic sequence
of functions, then eventually gn (ĉ) − gn ((ĉ)) = 0 along
This completes the proof of Lemma 3.
subsequences (with probability 1). Said differently, this means
3) Applying the Functional Delta Method: We now com-
gn (ĉ) − gn ((ĉ)) → P 0, and this implies n → P 0 since n
plete the proof of Theorem 2 by applying the functional delta
can be expressed in the form gn (ĉ) − gn ((ĉ)).
method to Lemma 3 (with a suitable map φ to be defined
Next, to see that n → P 0, notice that as soon as
below). In the following, C (I) denotes the space of continuous
Re(ψ̂n (·)/ϕ0 (ρq ·)) lies in the neighborhood N ( f 0 ; ε), it fol-
real-valued functions on an interval I, equipped with the sup-
lows from the definition of φ that n = 0. Also, the function
norm.
Re(ψ̂n (·)/ϕ0 (ρq ·)) converges uniformly to f 0 on the interval
Proof of Theorem 2: Since ϕ0 (ρ̄q c0 ) = 0, and the roots
I with probability 1.6 Consequently, n = o P (1).
of ϕ0 are assumed to be isolated, there is some δ0 ∈ (0, |c0 |)
We have now taken care of most of the preparations needed
such that over the interval c ∈ I := [c0 − δ0 , c0 + δ0 ], the
to apply the functional delta method. Multiplying the limit
value ϕ0 (ρ̄q c) is bounded away from 0. Define the function
in Lemma 3 through by 1/ϕ0 (ρq ·) and using the functional
f 0 ∈ C (I) by
version of Slutsky’s Lemma [68, p. 32], it follows that
f 0 (c) = exp(−|c|q ), √ ψ̂n (·) w
n Re ϕ0 (ρq ·) − exp(−| · |q ) −−→ z̃(·) in C (I), (60)
and let N ( f 0 ; ε) ⊂ C (I) be a fixed ε-neighborhood of f 0 in
the sup-norm, such that all functions in the neighborhood are where z̃(·) is a centered Gaussian process with continuous
bounded away from 0. Consider the map φ : C (I) → C (I) sample paths, and the marginals z̃(c) have variance equal to
defined according to ϕ (2ρ̄ c)
⎧ 1 1
+ 1
exp(−2q |c|q ) ϕ02 (ρ̄ qc) − exp(−2|c|q ). (61)
⎨− 1q log( f (c)) if f ∈ N ( f 0 ; ε) 2 ϕ 2 (ρ̄q c)
0
2
0 q
|c|
φ( f )(c) = (56) It is straightforward to verify that φ is Hadamard differen-
⎩1 (c) if f ∈ N ( f 0 ; ε),
I tiable at f 0 , and that the Hadamard derivative φ f0 is the
where 1I is the indicator function of the interval I. The ) q
q
linear map that multiplies by − exp(|·| |·|q . (See [68, Lemmas
importance of φ is that it can be related to ν̂q (tˆ)/xq in 3.9.3 and 3.9.25].) Consequently, the functional delta method
the following way. First, let ĉ := tˆγq xq and observe that [68, Th. 3.9.4] applied to line (60) with the map φ gives
the definition of ψ̂n gives √ w
√ ν̂q (tˆ) √ n (ĉ) n φ Re ϕ0ψ̂(ρnq ·) (·) − φ( f 0 )(·) − → φ f0 (z̃)(·)
n xq − 1 = n − |ĉ|1q Log+ Re ϕψ̂(ρ
0 q ĉ) ) q
q
= − exp(|·|
|·|q z̃(·)
+ |ĉ|q Log+ exp(−|ĉ| ) .
1 q (57) =: z(·), (62)
(The cases of ĉ = 0 and ϕ0 (ρq ĉ) = 0 are asymptotically in the space C (I). It is clear that z(·), defined in the previous
irrelevant.) Next, let (ĉ) be the point in the interval I that line, is a centered Gaussian process on I, since z̃(·) is.
is nearest to ĉ,5 and define the quantity n according to Combining the last two displays shows that the marginals z(c)
√ ν̂q (tˆ) √ n ((ĉ))
n xq − 1 = n − |(1ĉ)|q Log+ Re ϕψ̂(ρ q (ĉ))
are given by
q 0
+ n + n . (59) 6 The fact that ψ̂ (·) converges uniformly to ψ(·) on I with probability 1
n
follows essentially from a version of the Glivenko-Cantelli Theorem [71].
5 The purpose of introducing (ĉ) is that it always lies in the interval I, To see that 1/ϕ0 (ρq ·) converges uniformly to 1/ϕ0 (ρ̄q ·) on I as ρq → ρ̄q ,
and hence allows us to work entirely on I. note that since E|1 | < ∞, the ch.f. ϕ0 is Lipschitz on any compact interval.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
LOPES: UNKNOWN SPARSITY IN CS: DENOISING AND INFERENCE 5161
The final step of the proof essentially involves plug- Next, we consider the second case where (c j , ρ j ) →
ging (ĉ) into the limit (62) and using Slutsky’s Lemma. (c0 , ρ0 ) with c0 = 0 and ρ0 ≥ 0. Due
to the fact that all ch.f.’s
Since (ĉ) → P c0 , and z(·) takes values in the separable satisfy |ϕ0 | ≤ 1, and the fact that exp(2|c|q ) − exp(κq |c|q )
space C (I) almost surely, the functional version of Slutsky’s is positive when c > 0, the previous lower bound gives
Lemma [68, p. 32] gives the following convergence of pairs
v q (c, ρ) ≥ 1 1
exp(2|c|q ) + 1
exp(κq |c|q ) − 1 , (65)
√ ψ̂n w |c|2q 2 2
n φ Re ϕ0 (ρq ·) (·) − φ( f 0 )(·) , (ĉ) → z(·), c0 ,
−
which again holds for all (c, ρ) where v q is defined. Note
in the space C (I) × I. To finish, note that the evaluation map that this bound does not depend on ρ. A simple calculation
from C (I) × I to R, defined by ( f, c) → f (c), is continuous involving L’Hospital’s rule shows that whenever q ∈ (0, 2),
with respect to the product topology. The continuous mapping the lower bound tends to ∞ as c → 0+ .
theorem [68, Th. 1.3.6] then gives To finish the proof, we must show that for any ρ ≥ 0,
√ w
the function ṽ q (·, ρ) attains its minimum on the set [εq , ∞).
n φ Re ϕ0ψ̂(ρnq ·) ((ĉ)) − φ f 0 ((ĉ)) − → z(c0 ). This is simple because the lower bound (65) tends to ∞ as
(63) c → ∞.
Remark on the Positivity of the Variance Formula: Note
The proof is completed by substituting this limit into that for q ∈ (0, 2], the number κq is minimized when q = 2,
line (59).
1 case κq2 = 1−2. Likewise,
in which the bound (65) is at least
1
|c|2q 2
exp(2|c| ) + 2 exp(−2|c| q ) − 1 , which is the same as
– Second, we show that for any q ∈ (0, 2), if c0 = 0, then statement, it is enough to show that tˆpilotγq xq → P c0 for
v q (c j , ρ j ) → ∞ for any sequence (c j , ρ j ) → (0, ρ0 ), some finite c0 > 0 such that ϕ0 (ρ̄q c0 ) = 0 (by Theorem 2).
where ρ0 ≥ 0 is arbitrary, and c j > 0 for all j > 0. Reducing one step further, it is enough to show that there is
To handle the first case where c0 > 0, we derive a lower bound some finite c1 > 0 such that m̂1q γq xq → P c1 . To see this,
on v q (c, ρ) for all q ∈ (0, 2] and all pairs where v q is defined. let a ∧ b = min{a, b} and note that assumption A3 implies
Recall the formula γ x
tˆpilotγq xq = ( m̂1 γq xq ) ∧ η0 q σ q (66)
q
v q (c, ρ) = 1 1 1
exp(2|c|q )
|c|2q 2 ϕ0 (ρc)2 = c1 ∧ ( ρ̄ηq0 ) + o P (1) (67)
1 ϕ0 (2ρ|c|)
+ 2 ϕ0 (ρ|c|)2 exp((2 − 2q )|c|q ) −1 . =: c0 + o P (1). (68)
The lower bound is obtained by manipulating the factor Here, the constant c0 = c1 ∧( ρ̄ηq0 ) satisfies the desired condition
1 ϕ0 (2ρc)
2 ϕ (ρc)2 . Consider the following instance of Jensen’s inequal-
0
ϕ0 (ρ̄q c0 ) = 0 due to the choice of η0 .
ity, followed by a trigonometric identity, Now we turn to the last step of showing m̂1q γq xq → P c1 .
yi σ
Note that for i = 1, . . . , n, we have γq x ∼ Si + γq x i ,
ϕ0 (ρc)2 = (E[cos(ρc1 )])2 q q
where the Si are i.i.d. samples from stableq (1). Let Fn denote
≤ E[cos2 (ρc1 )] σ
the distribution function of the random variable |S1 + γq x 1 |
= 1
+ 21 E[cos(2ρc1 )] q
2 and let Fn be the empirical distribution function obtained
= 1
2 + 21 ϕ0 (2ρc), from n samples from Fn . (The subscript on Fn is used to
which gives 1 ϕ0 (2ρc)
≥ 1− 1 1
Letting κq := 2 − 2q , indicate that σ/γq xq may vary with n.) Then,
2 ϕ0 (ρc)2 2 ϕ0 (ρc)2 .
we have the lower bound m̂ q
γq xq = 1
γq xq · median(|y1 |, . . . , |yn |)
v q (c, ρ) ≥ 1
|c|2q
1 1
2 ϕ0 (ρc)2 exp(2|c|q ) − exp(κq |c|q ) = median( γq|yx
1|
, . . . , γq|yx
n|
)
q q
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
5162 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016
which forces ṽ q (c̆, ρ̄q ) = ṽ q (cq (ρ̄q ), ρ̄q ), and the uniqueness q q
Finally, the basic integral 1
0 (1−u)2 du = gives the stated
assumption A4 gives c̆ = cq (ρ̄q ), as desired. 1−q
result.
It remains to bound the last term on the right. Since sq (x) q ∈ [0, 2], there exists a non-zero vector x̃ ∈ R p satisfying
is a non-increasing function of q, and s0 (x) = x0 , we have A x̃ = Ax ◦ and
for any fixed non-zero x,
sq (x̃) ≥ 1
· (1 − np )2 · p. (71)
q πe
d
|sq (x) − x0 | = su (x)du.
du
0
Alternatively, if q ∈ (2, ∞], then there exists a non-zero vector
x̃ ∈ R p satisfying A x̃ = Ax ◦ and
We now derive a bound on | du d
su (x)|, which will then be
πe ( p − n)
2
integrated. For any fixed uu ∈ (0, q], if we define the probability
π j (x)
vector w j (x) := π(x) sq (x̃) ≥ . (72)
u with j = 1, . . . , p, then direct
u 1 + 4 log(2 p)
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
LOPES: UNKNOWN SPARSITY IN CS: DENOISING AND INFERENCE 5163
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
5164 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016
when x ◦ = 0. (The case x ◦ = 0 can be handled as in the and we will accomplish this using the classical technique of
previous proof.) The new item to handle is an upper bound on constructing a Bayes rule with constant risk. For any decision
Ex̃∞ . Clearly, we have x̃∞ ≤ x ◦ ∞ + B̃z∞ , and so rule δ : Rn → R, any A ∈ Rn× p , and any point x ∈ {e1 , x̃},
it is enough to upper-bound E B̃z∞ . We will do this using a define the (deterministic) risk function
version of Slepian’s inequality. If b̃i denotes the i th row of B̃,
R(x, δ) := δ(Ax) − sq (x).
define the random variable gi = b̃i , z, and let w1 , . . . , w p
be i.i.d. N(0, 1) variables. The idea is to √ compare the Also, for any prior π on the two-point set {e1 , x̃}, define
Gaussian process gi with the Gaussian process 2x ◦ ∞ wi .
By [68, Proposition A.2.6], the inequality r (π, δ) := R(x, δ)dπ(x).
√
E B̃z∞ = E max |gi | ≤ 2 2x ◦ ∞ E max |wi | , By [73, Propositions 3.3.1 and 3.3.2], the inequality (82) holds
1≤i≤ p 1≤i≤ p if there exists a prior distribution π ∗ on {e1 , x̃} and a decision
holds when the condition E(gi − g j )2 ≤ 2x ◦ 2∞ E(wi − w j )2 rule δ ∗ : Rn → R with the following three properties:
is satisfied for all i, j ∈ {1, . . . , p}. This can be verified by 1) The rule δ ∗ is Bayes for π ∗ , i.e. r (π ∗ , δ ∗ ) =
first noting that gi − g j = b̃i − b̃ j , z, which is distributed inf δ r (π ∗ , δ).
according to N(0, b̃i − b̃ j 22 ). Since b̃i 2 ≤ x ◦ ∞ for all i , 2) The rule δ ∗ has constant risk over {e1 , x̃},
it follows that i.e. R(e1 , δ ∗ ) = R(x̃, δ ∗ ).
3) The constant risk of δ ∗ is at least 12 πe 1
· (1 − np )2 · p − 21 .
E(gi − g j )2 = b̃i − b̃ j 22 ∗ ∗
To exhibit π and δ with these properties, we define
≤ 4x ◦ 2∞ π ∗ to be the two-point prior that puts equal mass at e1
= 2x ◦ 2∞ E(wi − w j )2 , (79) and x̃, and we define δ ∗ to be the trivial decision rule
δ ∗ (y) = 12 (sq (x̃) + sq (e1 )) for all y ∈ Rn . It is simple to check
as needed. To finish the proof, we make use of a standard the second and third properties. To check that δ ∗ is Bayes
bound for the expectation of Gaussian maxima for π ∗ , the triangle inequality and the relation Ae1 = A x̃ give
the following bound for any decision rule δ,
E max |wi | < 2 log(2 p).
1≤i≤ p r (π ∗ , δ) = 21 δ(A x̃) − sq (x̃) + 12 δ(Ae1 ) − sq (e1 ),
Combining the last two steps, we obtain ≥ 21 sq (x̃) − sq (e1 )
√
Ex̃∞ < x ◦ ∞ + 2 2x ◦ ∞ 2 log(2 p). (80) = 1 δ ∗ (A x̃) − sq (x̃) + 1 δ ∗ (Ae1 ) − sq (e1 )
2 2
= r (π ∗ , δ ∗ ), (83)
Hence, the bounds (78) and (80) lead to (76).
2) Proof of Theorem 4: which shows that δ∗ is Bayes for π ∗.
We only handle the case q ∈ (0, 2], since the case q ∈
(2, ∞] is essentially the same (using Lemma 4). We begin by ACKNOWLEDGMENT
making some reductions. First, it is enough to show that for The author would like to thank Peter Bickel for helpful
any fixed matrix A ∈ Rn× p , discussions, as well as the reviewers and associate editor for
their careful reading and insightful comments, which have
inf sup δ(Ax) − sq (x) ≥ 12 πe1
· (1 − np )2 · p − 12 . improved the paper.
δ:Rn →R x∈R p \{0}
(81) R EFERENCES
[1] M. E. Lopes, “Estimating unknown sparsity in compressed sens-
To see this, note that the generalinequality sq (x) ≤ p implies ing,” in Proc. 30th Int. Conf. Mach. Learn. (ICML), 2013,
δ( Ax) − 1 ≥ 1 δ(Ax) − sq (x), and we can optimize both pp. 217–225.
sq (x) p [2] E. J. Candés, J. K. Romberg, and T. Tao, “Stable signal recovery from
sides with respect x, δ, and A. To make another reduction, it incomplete and inaccurate measurements,” Commun. Pure Appl. Math.,
vol. 59, no. 8, pp. 1207–1223, 2006.
is enough to prove the same bound when R p \ {0} is replaced [3] D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52,
with any subset, as this can only make the supremum smaller. no. 4, pp. 1289–1306, Apr. 2006.
In particular, we replace R p \ {0} with the two-point subset [4] Y. C. Eldar and G. Kutyniok, Compressed Sensing: Theory and Appli-
cations. Cambridge, U.K.: Cambridge Univ. Press, 2012.
{e1 , x̃}, where e1 = (1, 0, . . . , 0) ∈ R p , and by Lemma 4, [5] S. Foucart and H. Rauhut, A Mathematical Introduction to Compressive
x̃ can be chosen to satisfy Ae1 = A x̃, with Sensing. New York, NY, USA: Springer-Verlag, 2013.
[6] P. Boufounos, M. F. Duarte, and R. G. Baraniuk, “Sparse signal
sq (x̃) ≥ 1
πe · (1 − np )2 · p. reconstruction from noisy compressive measurements using cross vali-
dation,” in Proc. IEEE/SP 14th Workshop Statist. Signal Process. (SSP),
(For q ∈ (2, ∞], the second bound in Lemma 4 can be used Aug. 2007, pp. 299–303.
[7] D. M. Malioutov, S. Sanghavi, and A. S. Willsky, “Compressed sensing
instead.) We now complete the proof by showing that the lower with sequential observations,” in Proc. IEEE Int. Conf. Acoust., Speech
bound (81) holds for the two-point problem, i.e. Signal Process. (ICASSP), Mar./Apr. 2008, pp. 3357–3360.
[8] Y. C. Eldar, “Generalized SURE for exponential families: Applica-
inf sup δ(Ax) − sq (x) ≥ 12 πe
1
(1 − np )2 p − 12 , tions to regularization,” IEEE Trans. Signal Process., vol. 57, no. 2,
δ:Rn →R x∈{e1 , x̃} pp. 471–481, Feb. 2009.
[9] R. Ward, “Compressed sensing with cross validation,” IEEE Trans. Inf.
(82) Theory, vol. 55, no. 12, pp. 5773–5782, Dec. 2009.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
LOPES: UNKNOWN SPARSITY IN CS: DENOISING AND INFERENCE 5165
[10] G. Raskutti, M. J. Wainwright, and B. Yu, “Minimax rates of estimation [37] M. Lopes, L. Jacob, and M. J. Wainwright, “A more powerful two-
for high-dimensional linear regression over q -balls,” IEEE Trans. Inf. sample test in high dimensions using random projection,” in Proc. Adv.
Theory, vol. 57, no. 10, pp. 6976–6994, Oct. 2011. Neural Inf. Process. Syst. 24 (NIPS), 2011, pp. 1206–1214.
[11] E. J. Candès and T. Tao, “Decoding by linear programming,” IEEE [38] G. Tang and A. Nehorai, “The stability of low-rank matrix recon-
Trans. Inf. Theory, vol. 51, no. 12, pp. 4203–4215, Dec. 2005. struction: A constrained singular value view,” IEEE Trans. Inf. Theory,
[12] D. L. Donoho and X. Huo, “Uncertainty principles and ideal atomic vol. 58, no. 9, pp. 6079–6092, Sep. 2012.
decomposition,” IEEE Trans. Inf. Theory, vol. 47, no. 7, pp. 2845–2862, [39] S. Negahban and M. J. Wainwright, “Restricted strong convexity and
Nov. 2001. weighted matrix completion: Optimal bounds with noise,” J. Mach.
[13] A. Cohen, W. Dahmen, and R. DeVore, “Compressed sensing and best Learn. Res., vol. 13, no. 1, pp. 1665–1697, 2012.
κ-term approximation,” J. Amer. Math. Soc., vol. 22, no. 1, pp. 211–231, [40] R. A. DeVore, “Deterministic constructions of compressed sensing
2009. matrices,” J. Complex., vol. 23, nos. 4–6, pp. 918–925, 2007.
[14] A. d’Aspremont and L. El Ghaoui, “Testing the nullspace property [41] R. Calderbank, S. Howard, and S. Jafarpour, “Construction of a large
using semidefinite programming,” Math. Program., vol. 127, no. 1, class of deterministic sensing matrices that satisfy a statistical isom-
pp. 123–144, 2011. etry property,” IEEE J. Sel. Topics Signal Process., vol. 4, no. 2,
[15] A. Juditsky and A. Nemirovski, “On verifiable sufficient conditions for pp. 358–374, Apr. 2010.
sparse signal recovery via 1 minimization,” Math. Program., vol. 127, [42] N. Alon, Y. Matias, and M. Szegedy, “The space complexity of approxi-
no. 1, pp. 57–88, 2011. mating the frequency moments,” in Proc. 28th Annu. ACM Symp. Theory
[16] G. Tang and A. Nehorai, “Performance analysis of sparse recovery based Comput. (STOC), 1996, pp. 20–29.
on constrained minimal singular values,” IEEE Trans. Signal Process., [43] J. Feigenbaum, S. Kannan, M. J. Strauss, and M. Viswanathan,
vol. 59, no. 12, pp. 5734–5745, Dec. 2011. “An approximate L 1 -difference algorithm for massive data streams,”
[17] J. A. Tropp and A. C. Gilbert, “Signal recovery from random mea- SIAM J. Comput., vol. 32, no. 1, pp. 131–151, 2002.
surements via orthogonal matching pursuit,” IEEE Trans. Inf. Theory, [44] G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan, “Comparing
vol. 53, no. 12, pp. 4655–4666, Dec. 2007. data streams using Hamming norms (how to zero in),” IEEE Trans.
[18] Y. Plan and R. Vershynin, “One-bit compressed sensing by lin- Knowl. Data Eng., vol. 15, no. 3, pp. 529–540, May/Jun. 2003.
ear programming,” Commun. Pure Appl. Math., vol. 66, no. 8, [45] P. Indyk, “Stable distributions, pseudorandom generators, embeddings,
pp. 1275–1297, 2013. and data stream computation,” J. Assoc. Comput. Mach., vol. 53, no. 3,
[19] R. Vershynin, “Estimation in high dimensions: A geometric perspective,” pp. 307–323, 2006.
in Sampling Theory, a Renaissance, G. E. Pfander, Ed. Switzerland: [46] P. Li, T. J. Hastie, and K. W. Church, “Nonlinear estimators and tail
Springer-Verlag, 2012. bounds for dimension reduction in l1 using Cauchy random projections,”
[20] R. R. Coifman and M. V. Wickerhauser, “Entropy-based algorithms J. Mach. Learn. Res., vol. 8, pp. 2497–2532, 2007.
for best basis selection,” IEEE Trans. Inf. Theory, vol. 38, no. 2, [47] P. Li, “Estimators and tail bounds for dimension reduction in
pp. 713–718, Mar. 1992. lα (0 < α ≤ 2) using stable random projections,” in Proc. 19th Annu.
[21] D. L. Donoho, “On minimum entropy segmentation,” Wavelets: Theory, ACM-SIAM Symp. Discrete Algorithms, 2008, pp. 10–19.
Algorithms, and Applications. San Diego, CA, USA: Academic, 1994, [48] P. Clifford and I. A. Cosma, “A statistical analysis of probabilistic
pp. 233–269. counting algorithms,” Scandin. J. Statist., vol. 39, no. 1, pp. 1–14, 2012.
[22] B. D. Rao and K. Kreutz-Delgado, “An affine scaling methodology [49] G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine, “Synopses for
for best basis selection,” IEEE Trans. Signal Process., vol. 47, no. 1, massive data: Samples, histograms, wavelets, sketches,” Found. Trends
pp. 187–200, Jan. 1999. Databases, vol. 4, nos. 1–3, pp. 1–294, 2011.
[23] N. Hurley and S. Rickard, “Comparing measures of sparsity,” IEEE [50] A. Gilbert and P. Indyk, “Sparse recovery using sparse matrices,” Proc.
Trans. Inf. Theory, vol. 55, no. 10, pp. 4723–4741, Oct. 2009. IEEE, vol. 98, no. 6, pp. 937–947, Jun. 2010.
[24] P. O. Hoyer, “Non-negative matrix factorization with sparseness [51] P. Li, C.-H. Zhang, and T. Zhang, “Compressed counting meets
constraints,” J. Mach. Learn. Res., vol. 5, pp. 1457–1469, Dec. 2004. compressed sensing,” in Proc. Conf. Learn. Theory (COLT), 2014,
[25] A. G. Dimakis, R. Smarandache, and P. O. Vontobel, “LDPC codes pp. 1058–1077.
for compressed sensing,” IEEE Trans. Inf. Theory, vol. 58, no. 5, [52] P. Indyk, “Sketching via hashing: From heavy hitters to compressed
pp. 3093–3114, May 2012. sensing to sparse Fourier transform,” in Proc. 32nd Symp. Principles
[26] M. Pilanci, L. El Ghaoui, and V. Chandrasekaran, “Recovery of sparse Database Syst. (PODS), 2013, pp. 87–90.
probability measures via convex programming,” in Proc. Adv. Neural [53] M. Markatou, J. L. Horowitz, and R. V. Lenth, “Robust scale estimation
Inf. Process. Syst. 25 (NIPS), 2012, pp. 2420–2428. based on the the empirical characteristic function,” Statist. Probab. Lett.,
[27] L. Demanet and P. Hand, “Scaling law for recovering the sparsest vol. 25, no. 2, pp. 185–192, 1995.
element in a subspace,” Inf. Inference, vol. 3, no. 4, pp. 295–309, 2014. [54] M. Markatou and J. L. Horowitz, “Robust scale estimation in the error-
[28] B. Barak, J. A. Kelner, and D. Steurer, “Rounding sum-of-squares components model using the empirical characteristic function,” Can.
relaxations,” in Proc. 46th Annu. ACM Symp. Theory Comput. (STOC), J. Statist., vol. 23, no. 4, pp. 369–381, 1995.
2014, pp. 31–40. [55] C. Butucea and C. Matias, “Minimax estimation of the noise level and
[29] M. O. Hill, “Diversity and evenness: A unifying notation and its of the deconvolution density in a semiparametric convolution model,”
consequences,” Ecology, vol. 54, no. 2, pp. 427–432, 1973. Bernoulli, vol. 11, no. 2, pp. 309–340, 2005.
[30] L. Jost, “Entropy and diversity,” Oikos, vol. 113, no. 2, pp. 363–375, [56] A. Meister, “Density estimation with normal measurement error with
2006. unknown variance,” Statist. Sinica, vol. 16, no. 1, pp. 195–211, 2006.
[31] A. Repetti, M. Q. Pham, L. Duval, É. Chouzenoux, and J.-C. Pesquet, [57] C. Matias, “Semiparametric deconvolution with unknown noise vari-
“Euclid in a taxicab: Sparse blind deconvolution with smoothed 1 /2 ance,” ESAIM, Probab. Statist., vol. 6, pp. 271–292, Nov. 2002.
regularization,” IEEE Signal Process. Lett., vol. 22, no. 5, pp. 539–543, [58] T. T. Cai, L. Wang, and G. Xu, “New bounds for restricted isometry
May 2015. constants,” IEEE Trans. Inf. Theory, vol. 56, no. 9, pp. 4388–4394,
[32] J. B. Lasserre, Moments, Positive Polynomials and Their Applications, Sep. 2010.
vol. 1. Singapore: World Scientific, 2009. [59] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition
[33] E. J. Candés and Y. Plan, “Tight oracle inequalities for low-rank matrix by basis pursuit,” SIAM Rev., vol. 43, no. 1, pp. 129–159, 2001.
recovery from a minimal number of noisy random measurements,” IEEE [60] T. Gneiting, “Curiosities of characteristic functions,” Expo. Math.,
Trans. Inf. Theory, vol. 57, no. 4, pp. 2342–2359, Apr. 2011. vol. 19, no. 4, pp. 359–363, 2001.
[34] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, “The [61] V. M. Zolotarev, One-Dimensional Stable Distributions, vol. 65.
convex geometry of linear inverse problems,” Found. Comput. Math., Providence, RI, USA: AMS, 1986.
vol. 12, pp. 805–849, Dec. 2012. [62] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Roy.
[35] M. Rudelson and R. Vershynin, “Sampling from large matrices: Statist. Soc. B, vol. 58, no. 1, pp. 267–288, 1996.
An approach through geometric functional analysis,” J. Assoc. Comput. [63] S. Chatterjee, “A new perspective on least squares under convex con-
Mach., vol. 54, no. 4, 2007, Art. no. 21. straint,” Ann. Statist., vol. 42, no. 6, pp. 2340–2381, 2014.
[36] R. Vershynin, “Introduction to the non-asymptotic analysis of random [64] H. Zou and T. Hastie, “Regularization and variable selection via the
matrices,” in Compressed Sensing: Theory and Applications, Y. C. Eldar elastic net,” J. Roy. Statist. Soc., B (Statist. Methodol.), vol. 67, no. 2,
and G. Kutyniok, Eds. Cambridge, U.K.: Cambridge Univ. Press, 2012. pp. 301–320, 2005.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.
5166 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 62, NO. 9, SEPTEMBER 2016
[65] S. Foucart, A. Pajor, H. Rauhut, and T. Ullrich, “The Gelfand widths [72] C. Beck and F. Schögl, Thermodynamics of Chaotic Systems. Cambridge,
of p -balls for 0 < p ≤ 1,” J. Complex., vol. 26, no. 6, pp. 629–640, U.K.: Cambridge Univ. Press, 1995.
2010. [73] P. J. Bickel and K. A. Doksum, Mathematical Statistics, vol. 1.
[66] B. Kashin, “Diameters of some finite-dimensional sets and classes of Upper Saddle River, NJ, USA: Prentice-Hall, 2001.
smooth functions,” Math. USSR Izv., vol. 11, no. 2, p. 317, 1977.
[67] A. Garnaev and E. D. Gluskin, “The widths of a Euclidean ball,” Dokl.
Akad. Nauk SSSR, vol. 277, no. 5, pp. 1048–1052, 1984.
[68] A. W. van der Vaart and J. A. Wellner, Weak Convergence and Empirical Miles E. Lopes (M’16) is currently Assistant Professor in the Department of
Processes. New York, NY, USA: Springer-Verlag, 1996. Statistics at the University of California, Davis. He received B.S. degrees in
[69] M. B. Marcus, “Weak convergence of the empirical characteristic mathematics and physics from the University of California, Los Angeles, as
function,” Ann. Probab., vol. 9, no. 2, pp. 194–201, 1981. well as an M.S. in computer science and a Ph.D. in statistics from the Univer-
[70] S. Csörgõ, “Multivariate empirical characteristic functions,” Probab. sity of California at Berkeley, where his research was supported by the DOE
Theory Rel. Fields, vol. 55, no. 2, pp. 203–229, 1981. CSGF and NSF GRFP fellowships. His research interests include compressed
[71] S. Csörgõ, “Limit behaviour of the empirical characteristic function,” sensing, statistical machine learning, and high-dimensional statistics - with a
Ann. Probab., pp. 130–144, 1981. focus on resampling methods.
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on May 04,2024 at 15:11:13 UTC from IEEE Xplore. Restrictions apply.