Pattern Recognition: Zheng Pan, Changshui Zhang

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Pattern Recognition 48 (2015) 231243

Contents lists available at ScienceDirect

Pattern Recognition
journal homepage: www.elsevier.com/locate/pr

Relaxed sparse eigenvalue conditions for sparse estimation


via non-convex regularized regression
Zheng Pan a,b,c, Changshui Zhang a,b,c,n
a

Department of Automation, Tsinghua University


State Key Lab of Intelligent Technologies and Systems
c
Tsinghua National Laboratory for Information Science and Technology (TNList), Beijing 100084, P.R. China
b

art ic l e i nf o

a b s t r a c t

Article history:
Received 30 June 2013
Received in revised form
23 January 2014
Accepted 19 June 2014
Available online 30 June 2014

Non-convex regularizers usually improve the performance of sparse estimation in practice. To prove this
fact, we study the conditions of sparse estimations for the sharp concave regularizers which are a
general family of non-convex regularizers including many existing regularizers. For the global solutions
of the regularized regression, our sparse eigenvalue based conditions are weaker than that of L1regularization for parameter estimation and sparseness estimation. For the approximate global and
approximate stationary (AGAS) solutions, almost the same conditions are also enough. We show that the
desired AGAS solutions can be obtained by coordinate descent (CD) based methods. Finally, we perform
some experiments to show the performance of CD methods on giving AGAS solutions and the degree of
weakness of the estimation conditions required by the sharp concave regularizers.
& 2014 Elsevier Ltd. All rights reserved.

Keywords:
Sparse estimation
Non-convex regularization
Sparse eigenvalue
Coordinate descent

1. Introduction
High-dimensional estimation concerns the parameter estimation problems in which the dimensions of parameters are comparable to or larger than the sampling sizes. In general, highdimensional estimation is ill-posed. Additional prior knowledge
about the structure of the parameters is usually needed to obtain
consistent estimations. In recent years, tremendous research
works have demonstrated that the prior on sparsity of the true
parameters can lead to good estimators, e.g., the well-known work
of compressed sensing [6] and its extensions to general highdimensional inference [24].
For high-dimensional sparse estimation, sparsity is usually
imposed as sparsity-encouraging [8] regularizers for linear regression methods. Many regularizers have been proposed to describe
the prior of sparsity, e.g., 0-norm, 1-norm, q-norm with 0 o q o 1,
smoothly clipped absolute deviation (SCAD) penalty [14], log-sum
penalty (LSP) [8], minimax concave penalty (MCP) [37] and Geman
penalty (GP) [17,32]. Except 1-norm, all of these sparsityencouraging regularizers are non-convex. Non-convex regularizers
were proposed to improve the performance of sparse estimation in
many applications, e.g., image inpainting and denoising [29],
biological feature selection [3,27], MRI [8,9,3335] and CT [26,30].

n
Corresponding author: Department of Automation, Tsinghua University, Beijing
100084, P.R. China. Tel.: 861062796872.
E-mail addresses: panz09@mails.tsinghua.edu.cn (Z. Pan),
zcs@tsinghua.edu.cn (C. Zhang).

http://dx.doi.org/10.1016/j.patcog.2014.06.018
0031-3203/& 2014 Elsevier Ltd. All rights reserved.

However, it still lacks theoretical explanation for the improvement


on sparse estimation for non-convex regularizers. This paper aims
to establish such a theoretical analysis.
In the eld of sparse estimation, the following three problems
are typically studied. In this paper, we mainly study the rst two
problems.
1. Sparseness estimation: whether the estimation is as sparse as
the true parameters.
2. Parameter estimation: whether the estimation is accurate in
the sense that the error between the estimation and the true
parameter is small under some metric.
3. Feature selection: whether the estimation correctly identities
the non-zero components of the true parameters.

For the sparseness estimation, the non-convex regularizers give


better approximations to 0-norm than the convex ones. They are
more probable to encourage the regularized regression to yield
sparser estimations than the convex regularizers. For example, qregularization can give the sparsest consistent estimations even
when 1-regularization fails [15]. However, q-norm has innite
derivatives at zero and zero vector is always a trivial local
minimizer of the regularized regression. The non-convex regularizers with nite derivatives can remedy the numerical problem of
q-norm, e.g., LSP, SCAD and MCP. These regularizers can also give
sparser solutions for more general situations than 1-regularization in experiments [8] and in theory [14,32,37,38].

232

Z. Pan, C. Zhang / Pattern Recognition 48 (2015) 231243

For the parameter estimation, a lot of applications and experiments have demonstrated that many non-convex regularizers give
good estimations with far less sampling sizes than 1-norm as the
regularizers [8,9,26,30,3335]. In theory, the requirements for the
sampling sizes are essentially the requirements for design matrix
or, rather, estimation conditions. A weaker estimation condition
means less sampling size needed or weaker requirements on a
design matrix. Weaker estimation conditions are important for the
application in which the data dimension is very high while the
sampling is expensive or restrictive. Theoretically, all of the nonconvex regularizers mentioned above admit accurate parameter
estimations under appropriate conditions, e.g., q-norm [15], MCP
[37], SCAD [37] and general non-convex regularizers [38].
There are mainly two types of estimation conditions. The rst is
sparse eigenvalue (SE) conditions, e.g., the restricted isometry
property (RIP) [6,7] and the SE used by Foucart and Lai [15] and
Zhang [40]. The second is restricted eigenvalue (RE) conditions, e.g.,
the 2-restricted eigenvalue (2-RE) [2,21] and the restricted
invertibility factor (RIF) [36,38]. Based on SE, Foucart and Lai [15]
gave a weaker estimation condition for q-norm than 1-norm.
Trzasko and Manduca [32] established a universal RIP condition
for general non-convex regularizers including 1-norm. Since the
conditions proposed by Trzasko and Manduca [32] are regularizerindependent, it cannot be weakened for non-convex regularizers
unfortunately. The denition of SE is regularizer-independent
while the RE is dependent on the regularizers. RE can give a
regularizer-dependent estimation condition for general regularizers, e.g., the 2-RE based work by Negahban et al. [24] and the RIF
based work by Zhang and Zhang [38].
The optimization for non-convex regularizers is difcult. It
usually cannot be guaranteed to achieve a global optimum for
general non-convex regularizers. Nevertheless, some optimization
methods can lead to local optima, e.g., coordinate descent [3,23]
and iterative reweighted (or majorizationminimization) methods
[8,20,41,39], homotopy [37], difference convex (DC) methods
[27,28] and proximal methods [18,25]. Hence, it is meaningful to
analyze the performance of sparse estimation for these nonoptimal optimization methods. For example, the multi-stage
relaxation methods [41,39] and its one-stage version the adaptive
LASSO [19,42] replace the regularizers with their convex relaxations using majorizationminimization. Compared with LASSO,
the multi-stage relaxation methods improve the performance on
parameter estimation [39]. Zhang and Zhang [38] use the solutions
of LASSO as the initialization and continue to optimize by gradient
descent. It is stated that LASSO followed by gradient descent can
output an approximate global and stationary solution which is
identical to the unique sparse local solution and the global
solution. The multi-stage relaxation methods, the LASSO
gradient descent methods and the homotopy methods need the
same SE or RE conditions as LASSO. The DC methods [28] and the
proximal methods [25] need to know the sparseness of its
solutions in advance to ensure the performance of parameter
estimation, but these two methods cannot control the sparseness
of its solutions explicitly.
Based on the related work, we make the following contributions:

 For a general family of non-convex regularizers, we propose


new SE based estimation conditions which are weaker than
that of 1-norm. As far as we know, our estimation conditions
are the weakest ones for general non-convex regularizers. The
proposed conditions approach the SE conditions of 0-regularized regression as the regularizers become closer and closer to
0-norm. We also compare our SE conditions with RE conditions. For 1-regularized regression, RE based estimation conditions are less severe than that based on SE [2]. However, for
the case of non-convex regularizers, their relationship changes.

For proper non-convex regularizers, SE conditions become


weaker than RE conditions, because SE conditions can be
greatly weakened from 1-norm to non-convex regularizers
while RE conditions remain the same.
Under the proposed SE conditions, we establish upper bounds
for the estimation error in 2-norm. The error bounds are on
the same order as that of 1-regularized regression. It means
that although the proposed SE conditions are weakened, the
parameter estimation performance is not weakened. With
appropriate additional conditions, we further give the results
of sparseness estimations, which show that the non-convex
regularized regression gives estimations with the sparseness on
the same order as the true parameters.
Like the global solutions of non-convex regularized regression,
we show that the approximate global and approximate stationary (AGAS) solutions [38] also theoretically guarantee
accurate parameter estimation and sparseness estimation. The
error bounds of parameter estimation are on the order of noise
level and the degrees of approximating the stationary solutions
and the global optima. If the degrees of these two approximations are comparable to the noise level, the theoretical performance on parameter estimation and sparseness estimation is
also comparable to that of global solutions. Furthermore, the
required estimation conditions are almost the same as that of
global solutions, which means that the estimation conditions
for AGAS solutions are also weaker than that required by 1norm. The estimation result on AGAS solutions is useful for
application since it shows the robustness of the non-convex
regularized regression to the inaccuracy of the solutions and
gives a theoretical guarantee for the numerical solutions.
Under a mild SE condition, the approximate global (AG)
solutions are obtainable and the approximation error is
bounded by the prediction error. If the prediction error is
small, the solution will be a good approximate global solution.
The algorithms which control the sparseness of the solutions
explicitly are suitable to give good AG solutions, e.g., OMP [31]
and GraDeS [16]. For an AG solution, the coordinate descent
(CD) methods update it to be approximate stationary (AS)
without destroying its AG property. CD have been applied to
regularized regression with non-convex regularizers [3,23].
However, the previous works did not allow the non-convex
regularizers to approximate 0-norm arbitrarily. Our analysis
does not have such restriction on the non-convex regularizers.

Denotation: We use T to denote the complement of the set T


and jT j to denote the number of elements in T . For an index set
T A f1; 2; ; pg, T denotes the restriction of 1 ; 2 ; ; p on
T , i.e., T i : iA T . The support supp of a vector is dened
as the index set composed of the non-zero components' indices of
, i.e., supp fi : i a0g. The 0-norm of the vector is the
number of non-zero components of , i.e., 0 jsuppj.

2. Preliminaries
We rst formulate the sparse estimation problems. Suppose
that we have n samples y1 ; z1 ; y2 ; z2 ; ; yn ; zn , where yi A R
and

z i A Rp

i 1; ; n.

for
T

Let

X z1 ; ; zn T A Rnp

and

y y1 ; ; yn A R . We assume that there exists an s-sparse true

parameter which is supported on S and satises y X e


with a small noise eA Rn . In this paper, we assume that the energy
n

of the noise is limited by a known level , i.e., e2 r . For


Gaussian noise e  N 0; I n , this assumption is satised for
q
p
n 2 n log n with the probability at least 1 1=n [4].

Z. Pan, C. Zhang / Pattern Recognition 48 (2015) 231243

233

Table 1
Examples of popular regularizers. The second column is the basis functions of the regularizers. The third column is the zero gaps of the global solutions when the regularizers
are -sharp concave. The fourth column is the values of n . Section 8.2 gives the proof for the result on n of LSP.
Zero gap

ru u

ru 2 u=q ; log 1=1 q

q1  q=1=2  q

Name

Basis functions

1-Norm
q-Norm

 
 
x=  1
min 1; 1 
dx



u
ru 2 log 1



Ru
x
ru 0 1
dx

SCAD

ru

LSP
MCP
GP

Ru



21  q q  1=2  q

n for 1

p
maxf1=  ; 0g

p
n r 2 log 1 2= 2

p
=

p
n minf ; 1g

n 2  q

p
maxf 3 2=  ; 0g

ru 2 u= u

We focus on using the following regularized regression to


recover from y. This method uses the solutions of the following
regularized regression as the estimations to the true parameters:
n

^ arg minp F ;

AR

where F L R. L y  X 22 =2n is the prediction


error. R is a non-convex regularizer. In this paper, we only study
the component-decomposable regularizer, i.e., R pi 1 rji j.
We call r(u) the basis function of R. Table 1 lists the basis
functions of some popular regularizers. For the basis functions in
Table 1, r(u) has the formulation ru r 0 u=; where r 0 u; is
2

a non-decreasing concave function over 0; 1 and is a


parameter to describe the degree of concavity, i.e., r(u) changes

from linear function of u to the indicator function I fu a 0g as varies


from 1 to 0 (except 1-norm).
Throughout this paper, we assume that the basis function r(u)
satises the following properties. All of them hold for the basis
functions in Table 1.
r0 0.
r(u) is non-decreasing.
r(u) is concave over 0; 1.
r(u) is continuous and piecewise differentiable. We use r_ u
and r_ u  to denote the right and left derivatives.
2
5. r(u) has the formulation ru r 0 u=; , where r 0 u; is
parameterized by and is independent of .
1.
2.
3.
4.

In this paper, the weaker SE based estimation conditions need


two important properties: zero gap and null consistency [38]. Zero
gap means the true parameters and the estimations are strong in
the sense that the minimal magnitude of the non-zero components cannot be too close to zero. Null consistency requires that
the regularized regression in Eq. (1) is able to identify the true
n
n
parameter exactly when 0 and the error e is inated by a
factor of 1= 4 1.
Denition 1 (Zero gap). We say that A R has a zero gap u0 for
some u0 Z 0 if minfji j : i A suppg Zu0 .
p

Denition 2 (Null consistency). Let A 0; 1. We say that the


regularized regression in Eq. (1) is -null consistent if
min X  e=22 =2n R e=22 =2n.
In order to guarantee the above two properties, we propose the
following assumption, named sharp concavity. Sharp concavity is
important for our analysis because zero gap and null consistency
can be derived from it.

( p
2  =2;

2 r 2
2 4 2

=;

Denition 3 (Sharp concavity). We say that a basis function r(u)


satises C-sharp concavity condition over an interval I if ru 4
ur_ u  Cu2 =2 holds for any u A I , where C is a positive
constant. We also say that r(u) is C-sharp concave over I and a
regularizer R is C-sharp concave if its basis function is C-sharp
concave.
Strictly concave functions can only satisfy ru 4 ur_ u  . However, if the left-derivative r_ u  decreases so fast that it admits a
margin proportional to u2 in some interval I , the concave functions guarantee the sharp concavity.
C-sharp concavity is satised over 0; u0 if r(u) is strongly
concave (or  ru is strongly convex) over 0; u0 , i.e., for any
t 1 ; t 2 A 0; u0 and A 0; 1,
rt 1 1  t 2 Z rt 1 1  rt 2 12 C 1  t 1  t 2 2 :

Section 8.1 shows that sharp concavity only needs Eq. (2) that
holds for t 1 0 and any t 2 A 0; u0 , which means that the sharp
concavity is weaker than the strong concavity. For example, MCP is
p
1 a  1 sharp concave over 0; 1 a for any a 4 0.
p
Whereas the strong concavity does not hold over ; 1 a .
Besides, q-norm holds q1  qu0 =q  2 sharp concavity over

0; u0 ; LSP satises = u0 2 sharp concavity over 0; u0 ; GP


2

is 2 = u0 sharp concave over 0; u0 .


Let xi be the i-th column of X and
3

max xi 22 =n:
1rirp

We observe that -sharp concavity derives non-trivial zero gaps


and null consistency.
Theorem 1. If r(u) is -sharp concave over 0; u0 , any global
solution of Problem (1) has a zero gap no less than u0, i.e., j^ i j Z u0
for any i A supp^ .
Table 1 lists the zero gaps of ^ when the basis functions are sharp concave.
Theorem 2. Let r(u) be -sharp concave over 0; u0 . The
consistency condition is satised if ru0 Z e22 =2n2 .

-null

Zhang and Zhang [38] give a probabilistic condition for null


consistency when X is drawn from Gaussian distributions. However,
our condition is deterministic from the view of X. It is easy to check
whether our condition holds. For the case of ru r 0 u=; ,
p
the condition of Theorem 2 is Z  1 b0 e2 = n, where
p
b0 1= 2r 0 u0 =; is a constant if u0 O (all the regularizers
in Table 1 satisfy u0 O). Hence, we assume
p
 1 b0 = n
3
2

234

Z. Pan, C. Zhang / Pattern Recognition 48 (2015) 231243

in this paper, so that the -null consistency holds. In addition, we


dene

inf fu=2 ru=ug:


n

parameter respectively. Denote

0 minf^ 0 ; n g:

u40

provides a natural normalization of [38]. Table 1 lists the values


n
n
of of the corresponding regularizers. We observe O from
2
Table 1. In general, for ru r 0 u=; , we can dene a constant
a (independent to ),
n

a inf fu=2 r 0 u; =ug;

so that a . Thus, we have


p
n  1 a b0 = n:

u40

If the basis function r(u) is linear over 0; u for some u 4 0, it is


not sharp concave, e.g., SCAD and truncated 1-norm [39]. We
name such regularizers that are linear near the origin as weak nonconvex regularizers. The zero gaps of the global solutions with such
regularizers cannot be guaranteed to be strictly positive.

Theorem 3 (Parameter estimation of global solutions). Suppose that


the following conditions hold.

1. r(u) is invertible for u Z0 and r  1 u=s1 =r  1 u=s2 is a nondecreasing function of u for any s2 Zs1 Z 1.
2. The regularized regression satises -null consistency.
3. The following SE condition holds for some integer t Z s:
p
2t=  2t o 4 2  1Hr 0 ; ; s; t 1;
9
p
n
where s 0 , 1 =1  , H r 0 ; ; s; t s=t
r  1 r0 =s=r  1 r0 =t
for
0 4 0
and
H r 0; ; s; t lim-0 H r ; ; s; t.
Then,
^  2 r C 1 ;
n

10

where
3. Sparse estimation of global solutions

C1

In this section, we show our results on the SE based sparse


estimation.
Denition 4 (Sparse eigenvalue). For an integer t Z1, we say that
 t and t are the minimum and maximum sparse eigenvalues(SE) of a matrix X if

 t r

X 22
r t
n22

for any with 0 rt:

The SE is related to the restricted isometry constant (RIC) t


[6,7], which satises 1  t rX 22 =n22 r 1 t for all with
0 r t. Thus, it follows that t t   t= t  t,
where t is actually the RIC of the scaled matrix 2X= t  t.
We employ SE since it allows t Z2 and avoids the scaling
problem of RIC [15].
In order to show the typical values of t and  t, we
compute them and their ratio t=  t for the standard
Gaussian n  p matrix,1 where we x p 10 000, n 500, 1000,
1500, 2000 and t varies from 1 to n. It should be noted that t
and  t cannot be obtained efciently. We use the following
approximation method: For a matrix X A Rnp , we randomly
sample its 100 submatrices X 1 ; X 2 ; ; X 100 A Rnt composed of t
columns of X and regard ~ t maxi max X Ti X i and

~  t mini min X Ti X i as the approximations for t and


 t, where max A and min A mean the maximal and minimal
eigenvalues of A, respectively. Actually, ~ t r t and
~  t Z  t. For each n and t, we generate 100 standard Gaussian

matrices and compute the maxima, the minima and the means of
the values of ~ t, ~  t and ~ t=~  t for the 100 trials. Fig. 1
illustrates the results. The variances of ~ t, ~  t and
~ t=~  t with the same n and t are small since the corresponding lines for the maximum, minimum and mean values are close to
each other. However, ~ t=~  t grows very fast as t grows or n
decreases.
Based on SE, we establish the following parameter estimation
result for global solutions of non-convex regularized regression.
Let ^ 0 and n0 be the zero gaps of the global solution ^ and the true
1

The elements are i.i.d. drawn from the standard Gaussian distribution N 0; 1.

p
p
1 21 t
H r 0 ; ; s; t 1=2
p
:
 2t
H r 0 ; ; s; t  1 2 2t=  2t 1=4

Since is on the order of noise level (Eq. (6)), the estimation


n
error ^  2 is at most on the order of noise level. We give a
detailed discussion on Theorem 3 in Section 4. Before the discussion, we rst show a corollary given in Section 4, which shows that
our SE condition only needs  t 4 0 with t Os. This SE
condition is much weaker than that of 1-norm. In fact, it is
almost optimal since it is the same as the estimation condition of
0-regularization [15,38].
n

Corollary 1. Let conditions 1 and 2 of Theorem 3 hold and


H r 0 ; ; s; s 1-1 as -0. If  2s 2 4 0, there exists
4 0 such that ^  n 2 rOn .
In addition to the error bound in Theorem 3, we hope that the
regularized regressions yield enough sparse solutions. We extend
the results from Zhang and Zhang [38] and show that the global
solutions are sparse under appropriate conditions.
Theorem 4 (Sparseness estimation of global solutions). Suppose
that the conditions of Theorem 3 hold. Consider l0 40 and integer
m0 40 such that
q
n
2t m0 rC 2 1 =m0 X T e=n1 o r_ l0  ;
where
C2
is
dened
in
n
Sj r m0 trC 2 1 =rl0 .

Eq.

(31).

Then,

jsupp^ \

Corollary 2. Suppose that the basis function ru r 0 u= and the


conditions of Theorem 4 hold with t 1s, m0 0 s and l0 1
for some 0 ; 1 4 0. Let C 3 C 2 1 a , where C2 is the same as
Theorem 4 and a is dened in Eq. (5). If
2

2 1 0 s

r_ 0 1   a 2
;
r 0 C 3

11

then
jsupp^ \Sjr 0 1r 0 C 3 =r 0 1 s:

12

Example for Corollary 2. Consider the example of LSP with


p
r 0 u log 1 u= and 1 . Suppose that the columns of X
are normalized so that 1. Section 8.2 shows that

Z. Pan, C. Zhang / Pattern Recognition 48 (2015) 231243

a r
than

p
2 log 1 2= 2 . Thus, the right hand side of Eq. (11) is larger

p
p
1=1  2 log 1 2= 2 2
p
log 1  1 C 2 1 2 log 1 2= 2
Thus, as goes to 0, the right hand side of Eq. (11) is arbitrarily
large. Eq. (11) holds for enough small . The right hand side of
Eq. (12) is 0 s Os as -0. Hence, we can freely select 0
satisfying Eq. (11) with enough small . For example, if 0 1=s,
Eq. (11) holds for enough small and Eq. (12) becomes
p
log 1  1 C 2 1 2 log 1 2= 2
:
jsupp^ =Sj r 1 s 1
p
log 1 1=
13
The right hand side of Eq. (13) is at most on the order of s when
is close to zero.

with q 2=1 . When -0, r(u) will become the basis


function of MCP. We will address the method to obtain Eq. (14)
in Section 8.7. Any other non-invertible regularizers in Table 1 can
be approximated in the same way.
4.2. Non-decreasing property of r  1 u=s1 =r  1 u=s2
It can be veried that all the regularizers in Table 1 or their
invertible approximate ones (in the way of Eq. (14)) satisfy the
non-decreasing property of r  1 u=s1 =r  1 u=s2 for any s2 Z s1 4 0.
In fact, for derivative basis functions, this non-decreasing property
is equal to that ur_ u=ru is a non-increasing function of u.
4.3. Non-sharp concave regularizers
If r(u) is not -sharp concave, e.g., SCAD or LSP with 2 4 1=,
we cannot guarantee that ^ has a positive zero gap. In this case,
condition 2 (null consistency) of Theorem 3 can be guaranteed by
the 2-regularity conditions [38] and condition 3 becomes
p
2s=  2s o 1:65= 1 with t s, which also belongs

4. Discussion on Theorem 3
This section gives some detailed discussion on Theorem 3.
4.1. Invertible approximate regularizers
If r(u) is not invertible, e.g., MCP, we can design invertible basis
function to approximate it. For example, we can use the following
invertible function, named Approximate MCP, to approximate MCP:
8
2
0 r u r 1  ;
>
< u  u =2 ; 
2=1
ru 1 2
u
2
>
; u 4 1  ;
: 1 
2
1 
14
where A 0; 1. Approximate MCP is concatenated by the part of
MCP over 0; 1   and the part of q-norm over 1  ; 1

235

to the 2-regularity conditions. Hence, without -sharp concavity,


Theorem 3 still holds. Intuitively, non-sharp concave regularizers
need the same estimation conditions as 1-regularization since
they cannot approximate 0-norm arbitrarily.
4.4. Relaxed SE based estimation conditions
Much more relaxed estimation conditions are sufcient for -

sharp concave regularizers. Suppose that r(u) is -sharp concave

over 0; 0 with 0 o 0 r mini A S ji j. In this case, H r 0 ; ; s; t can


become arbitrarily large for proper regularizers so that the SE
condition in Eq. (9) is much weaker than the SE conditions of 1regularized regression. We have shown in Fig. 1 that ~ t=~  t
n

10

10

n=500
n=1000
n=1500
n=2000

3.5
3

10

10

2.5
2

n=500
n=1000
n=1500
n=2000

1.5
1

500

1000
t

1500

10
10

10

n=500
n=1000
n=1500
n=2000

500

1000
t

1500

10

200

400

600

800

1000 1200 1400

Fig. 1. ~ t, ~  t and ~ t=~  t for the Gaussian random matrices with p 10 000, n 500, 1000, 1500, 2000 and t ranges from 1 to n. The solid lines are the average
values of the 100 trials and the two dashed lines around each solid line are the maximum and the minimum of the 100 trials.

Fig. 2. The upper bounds of the SE conditions for LSP, approximate MCP (AMCP) and q-norm. We set 1:01, t 2s. In each subgure, we also plot the upper bound of SE
conditions for 1-norm, i.e., the right hand side of Eq. (16) with q 1. (a) Approximate MCP vs. L1. (b) LSP vs. L1. (c) q-norm vs. L1.

236

Z. Pan, C. Zhang / Pattern Recognition 48 (2015) 231243

( r t=  t) increases very fast as t increases or n decreases.


Thus, a weaker constraint on 2t=  2t in Eq. (9) is very
important for sparse estimation problems.
Here, we give the examples of approximate MCP, q-norm and
LSP. For approximate MCP, Eq. (15) gives its H r 0 ; ; s; t (see
Section 8.7)
H r 0 ; ; s; t

 1=2

t=s

1=2

where we set =1 =t
tions can be written as
p
 
2t
4 2  1 t 1=q  1=2
p
o1
:
 2t
s

15
1=

. For q-norm, the SE condi-

16

When 1, Eq. (16) is identical to the estimation condition of


Foucart and Lai [15]. Hence, Foucart and Lai [15] can be regarded as
a special case of our theory. For LSP, we have
r
r p  1=s
s1 0 = 1=s  1
s
1
p

H r 0 ; ; s; t
:
17
t 1 0 = =t  1
t  =t  1
It should be noted that H r 0 ; ; s; t-1 as -0 for approximate
MCP, q-norm and LSP. Fig. 2 shows some special cases of
H r 0 ; ; s; t for these three regularizers and 1-norm. In Fig. 2,
the SE conditions in Eq. (9) are much weaker than that of 1-norm.
Theorem 3 reveals that the upper bound constraint for
2t=  2t tends to innity as -0 for proper non-convex
regularizers. It implies that if
(
)
X 22
 2t inf
:

r2t
4 0;
18
0

n J J 22
there exists 4 0 so that the SE condition (Eq. (9)) is satised.
Based on this observation, we have Corollary 1. In Corollary 1,
 2s 2 4 0 holds if the columns of X are in general position2
and 2s 2 r n, which is almost optimal in the sense that it is the
same as the SE condition of 0-regularized regression [38].
4.5. Comparison between SE and RE
Like SE, RE is also popular to construct estimation conditions.
There are some variants of RE, e.g., 2-RE [2,21] and RIF [36,38]. It
can derive a simple expression to the parameter estimation and
the corresponding estimation condition.
Denition 5 (2-RE). For Z 1, a regularizer R, an index set
S  f1; pg and its complement set S , the 2-RE is dened as
(
)
X 22
RER ; S inf
:
R

:
19
S
S

n22
Denition 6 (Restricted invertibility factor). For Z 1, Z 1, a
regularizer R, an index set S  f1; pg, RIF is dened as
(
)
jSj1= X T X 1
RIFR

;
S

inf
:
R

:
S

S
n

Theorem 5. Suppose that -null consistency condition holds and


p
1 =1  . Then, ^  n 2 r 2 s=RER ; Sr_ 0 . For
n
n 1=
R
^
any Z 1,  r 1 s =RIF ; S.
The estimation conditions based on RE require that
RER ; S 4 0 or RIFR
; S 40. The same conclusion can also be
obtained for 1-regularized regression [24,38]. What we are
interested in is whether non-convex regularizers allow a larger

value of RER ; S than 1-norm, i.e., whether RER ; S 4 0


becomes weaker by employing non-convex regularizers.
Dene f A Rn : RS r RS ; 2 1g for 4 0.
The concavity of r(u) gives that r_ 0 u Z ru Zur_ u , which
derives that

*f A Rn : r_ 0 S 1 r jS j; r_ jS j ; 2 1g;
where jS j is the vector composed of the absolute values of the
components of S , i.e., jS j ji j : i A S. In the same way,
r_ jS j  r_ ji j  : i A S. Thus, we give an upper bound to
RER ; S:
(
)
X 22
RER ; S
inf p
:

4 0; A R n22
(
  


X 22
r
inf p
: r_ 0 S 1 r S ; r_ S  ; 2 1
2
4 0; A R n2
(
)
-0
X 22
:

1
r inf p
S
1
1
2
S
A R n22
RE1 ; S
RER ; S r RE1 ; S means that the RE based condition of nonconvex regularized regression RER ; S 4 0 is not relaxed. Negahban
et al. [24] put an additional constraint U f : J J Z g to the
denition of RE. This constraint avoids the bad case -0. However, it
still cannot guarantee to provide larger RE for non-convex regularizers
than 1-norm. For example, let t 1 ; t 2 and t3 satisfy that
jt 1 j jt 2 j r 2jt 3 j and 2, S f3g and S f1; 2g. Thus, the concavity
of r(u) implies that rjt 1 j rjt 2 j r 2rjt 1 j jt 2 j=2 r 2rjt 3 j. For
this case, f : S 1 r S 1 g  f : RS r RS g. Thus,
RER ; S r RE1 ; S. For RIF, we have the same result. Although
non-convex regularizers give better approximations to 0-norm, the
RE of non-convex regularizers cannot be guaranteed to be larger than
that of 1-norm. The framework of RE does not leave space to relax the
estimation condition for non-convex regularizers.
The only difference between the denitions of SE and RE lies in
the constraints for . The two constraints f : 0 r 2tg and
f : RS r RS g do not contain each other. However, we observe
that  2t Z minjT j r s RER 2t  s=s; T Z minjT j r s RER 2  1 2=s; T

for t Z s 1. When is small and s b2, 2  1 2=s is close to and

minjT j r s RER 2  1 2=s; T  minjT j r s RER ; T . Hence, with


proper regularizers, the SE condition in Eq. (18) is a weaker condition
than minjT j r s RER ; T 4 0.
We can also compare RE and our SE conditions with the help of
p
the failure bound of RIC 2s 1= 2 for 1-minimization recovery
[12], where 1-minimization recovery includes the basis pursuit
[10] and Dantzig selector [5]. The failure bound means that for any
p
4 0 there exists X A Rp  1p with 2s o1= 2 where 1-minimization recovery fails. On the other hand, 1-minimization
recovery succeeds when RE1 ; S 4 0 [2], like 1-regularized
regression (Theorem 5). Thus, minjT j r s RE1 ; T 0 if
p
p
2s Z1= 2, i.e., 2s=  2s Z 3 2 2. Since non-convex regup
larizers cannot weaken RE conditions, 2s=  2s Z 3 2 2

also causes minjT j r s RER ; T 0 for non-convex regularizers. On


the contrary, our SE conditions, e.g.,  2s 2 4 0, still are
possible to hold with proper non-convex regularizers even when
p
2s=  2s Z 3 2 2.
4.6. Comparison with the conditions for feature selection

General position means that any n columns of X are linear independent. The
columns of X are in general position with probability 1 if the elements of X are i.i.d.
drawn from some distribution, e.g., Gaussian.

Shen et al. [28] gave a necessary condition for consistent feature


selection, which can be relaxed further to  s 4 C log p=n with a

Z. Pan, C. Zhang / Pattern Recognition 48 (2015) 231243

constant C 4 0 independent of p; s; n. This necessary condition needs


s=  s to be upper bounded by a constant which is independent of the regularizers. For their DC algorithm based methods, they
tightened the conditions to that 2s~ =  2s~ is upper bounded,
where s~ is the number of non-zero components of the solutions
given by their methods. This condition cannot be veried until the
solutions are given. However, our SE conditions do not depend on
the sparseness of the practical solutions (see Section 5).

237

p
n
r_ 0 , the parameter O= n, the degree of approximating
the stationary solutions and the degree of approximating the
global optima r  1 =1  . If ru r 0 u=; and r 0 u; has a
nite derivative at zero, we know that r_ 0 r_ 0 0 ; , e.g.,
p
r_ 0 for MCP. Since O= n by Eq. (3) in this paper, the
estimation error bound is actually
p
n
~  2 r O= n O Or  1 =1  :
2

According to Theorem 6, we do not need to solve Problem (1)


exactly. A good suboptimal solution is enough to give good
parameter estimation. Even, we do not need a strictly stationary

5. Sparse estimation of AGAS solutions


For Problem (1), it is practical to obtain a solution which is
approximate global (AG) (Denition 7) and approximate stationary
(AS) (Denition 8). We show in this section that this kind of
solutions also gives good estimation to the true parameters.
n
Denition 7. Given Z 0, we say that ~ is a ; approximate
n
~
global solution of min F if F r F .

Denition 8. Given Z 0, we say that ~ is a -approximate


stationary solution of min F if the directional derivative of F
at ~ in any direction d A Rp with d2 1 is no less than  , i.e.,
F 0 ; d Z  .
The directional derivative is dened as F 0 ; d lim inf 0
F d  F = for any A Rp and d A Rp . For Problem (1),
F 0 ; d d L pi 1 R0 i ; di .
The following theorem gives the parameter estimation result
with AGAS solutions. Let u~ 0 Z 0 be the zero gap of ~ and
~ 0 minfu~ 0 ; mini A suppn jni jg.
T

Theorem 6 (Parameter estimation of AGAS solutions). Suppose that


the following conditions hold for the regularized regression.

solution since Theorem 6 allows a margin . So, the non-convex


regularized regression is robust to the inaccuracy of the solutions,
which is important for numerical computation.
It should be noted that r_ 0 is required to be nite in
Theorem 6, which forbids the regularizers with innite r_ 0 , e.
g., 0-norm and q-norm (0 oq o1). It may be due to the strongly
NP-hard property brought by 0-norm and q-norm regularized
regression [11].
Similar to Theorem 4, we give the following sparseness
estimation result for AGAS solutions. The proof is the same as
that of Theorem 4.
Theorem 7 (Sparseness estimation of AGAS solutions). Suppose
that the conditions of Theorem 6 hold. Let b t  1r
c4 ~ c5 r  1 =1  , where c4 and c5 are dened in Eqs. (37)
and (38). Consider l0 4 0 and integer m0 4 0 such that
s


2 m0

b X T e=n1 r r_ l0  :
m0
1
Then, jsupp~ \Sj o m0 b=rl0 .
The sparseness of AGAS solutions is also affected by

n
1. ~ is a ; AG solution and -AS solution.
2. r(u) is invertible for u Z 0 and r  1 u=s1 =r  1 u=s2 is a nondecreasing function w.r.t. u for any s2 Z s1 40.
3. The regularized regression satises -null consistency.
4. The following SE condition holds for some integer t Z s 1:
p
2t=  2t o4 2 1Gr ~ 0 ; ; s; t 1;
20

where

1
;
1

Gr ~ 0 ; ; s; t

p
st
r  1 r~ 0 =s

1
t  1 r r~ 0 =t  1

for ~ 0 4 0 and Gr 0; ; s; t lim-0 Gr ; ; s; t.

~ _
Then, ~  2 r C 4 ~ C 5 r  1 1 
, where r 0 v and
C 4 ; C 5 are positive constants. C4 and C5 are dened in Eqs. (39) and (40).
n

Conditions 24 are almost the same as the three conditions of


Theorem 3 except the slightly different requirements for t and
the denition of Gr ~ 0 ; ; s; t. Consequently, the discussion in
Section 4 is also suitable for this theorem:
1. The non-invertible basis functions can be approximated by
approximate invertible basis functions.
2. Without -sharp concavity, condition 4 of Theorem 6 is almost
the same as RIP conditions in Foucart and Lai [15].
3. With -sharp concavity and a positive zero gap (we show in
Theorem 9 that our CD methods guarantee the positive zero
gaps), SE based estimation conditions can be much relaxed.
Theorem 6 shows that the error bounds of parameter estimation are mainly determined by four parts: the slope of r(u) at zero

~ r_ 0 n and . Theorem 7 can also derive a similar


conclusion as Corollary 2. For an AGAS solution with small and ,
the sparseness of the solution is on the order of s, just like the
global solutions.
5.1. Approximate global solutions

We need AG solutions in Theorems 6 and 7. The methods to


obtain such solutions are crucial consequently. Instead of restricting to the solutions given by a specic algorithm, we use the
0
prediction error X  y22 =2n to give a quality guarantee for any
solution 0 that is regarded as an AG solution.
Theorem 8. Suppose that 0 is an s0-sparse vector with the predic0
n
tion error 20 X  y22 =2n. If  s s0 4 0, then 0 is a ( , )AG solution where
p
p !
20 = n
20 s s0 r p
s s0  s s0
Corollary 3. Suppose that 0 is an s0-sparse vector with the
p
prediction error 0 = n for some Z 0 and the basis function
p
2
has the formulation ru r 0 u= with  1 b0 = n. Then, 0 is
n
a ; C 6 2 =nAG solution where
!
p
2
s s0 b0
1 2
2
p

r
C6
:
0
2
b0 s s0  s s0

The methods that explicitly control the sparseness of its


solutions are suitable for giving the AG solutions, e.g., OMP [31]
and GraDeS [16]. However, we do not need the strong conditions
for consistent parameter estimation for these methods, e.g.,

238

Z. Pan, C. Zhang / Pattern Recognition 48 (2015) 231243

2s o 1=3 for GraDeS [16] or 1=  t log s=  t grows


sub-linearly as t for OMP [40]. In fact, Theorem 8 only requires
 s s0 4 0. Hence, s0 can be large enough to make 0 to be
small. The relationship between 0 and s0 depends on the
employed method and the design matrix X. Even with a bad value
of in the initialization, we can decrease it further by CD methods
as stated in Section 5.2.
5.2. Approximate stationary solutions with zero gap
Theorem 6 also requires the solution to be -AS and has a
positive zero gap. General gradient descent algorithms can provide
stationary solutions but they cannot ensure a positive zero gap.
However, we observe that the coordinate descent (CD) methods
can yield AS solutions and all of these solutions have positive zero
gaps under proper sharp concavity conditions.
In every step, CD only optimizes for one dimension, i.e.,

k
k
k  1
k  1 T
k  1 2
k
u  i
;
i arg minF 1 ; ; i  1 ; u; i 1 ; ; p
2

uAR

21
where k is the number of iterations, i 1; ; p and 4 0 is a
positive constant. The constant plays a role of balance between
decreasing F and not going far from the previous step. The
above CD method is also called proximal coordinate descent. For
Problem (1), the CD methods iterate as follows:


1 xi 22

arg
min

i
u A R2
n
!2
k  1
 
i
xTi k
i =n
u
ru;
22
xi 22 =n
where xi is the i-th column of the design matrix X and
k
k  1
k
. Problem (22) is a non-convex
i y j o i xj j  j 4 i xj j

but only one-dimensional problem. All of its solutions are between


k  1

2
0 and i
xTi k
i =n= xi 2 =n. We assume that Problem
(22) can be exactly solved. If Problem (22) has more than one
k

minimizer, any one of them can be selected as i . In this paper,


CD methods stop iterating if

k  1

2 r ;

23

where 40 is a small tolerance proportional to the value


Theorem 10).

(see

Theorem 9. If r(u) is sharp concave over 0; u0 , then


k
k
i Zu0 or i 0 for any k 1; 2; and any i 1; ; p.
The above zero gap property of CD is a corollary of Theorem 1.
The sharp concavity condition of Theorem 9 is a little stronger
than the requirements of Theorem 3. Nonetheless, we can set to
be small to narrow the difference between the sharp concavity
conditions of Theorems 3 and 9.
Besides the zero gap, we show in the following theorem that
CD methods simultaneously give AS solutions and keep them to be
still AG solutions.

theorem also gives an upper bound for , i.e.,


p
r p pk  k  1 2 ;

where k is the number of iterations. Usually, we hope that is on


p
n
n
n
the order of so that ~ r_ 0 O O= n in
Theorem 6.
CD has been applied to the non-convex regularized regression
by Breheny and Huang [3] and Mazumder et al. [23]. However,
their non-convex regularizers are restrictive because they need Eq.
(22) to be strictly convex for 0. They could not deal with the
MCP with r 1, the SCAD with r2 or the LSP with r 1.
Compared with them, the conclusions of Theorem 10 are weaker
but they are enough to obtain -AS solutions and the regularizers
can approximate 0-norm arbitrarily.

6. Experiment
In this section, we experimentally show the performance of CD
methods on giving AGAS solutions and the degree of weakness of
the estimation conditions required by the sharp concave
regularizers.
6.1. AGAS solutions
In Section 5, we prove that is monotonously decreasing,
tends to 0 and the zero gap u~ 0 is maintained in each iteration of
CD algorithms. We experimentally show these in this part.
We set the dimension of the parameter as p 1000, the number
n
of non-zero components of (the true parameter) s log p. We
randomly choose s indices as the non-zero components. The nonzero components are i.i.d. drawn from N 0; 1 and those belonging
to  0:1; 0:1 are promoted to 70.1 according to their signs.
The elements of the design matrix X A Rnp are i.i.d. drawn
from N 0; 1, where n 10s log p. The noise e is drawn from
N 0; I n and is normalized such that e2 0:01. We x 0:1
and 0:01 for all the non-convex regularizers (LSP, MCP and GP)
and use Eq. (3) to choose .
For the CD algorithm, we set 0:1. The CD algorithm is
initialized with zero vectors and terminated when is below 10  3
p
(we set 10  3 = p p by Theorem 10) or the number of
iterations is over 500. For each regularizer, we run CD for 100 trials
with independent true parameters and design matrices.
We illustrate the boxplots for u~ 0 , and of each iteration in
Fig. 3. The left column shows that CD methods maintain the zero
gaps in each iteration as stated in Theorem 9. The middle column
k
n
shows that F  F decreases to zero for most of trials in
100 iterations. The right column shows that most of the solutions
are very close to stationary solutions within 100 iterations.
6.2. Weaker conditions for sparse estimation
We show the performance of non-convex regularizers for
sparse estimation in this part. For an estimation ~ , three criteria
are used to describe the performance of sparse estimation:
(1) sparseness ~ 0 ; (2) Relative recovery error (RRE)
n
n
~  = ; and (3) Support recovery rate (SRR) jsupp
2

Theorem 10. fF g is a non-increasing sequence and converges;


p
For any 40 and with = p p, CD stops within
0

k 1 2p p2 F =2 iterations and outputs a -AS solution, where p is the number of columns of the design matrix X.
Theorem 10 shows that CD methods give a further decrease to
the value of AG property and guarantee the -AS property, which
is necessary for sparse estimation in Theorems 6 and 7. This

24

~ \ supp j=jsupp~ [ supp j. A weaker estimation condition than convex regularizers can be veried by achieving a more
accurate sparseness, lower RRE or higher SRR with less
sampling size.
We x the dimension of the parameters and the sparseness
of the true parameters and we vary the sampling size n to
compare the three criteria between convex regularizers (1-norm,
implemented by FISTA [1]) and non-convex regularizers (LSP, MCP
n

239

Zero Gap

Z. Pan, C. Zhang / Pattern Recognition 48 (2015) 231243

iteration

iteration

Zero Gap

iteration

iteration

iteration

Zero Gap

iteration

iteration

iteration

iteration

Fig. 3. The zero gap u~ 0 (left) and the parameters of AGAS solutions (middle) and (right) in each iteration of CD algorithms. The three rows correspond to LSP, MCP and GP,
respectively. The gures are in the form of the boxplots of the 100 trials of CD algorithms. The right column is actually the boxplots of the upper bound for in Eq. (24).

900

10

LSP
MCP
GP
L1

1100

LSP
MCP
GP
L1

10

0.9
0.8
0.7
0.6

SRR

700

RRE

Sparseness

10
10

500

0.4
10

0.3

300

0.2

10

LSP
MCP
GP
L1

0.1

100
0

0.5

500

1000

1500

10

500

1000

1500

500

1000

1500

Fig. 4. The sparseness (left), RRE (middle) and SRR (right) corresponding to the regularizers (LSP, MCP, GP and 1-norm). The true parameters, the design matrices and the
noises are generated in the same way as Section 6.1 except that p 10 000, s 100 and n varies from s to 15s. The parameter of the regularizers is set as 10  7. We use the
OMP [31] to generate an initial solution for CD with at most n  s non-zero components. The parameters of CD 0:1 and the stopping criterion of CD are the same as
Section 6.1. Every data point is the average of 100 trials of CD methods. For each regularizer and each n, we select from 10  6 ; 10  5 ; ; 10 such that it gets the smallest
average RRE of the 100 trials.

and GP). As Fig. 4 shows, non-convex regularizers give much


more accurate sparseness estimation, lower RREs and higher
SRRs than 1-regularization. Among the three non-convex
regularizers, the performance of sparse estimation is similar to
each other.

6.3. Single-pixel camera


We compare non-convex regularizers and 1-norm in the
application of single-pixel camera [13]. In this application, we
need to recover an image from a small fraction of pixels of an
image, which is a similar task to image inpainting [22]. Since most

of natural images have sparse Discrete Cosine Transformations


(DCT), we can recover the image by solving the problem
min y  Mvec22 =2n RvecD, where y's components
are the known pixels, is the estimated image, M is a mask
matrix indicating the positions of the known pixels, D is the
2D-DCT of and vec is the vectorization of . Denote D
and we rewrite the problem in the form of Problem (1)
min y  MvecD  1 22 =2n Rvec, where D  1  is the
inverse 2D-DCT of . Fig. 5(a) shows the test image (size
256  256). We randomly choose 25% pixels of it as y. The PSNRs
of LSP ( 10  7 ) and 1-norm are compared in Fig. 5(d),
where LSP has higher PSNRs than 1-norm for all s in the
gure. The PSNRs of LSP are more robust to than 1-norm.

240

Z. Pan, C. Zhang / Pattern Recognition 48 (2015) 231243

Fig. 5. Comparison of image recovery. (a) The original image. (b), (c) The estimated image by LSP and 1-norm with highest PSNRs in (d). (d) The PSNRs of LSP and 1-norm
for different values of . The results of LSP and 1-norm are obtained by CD ( 0:001, 0 0) and FISTA respectively.

Fig. 5(b) and (c) illustrates the recovered images by LSP and 1norm with the best PSNRs. The image produced by LSP is of better
quality than the one created by 1-norm.

Hence, ru  ru  u=u Z d Z ru u  ru=u. Let u-0


and then the lemma follows.
8.1. Sharp concavity and strong concavity
Invoking Eq. (2) with 40, t 1 0 and t 2 t 40, we have
r1  t Z 1  rt C 1  t 2 =2, which implies

7. Conclusion
This paper establishes a theory for sparse estimation with nonconvex regularized regression. The framework of non-convex
regularizers in this paper is general and especially suitable for
sharp concave regularizers. For proper sharp concave regularizers,
both global solutions and AGAS solutions can give good parameter
estimation and sparseness estimation. The proposed SE based
estimation conditions are weaker than that of 1-norm. To obtain
AGAS solutions, we give a prediction error based guarantee for AG
property and prove that CD methods yield the desired AGAS
solutions.
Our theory explains the improvements on sparse estimation
from 1-regularization to non-convex regularization. Our work can
serve as a guideline for the further study on designing regularizers
and developing algorithms for non-convex regularization.

8. Technical proofs
We rst provide two lemmas. The rst is Lemma 1 of Zhang
and Zhang [38].

25

Under the -null consistency condition, we further have


X T e=n1 r :
n

rt  rt  t
C1  t 2 =2:
t

Let -0. Sharp concavity follows.


8.2. The upper bound of for LSP
n

Dene U 40 such that U 2 =log 1 U 2= 2 . Let u U


p
n
and we have r U=2 log 1 U= U 2 log 1 U.

Hence,
Note
that
U r U 2 =log 1 U 2= 2 .
p
p
2 log 1 2= 2 . Also, a r 2 log 1 2= 2 .

n r

8.3. Proof of Theorem 1

^ minimizes y X 22 =2n R, therefore the subgradient


at ^ contains zero, i.e., jxTi X ^  y=njr r_ j^ i j for any
iA supp^ . Dene ^ 1 ; ; ^ i  1 ; 0; ^ i 1 ; ; ^ n . We have
y X ^ 22 =2n R^ r y  X 22 =2n R ,
which
implies
2
2
2nrj^ i j r ^ i xi 22 2^ i xTi y  X ^ r ^ i xi 22 2j^ i jjxTi y  X ^ j r

Lemma 1. Let ^ be a global optima of Problem (1). We have


X T X ^  y=n1 r :

rt Zt

26

Lemma 2.
1. r(u) is subadditive, i.e., ru1 u2 r ru1 ru2 , 8 u1 ; u2 Z 0.
2. For any 8 u 4 0 and any d A ru, r_ 0 Z r_ u  Z
d Z r_ u Z0.
Proof. 1. Since r(u) is concave, it follows that 8 u1 ; u2 Z 0,
u1 =u1 u2 ru1 u2 u2 =u1 u2 r0 rru1
and
u2 =u1 u2 ru1 u2 u1 =u2 u2 r0 rru2 . Summing up the
two inequalities gives ru1 u2 r ru1 ru2 .
2. Invoking the subadditivity, we have ru  u  ru=u r r
u=u for u 4 0 and u Z u. Let u-0. Then r_ 0 Z r_ u  .
The
concavity
of
r(u)
yields
that
ru  r
u  u=u Z ru u  ru=u for u 4 0. From the denition of subgradient of concave function, we have u
d Z ru u  ru and  u d Zru  u  ru for any u 4 0.

2
n^ i 2nj^ i jr_ j^ i j . If ^ i A 0; u0 , this inequality contradicts with
-sharp concavity condition.

8.4. Proof of Theorem 2


We assume that 0 is not a minimizer of
min X e=22 =2n R while ^ a 0 is a minimizer.
Therefore, X ^  e=2 =2n R^ oe2 =2n2 . Since r(u) is

-sharp concave over 0; u0 , the non-zero components


of ^ has magnitudes larger than u0. Thus, X ^ 
e=22 =2n R^ Zru0 Z e22 =2n2 . It contradicts with the
assumption.

8.5. Proof of Theorem 3


Let ^  , S supp , s jSj and T be any index set with
jT j r s. Let i1 ; i2 ; be a sequence of indices such that ik A T for
k Z 1 and ji1 j Z ji2 j Z ji3 j Z . Given an integer t Zs, we partin

tion T as T [ i Z 1 T i such that T 1 fi1 ; ; it g, T 2 fit 1 ; ; i2t g,


. Dene i Z 2 T i 2 , 1 =1  . Before the proof, we

Z. Pan, C. Zhang / Pattern Recognition 48 (2015) 231243

241

introduce the following three lemmas. Lemma 3 is a special case of


Lemma 6 with 0.

8.6. Proof of Theorem 4

Lemma 3. Under -null consistency,


r RS .
p
Lemma 4. r = t r RT =t.

The proof is similar to Theorem 2 in Zhang and Zhang [38]


except that we bound RS and X 22 =2n as follows.
n
By Eq. (30), we have RS rt=rC 2 1 and
n
2
X 2 =2n r RS r trC 2 1 .

X 22 =2n RS

Proof. For any iA T k and jA T k  1 (k Z 2), we have ji j r jj j. Thus,

rji j r RT k  1 =t, i.e., ji j2 r r  1 RT k  1 =t2 . It follows that


p
Thus,
rT k 2 = t r RT k  1 =t.
p
p
RT =t Z k Z 2 RT k  1 =t Z k Z 2 rT k 2 = t Z r = t .

Lemma 5. Under -null consistency,


p


p


1 2 2t   2t
max T 2 ; T 1 2 r
t 1 n :
2  2t
2
27
Proof. By
Lemma
1,
we
have
X T X =n1 r
n
n
X T X ^  y=n1 X T e=n1 r . We modify Eq. (12) in
Foucart and Lai [15] to the following inequality:
1
1
X ; XT T 1 r T 1 T 1 1 J X T X J 1
n
n
p
n
r t 1 T 2 T 1 2 :

Next, we turn to the proof of Theorem 3. Let   2t,


p
2t, Hr H r 0 ; ; s; t and 1 2 =   1=4.
There are two cases according to the difference of supports of ^
n
and .
n
Case 1: supp^ supp . For this case, we have i 0 for
i A S and 0, with which and Lemma 5, we obtain that
p
p
n
2 S 2 r c1 ; where c1 1 21 t =2  .
n
Case 2: supp^ a supp . Let T be the indices of the rst s
largest components of in the sense of magnitudes. From
p
the concavity of r(u), RT r srT 1 =s r srT 2 = s. By
Lemma 5, we have
!
p



p
T 2
1 2  
p
rsr p
RT r sr
t 1 n :
2
s
2 s 

and Cq 1 

r . Thus, it is feasible that q 2=1

and C 0:5 1  = 1  q . Eq. (14) follows. For this set2

-sharp concave over 0; 0 with


2
0 1  = 1
. We observe that r0 =s Z
2
1  =2 r0 =t holds under the condition that =t
=1 1,
i.e.,
=1 =t1= .
Thus,
r  1 r0 =t 1  and r  1 r0 =s 1  t=s1=q
with q 2=1 . Then, Eq. (15) follows.
ting for C and q, r(u) is

1 =2

n
Let ^  . By Lemma 3 in Section 8.5, we have

T
1=
RE ; S22 r X 22 =n and RIFR
; S r s X X 1 =n.
Invoking
null
consistency,
we
have
eT X =n r X 22 =2n R. Then,
R

0 Z L  L R  R
n

Z X 22 =2n  eT X =n RS  RS
Z 1  X 22 =2n  1 R

p
Z 1  22 RER ; S=2  1 sr_ 0 2 :
p
Hence, we obtain 2 r 2 s=RER ; Sr_ 0 . By Lemma 1,
n
X T X =n rX T X ^  y=n X T e=n r1 . By the de1

nition of RIF, we have r 1 s1= =RIFR


; S.
n

The proof needs the following two lemmas, which are extensions of Lemmas 3 and 5. The notations are the same as Section 8.5
n
except that ~  .

29

n
Lemma 6. Suppose that ~ is a ; approximate global solution
and the regularized regression satises the -null consistency condition. Then, X 22 =2n RS r RS =1  .

30

where

p
1 2
C2
:
2H r  

q1

28

for 0 4 0. If 0 0, the left hand of the above inequality still holds


since H r 0; ; s; t lim-0 H r ; ; s; t. Under the condition
H r  4 0, we have
n

and the concavity of r(u) require that C 1  q 0:5 1 

8.9. Proof of Theorem 6

n
By the denition of 0 in Eq. (8) and supp^ a supp , there
exists j satisfying jj jZ 0 , which implies RT Z r0 . Since
r  1 u=s=r  1 u=t is a non-decreasing function of u, we have that
r
r  1 RT =s
r  1 r0 =s
t
H r 0 ; ; s; t:
Z

s
r  1 RT =t r  1 r0 =t

r  1 RS =t rr  1 RT =t r C 2 1 ;

Suppose ru Cuq 0 o q r1 for u Z 1  . The continuity

8.8. Proof of Theorem 5

Then, following the proof of Theorem 3.1 in Foucart and Lai [15],
Eq. (27) follows.

Combining Lemmas 3 and 4, it follows that


r
r
p




RT
t
RT
1 21 t n
 r1
r
:
r1
s
s
t
2 
s

8.7. The method to obtain Eqs. (14) and (15)

31

p
n
Hence, we have r t C 2 1 by Lemmas 3 and 4. Invoking
Lemma 5 and 2 r T 2 T 1 2 , the conclusion follows
with some algebra.

Proof. Invoking -null consistency condition, we have


eT X =n r X 22 =2n R.
Since
~ n
is
a
n
; approximate global solution, we have

ZLn  Ln Rn Rn
Z X 22 =2n  eT X =n RS  RS
Z 1  X 22 =2n  R RS  RS
Hence, the conclusion follows.

Lemma 7. Under -null consistency,


p


p


1 2 2t   2t
t ~ :
max T 2 ; T 1 2 r
2  2t
2

32

Proof. Since ~ is a -AS solution, we have X T X ~  y=n1 r


r_ 0 . From the triangle inequality and Eq. (26), we have

242

Z. Pan, C. Zhang / Pattern Recognition 48 (2015) 231243

n
X T X =n1 r X T X ~  y=n1 X T e=n1 r r_ 0 ~ .
Eq. (32) follows with the same analysis as the proof of Lemma 5.

Next, we turn to the proof of Theorem 6. The proof is similar to


that of Theorem 3. Here we only provide some important steps.pLet

  2t, 2t, Gr Gr ~ 0 ; ; s; t and 1 2


=   1=4.
n
Case 1: supp~ supp . Similar to p
Case
p1 of Theorem 3, we
have 2 S 2 rc3 ~ where c3 1 2 t =2  .
n
Case 2: supp~ a supp . Similar to Eq. (29), we have
r
pr




RT
t
RT

1 2 t
 r 1

r
~
33
r1
s
s
t
2 
s
1  t
Since r(u) is non-decreasing and concave, r  1 u is convex. Therefore,






RT

t  1  1 RT
1

r
r1
r1
r
:
t
t
t 1
t
1  t
1
34
r RT =s
t 1
Z Gr p
r  1 RT =t  1
st
1

35

RS =t  1 r c4 ~ c5 r

1

=1  =t  1;

36

1rkrK

k
K

k  1 2
2 r k 1

p
t
1 2
c4
t  1 2Gr  

37

38
c5 =Gr  :
p
p
Hence, we have r t c4 ~ c5 1= t r  1 =1  . With this
and Lemma 7, it follows that 2 r C 4 ~ C 5 r  1 =1  , where
p
p
t 1 2 Gr =t 1 0:5t=t  1
C4
Z c3 ;
39
Gr 

2 1Gr
:
C 5 p
t Gr 

40

8.10. Proof of Theorem 8


p
0
n
0
0
0
Let  . We have X 2 r X  e2 r 0 2n .
So,

L  L R  R
0
0
r L R
p
0
r 20 s s0 r 2 = s s0
p
0
r 20 s s0 rX 2 = n  s s0 s s0
p
p
p

r 20 s s0 r= n 20 = s s0  s s0 :
n

k
di xTi Xzk;i  y=n R0 i ; di

k  1

k  1

For any i 1; ; p  1, let zk;i 1 ; ; i ; i 1 ; ; p


T
k  1
k
k
and zk;0
, zk;p . By the denition of i in Eq. (21),
we have
k

k  1 2

F zk;i r F zk;i i  i
F

2 r F zk;0 F

k
i 

2F
K

42

is non-negative, i.e.,

k  1
di Z 0
i

43

for any di A R. Summing up Eq. (43) from i 1 to p, we have for any


d A Rp
p

k  1

0 r i  i
i1

di R0 ; d

di xTi Xzk;i  y=n


r d1

k  1

d L

1 R0 ; d

k  1

di j

i 1 j i1

r F 0 ; d d1

F zk;p r

k  1

 j xTi xj =n

k  1

1 d1

i 1 j i1

k  1

j j

 j j

k  1

r F 0 ; d pd1 
1
p
k
k  1
0 k
r F ; d p pd1 
2

44

p
k
k  1
2 . When CD
Hence, F ; d Z  p pd1 
p
k
k  1
stops
iteration,

2 r = p p
and
j

j  1

2 Z for jr k  1, which implies F 0 ; d Z  for


0

any d2 1. Invoking Eq. (42), we have r 2F = k  1.


2

Thus, k r 2p p2 F =2 1.

Acknowledgements
This work is supported by 973 Program (2013CB329503),
NSFC (Grant No. 91120301) and Tsinghua National Laboratory for
Information Science and Technology (TNList) Cross-discipline
Foundation.
References

8.11. Proof of Theorem 10

k  1

The directional derivative of Eq. (21) at i

and

k  1 22

where

Thus,

min

Combining Eqs. (33)(35), we know that under the condition of


Eq. (20),
r

i1

We observe that

1

fF g, as well as fF zk;i g and fF zk;i i  i


2 =2g, are
non-increasing sequences and converge to the same nonnegative value.
Summing up the right inequality of Eq. (41) from i 1 to p, we
k
k  1 2
k  1
k
have 
2 r2F
 F = . Summing up from
k 1 to K, we have

=2 r F zk;i  1 :
F zk;i rF zk;i
k

41
k
i 

k  1 2
=
i

. Note that F Z 0 for any k. Thus,

[1] A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for


linear inverse problems, SIAM J. Imaging Sci. 2 (1) (2009) 183202.
[2] P.J. Bickel, Y. Ritov, A.B. Tsybakov, Simultaneous analysis of Lasso and Dantzig
selector, Ann. Stat. 37 (4) (2009) 17051732.
[3] P. Breheny, J. Huang, Coordinate descent algorithms for nonconvex penalized
regression, with applications to biological feature selection, Ann. Appl. Stat. 5
(1) (2011) 232.
[4] T.T. Cai, Guangwu Xu, Jun Zhang, On recovery of sparse signals via 1
minimization, IEEE Trans. Inf. Theory 55 (7) (2009) 33883397.
[5] E. Cands, T. Tao, The Dantzig selector: statistical estimation when p is much
larger than n, Ann. Stat. 35 (6) (2007) 23132351.
[6] E.J. Cands, Y. Plan, A probabilistic and ripless theory of compressed sensing,
IEEE Trans. Inf. Theory 57 (November (11)) (2011) 72357254.
[7] E.J. Cands, T. Tao, Decoding by linear programming, IEEE Trans. Inf. Theory 51
(12) (2005) 42034215.
[8] E.J. Cands, M.B. Wakin, S.P. Boyd, Enhancing sparsity by reweighted 1
minimization, J. Fourier Anal. Appl. 14 (5) (2008) 877905.
[9] R. Chartrand, Fast algorithms for nonconvex compressive sensing: MRI
reconstruction from very few data, in: IEEE International Symposium on
Biomedical Imaging: From Nano to Macro, ISBI'09, IEEE, 2009, pp. 262265.
[10] S.S. Chen, D.L. Donoho, M.A. Saunders, Atomic decomposition by basis pursuit,
SIAM J. Sci. Comput. 20 (1) (1999) 3361.
[11] Xiaojun Chen, Dongdong Ge, Zizhuo Wang, Yinyu Ye, Complexity of unconstrained L2-Lp minimization, Math. Programm. (2011) 113.

Z. Pan, C. Zhang / Pattern Recognition 48 (2015) 231243


[12] M.E. Davies, R. Gribonval, Restricted isometry constants where p sparse
recovery can fail for 0o p r 1, IEEE Trans. Inf. Theory 55 (May (5)) (2009)
22032214.
[13] Marco F. Duarte, Mark A. Davenport, Dharmpal Takhar, Jason N. Laska,
Ting Sun, Kevin F. Kelly, Richard G. Baraniuk, Single-pixel imaging via
compressive sampling, IEEE Signal Process. Mag. 25 (2) (2008) 8391.
[14] J. Fan, R. Li, Variable selection via nonconcave penalized likelihood and its
oracle properties, J. Am. Stat. Assoc. 96 (456) (2001) 13481360.
[15] S. Foucart, M.J. Lai, Sparsest solutions of underdetermined linear systems via
lq-minimization for 0o qr 1, Appl. Comput. Harmon. Anal. 26 (3) (2009)
395407.
[16] R. Garg, R. Khandekar, Gradient descent with sparsication: an iterative
algorithm for sparse recovery with restricted isometry property, in: ICML
2009, 2009.
[17] D. Geman, C. Yang, Nonlinear image recovery with half-quadratic regularization, IEEE Trans. Image Process. 4 (7) (1995) 932946.
[18] Pinghua Gong, Changshui Zhang, Zhaosong Lu, Jianhua Z. Huang, Jieping Ye, A
general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems, in: ICML 2013, 2013.
[19] J. Huang, S. Ma, C.H. Zhang, Adaptive Lasso for sparse high-dimensional
regression models, Stat. Sin. 18 (4) (2008) 1603.
[20] D.R. Hunter, R. Li, Variable selection using MM algorithms, Ann. Stat. 33 (4)
(2005) 1617.
[21] V. Koltchinskii, The Dantzig selector and sparsity oracle inequalities, Bernoulli
15 (3) (2009) 799828.
[22] J. Mairal, M. Elad, G. Sapiro, Sparse representation for color image restoration,
IEEE Trans. Image Process. 17 (1) (2008) 5369.
[23] R. Mazumder, J.H. Friedman, T. Hastie, Sparsenet: coordinate descent with
nonconvex penalties, J. Am. Stat. Assoc. 106 (495) (2011) 11251138.
[24] Sahand N. Negahban, Pradeep Ravikumar, Martin J. Wainwright, Bin Yu, A
unied framework for high-dimensional analysis of m-estimators with
decomposable regularizers, Stat. Sci. 27 (4) (2012) 538557.
[25] Zheng Pan, Changshui Zhang, High-dimensional inference via Lipschitz
sparsity-yielding regularizers, in: AISTATS 2013, 2013.
[26] J.C. Ramirez-Giraldo, J. Trzasko, S. Leng, L. Yu, A. Manduca, C.H. McCollough,
Nonconvex prior image constrained compressed sensing (NCPICCS): theory
and simulations on perfusion CT, Med. Phys. 38 (4) (2011) 2157.

243

[27] Xiaotong Shen, Wei Pan, Yunzhang Zhu, Likelihood-based selection and sharp
parameter estimation, J. Am. Stat. Assoc. 107 (497) (2012) 223232.
[28] Xiaotong Shen, Wei Pan, Yunzhang Zhu, Hui Zhou, On constrained and
regularized high-dimensional regression, Ann. Inst. Stat. Math. 1 (2013) 126.
[29] J. Shi, X. Ren, G. Dai, J. Wang, Z. Zhang, A non-convex relaxation approach to
sparse dictionary learning, in: CVPR 2011, 2011.
[30] E.Y. Sidky, R. Chartrand, X. Pan, Image reconstruction from few views by nonconvex optimization, in: Nuclear Science Symposium Conference Record, 2007
(NSS'07), vol. 5, IEEE, 2007, pp. 35263530.
[31] J.A. Tropp, A.C. Gilbert, Signal recovery from random measurements via
orthogonal matching pursuit, IEEE Trans. Inf. Theory 53 (12) (2007)
46554666.
[32] J. Trzasko, A. Manduca, Relaxed conditions for sparse signal recovery with
general concave priors, IEEE Trans. Signal Process. 57 (11) (2009) 43474354.
[33] J. Trzasko, A. Manduca, E. Borisch, Sparse MRI reconstruction via multiscale l0continuation, in: IEEE/SP 14th Workshop on SSP'07, 2007.
[34] J. Trzasko, A. Manduca, E. Borisch, Highly undersampled magnetic resonance
image reconstruction via homotopic l0-minimization, IEEE Trans. Med. Imaging 28 (1) (2009) 106121.
[35] J.D. Trzasko, A. Manduca, A xed point method for homotopic 0-minimization with application to MR image recovery, in: Medical Imaging, International
Society for Optics and Photonics, 2008.
[36] F. Ye, C.H. Zhang, Rate minimaxity of the Lasso and Dantzig selector for the lq
loss in lr balls, J. Mach. Learn. Res. (2010) 35193540.
[37] C.H. Zhang, Nearly unbiased variable selection under minimax concave
penalty, Ann. Stat. 38 (2) (2010) 894942.
[38] C.H. Zhang, T. Zhang, A general theory of concave regularization for high
dimensional sparse estimation problems, Stat. Sci., 2012.
[39] Tong Zhang, Analysis of multi-stage convex relaxation for sparse regularization, J. Mach. Learn. Res. 11 (2010) 10811107.
[40] Tong Zhang, Sparse recovery with orthogonal matching pursuit under rip,
IEEE Trans. Inf. Theory 57 (September (9)) (2011) 62156221.
[41] Tong Zhang, Multi-stage convex relaxation for feature selection, Bernoulli 19
(5B) (2013) 22772293.
[42] H. Zou, The adaptive Lasso and its oracle properties, J. Am. Stat. Assoc. 101
(476) (2006) 14181429.

Zheng Pan received his B.E. degree in Automation from Tsinghua University, Beijing, China, in 2009. He is currently a Ph.D. student at the State Key Laboratory of Intelligent
Technology and Systems, Department of Automation, Tsinghua University, Beijing, China. His research interests include machine learning and data mining.

Changshui Zhang received his B.S. degree from the Peking University, Beijing, China, in 1986, and Ph.D. degree from Tsinghua University, Beijing, China, in 1992. He is
currently a Professor of the Department of Automation, Tsinghua University. He is an Editorial Board Member of Pattern Recognition. His interests include articial
intelligence, image processing, pattern recognition, machine learning, and evolutionary computation.

You might also like