Professional Documents
Culture Documents
Pattern Recognition: Zheng Pan, Changshui Zhang
Pattern Recognition: Zheng Pan, Changshui Zhang
Pattern Recognition: Zheng Pan, Changshui Zhang
Pattern Recognition
journal homepage: www.elsevier.com/locate/pr
art ic l e i nf o
a b s t r a c t
Article history:
Received 30 June 2013
Received in revised form
23 January 2014
Accepted 19 June 2014
Available online 30 June 2014
Non-convex regularizers usually improve the performance of sparse estimation in practice. To prove this
fact, we study the conditions of sparse estimations for the sharp concave regularizers which are a
general family of non-convex regularizers including many existing regularizers. For the global solutions
of the regularized regression, our sparse eigenvalue based conditions are weaker than that of L1regularization for parameter estimation and sparseness estimation. For the approximate global and
approximate stationary (AGAS) solutions, almost the same conditions are also enough. We show that the
desired AGAS solutions can be obtained by coordinate descent (CD) based methods. Finally, we perform
some experiments to show the performance of CD methods on giving AGAS solutions and the degree of
weakness of the estimation conditions required by the sharp concave regularizers.
& 2014 Elsevier Ltd. All rights reserved.
Keywords:
Sparse estimation
Non-convex regularization
Sparse eigenvalue
Coordinate descent
1. Introduction
High-dimensional estimation concerns the parameter estimation problems in which the dimensions of parameters are comparable to or larger than the sampling sizes. In general, highdimensional estimation is ill-posed. Additional prior knowledge
about the structure of the parameters is usually needed to obtain
consistent estimations. In recent years, tremendous research
works have demonstrated that the prior on sparsity of the true
parameters can lead to good estimators, e.g., the well-known work
of compressed sensing [6] and its extensions to general highdimensional inference [24].
For high-dimensional sparse estimation, sparsity is usually
imposed as sparsity-encouraging [8] regularizers for linear regression methods. Many regularizers have been proposed to describe
the prior of sparsity, e.g., 0-norm, 1-norm, q-norm with 0 o q o 1,
smoothly clipped absolute deviation (SCAD) penalty [14], log-sum
penalty (LSP) [8], minimax concave penalty (MCP) [37] and Geman
penalty (GP) [17,32]. Except 1-norm, all of these sparsityencouraging regularizers are non-convex. Non-convex regularizers
were proposed to improve the performance of sparse estimation in
many applications, e.g., image inpainting and denoising [29],
biological feature selection [3,27], MRI [8,9,3335] and CT [26,30].
n
Corresponding author: Department of Automation, Tsinghua University, Beijing
100084, P.R. China. Tel.: 861062796872.
E-mail addresses: panz09@mails.tsinghua.edu.cn (Z. Pan),
zcs@tsinghua.edu.cn (C. Zhang).
http://dx.doi.org/10.1016/j.patcog.2014.06.018
0031-3203/& 2014 Elsevier Ltd. All rights reserved.
232
For the parameter estimation, a lot of applications and experiments have demonstrated that many non-convex regularizers give
good estimations with far less sampling sizes than 1-norm as the
regularizers [8,9,26,30,3335]. In theory, the requirements for the
sampling sizes are essentially the requirements for design matrix
or, rather, estimation conditions. A weaker estimation condition
means less sampling size needed or weaker requirements on a
design matrix. Weaker estimation conditions are important for the
application in which the data dimension is very high while the
sampling is expensive or restrictive. Theoretically, all of the nonconvex regularizers mentioned above admit accurate parameter
estimations under appropriate conditions, e.g., q-norm [15], MCP
[37], SCAD [37] and general non-convex regularizers [38].
There are mainly two types of estimation conditions. The rst is
sparse eigenvalue (SE) conditions, e.g., the restricted isometry
property (RIP) [6,7] and the SE used by Foucart and Lai [15] and
Zhang [40]. The second is restricted eigenvalue (RE) conditions, e.g.,
the 2-restricted eigenvalue (2-RE) [2,21] and the restricted
invertibility factor (RIF) [36,38]. Based on SE, Foucart and Lai [15]
gave a weaker estimation condition for q-norm than 1-norm.
Trzasko and Manduca [32] established a universal RIP condition
for general non-convex regularizers including 1-norm. Since the
conditions proposed by Trzasko and Manduca [32] are regularizerindependent, it cannot be weakened for non-convex regularizers
unfortunately. The denition of SE is regularizer-independent
while the RE is dependent on the regularizers. RE can give a
regularizer-dependent estimation condition for general regularizers, e.g., the 2-RE based work by Negahban et al. [24] and the RIF
based work by Zhang and Zhang [38].
The optimization for non-convex regularizers is difcult. It
usually cannot be guaranteed to achieve a global optimum for
general non-convex regularizers. Nevertheless, some optimization
methods can lead to local optima, e.g., coordinate descent [3,23]
and iterative reweighted (or majorizationminimization) methods
[8,20,41,39], homotopy [37], difference convex (DC) methods
[27,28] and proximal methods [18,25]. Hence, it is meaningful to
analyze the performance of sparse estimation for these nonoptimal optimization methods. For example, the multi-stage
relaxation methods [41,39] and its one-stage version the adaptive
LASSO [19,42] replace the regularizers with their convex relaxations using majorizationminimization. Compared with LASSO,
the multi-stage relaxation methods improve the performance on
parameter estimation [39]. Zhang and Zhang [38] use the solutions
of LASSO as the initialization and continue to optimize by gradient
descent. It is stated that LASSO followed by gradient descent can
output an approximate global and stationary solution which is
identical to the unique sparse local solution and the global
solution. The multi-stage relaxation methods, the LASSO
gradient descent methods and the homotopy methods need the
same SE or RE conditions as LASSO. The DC methods [28] and the
proximal methods [25] need to know the sparseness of its
solutions in advance to ensure the performance of parameter
estimation, but these two methods cannot control the sparseness
of its solutions explicitly.
Based on the related work, we make the following contributions:
2. Preliminaries
We rst formulate the sparse estimation problems. Suppose
that we have n samples y1 ; z1 ; y2 ; z2 ; ; yn ; zn , where yi A R
and
z i A Rp
i 1; ; n.
for
T
Let
X z1 ; ; zn T A Rnp
and
233
Table 1
Examples of popular regularizers. The second column is the basis functions of the regularizers. The third column is the zero gaps of the global solutions when the regularizers
are -sharp concave. The fourth column is the values of n . Section 8.2 gives the proof for the result on n of LSP.
Zero gap
ru u
q1 q=1=2 q
Name
Basis functions
1-Norm
q-Norm
x= 1
min 1; 1
dx
u
ru 2 log 1
Ru
x
ru 0 1
dx
SCAD
ru
LSP
MCP
GP
Ru
21 q q 1=2 q
n for 1
p
maxf1= ; 0g
p
n r 2 log 1 2= 2
p
=
p
n minf ; 1g
n 2 q
p
maxf 3 2= ; 0g
ru 2 u= u
^ arg minp F ;
AR
( p
2 =2;
2 r 2
2 4 2
=;
Section 8.1 shows that sharp concavity only needs Eq. (2) that
holds for t 1 0 and any t 2 A 0; u0 , which means that the sharp
concavity is weaker than the strong concavity. For example, MCP is
p
1 a 1 sharp concave over 0; 1 a for any a 4 0.
p
Whereas the strong concavity does not hold over ; 1 a .
Besides, q-norm holds q1 qu0 =q 2 sharp concavity over
max xi 22 =n:
1rirp
-null
234
0 minf^ 0 ; n g:
u40
u40
1. r(u) is invertible for u Z0 and r 1 u=s1 =r 1 u=s2 is a nondecreasing function of u for any s2 Zs1 Z 1.
2. The regularized regression satises -null consistency.
3. The following SE condition holds for some integer t Z s:
p
2t= 2t o 4 2 1Hr 0 ; ; s; t 1;
9
p
n
where s 0 , 1 =1 , H r 0 ; ; s; t s=t
r 1 r0 =s=r 1 r0 =t
for
0 4 0
and
H r 0; ; s; t lim-0 H r ; ; s; t.
Then,
^ 2 r C 1 ;
n
10
where
3. Sparse estimation of global solutions
C1
t r
X 22
r t
n22
matrices and compute the maxima, the minima and the means of
the values of ~ t, ~ t and ~ t=~ t for the 100 trials. Fig. 1
illustrates the results. The variances of ~ t, ~ t and
~ t=~ t with the same n and t are small since the corresponding lines for the maximum, minimum and mean values are close to
each other. However, ~ t=~ t grows very fast as t grows or n
decreases.
Based on SE, we establish the following parameter estimation
result for global solutions of non-convex regularized regression.
Let ^ 0 and n0 be the zero gaps of the global solution ^ and the true
1
The elements are i.i.d. drawn from the standard Gaussian distribution N 0; 1.
p
p
1 21 t
H r 0 ; ; s; t 1=2
p
:
2t
H r 0 ; ; s; t 1 2 2t= 2t 1=4
Eq.
(31).
Then,
jsupp^ \
2 1 0 s
r_ 0 1 a 2
;
r 0 C 3
11
then
jsupp^ \Sjr 0 1r 0 C 3 =r 0 1 s:
12
a r
than
p
2 log 1 2= 2 . Thus, the right hand side of Eq. (11) is larger
p
p
1=1 2 log 1 2= 2 2
p
log 1 1 C 2 1 2 log 1 2= 2
Thus, as goes to 0, the right hand side of Eq. (11) is arbitrarily
large. Eq. (11) holds for enough small . The right hand side of
Eq. (12) is 0 s Os as -0. Hence, we can freely select 0
satisfying Eq. (11) with enough small . For example, if 0 1=s,
Eq. (11) holds for enough small and Eq. (12) becomes
p
log 1 1 C 2 1 2 log 1 2= 2
:
jsupp^ =Sj r 1 s 1
p
log 1 1=
13
The right hand side of Eq. (13) is at most on the order of s when
is close to zero.
4. Discussion on Theorem 3
This section gives some detailed discussion on Theorem 3.
4.1. Invertible approximate regularizers
If r(u) is not invertible, e.g., MCP, we can design invertible basis
function to approximate it. For example, we can use the following
invertible function, named Approximate MCP, to approximate MCP:
8
2
0 r u r 1 ;
>
< u u =2 ;
2=1
ru 1 2
u
2
>
; u 4 1 ;
: 1
2
1
14
where A 0; 1. Approximate MCP is concatenated by the part of
MCP over 0; 1 and the part of q-norm over 1 ; 1
235
10
10
n=500
n=1000
n=1500
n=2000
3.5
3
10
10
2.5
2
n=500
n=1000
n=1500
n=2000
1.5
1
500
1000
t
1500
10
10
10
n=500
n=1000
n=1500
n=2000
500
1000
t
1500
10
200
400
600
800
Fig. 1. ~ t, ~ t and ~ t=~ t for the Gaussian random matrices with p 10 000, n 500, 1000, 1500, 2000 and t ranges from 1 to n. The solid lines are the average
values of the 100 trials and the two dashed lines around each solid line are the maximum and the minimum of the 100 trials.
Fig. 2. The upper bounds of the SE conditions for LSP, approximate MCP (AMCP) and q-norm. We set 1:01, t 2s. In each subgure, we also plot the upper bound of SE
conditions for 1-norm, i.e., the right hand side of Eq. (16) with q 1. (a) Approximate MCP vs. L1. (b) LSP vs. L1. (c) q-norm vs. L1.
236
1=2
t=s
1=2
where we set =1 =t
tions can be written as
p
2t
4 2 1 t 1=q 1=2
p
o1
:
2t
s
15
1=
16
H r 0 ; ; s; t
:
17
t 1 0 = =t 1
t =t 1
It should be noted that H r 0 ; ; s; t-1 as -0 for approximate
MCP, q-norm and LSP. Fig. 2 shows some special cases of
H r 0 ; ; s; t for these three regularizers and 1-norm. In Fig. 2,
the SE conditions in Eq. (9) are much weaker than that of 1-norm.
Theorem 3 reveals that the upper bound constraint for
2t= 2t tends to innity as -0 for proper non-convex
regularizers. It implies that if
(
)
X 22
2t inf
:
r2t
4 0;
18
0
n J J 22
there exists 4 0 so that the SE condition (Eq. (9)) is satised.
Based on this observation, we have Corollary 1. In Corollary 1,
2s 2 4 0 holds if the columns of X are in general position2
and 2s 2 r n, which is almost optimal in the sense that it is the
same as the SE condition of 0-regularized regression [38].
4.5. Comparison between SE and RE
Like SE, RE is also popular to construct estimation conditions.
There are some variants of RE, e.g., 2-RE [2,21] and RIF [36,38]. It
can derive a simple expression to the parameter estimation and
the corresponding estimation condition.
Denition 5 (2-RE). For Z 1, a regularizer R, an index set
S f1; pg and its complement set S , the 2-RE is dened as
(
)
X 22
RER ; S inf
:
R
:
19
S
S
n22
Denition 6 (Restricted invertibility factor). For Z 1, Z 1, a
regularizer R, an index set S f1; pg, RIF is dened as
(
)
jSj1= X T X 1
RIFR
;
S
inf
:
R
:
S
S
n
*f A Rn : r_ 0 S 1 r jS j; r_ jS j ; 2 1g;
where jS j is the vector composed of the absolute values of the
components of S , i.e., jS j ji j : i A S. In the same way,
r_ jS j r_ ji j : i A S. Thus, we give an upper bound to
RER ; S:
(
)
X 22
RER ; S
inf p
:
4 0; A R n22
(
X 22
r
inf p
: r_ 0 S 1 r S ; r_ S ; 2 1
2
4 0; A R n2
(
)
-0
X 22
:
1
r inf p
S
1
1
2
S
A R n22
RE1 ; S
RER ; S r RE1 ; S means that the RE based condition of nonconvex regularized regression RER ; S 4 0 is not relaxed. Negahban
et al. [24] put an additional constraint U f : J J Z g to the
denition of RE. This constraint avoids the bad case -0. However, it
still cannot guarantee to provide larger RE for non-convex regularizers
than 1-norm. For example, let t 1 ; t 2 and t3 satisfy that
jt 1 j jt 2 j r 2jt 3 j and 2, S f3g and S f1; 2g. Thus, the concavity
of r(u) implies that rjt 1 j rjt 2 j r 2rjt 1 j jt 2 j=2 r 2rjt 3 j. For
this case, f : S 1 r S 1 g f : RS r RS g. Thus,
RER ; S r RE1 ; S. For RIF, we have the same result. Although
non-convex regularizers give better approximations to 0-norm, the
RE of non-convex regularizers cannot be guaranteed to be larger than
that of 1-norm. The framework of RE does not leave space to relax the
estimation condition for non-convex regularizers.
The only difference between the denitions of SE and RE lies in
the constraints for . The two constraints f : 0 r 2tg and
f : RS r RS g do not contain each other. However, we observe
that 2t Z minjT j r s RER 2t s=s; T Z minjT j r s RER 2 1 2=s; T
General position means that any n columns of X are linear independent. The
columns of X are in general position with probability 1 if the elements of X are i.i.d.
drawn from some distribution, e.g., Gaussian.
237
p
n
r_ 0 , the parameter O= n, the degree of approximating
the stationary solutions and the degree of approximating the
global optima r 1 =1 . If ru r 0 u=; and r 0 u; has a
nite derivative at zero, we know that r_ 0 r_ 0 0 ; , e.g.,
p
r_ 0 for MCP. Since O= n by Eq. (3) in this paper, the
estimation error bound is actually
p
n
~ 2 r O= n O Or 1 =1 :
2
b X T e=n1 r r_ l0 :
m0
1
Then, jsupp~ \Sj o m0 b=rl0 .
The sparseness of AGAS solutions is also affected by
n
1. ~ is a ; AG solution and -AS solution.
2. r(u) is invertible for u Z 0 and r 1 u=s1 =r 1 u=s2 is a nondecreasing function w.r.t. u for any s2 Z s1 40.
3. The regularized regression satises -null consistency.
4. The following SE condition holds for some integer t Z s 1:
p
2t= 2t o4 2 1Gr ~ 0 ; ; s; t 1;
20
where
1
;
1
Gr ~ 0 ; ; s; t
p
st
r 1 r~ 0 =s
1
t 1 r r~ 0 =t 1
~ _
Then, ~ 2 r C 4 ~ C 5 r 1 1
, where r 0 v and
C 4 ; C 5 are positive constants. C4 and C5 are dened in Eqs. (39) and (40).
n
r
C6
:
0
2
b0 s s0 s s0
238
k
k
k 1
k 1 T
k 1 2
k
u i
;
i arg minF 1 ; ; i 1 ; u; i 1 ; ; p
2
uAR
21
where k is the number of iterations, i 1; ; p and 4 0 is a
positive constant. The constant plays a role of balance between
decreasing F and not going far from the previous step. The
above CD method is also called proximal coordinate descent. For
Problem (1), the CD methods iterate as follows:
1 xi 22
arg
min
i
u A R2
n
!2
k 1
i
xTi k
i =n
u
ru;
22
xi 22 =n
where xi is the i-th column of the design matrix X and
k
k 1
k
. Problem (22) is a non-convex
i y j o i xj j j 4 i xj j
2
0 and i
xTi k
i =n= xi 2 =n. We assume that Problem
(22) can be exactly solved. If Problem (22) has more than one
k
k 1
2 r ;
23
(see
6. Experiment
In this section, we experimentally show the performance of CD
methods on giving AGAS solutions and the degree of weakness of
the estimation conditions required by the sharp concave
regularizers.
6.1. AGAS solutions
In Section 5, we prove that is monotonously decreasing,
tends to 0 and the zero gap u~ 0 is maintained in each iteration of
CD algorithms. We experimentally show these in this part.
We set the dimension of the parameter as p 1000, the number
n
of non-zero components of (the true parameter) s log p. We
randomly choose s indices as the non-zero components. The nonzero components are i.i.d. drawn from N 0; 1 and those belonging
to 0:1; 0:1 are promoted to 70.1 according to their signs.
The elements of the design matrix X A Rnp are i.i.d. drawn
from N 0; 1, where n 10s log p. The noise e is drawn from
N 0; I n and is normalized such that e2 0:01. We x 0:1
and 0:01 for all the non-convex regularizers (LSP, MCP and GP)
and use Eq. (3) to choose .
For the CD algorithm, we set 0:1. The CD algorithm is
initialized with zero vectors and terminated when is below 10 3
p
(we set 10 3 = p p by Theorem 10) or the number of
iterations is over 500. For each regularizer, we run CD for 100 trials
with independent true parameters and design matrices.
We illustrate the boxplots for u~ 0 , and of each iteration in
Fig. 3. The left column shows that CD methods maintain the zero
gaps in each iteration as stated in Theorem 9. The middle column
k
n
shows that F F decreases to zero for most of trials in
100 iterations. The right column shows that most of the solutions
are very close to stationary solutions within 100 iterations.
6.2. Weaker conditions for sparse estimation
We show the performance of non-convex regularizers for
sparse estimation in this part. For an estimation ~ , three criteria
are used to describe the performance of sparse estimation:
(1) sparseness ~ 0 ; (2) Relative recovery error (RRE)
n
n
~ = ; and (3) Support recovery rate (SRR) jsupp
2
k 1 2p p2 F =2 iterations and outputs a -AS solution, where p is the number of columns of the design matrix X.
Theorem 10 shows that CD methods give a further decrease to
the value of AG property and guarantee the -AS property, which
is necessary for sparse estimation in Theorems 6 and 7. This
24
~ \ supp j=jsupp~ [ supp j. A weaker estimation condition than convex regularizers can be veried by achieving a more
accurate sparseness, lower RRE or higher SRR with less
sampling size.
We x the dimension of the parameters and the sparseness
of the true parameters and we vary the sampling size n to
compare the three criteria between convex regularizers (1-norm,
implemented by FISTA [1]) and non-convex regularizers (LSP, MCP
n
239
Zero Gap
iteration
iteration
Zero Gap
iteration
iteration
iteration
Zero Gap
iteration
iteration
iteration
iteration
Fig. 3. The zero gap u~ 0 (left) and the parameters of AGAS solutions (middle) and (right) in each iteration of CD algorithms. The three rows correspond to LSP, MCP and GP,
respectively. The gures are in the form of the boxplots of the 100 trials of CD algorithms. The right column is actually the boxplots of the upper bound for in Eq. (24).
900
10
LSP
MCP
GP
L1
1100
LSP
MCP
GP
L1
10
0.9
0.8
0.7
0.6
SRR
700
RRE
Sparseness
10
10
500
0.4
10
0.3
300
0.2
10
LSP
MCP
GP
L1
0.1
100
0
0.5
500
1000
1500
10
500
1000
1500
500
1000
1500
Fig. 4. The sparseness (left), RRE (middle) and SRR (right) corresponding to the regularizers (LSP, MCP, GP and 1-norm). The true parameters, the design matrices and the
noises are generated in the same way as Section 6.1 except that p 10 000, s 100 and n varies from s to 15s. The parameter of the regularizers is set as 10 7. We use the
OMP [31] to generate an initial solution for CD with at most n s non-zero components. The parameters of CD 0:1 and the stopping criterion of CD are the same as
Section 6.1. Every data point is the average of 100 trials of CD methods. For each regularizer and each n, we select from 10 6 ; 10 5 ; ; 10 such that it gets the smallest
average RRE of the 100 trials.
240
Fig. 5. Comparison of image recovery. (a) The original image. (b), (c) The estimated image by LSP and 1-norm with highest PSNRs in (d). (d) The PSNRs of LSP and 1-norm
for different values of . The results of LSP and 1-norm are obtained by CD ( 0:001, 0 0) and FISTA respectively.
Fig. 5(b) and (c) illustrates the recovered images by LSP and 1norm with the best PSNRs. The image produced by LSP is of better
quality than the one created by 1-norm.
7. Conclusion
This paper establishes a theory for sparse estimation with nonconvex regularized regression. The framework of non-convex
regularizers in this paper is general and especially suitable for
sharp concave regularizers. For proper sharp concave regularizers,
both global solutions and AGAS solutions can give good parameter
estimation and sparseness estimation. The proposed SE based
estimation conditions are weaker than that of 1-norm. To obtain
AGAS solutions, we give a prediction error based guarantee for AG
property and prove that CD methods yield the desired AGAS
solutions.
Our theory explains the improvements on sparse estimation
from 1-regularization to non-convex regularization. Our work can
serve as a guideline for the further study on designing regularizers
and developing algorithms for non-convex regularization.
8. Technical proofs
We rst provide two lemmas. The rst is Lemma 1 of Zhang
and Zhang [38].
25
rt rt t
C1 t 2 =2:
t
Hence,
Note
that
U r U 2 =log 1 U 2= 2 .
p
p
2 log 1 2= 2 . Also, a r 2 log 1 2= 2 .
n r
rt Zt
26
Lemma 2.
1. r(u) is subadditive, i.e., ru1 u2 r ru1 ru2 , 8 u1 ; u2 Z 0.
2. For any 8 u 4 0 and any d A ru, r_ 0 Z r_ u Z
d Z r_ u Z0.
Proof. 1. Since r(u) is concave, it follows that 8 u1 ; u2 Z 0,
u1 =u1 u2 ru1 u2 u2 =u1 u2 r0 rru1
and
u2 =u1 u2 ru1 u2 u1 =u2 u2 r0 rru2 . Summing up the
two inequalities gives ru1 u2 r ru1 ru2 .
2. Invoking the subadditivity, we have ru u ru=u r r
u=u for u 4 0 and u Z u. Let u-0. Then r_ 0 Z r_ u .
The
concavity
of
r(u)
yields
that
ru r
u u=u Z ru u ru=u for u 4 0. From the denition of subgradient of concave function, we have u
d Z ru u ru and u d Zru u ru for any u 4 0.
2
n^ i 2nj^ i jr_ j^ i j . If ^ i A 0; u0 , this inequality contradicts with
-sharp concavity condition.
241
X 22 =2n RS
p
1 2 2t 2t
max T 2 ; T 1 2 r
t 1 n :
2 2t
2
27
Proof. By
Lemma
1,
we
have
X T X =n1 r
n
n
X T X ^ y=n1 X T e=n1 r . We modify Eq. (12) in
Foucart and Lai [15] to the following inequality:
1
1
X ; XT T 1 r T 1 T 1 1 J X T X J 1
n
n
p
n
r t 1 T 2 T 1 2 :
and Cq 1
1 =2
n
Let ^ . By Lemma 3 in Section 8.5, we have
T
1=
RE ; S22 r X 22 =n and RIFR
; S r s X X 1 =n.
Invoking
null
consistency,
we
have
eT X =n r X 22 =2n R. Then,
R
0 Z L L R R
n
Z X 22 =2n eT X =n RS RS
Z 1 X 22 =2n 1 R
p
Z 1 22 RER ; S=2 1 sr_ 0 2 :
p
Hence, we obtain 2 r 2 s=RER ; Sr_ 0 . By Lemma 1,
n
X T X =n rX T X ^ y=n X T e=n r1 . By the de1
The proof needs the following two lemmas, which are extensions of Lemmas 3 and 5. The notations are the same as Section 8.5
n
except that ~ .
29
n
Lemma 6. Suppose that ~ is a ; approximate global solution
and the regularized regression satises the -null consistency condition. Then, X 22 =2n RS r RS =1 .
30
where
p
1 2
C2
:
2H r
q1
28
n
By the denition of 0 in Eq. (8) and supp^ a supp , there
exists j satisfying jj jZ 0 , which implies RT Z r0 . Since
r 1 u=s=r 1 u=t is a non-decreasing function of u, we have that
r
r 1 RT =s
r 1 r0 =s
t
H r 0 ; ; s; t:
Z
s
r 1 RT =t r 1 r0 =t
r 1 RS =t rr 1 RT =t r C 2 1 ;
Then, following the proof of Theorem 3.1 in Foucart and Lai [15],
Eq. (27) follows.
31
p
n
Hence, we have r t C 2 1 by Lemmas 3 and 4. Invoking
Lemma 5 and 2 r T 2 T 1 2 , the conclusion follows
with some algebra.
ZLn Ln Rn Rn
Z X 22 =2n eT X =n RS RS
Z 1 X 22 =2n R RS RS
Hence, the conclusion follows.
p
1 2 2t 2t
t ~ :
max T 2 ; T 1 2 r
2 2t
2
32
242
n
X T X =n1 r X T X ~ y=n1 X T e=n1 r r_ 0 ~ .
Eq. (32) follows with the same analysis as the proof of Lemma 5.
1 2 t
r 1
r
~
33
r1
s
s
t
2
s
1 t
Since r(u) is non-decreasing and concave, r 1 u is convex. Therefore,
RT
t 1 1 RT
1
r
r1
r1
r
:
t
t
t 1
t
1 t
1
34
r RT =s
t 1
Z Gr p
r 1 RT =t 1
st
1
35
RS =t 1 r c4 ~ c5 r
1
=1 =t 1;
36
1rkrK
k
K
k 1 2
2 r k 1
p
t
1 2
c4
t 1 2Gr
37
38
c5 =Gr :
p
p
Hence, we have r t c4 ~ c5 1= t r 1 =1 . With this
and Lemma 7, it follows that 2 r C 4 ~ C 5 r 1 =1 , where
p
p
t 1 2 Gr =t 1 0:5t=t 1
C4
Z c3 ;
39
Gr
2 1Gr
:
C 5 p
t Gr
40
L L R R
0
0
r L R
p
0
r 20 s s0 r 2 = s s0
p
0
r 20 s s0 rX 2 = n s s0 s s0
p
p
p
r 20 s s0 r= n 20 = s s0 s s0 :
n
k
di xTi Xzk;i y=n R0 i ; di
k 1
k 1
k 1 2
F zk;i r F zk;i i i
F
2 r F zk;0 F
k
i
2F
K
42
is non-negative, i.e.,
k 1
di Z 0
i
43
k 1
0 r i i
i1
di R0 ; d
k 1
d L
1 R0 ; d
k 1
di j
i 1 j i1
r F 0 ; d d1
F zk;p r
k 1
j xTi xj =n
k 1
1 d1
i 1 j i1
k 1
j j
j j
k 1
r F 0 ; d pd1
1
p
k
k 1
0 k
r F ; d p pd1
2
44
p
k
k 1
2 . When CD
Hence, F ; d Z p pd1
p
k
k 1
stops
iteration,
2 r = p p
and
j
j 1
Thus, k r 2p p2 F =2 1.
Acknowledgements
This work is supported by 973 Program (2013CB329503),
NSFC (Grant No. 91120301) and Tsinghua National Laboratory for
Information Science and Technology (TNList) Cross-discipline
Foundation.
References
k 1
and
k 1 22
where
Thus,
min
i1
We observe that
1
=2 r F zk;i 1 :
F zk;i rF zk;i
k
41
k
i
k 1 2
=
i
243
[27] Xiaotong Shen, Wei Pan, Yunzhang Zhu, Likelihood-based selection and sharp
parameter estimation, J. Am. Stat. Assoc. 107 (497) (2012) 223232.
[28] Xiaotong Shen, Wei Pan, Yunzhang Zhu, Hui Zhou, On constrained and
regularized high-dimensional regression, Ann. Inst. Stat. Math. 1 (2013) 126.
[29] J. Shi, X. Ren, G. Dai, J. Wang, Z. Zhang, A non-convex relaxation approach to
sparse dictionary learning, in: CVPR 2011, 2011.
[30] E.Y. Sidky, R. Chartrand, X. Pan, Image reconstruction from few views by nonconvex optimization, in: Nuclear Science Symposium Conference Record, 2007
(NSS'07), vol. 5, IEEE, 2007, pp. 35263530.
[31] J.A. Tropp, A.C. Gilbert, Signal recovery from random measurements via
orthogonal matching pursuit, IEEE Trans. Inf. Theory 53 (12) (2007)
46554666.
[32] J. Trzasko, A. Manduca, Relaxed conditions for sparse signal recovery with
general concave priors, IEEE Trans. Signal Process. 57 (11) (2009) 43474354.
[33] J. Trzasko, A. Manduca, E. Borisch, Sparse MRI reconstruction via multiscale l0continuation, in: IEEE/SP 14th Workshop on SSP'07, 2007.
[34] J. Trzasko, A. Manduca, E. Borisch, Highly undersampled magnetic resonance
image reconstruction via homotopic l0-minimization, IEEE Trans. Med. Imaging 28 (1) (2009) 106121.
[35] J.D. Trzasko, A. Manduca, A xed point method for homotopic 0-minimization with application to MR image recovery, in: Medical Imaging, International
Society for Optics and Photonics, 2008.
[36] F. Ye, C.H. Zhang, Rate minimaxity of the Lasso and Dantzig selector for the lq
loss in lr balls, J. Mach. Learn. Res. (2010) 35193540.
[37] C.H. Zhang, Nearly unbiased variable selection under minimax concave
penalty, Ann. Stat. 38 (2) (2010) 894942.
[38] C.H. Zhang, T. Zhang, A general theory of concave regularization for high
dimensional sparse estimation problems, Stat. Sci., 2012.
[39] Tong Zhang, Analysis of multi-stage convex relaxation for sparse regularization, J. Mach. Learn. Res. 11 (2010) 10811107.
[40] Tong Zhang, Sparse recovery with orthogonal matching pursuit under rip,
IEEE Trans. Inf. Theory 57 (September (9)) (2011) 62156221.
[41] Tong Zhang, Multi-stage convex relaxation for feature selection, Bernoulli 19
(5B) (2013) 22772293.
[42] H. Zou, The adaptive Lasso and its oracle properties, J. Am. Stat. Assoc. 101
(476) (2006) 14181429.
Zheng Pan received his B.E. degree in Automation from Tsinghua University, Beijing, China, in 2009. He is currently a Ph.D. student at the State Key Laboratory of Intelligent
Technology and Systems, Department of Automation, Tsinghua University, Beijing, China. His research interests include machine learning and data mining.
Changshui Zhang received his B.S. degree from the Peking University, Beijing, China, in 1986, and Ph.D. degree from Tsinghua University, Beijing, China, in 1992. He is
currently a Professor of the Department of Automation, Tsinghua University. He is an Editorial Board Member of Pattern Recognition. His interests include articial
intelligence, image processing, pattern recognition, machine learning, and evolutionary computation.