Abolghasemi 2012

Signal Processing 92 (2012) 999–1009
Contents lists available at SciVerse ScienceDirect
Signal Processing
journal homepage: www.elsevier.com/locate/sigpro
A gradient-based alternating minimization approach for

optimization of the measurement matrix in compressive sensing
Vahid Abolghasemi n, Saideh Ferdowsi, Saeid Sanei
Faculty of Engineering and Physical Sciences, University of Surrey, Guildford GU2 7XH, Surrey, UK
a r t i c l e i n f o abstract
Article history: In this paper the problem of optimization of the measurement matrix in compressive
Received 7 December 2010 (also called compressed) sensing framework is addressed. In compressed sensing a
Received in revised form measurement matrix that has a small coherence with the sparsifying dictionary (or
27 July 2011
basis) is of interest. Random measurement matrices have been used so far since they
Accepted 14 October 2011
Available online 25 October 2011
present small coherence with almost any sparsifying dictionary. However, it has been
recently shown that optimizing the measurement matrix toward decreasing the
Keywords: coherence is possible and can improve the performance. Based on this conclusion, we
Basis pursuit (BP) propose here an alternating minimization approach for this purpose which is a variant
Compressive sensing (CS)
of Grassmannian frame design modified by a gradient-based technique. The objective is
Gradient descent
to optimize an initially random measurement matrix to a matrix which presents a
Orthogonal matching pursuit (OMP)
Random Gaussian matrices smaller coherence than the initial one. We established several experiments to measure
Sparse representation the performance of the proposed method and compare it with those of the existing
approaches. The results are encouraging and indicate improved reconstruction quality,
when utilizing the proposed method.
& 2011 Elsevier B.V. All rights reserved.
1. Introduction found. ‘0 -norm ðJ J0 Þ counts the number of non-zero

components and is a measure of sparsity degree in the
Sparse signal recovery problem has been widely signal. Note that in most cases the original signal is sparse in
researched within signal and image processing communities a different domain (e.g. Fourier, wavelet, or discrete cosine
in recent years. Sparse signals/images have few non-zero transform (DCT)). In such cases the notation changes to
components. A signal or image with an underlying sparse y ¼ Ux ¼ UWa ¼ Da, where U 2 Rpn and W 2 Rnm are
structure can be more efficiently and advantageously com- called measurement (sensing) matrix and sparsifying trans-
pressed, stored or transmitted than normal non-sparse form, respectively. Note that U is always an overcomplete
signals. Such an optimum compression requires a novel matrix (p 5 n) and W can either be overcomplete ðn o mÞ,
sampling scheme, followed by an appropriate reconstruction called dictionary, or complete (n¼m) which is called basis.
method, different from conventional techniques. Given the One of the recent emerging fields having direct con-
compressed measurements y and the sampling pattern D, nection to sparse recovery problem is compressed sensing
the main challenge is to solve the following inverse problem: (CS) [1,2]. In CS we not only attempt to find a sparse
minJaJ0 s:t: y ¼ Da, ð1Þ solution for (1), but also discuss the possible ways to
generate U to compressively acquire the input signal and
p pm
where y 2 R and the rectangular matrix D 2 R with yet not to violate the uniqueness conditions for sparse
p o m are known and the sparse signal a 2 Rm should be recovery in (1). In other words, CS targets Shannon’s
theorem [3] and takes the advantages of sparsity to
n decrease the sampling rate.
Corresponding author. Tel.: þ44 7529343857.
E-mail addresses: v.abolghasemi@surrey.ac.uk, The major difficulty in solving (1) is its non-convexity.
vabolghasemi@ieee.org (V. Abolghasemi). Its solution requires a combinatorial search through all
0165-1684/$ - see front matter & 2011 Elsevier B.V. All rights reserved.
doi:10.1016/j.sigpro.2011.10.012
1000 V. Abolghasemi et al. / Signal Processing 92 (2012) 999–1009
possible sparse a which is not feasible. However, there frames [17–19], aims at minimizing a cost function in an
have been reported several approaches providing the alternating scheme. The results of our experiments and
solution for the sparse recovery problem, by either relax- simulations confirm the ability of the proposed method to
ing it to decrease the coherence leading to higher reconstruction
quality.
minJaJ1 s:t: y ¼ Da, ð2Þ
The rest of the paper is organized as follows. In the next
P
where JaJ1 ¼ i 9ai 9 refers to ‘1 -norm, or using a greedy section, we first define the criteria that can be used for
strategy. Note that in addition to (1) and (2), there are measuring the coherence of a matrix. Then, the proposed
other formulations as well which are not presented here. approach for optimization of the measurement matrix is
Some of the well known recovery methods are basis described. In Section 3 an adaptive scheme to choose the
pursuit (BP) [4], orthogonal matching pursuit (OMP) [5], optimum stepsize is proposed. In Section 4, some simula-
least angle regression (LARS) [6], and least absolute tions are given to examine the proposed method from the
shrinkage and selection operator (LASSO) [7]. practical perspectives. The discussion and conclusion are
Although sparsity is an essential requirement for the presented in Sections 5 and 6, respectively.
signals in the CS framework, it is not sufficient for a signal
to be successfully reconstructed. Thus far, it is realized
2. Measurement matrix optimization
that in order to be able to attain a reasonably small p and
yet have a stable reconstruction, U requires to have
2.1. Problem formulation
specific properties [8], the one which is focused on here,
is mutual coherence. In CS theory, a suitable sensing
Following the notations stated earlier, assume a has
matrix U is desired to be as incoherent as possible with
s 5 m non-zero samples and called s-sparse. A successful
sparsifying matrix W. In other words, the correlation
CS requires U and W to be incoherent. A good measure of
between any distinct pair of columns in equivalent matrix
coherence between these two matrices (or equivalently
D should be very small, and that means to have a nearly
columns of D) can be obtained by referring to the defini-
orthogonal matrix D. It is shown that random matrices
tion of mutual coherence [20,21]. Here, we quote this
with Gaussian or Bernoulli distributions are appropriate
definition as that presented by Elad [10]:
choices for U [9], as they satisfy this property with high
probability and can be generated non-adaptively. Although Definition 1. For a matrix D, the mutual coherence is
U is normally selected to be random, some recent works defined as the maximum absolute value and normalized
have been reported in which the authors attempt to find an inner product between all columns in D that can be
optimal structure for U (explicitly or implicitly) with the described as
hope of increasing the reconstruction quality and to take ( T
)
9di dj 9
fewer measurements [10–15]. Elad [10] attempts to itera- mmx ðDÞ ¼ max : ð3Þ
iaj,1 r i,j r m Jdi J2 Jdj J2
tively decrease the average mutual coherence using a
shrinkage operation followed by a singular value decom-
A desired U has a small mmx with respect to W. An
position (SVD) step. In [11] the authors apply a kind of non-
alternative description for mmx is obtained by referring to
uniform sampling by segmenting the input signal and ~ T D,
the corresponding Gram matrix G~ ¼ D ~ in which D~ is
taking samples with different rates from each segment. In
column-normalized version of D. We set two coherence
[12], which is an application to magnetic resonance imaging
measures based on this matrix, which are used to study
(MRI), the authors define an incoherence criterion based on
the performance of different methods later in the Results
point spread function (PSF) and propose a Monte Carlo
section. These parameters are the maximum and average
scheme for random incoherent sampling of this type of data. ~ denoted respectively as
of off-diagonal elements in G,
Wang et al. [13] propose a variable density sampling
strategy by exploiting the prior information about the mmx ¼ max 9g~ ij 9, ð4Þ
iaj,1 r i,j r m
statistical distributions of natural images in the wavelet
domain. Their proposed method is computationally efficient P P
iaj 9g ij 9
jai
~
and can be applied to several transform domains. In another mav ¼ : ð5Þ
mðm1Þ
work in [14], Wang et al. propose to generate colored
random projections using an adaptive scheme. Duarte- There are important reasons which motivate us to
Carvajalino et al. [15] take the advantage of an eigenvalue search for measurement matrices with small coherence.
decomposition process followed by a KSVD-based algorithm The first reason is the influence that the mutual coherence
[15] (see also [16]) to optimize U and learn dictionary W, has on the achievable sparsity bound for the given signal.
respectively. The results of all previous methods reveal the A detailed discussion about this fact has been documen-
improved performance which is an evidence of the benefits ted in [21,22]. In a brief statement, if we suppose that the
that optimal sampling can provide for this framework. following inequality holds in a noiseless setting:
Our contribution in this paper is to develop a novel
1 1
approach for optimizing the measurement matrix. We JaJ0 o 1þ ð6Þ
2 mmx ðDÞ
propose a gradient-based method to optimize a randomly
selected measurement matrix to decrease the coherence then, a is necessarily the sparsest signal when y and D are
with the given sparsifying matrix. The proposed algo- known. More importantly, both BP and OMP are guaranteed
rithm, which is related to the design of Grassmannian to successfully recover it [21,22]. This implies that as long as
V. Abolghasemi et al. / Signal Processing 92 (2012) 999–1009 1001
the mutual coherence is small, a successful reconstruction is between all vectors are simply obtained by multiplying D ~
achievable for a wider range of sparsity degree. by its transposed: D ~ T D.
~ This is equivalent to the Gram
Another significant requirement is the restricted iso- matrix G~ which has already been described. Note that
metry property (RIP) [9,20], which is defined with respect diagonal elements of G~ are not of our interest since they
to an isometry constant 0 o d r1. For a s-sparse signal a ~ In particular,
indicate the Euclidean norms of columns of D.
and any integer t r s, the isometry constant of D is the we are only interested in the off-diagonal elements of G~
smallest ds that satisfies which correspond to inner products between any two
2 2 2 distinct vectors defined as
ð1ds ÞJat J2 rJDt t J2 rð1 þ ds ÞJ t J2 :
a a ð7Þ
T
g~ ij ¼ cosðyij Þ :¼ /d~ i , d~ j S ¼ d~ i d~ j for iaj, ð9Þ
This property implies that for a proper isometry constant
(the ideal d is 0), RIP ensures that any t subset of columns where / , S denotes the vector inner product. It is
in D with cardinality less than s is nearly orthogonal. This important to note that minimum possible absolute off-
orthogonality can be related to the coherence of columns; diagonals of G~ are zero and can occur only if D
~ is ortho-
less coherence corresponds to higher degree of orthogon- normal with p¼m. This is not the case here, since we only
ality. This results in a better CS behavior and guarantees deal with p o m. It can be shown that minimum absolute
the identifiability of the original signal by both OMP and inner product for each vector pair in an equiangular tight
BP [4]. Note that in noisy circumstances more severe frame is [18]:
conditions are applied to ensure the stability of recon- rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
mp
struction, which are not dealt here. mE ¼ : ð10Þ
pðm1Þ
2.2. The proposed method In fact, (10) is the minimum achievable mutual coherence of
D, too. We now summarize our discussion in the following
Recall from the previous section that an incoherent D has definition [18]:
a small mutual coherence. In other words, the correspond-
Definition 2. A matrix D is called an equiangular tight frame
ing Gram matrix has its absolute off-diagonal elements
(ETF) if its corresponding Gram matrix has unit diagonal
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
close to zero and diagonal elements equal to one. This,
elements and off-diagonals equal to 7 ðmpÞ=pðm1Þ:
indicates a Gram matrix which is equal to identity matrix. In 8 rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
a previous work [23], we proposed a method to minimize >
<7 mp
for iaj,
the difference between Gram matrix and identity matrix in g~ ij ¼ pðm1Þ ð11Þ
>
:1
the form of Frobenius norm: otherwise:
minJGIJ2F , ð8Þ In general, designing an exact ETF is a complicated pro-

G
cedure [18] and such frames do not exist for any arbitrary
where J J2F denotes the Frobenius norm and I is identity p and m—it is known that a real equiangular tight frame
matrix. The proposed method in [23] is able to optimize an can exist only if m r pðp þ 1Þ=2 [18]. Nonetheless, there
initial U to a less coherent matrix. However, further study are methods established to find ETFs such as in [17,18].
revealed that choosing identity matrix in the above objec- The problem we face here is more challenging, where we
tive function is a very strict constraint, and a Gram matrix require to find a measurement matrix U for any arbitrary
equal to I is only achievable for p¼m, which is not the case p o m, having small coherence with a known fixed dic-
here. We believe that we are able to optimize U not with tionary/basis W. In other words, D ¼ UW is desired to be
respect to an identity matrix, but with a matrix which is as close as possible (and not exactly equal) to an ETF,
updated inside the algorithm, accordingly. We shortly where U, and potentially W, are overcomplete matrices.
propose a method in this paper which follows this idea. What follows is the proposed approach to find such a
Another interesting description of matrices with small measurement matrix.
coherence is inferred from the equiangular tight frame Consider a convex set HmE which contains the ideal
(ETF) design, which is a special type of Grassmannian ETFs [18]:
frames [18,19]. A tight frame is a generalization of an
orthonormal basis. In orthonormal bases all vectors have HmE ¼ H 2 Rmm : H ¼ HT ,diag H ¼ 1,max9hij 9 r mE ,
iaj
unit norm and they are perpendicular to each other. In a
ð12Þ
more general term, every orthonormal basis is equiangular
[18] and that causes each pair of distinct vectors to have where 1 is a column vector of all ‘ones’. As mentioned
zero inner product. An extended and more general type of earlier, we cannot guarantee to find a U leading to a D
orthonormal basis is called tight frame which contains with coherence exactly equal to mE . In other words, it is
vectors maximally partitioned in space. Now, we study not always possible to find a U with a Gram matrix lying
ETFs in detail. Consider m unit-norm vectors in a p- in the set HmE , because of two reasons: (1) dealing with
dimensional space, where p rm. ETF is an arrangement of arbitrary dimensions and not necessarily m r pðp þ 1Þ=2,
such vectors that have the same—normally minimum— and (2) being constrained by the fixed matrix W. As a
absolute inner product with respect to each other (in solution, we relax this situation and attempt to optimize a
orthonormal basis this value is zero, equivalent to 901 random U to have a Gram matrix lying in a set Hm with
Euclidean angle). Suppose we arrange these vectors as m Z mE . In order to find such a matrix, we define the
columns of matrix D. ~ It is known that inner product following minimization problem which needs to be solved,
alternately: solution [24]. Here also Hm is a convex set and therefore,

T T the solution is straightforward which can be simply
min JW U UWHJ2F : ð13Þ
U,H2Hm described as a matrix having unit diagonal elements and
off-diagonals satisfying
The reason for introducing m is to emphasize that we can (
use the proposed method to find a matrix with a coherence g ij if 9g ij 9 r m,
8i,j iaj : hij ¼ ð18Þ
close to any arbitrary m and not necessarily mE . In the m signðg ij Þ otherwise,
following subsections, we propose an alternating minimiza-
tion algorithm, which iteratively minimizes (13) to find the where gij is the ij-th element of G ¼ WT UT UW and m is the
desired U.1 The proposed algorithm attempts to force an desired mutual coherence defined by the user. Such an
upper limit on off-diagonal elements of WT UT UW, and find element-wise scheme in updating H assures that in the
the desired U, accordingly. next update of U, the elements with higher coherence
will be intensively constrained. The pseudocode of the
2.2.1. Updating U : a gradient descent approach algorithm appears in Algorithm 1.
Starting with a U, constructed from random i.i.d
(independent and identically distributed) columns, we Algorithm 1. The proposed optimization method.
apply a gradient descent method to minimize (13), while
keeping H fixed. We denote the ij-th element of U by fij Input: Sparse representation matrix W, stepsize b, threshold m,
and define the cost function as and number of iterations: L, K.
Output: Measurement matrix U.
J ¼ JWT UT UWHJ2F : ð14Þ 1 begin
2 Initialize U to a random matrix.
The minimization method can be described as an iterative 3 Initialize H to an identity matrix.
process such that fij ’fij Z@J =ð@fij Þ, where Z 4 0 is the 4 for l ¼ 1 to L do
iteration stepsize. In order to apply the proposed method 5 G’WT UT UW
to all elements of matrix U, we compute the gradient of J 6 Update H:
7 for
( all iaj do
with respect to U (denoted as rU J ¼ @J =@U), where H is g ij if 9g ij 9 r m
8 hij ¼
considered fixed: m signðg ij Þ otherwise
@J @
¼ TrfðWT UT UWHÞT ðWT UT UWHÞg: ð15Þ
@U @U 9 end
10 Update U:
Here, Trfg denotes the matrix trace operation. Consider-
11 for k¼ 1 to K do
ing the matrix derivative rules in matrix computation [25] 12 U’UbUWðWT UT UWHÞWT
we simplify (15) to 13 end
@J 14 end
¼ 4UWðWT UT UWHÞWT : ð16Þ 15 end
@U
The full update equation would then be
We also note that having unit diagonal elements in H
Uðk þ 1Þ ¼ UðkÞ bUðkÞ WðWT UTðkÞ UðkÞ WHÞWT , ð17Þ can be thought as a normalization constraint on the
columns of D ¼ UW, when solving (13). Therefore, nor-
where k is the iteration index and b ¼ 4Z. The scalar
malizing the columns of D in Algorithm 1 is not applied.
stepsize b can either be a fixed value or be adaptively
This has also the advantage of less computation cost.
updated during the iterations. We will discuss about such
However, in order to be consistent in representing the
adaptive stepsize shortly. At k-th iteration and after
results, we always compute the coherence based on the
updating the measurement matrix using (17), we keep ~ and accordingly
column-normalized D, denoted by D,
U fixed and update H to minimize (13). We next describe ~ T D,
G~ ¼ D ~ which is also seen in (4) and (5).
the procedure for updating H.
The proposed algorithm is repeated for a number of
iterations, where U and H are alternately updated to
2.2.2. Updating H: a component-wise approach reduce (13). Regarding the convergence of the proposed
In order to have no high-coherent element between U method we mention that constructing ETFs using alter-
and W, an upper limit on off-diagonal elements of nating minimization methods have a global minimum
WT UT UW is imposed. This is done while updating H, subject to an appropriate initialization [17,18]. It has been
based on a given threshold m. Minimization of (13) with proven in [17,18] that there exist at least one accumula-
respect to H, which is a Hermitian matrix, is known as tion point for minG,H JGHJ2F , and the minimization leads
matrix nearness problem and is solved element-wise to a global minimum after infinite number of iterations
[17,18]. In fact, the closest H to WT UT UW, in terms of (more discussion on convergence of alternating minimi-
Frobenius norm, has similar sign pattern. On the other zation methods can be found in [17,18]). The proposed
hand, it is known that the nearest element of a set to a method, however, offers a gradient descent method to
point (in terms of Frobenius norm) can be obtained by update the measurement matrix which guarantees
projecting that point onto the set, which is a unique gradual reduction of (13) and convergence to a local
minimum. Regarding the stopping criterion, we used a
1
A partially similar and independent work in parametric dictionary fixed number of iterations. However, other stopping
design has also been reported in [24]. criteria can also be used. For instance, one may continue
the algorithm until reaching to a desired coherence, or the fact that has been also indicated in the corresponding
until the value of cost function for two successive itera- literature [28].
tions does not change significantly.
Next, we describe a procedure for choosing an adaptive
stepsize which improves the convergence behavior of the
proposed method compared with using a fixed stepsize. 4. Results
In this section we examine the performance of the

3. Adaptive stepsize selection
proposed method and compare it with well-established
similar methods by presenting the empirical results
Although we mentioned that a fixed stepsize can be
obtained from an extensive set of experiments. In the
used in the proposed algorithm, it may not be an opti-
first experiment, we applied the proposed algorithm to a
mum choice. Hence, we describe the procedure for select-
random Uð30200Þ taken from a Gaussian distribution
ing an adaptive stepsize to achieve a higher performance
where a random dictionary W of size 200 400 was used.
and faster initial adaptation. An adaptive stepsize varies
Then, Elad’s algorithm [10], accessible via SparseLab tool-
during the iterations of the algorithm based on some
box [29], was applied to the same matrices. The distribu-
criteria. Stochastic gradient stepsize has been shown to be
tions of absolute off-diagonal elements in G~ for the
advantageous over the fixed one [26–28]. In order to find
proposed method with m ¼ 0:2, L¼100, K¼50 and a fixed
such stepsize at each iteration of the algorithm we can
use the steepest descent method as follows:
b ¼ 0:01, and for Elad’s [10] with t ¼0.2, and 1000 itera-
tions, for three different g: g1 ¼ 0:5, g2 ¼ 0:7, g3 ¼ 0:9 are
@J ðUðk þ 1Þ Þ depicted in Fig. 1(a) and (b). We also added the distribu-
bðk þ 1Þ ¼ bðkÞ x , ð19Þ
@bðkÞ tion graph of our previous work in [23] for comparison. It
where x is a constant. It is seen from (19) that the stepsize is seen in Fig. 1(a) and (b) that the proposed method has a
is adapted based on the gradient of the total cost function better response in decreasing both mav and mmx . Also,
and the stepsize of the previous iteration. In other words, Fig. 1(b) shows that the proposed method is superior to
it is adapted in a way to minimize (14). In order to [23] in reducing mmx , but the method in [23] is more
calculate the differentiation in (19) we replace (17) into powerful in pulling the center of distribution toward zero
(14) and obtain the following equation: (equivalent to reducing mav ). Further, we repeated this
experiment with the same settings but when an over-
J ðUðk þ 1Þ Þ ¼ JWT ðUðkÞ bðkÞ UðkÞ WðWT UTðkÞ UðkÞ WHðkÞ ÞWT ÞT complete DCT dictionary was used for W. The results are
shown in Fig. 1(c). The results in Fig. 1(b) and (c) indicate
ðUðkÞ bðkÞ UðkÞ WðWT UTðkÞ UðkÞ WHðkÞ ÞWT ÞWHðk þ 1Þ J2F : ð20Þ
that the proposed method performs almost similarly for
For notational simplicity and avoiding very long equations both random and DCT dictionaries. This implies that the
we define performance of the proposed method does not highly
depend on the type of dictionary used, which indicates
MðkÞ 9UðkÞ WðWT UTðkÞ UðkÞ WHðkÞ ÞWT ð21Þ universality of the obtained measurement matrix. The
and then simplify (20) to numerical values of mmx and mav obtained from this
experiment are given in Table 1, for better illustration.
J ðUðk þ 1Þ Þ ¼ JWT ðUðkÞ bðkÞ MðkÞ ÞT ðUðkÞ bðkÞ MðkÞ ÞWHðk þ 1Þ J2F : In order to observe the convergence behavior of the
ð22Þ proposed method and also evaluate the influence of using
an adaptive stepsize we calculated the objective function
In order to calculate the derivative of (22) with respect to
(14) and also evolution of mmx , while the iterations were
bðkÞ we expand J J2F using the trace operator and obtain:
proceeding. In this experiment both U and W were
@J ðUðk þ 1Þ Þ generated randomly and other parameters were as fol-
¼ 2TrfðAðkÞ Hðk þ 1Þ ÞT ðBðkÞ þ BTðkÞ Þg
@bðkÞ lows: p ¼50, n ¼80, m ¼100, m ¼ mE ¼ 0:082, x ¼ 105 , and
þ 2 Trf2CðkÞ ðAðkÞ Hðk þ 1Þ Þ þ ðBðkÞ þ BTðkÞ Þ2 gbðkÞ bð0Þ ¼ 0:05. We also applied the methods in [10,23] to this
2 3 data. The resulting graphs appear in Fig. 2. It is seen from
6 TrfCðkÞ ðBðkÞ þ BTðkÞ ÞgbðkÞ þ4 TrfC2ðkÞ gbðkÞ , ð23Þ
Fig. 2(a) that the proposed method (for both cases of fixed
where and adaptive stepsize) achieves a lower mutual coherence
compared with other methods. Moreover, the achieved
AðkÞ 9WT UTðkÞ UðkÞ W,
mutual coherence when using adaptive stepsize is even
smaller. Similar superiority is observed by inspecting
BðkÞ 9WT UTðkÞ MðkÞ W, Fig. 2(b) where the objective function error is depicted.2
It is seen that an adaptive stepsize leads to a smaller
CðkÞ 9WT MTðkÞ MðkÞ W: ð24Þ steady state error and a faster convergence. This behavior
Now, the stepsize value for the (kþ1)-th iteration can be is due to the adaptation of stepsize b and reach for an
found from (19). This adaptive update of stepsize should optimum value which is clearly seen from the evolution
be executed at each iteration of the inner loop in graph in Fig. 2(c).
Algorithm 1. We still need to manually choose x (called
stepsize of stepsize). However, we observed that the 2
Note that since there exist no explicit objective function in [10],
overall performance is relatively insensitive to its value; we were unable to plot such a graph for it in Fig. 2(b).
Optimization using Elad’s method Evolution of μmx

with random dictionary 0.8
Elad’s
12000 Proposed (fixed stepsize)
γ = 0.5 0.7 Proposed (adaptive stepsize)
The method in [23]
10000 γ = 0.7
γ =0.9 0.6
No optimization
8000 0.5
μmx
6000 0.4
4000 0.3
2000 0.2
0 0.1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 10 20 30 40 50 60 70 80 90 100
Absolute off−diagonal values of Gram matrix elements Iteration number
Optimization with random dictionary Objective function error

1
8000 Proposed (fixed stepsize)
No optimization 0.9 Proposed (adaptive stepsize)
7000 Proposed method The method in [23]
The method in [23] 0.8
Normalized error
6000
0.7
5000 0.6
4000 0.5
3000 0.4
2000 0.3
1000 0.2
0 0.1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 10 20 30 40 50 60 70 80 90 100
Iteration number
Absolute off−diagonal values of Gram matrix elements
Evolution of adaptive stepsize β
Optimization with overcomplete DCT dictionary 0.055
8000 0.0545
No optimization
0.054
7000 Proposed method
The method in [23] 0.0535
6000 0.053
5000 0.0525
β
4000 0.052
0.0515
3000
0.051
2000 0.0505
1000 0.05
10 20 30 40 50 60 70 80 90 100
0 Iteration number
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Absolute off−diagonal values of Gram matrix elements Fig. 2. The convergence results: (a) evolution of mmx , (b) the objective
function, and (c) the evolution of adaptive stepsize b, all versus iterations
~ (a),
Fig. 1. Comparing the distributions of off-diagonal elements of G: number.
(b) using random dictionary, and (c) with overcomplete DCT dictionary.
Table 1
Effects of different methods on mutual coherence when using two types of dictionaries: random and DCT.
Dictionary type Before Method in [23] Proposed Elad’s method

optimization method
g1 g2 g3
Random
dictionary:
mmx 0.7326 0.6803 0.4298 0.7831 0.8552 0.9281
mav 0.1557 0.1436 0.1475 0.1554 0.1548 0.1556
DCT dictionary:
mmx 0.9334 0.7429 0.4855 0.9593 0.9146 0.8932
mav 0.1527 0.1430 0.1471 0.1551 0.1538 0.1552
LASSO with random Φ OMP with random Φ

1 1
0.9 0.9
0.8 0.8 0.8 0.8
0.7 0.7
ρ = k/p
ρ = k/p
0.6 0.6 0.6 0.6

0.5 0.5
0.4 0.4 0.4 0.4
0.3 0.3
0.2 0.2 0.2 0.2
0.1 0.1
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
δ = p/n δ = p/n
LASSO with proposed Φ OMP with proposed Φ

1 1
0.9 0.9
0.8 0.8 0.8 0.8
0.7 0.7
ρ = k/p
ρ = k/p
0.6 0.6 0.6 0.6

0.5 0.5
0.4 0.4 0.4 0.4
0.3 0.3
0.2 0.2 0.2 0.2
0.1 0.1
0 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
δ = p/n δ = p/n
Fig. 3. Empirical phase transition in the ðr, dÞ plane, when (a) LASSO and (b) OMP with random U, (c) LASSO and (d) OMP with proposed method
was used.
In the next part of the experiments we investigated the for each n where the reconstruction is successful with
advantages of measurement matrix optimization in the probability 1g. g is the estimator’s threshold and was set
sparse signal recovery. We selected OMP, a greedy algo- to g ¼ 0:25 (see [29] for more details about the estimator’s
rithm, which iteratively builds up an approximation of the threshold). Preliminary inspection of the results, especially
sparse signal and is widely used as a reconstruction those in Fig. 4, where the curves for different methods are
algorithm in this framework. Further, LASSO (a continuation depicted on the same graph, reveals the success of the
of BP [4]) was selected as a minimization approach which proposed approach. It is seen that the proposed approach
attempts to fit a regression model while imposing ‘1 -norm does give improvements over random matrices, and Elad’s
constraint on the regression coefficients and solving: method. However, it has been observed that these advan-
tages decrease as the problems size grows (as n increases
minJyDaJ22 s:t: JaJ1 r k, ð25Þ from around 800–1000 and beyond). Nevertheless, the
a
proposed method outperforms both Elad’s and pure random
where k is a nonnegative constant [7,30]. We present the matrices at large dimensions, though not very noticeably.
results of applying these methods as they are well known in Following the previous test in analysis of the reconstruc-
sparse recovery problems. Extensive simulations were tion performance, we set up a new experiment where a
established to sketch the phase transition diagram3 in a redundant DCT dictionary Wð80120Þ was used. We con-
noiseless setting. One hundred trials were generated at each ducted two separate CS experiments, first by fixing p¼25
point on a grid of d ¼ p=n and r ¼ s=p axes, for n ¼ m ¼ 500, and varying s from 1 to 10 and second by fixing s¼4 and
d 2 ½0:1 1, and r 2 ½0:05 1. The ðr, dÞ plane is made of varying p from 15 to 40 (see also [10,29]). Each experiment
30 30 equispaced points in total. W was chosen to be a was performed for 50,000 random sparse ensembles and
random matrix of size 500 500. Non-zeros of the sparse the average reconstruction error was recorded. The corre-
signals in each trial were drawn from random Gaussian sponding graphs shown in Fig. 5 indicate the improved
distribution. The resulting diagrams are demonstrated in performance when using an optimized U instead of a
Fig. 3 for two different recovery algorithms, namely OMP [5] random U. Moreover, it is seen that the proposed method
and LASSO [7].4 Shaded attributes are the proportion of performs better, especially when OMP is used as the
successful outcomes to all trials. An outcome is considered recovery method. The reason can be inferred from the fact
to be successful if the reconstruction error JxxJ ^ 2 =JxJ2 is that OMP works based on orthogonality, which is improved
less than 0.01. The overlaid curves show the estimate of a when the proposed method is applied.
More importantly, Fig. 5(a) reveals that by using the
3
Details about transition diagram can be found in [29,30].
optimized measurement matrices one can achieve the
4
Other methods such as BP [4] and LARS [6] were also tested and same performance as that obtained using pure random
similar behaviors were observed. matrices, but with fewer number of measurements.
Reconstruction using LASSO

1 100
Random Φ
Proposed
Relative number of errors

0.8 Elad
10−1
0.6
ρ = k/p
10−2
0.4
BP with random Φ
BP with Elad’s method
0.2 10−3 BP with proposed method
OMP with random Φ
OMP with Elad’s method
OMP with proposed method
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 10−4
15 20 25 30 35 40
δ = p/n Number of measurements
Reconstruction using OMP

1 100
Random Φ
0.9 Proposed

Elad 10−1
0.8
0.7 10−2
ρ = k/p
0.6
0.5 10−3
BP with random Φ
BP with Elad’s method
0.4 BP with proposed method
10−4 OMP with random Φ
0.3 OMP with Elad’s method
OMP with proposed method
0.2 10−5
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 9 10
δ = p/n Number of non−zeros
Fig. 4. (a) Empirical phase transition curves of (a) LASSO and (b) OMP Fig. 5. Relative number of errors as a function of (a) measurement
when different measurement matrices were used. numbers p and (b) number of non-zeros.
For instance, notice the recovery error of BP using the 0.9

Bernoulli random Φ
optimized measurement matrix at p¼30 in Fig. 5(a). It is 0.8 Elad’s method
almost as small as the recovery rate at p¼ 36 for a pure
Proposed method (fixed stepsize)

0.7
random U. The same behavior can also be noticed for the Proposed method (adaptive stepsize)
recovery using OMP. 0.6
In this part of the simulations the possibility of using 0.5
the proposed approach for optimizing the Bernoulli pro- 0.4
jections are studied. The main advantages of using ran-
0.3
dom Bernoulli matrices are reduction of the hardware
complexity and storage due to their binary nature. 0.2
Assuming the same settings in the previous experiment 0.1
we first generated a measurement matrix with random
0
Bernoulli distribution. Then, the proposed method was 1 2 3 4 5 6 7 8 9 10
applied to optimize it. The optimized matrix would not Number of non−zeros
obviously have 71 values and includes values with any
Fig. 6. Relative reconstruction error using OMP versus number of non-
amplitude. One preliminary way to quantize the opti-
zeros.
mized elements back to the Bernoulli type, i.e. 71, is to
apply the nonlinear signðÞ operation. The recovery error
rates using this scheme are given in Fig. 6, which indicates mutual coherence and the aim is to approach this value. In
its superiority over pure random Bernoulli matrices. order to support the effectiveness of this choice and demon-
However, applying such nonlinear operation is not opti- strate the influence of choosing different m we measured the
mum and further work is required to better optimize reconstruction error of 10,000 signal ensembles while opti-
random Bernoulli matrices. mizing U with different values of m. In this experiment, the
As previously mentioned, the parameter m in Algorithm 1 signal ensembles were of length m¼100 with five non-
can be theoretically chosen as m Z mE . However, we empiri- zeros, W was an overcomplete DCT dictionary, and OMP was
cally observed that the best performance is achieved when used as the recovery method. Other parameters of the
m C mE . This is a reasonable choice since mE is the optimum algorithm were p¼30, n¼64, L¼100 and K¼20. Based on
0.5
0.45 1
Pure random Φ
0.4 0.9 Elad’s method

0.8 Proposed method (fixed stepsize)
0.35 Proposed method (adaptive stepsize)
0.3 0.7
0.25 0.6
0.2 μE = 0.1535 0.5
0.15 0.4
0.1 No optimization (pure random Φ) 0.3
Optimized Φ uisng a fixed stepsize
0.05 0.2
Optimized Φ using adaptive stepsize
0 0.1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
μ 0 10 20 30 40 50 60
Signal to noise ratio of the measurements (dB)
Fig. 7. Relative reconstruction error versus m. The parameter m is used in
Algorithm 1 for optimization of the measurement matrix.
1
Pure random Φ
0.9 Elad’s mehtod
the given parameters, the optimum mutual coherence can

Proposed method (fixed stepsize)
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 0.8 Proposed method (adaptive stepsize)
be calculated as mE ¼ ðmdÞ=dðm1Þ ¼ 0:1535. Fig. 7
0.7
represents the relative reconstruction error as a function of
0.6
m for three cases of no optimization, optimization with a
fixed stepsize b ¼ 0:01 and optimization with adaptive 0.5
stepsize of x ¼ 106 and bð0Þ ¼ 0:02. It is seen from the 0.4
graphs in Fig. 7 that the minimum error is achieved when 0.3
m C0:1 0:3, and hence, m C mE is a suitable choice. It is also 0.2
noticeable that for m o mE still the error is relatively small 0.1
but it increases as it approaches to m ¼ 0. It is worth to note 0
that the choice of m ¼ 0 is equivalent to using the method in 0 10 20 30 40 50 60
[23] which means replacing H with the identity matrix I. We Signal to noise ratio of the measurements (dB)
have also observed that as m decreases, the steady state error Fig. 8. Relative reconstruction error versus the SNR of the measure-
of objective function (14) becomes higher and vice versa. ments in dB using (a) OMP and (b) LASSO.
All the aforementioned experiments were conducted
under the noiseless settings. In order to show the robust-
ness of the proposed method in noisy situations we [10] is almost twice as that for the proposed method with
considered the noisy model y ¼ UWa þ v, where v 2 Rp fixed stepsize (notice the last column in Table 2). That can
is the vector of additive Gaussian noise with zero mean. be because of using SVD in Elad’s method which is compu-
The main parameter settings for this simulation were: tationally expensive, especially for high dimensions. In
p ¼25, n ¼64, m¼100, number of non-zeros s¼5, and addition, it is seen that using adaptive stepsize for the
m ¼ 0:17. First, the proposed method was applied to a proposed method incurs more complexity and hence
random measurement matrix, where a DCT dictionary increases the computation time. In general, high computa-
was used. Then, the noisy measurements were generated tion time in very-large scale problems is a drawback of such
with different signal to noise ratios (SNRs) from 0 to optimization techniques.
60 dB. We applied OMP and LASSO to these noisy mea-
surements to recover the sparse signals. The relative recon- 5. Discussion
struction error under different noise levels are given in Fig. 8.
It is seen from Fig. 8 that the reconstruction performance In this short section we discuss the characteristics of
improves as the SNR increases and vice versa. However, the the proposed method with respect to a given class of
optimized measurement matrix is more robust to noise signals or images. It might have been realized that the
compared with the pure random matrix. proposed method is non-adaptive and only requires the
Finally, we compare the computation time of the pro- knowledge about sparsifying dictionary. In contrast to
posed method with other methods. We set up a simulation some methods like [14], it does not exploit any prior
to optimize measurement matrices with different dimen- knowledge (e.g. the frequency spectrum, location of non-
sions. A desktop computer with a Core 2 Duo CPU of zeros) about the signal of interest into the optimization
3.00 GHz, and 2 GB of RAM was used for this experiment. process. Due to such independency from the original
Table 2 represents the running time per iteration with signal, the optimized measurement matrix is not affected
respect to different dimensions for U and W. As it is seen by the properties (whether known or unknown) of the
from the table, the computation time increases with sparse signal. In particular, the power spectrum density of
increasing the dimensions. The methods in [23,10] present the optimized measurement matrix, which is originally
lowest and highest computation times, respectively. For white for a pure random matrix, does not change sig-
large dimensions the computation of the Elad’s method nificantly and remains white, even if the signal to be
Table 2
The computation time (in seconds) per iteration for different methods.
Optimization technique Matrix dimensions
U: 60 100 120 200 180 300 240 400 300 500

W: 100 200 200 400 300 600 400 800 500 1000
Elad’s method [10] 0.0772 0.4533 1.2673 2.8761 5.2674
The method in [23] 0.0079 0.0591 0.2453 0.6048 1.1691
The proposed method (fixed stepsize) 0.0565 0.2562 0.7069 1.4385 2.5095
The proposed method (adaptive stepsize) 0.0673 0.3526 0.9854 2.0480 3.6545
sampled has a particular spectrum. We have observed [2] E.J. Candes, J. Romberg, T. Tao, Robust uncertainty principles: exact
signal reconstruction from highly incomplete frequency information,
such behavior in other non-adaptive methods too, such as IEEE Transactions on Information Theory 52 (2) (2006) 489–509.
[10,23]. This non-adaptivity has the advantage that the [3] C.E. Shannon, Communication in the presence of noise, Proceedings
proposed method can be applied to a wide range of of the IRE 37 (1) (1949) 10–21.
[4] S. Chen, D.L. Donoho, M.A. Saunders, Atomic decomposition by
signals and images without any restriction. On the other basis pursuit, SIAM Journal on Scientific Computing 20 (1) (1999)
hand, the disadvantage is that it is not able to incorporate 33–61.
the prior knowledge into the optimization process in [5] J.A. Tropp, A.C. Gilbert, Signal recovery from random measurements
via orthogonal matching pursuit, IEEE Transactions on Information
cases where such a priori is available. However, in the Theory 53 (12) (2007) 4655–4666.
case of available knowledge about spectrum of the sparse [6] B. Efron, T. Hastie, L. Johnstone, R. Tibshirani, Least angle regression,
signal (e.g. the signal is bandpass or highpass), one Annals of Statistics 32 (2004) 407–499.
[7] R. Tibshirani, Regression shrinkage and selection via the LASSO,
reasonable approach can be to first generate a colored Journal of the Royal Statistical Society (Series B) 58 (1) (1996)
random measurement matrix using the method described 267–288.
in [14], then apply the proposed method in this paper to [8] E.J. Cande s, M.B. Wakin, An introduction to compressive sampling,
IEEE Signal Processing Magazine 25 (2) (2008) 21–30.
optimize it, and decrease the mutual coherence. This is
[9] E.J. Cande s, J.K. Romberg, T. Tao, Stable signal recovery from
the context that has to be further studied in the future. incomplete and inaccurate measurements, Communications on
Pure and Applied Mathematics 59 (8) (2006) 1207–1223.
[10] M. Elad, Optimized projections for compressed sensing, IEEE
6. Conclusion Transactions on Signal Processing 55 (12) (2007) 5695–5702.
[11] V. Abolghasemi, S. Sanei, S. Ferdowsi, F. Ghaderi, A. Belcher,
Segmented compressive sensing, in: IEEE/SP 15th Workshop on
In this paper optimization of the measurement matrix Statistical Signal Processing SSP 2009, September 2009.
in compressed sensing was addressed. Although random [12] M. Lustig, D. Donoho, J.M. Pauly, Sparse MRI: the application of
compressed sensing for rapid MR imaging, Magnetic Resonance in
sampling has been widely used so far in this framework, it Medicine 58 (6) (2007) 1182–1195.
has been recently shown that optimized projections with [13] Z. Wang, G.R. Arce, Variable density compressed image sampling,
the property of smaller mutual coherence with respect to IEEE Transactions on Image Processing 19 (1) (2010) 264–270.
[14] Z. Wang, G.R. Arce, J.L. Paredes, Colored random projections for
sparsifying matrix can significantly improve the perfor-
compressed sensing, IEEE International Conference on Acoustics,
mance. We proposed a gradient-based alternating optimiza- Speech and Signal Processing, ICASSP 2007, vol. 3, April 2007, pp.
tion method which iteratively optimizes the measurement III-873–III-876.
matrix to decrease the mutual coherence. The advantages [15] J.M. Duarte-Carvajalino, G. Sapiro, Learning to sense sparse signals:
simultaneous sensing matrix and sparsifying dictionary optimi-
are higher performance in sparse recovery, and also possi- zation, IEEE Transactions on Image Processing 18 (7) (2009)
bility to take fewer measurements and yet achieve the same 1395–1408.
performance level when using pure random matrices. We [16] M. Aharon, M. Elad, A. Bruckstein, K-SVD: an algorithm for design-
ing overcomplete dictionaries for sparse representation, IEEE
also described a steepest descent method for obtaining an Transactions on Signal Processing 54 (11) (2006) 4311–4322.
adaptive stepsize to improve the performance. The pre- [17] I.S. Dhillon, R.W. Heath, T. Strohmer, J.A. Tropp, Constructing
sented numerical results verify the success of the proposed packings in Grassmannian manifolds via alternating projection,
Experimental Mathematics 17 (1) (2008) 9–35.
approach. In addition, the experimental results indicate that [18] J.A. Tropp, I.S. Dhillon, R.W. Heath, T. Strohmer, Designing
the proposed method is computationally less expensive structured tight frames via an alternating projection method, IEEE
than some current methods in the literature. Transactions on Information Theory 51 (1) (2005) 188–209.
[19] T. Strohmer, R.W. Heath, Grassmannian frames with applications to
coding and communication, Applied and Computational Harmonic
Analysis 14 (3) (2003) 257–275.
[20] D.L. Donoho, P.B. Stark, Uncertainty principles and signal recovery,
Acknowledgments SIAM Journal on Applied Mathematics 49 (3) (1989) 906–931.
[21] D.L. Donoho, M. Elad, Optimally sparse representation in general
The authors would like to thank the anonymous revie- (nonorthogonal) dictionaries via ‘1 -minimization, Proceedings of
the National Academy of Sciences 100 (5) (2003) 2197–2202.
wers for their thorough remarks and fruitful suggestions. [22] J.A. Tropp, Greed is good: algorithmic results for sparse app-
roximation, IEEE Transactions Information Theory 50 (2004)
References 2231–2242.
[23] V. Abolghasemi, S. Ferdowsi, B. Makkiabadi, S. Sanei, On optimiza-
tion of the measurement matrix for compressive sensing, in:
[1] D. Donoho, Compressed sensing, IEEE Transactions on Information European Signal Processing Conference EUSIPCO, Aalborg, Den-
Theory 52 (4) (2006) 1289–1306. mark, August 2010.
[24] M. Yaghoobi, L. Daudet, M.E. Davies, Parametric dictionary design [28] S.C. Douglas, Generalized gradient adaptive step sizes for stochastic
for sparse coding, IEEE Transactions on Signal Processing 57 (12) gradient adaptive filters, in: International Conference on Acoustics,
(2009) 4800–4810. Speech, and Signal Processing, ICASSP 1995, vol. 2, pp. 1396–1399,
[25] K.B. Petersen, M.S. Pedersen, The Matrix Cookbook, Version May 1995.
20081110, October 2008. [29] D.L. Donoho, I. Drori, V. Stodden, Y. Tsaig, Sparselab /http://
[26] S. Amari, A theory of adaptive pattern classifiers, IEEE Transactions sparselab.stanford.edu/S, December 2005.
on Electronic Computers EC EC-16 (3) (1967) 299–307. [30] D.L. Donoho, Y. Tsaig, Fast solution of ‘1 -norm minimization
[27] V.J. Mathews, Z. Xie, A stochastic gradient adaptive filter with problems when the solution may be sparse, IEEE Transactions on
gradient adaptive step size, IEEE Transactions on Signal Processing Information Theory 54 (11) (2008) 4789–4812.
41 (6) (1993) 2075–2087.

Abolghasemi 2012

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Abolghasemi 2012

Uploaded by

Copyright:

Available Formats

Signal Processing 92 (2012) 999–1009

Contents lists available at SciVerse ScienceDirect

A gradient-based alternating minimization approach for

1. Introduction found. ‘0 -norm ðJ J0 Þ counts the number of non-zero

minJGIJ2F , ð8Þ In general, designing an exact ETF is a complicated pro-

alternately: solution [24]. Here also Hm is a convex set and therefore,

In this section we examine the performance of the

Optimization using Elad’s method Evolution of μmx

Optimization with random dictionary Objective function error

Dictionary type Before Method in [23] Proposed Elad’s method

LASSO with random Φ OMP with random Φ

0.6 0.6 0.6 0.6

LASSO with proposed Φ OMP with proposed Φ

0.6 0.6 0.6 0.6

Reconstruction using LASSO

Relative number of errors

Reconstruction using OMP

Relative number of errors

For instance, notice the recovery error of BP using the 0.9

Proposed method (fixed stepsize)

0.4 0.9 Elad’s method

Relative number of errors

Relative number of errors

Optimization technique Matrix dimensions

U: 60 100 120 200 180 300 240 400 300 500

You might also like