Professional Documents
Culture Documents
AMP_paper
AMP_paper
Abstract—High-dimensional signal recovery of standard linear problem can be formalized as a least absolute shrinkage and
regression is a key challenge in many engineering fields, such selection operator (LASSO) [1] inference problem
as, communications, compressed sensing, and image processing.
The approximate message passing (AMP) algorithm proposed 1
x̂LASSO = arg min ky − Hxk22 + λkxk1 , (2)
by Donoho et al is a computational efficient method to such x 2
arXiv:2201.07487v3 [cs.IT] 1 Mar 2022
Divergence-free denoiser
CAMP [25] (2020) MAMP [26] (2021) EP [27] (2oo1)
Three orthogonality
Gaussian approximation AMP [15] (2009) Divergence-free denoiser OAMP [24] (2016) Scalar variance EC(single-loop) [28]
Message passing (2005)
Taylor series IID sub-Guassian LMMSE de-correlated matrix unitrarily-invariant
Onsager term
Fig. 1. The relations between the message passing based algorithms in standard linear regression inference problem.
approximate message passing (AMP) [15] algorithm, the random measurement matrix, recently, some long memory
main focus in this paper, is a celebrated implementation algorithms such as convolutional AMP (CAMP) [25], and
of Bayes estimation. By postulated posterior/MMSE, in memory AMP [26] were proposed. Different from OAMP,
which the postulated prior and likelihood function are dif- CAMP only modifies the Onsager term of AMP. The Onsager
ferent from true ones, AMP can provide the exact sparse term of CAMP includes all proceeding messages to ensure
solution to LASSO inference problem using Laplace the Gaussianity of input signal of denoiser. However, CAMP
method of integration. In general, we call the algorithm may fail to converge in the case of large condition number.
which relies on Bayesian formula as Bayesian algorithms. Following CAMP and OAMP, the MAMP algorithm applies
finite terms of matrix Taylor series to approximate matrix
On the other hand, in the Bayes-optimal setting (may inversion of OAMP and involves all previous messages to
Mention of
M ≥ N ) where both prior and likelihood function are known, ensure three orthogonality. EP
the MMSE and MAP give a much better performance than Another efficient algorithm related to AMP is called ex- algorithm
convex relaxation. However, due to high-dimensional integra- pectation propagation (EP) [27]. EP is earlier than AMP,
tion, the exact MMSE is hard to obtain. Fortunately, some which approximates the factorable factors by choosing a
existing works [16] showed that AMP can achieve the Bayes- distribution from Gaussian family via minimizing Kullback-
optimal MSE performance but with affordable complexity in Leibler (KL) divergence. Some EP-related methods refer to
independent identical distributed (IID) sub-Gaussian random expectation consistent (EC) [28, Appendix D] (single-loop),
measurement matrices region [17]. For convenience, we depict OAMP [24], and vector AMP (VAMP) [29]. They were
Fig. 1 to show the relations between AMP and its related proposed independently in different manners but share the
algorithms. The AMP derives from the message passing [18] same algorithm. Actually, EP/EC (single-loop) have a slight
algorithm in coding theory, which is also known as belief difference from OAMP/VAMP, since EP/EC has the element-
propagation [19] in computer science or cavity method [20] in wise variances and they can be reduced to OAMP/VAMP by
statistic mechanics. The AMP algorithm is closely related to taking the mean operation for element-wise variance. Among
the Thouless-Anderson-Palmer (TAP) [21] equations which is them, EC approximation is based on the minimum Gbiss
used to approximate marginal moments in large probabilistic free energy. It means that those methods can be regarded as
models. In [22], the first AMP algorithm was proposed for the an example of solving the fixed point of Gbiss free energy.
code division multiple access (CDMA) multi-user detection Almost at the same time as OAMP, the VAMP was proposed
problem. A significant feature of AMP algorithm is that the using a EP-type message passing and the dynamic of VAMP
dynamic of AMP can be fully predicted by a scalar equation was rigorously analyzed in [29]. Recently, [30] proved that
termed state evolution (SE) [16], which is perfectly agree with VAMP and AMP have identical fixed points in their state
the fixed point of the exact MMSE estimator using replica evolutions in their overlapping random matrices. We also
method [23]. The AMP algorithm is also related to ISTA, the note that under the mismatch case [31], where the prior
difference between them is the Onsager term, which leads and likelihood function applied to the inference problem are
to AMP more faster than ISTA but it doesn’t change its fixed different from the true prior and likelihood function, the AMP
points. As the measurement matrix is beyond IID sub-Gaussian as well as its related algorithms may not converge although
region, AMP methods often fail to converge. Beyond IID sub- the corresponding SE converges to a fixed point predicted by
Gaussian region, the orthogonal AMP (OAMP) [24] can be replica method. Actually, AMP for LASSO is one case of
applied to more general unitarily-invariant matrices via the mismatched model, but its convergence is guaranteed due to
LMMSE de-correlated matrix and divergence-free denoiser, convex nature of LASSO [32]. The failure of AMP can occur
but it should pay more computational complexity due to the when the mismatched models are defined by non-convex cost
matrix inversion. To balance the complexity and region of function [33].
3
0.8
The AMP algorithm [15] posted below is related to ISTA
4
0.6
z(t) = y − Hx̂(t)
(t) (t)
q(y1 jx) Zi←a , β1 Vi←a ). By Gaussian reproduction lemma1 , the mes-
(t)
q(x1 ) sage µi←a (xi ) is approximated as
x1
(t) (t) 1 (t)
q(y2 jx) µi←a (xi ) ∝ N 0|ya − hai xi − Zi←a , (1 + Vi←a )
β
...
q(xi ) (t)
!
xi t
ya − Zi←a 1 + Vi←a
∝ N xi | , . (17)
q(ya jx) hai β|hai |2
...
q(xN ) xN In the sequel, the mean and variance of µi←a (xi ) are defined
(t)
where
while the mean of the approximated posterior q̂ (t+1) (xi |y) −1
will serve as an approximation of MMSE estimator. 2
(t)
X |hbi |
To reduce the complexity of sum-product message passing Σi→a = (t)
, (20)
(t)
shown in (12), we first simplify the message µi←a (xi ) as b6=a 1 + Vi←b
below X h∗ (yb − Z (t) )
(t)
ri→a = Σti→a bi i←b
(t)
. (21)
(t)
µi←a (xi ) b6=a 1 + Vi←b
Z Z N N Note that several zero value elements in H have no effect on
(t)
X Y
(t) (t)
∝ q(ya |za )δ(za − hak xk )dza µj→a (xj )dx\i Σi→a , ri→a as well as rest parameters in the derivation of
x\i za k=1 j6=i AMP.
(t+1)
Z X As a result, the message µi→a (xi ) is approximated as the
∝ q(ya |za )E δ za − haj xj − hai xi dza , product of a Laplace prior and a Gaussian likelihood function
za
j6=i
(t+1) 1 −βλ|xi | (t) (t)
(14) µi→a (xi ) = e N (xi |ri→a , Σi→a ), (22)
Zβ
QN (t) where Zβ is normalized constant.
where the expectation is over j6=i µj→a (xj ). We define
(t) (t) For convenience, define a distribution
random variable (RV) ζi←a associated with za and ξj→a
(t) 1 1
following µj→a (xj ) associated with xj . Denote the mean and fβ (x; r, Σ) = exp −β λ|x| + (x − r)2 , (23)
(t) (t) (t) Zβ 2Σ
variance of ξj→a as x̂j→a and v̂j→a /β, respectively. From
(14), as the dimension N tends to infinity, using central limit and its mean and variance
(t)
Z
(CLT) theorem the RV ζi←a converges to a Gaussian RV with Fβ (x; r, Σ) = xfβ (x; r, Σ)dx, (24)
mean and variance Z
(t) (t) (t) 1 (t) Gβ (x; r, Σ) = x2 fβ (x; r, Σ)dx − |Fβ (x; r, Σ)|2 . (25)
E{ζi←a } = Zi←a + hai xi , Var{ζi←a } = V , (15)
β i←a
(t+1)
The mean and variance of the message µi→a (xi ) are repre-
where sented as
(t+1) (t) (t)
x̂i→a = Fβ (xi ; ri→a , Σi→a ), (26)
X X
t
Zi←a = haj x̂tj→a , t
Vi←a = |haj |2 v̂j→a
t
. (16)
(t+1) (t) (t)
j6=i j6=i v̂i→a = βGβ (xi ; ri→a , Σi→a ). (27)
Based
P on this Gaussian approximation, the term E{δ(za − 1 N (x|a, A)N (x|b, B)
= N (x|c, C)N
(0|a − b, A + B) with C =
j6=i aj xj − hai xi )} in (14) is replaced by N (za |hai xi +
a b
h (A−1 + B −1 )−1 and c = C A +B
6
(t+1)
Recalling the approximated posterior q̂ (t+1) (xi |y) in (13), we Applying Taylor series to v̂i→a in (27), we have
define
(t+1) (t+1) ∂ (t) (t)
M
!−1 v̂i→a ≈ v̂i + 4r βGβ (xi ; ri , Σi ). (38)
(t)
X |hai |2 ∂r
Σi = (t)
, (28)
a=1 1 + Vi←a
Combining (36) with (38) into (33) obtains
M (t)
(t) (t)
X h∗ai (ya − Zi←a ) N
X (t)
∗
(t) hai (ya − Zi←a )
(t)
ri = Σi (t)
. (29) Va(t) = |hai |2 v̂i − Σi
1+ Vi←a (t)
a=1 i=1 1 + Va
QM (t) (t) (t) ∂ (t)
The term a=1 µi←a (xi ) is proportion to N (xi |ri , Σi ). × βGβ (xi ; r, Σi )
Accordingly, the mean and variance of approximated posterior ∂r
N N (t)
q̂ (t+1) (xi |y) can be denoted as X (t)
X |hai |3 (ya − Zi←a )
≈ |hai |2 v̂i − PM 2
a=1 |hai |
(t+1) (t) (t)
x̂i = Fβ (xi ; ri , Σi ), (30) i=1 i=1
(t+1) (t) (t) ∂ (t)
v̂i = βGβ (xi ; ri , Σi ). (31) × βGβ (xi ; r, Σi )
∂r
N
Also define X (t)
√
= |hai |2 v̂i + O(1/ N )
N
X (t) i=1
Za(t) = hai x̂i→a (32) N
(t)
X
i=1 ≈ |hai |2 v̂i . (39)
(t)
X
Va(t) = |hai |2 v̂i→a
t
≈ Vi←a (33) i=1
i=1
Substituting (37) into (32) gets
(t) (t)
where Va= Vi←a
holds by ignoring infinitesimal. N N (t−1)
(t+1)
X (t)
X |hai |2 (ya − Z ) (t)
Applying first-order Taylor series2 to x̂i→a in (26), we have Za(t) ≈ hai x̂i − i←a
(t−1)
v̂i
i=1 i=1 1 + Va
(t+1) (t+1) ∂ (t) (t) N N
x̂i→a ≈ x̂i + 4r Fβ (xi ; ri , Σi ) X (t)
X (t) (t−1)
|hai |2 v̂ (ya − Za + hai x̂
(t−1)
)
∂r = hai x̂i − i i
∂ (t−1)
+ 4Σ
(t) (t)
Fβ (xi ; ri , Σi ), (34) i=1 i=1 1 + Va
∂Σ N (t) (t−1)
X (t) Va (ya − Za )
where ≈ hai x̂i − (t−1)
. (40)
i=1 1+ Va
(t) (t)
4Σ = Σi→a − Σi
Inserting (37) into (29) yields
|hai |2
(t)
1+Va M (t) (t)
= (t) (t)
X h∗ai (ya − Za + hai x̂i )
PM |hai |2 PM |hbi |2 ri ≈ Σi (t)
a=1 1+V (t)
i←a
b6=a 1+V (t)
i←b a=1 1 + Va
M (t)
≈ 0, (35) (t) (t)
X h∗ai (ya − Za )
(t) (t)
= x̂i + Σi (t)
. (41)
4r = ri→a − ri a=1 1 + Va
∗ (t)
(t) hai (ya − Zi←a ) Up to now, the derivation of AMP for LASSO is complete.
≈ −Σi (t)
, (36)
1+ Va The AMP algorithm is shown in Algorithm 1.
To in line with Donoho’s AMP, we still need to carry out
(t) (t)
where we use the approximations Va = Vi←a + O(1/N ) the following simplifications using the fact |hai |2 = O(1/M )
(t)
and Σi = Σti→a + O(1/N ) to obtain 4r. Applying the fact3
(t+1)
∂ (t) β (t) (t) v̂i
∂r Fβ (xi ; r, Σi )|r=ri(t) = (t) Gβ (x; ri , Σi ) = , (34) N
Σi Σi
(t)
1 X t 4 (t)
can be simplified as Va(t) = v̂ = V , (43a)
M i=1 i
(t)
(t+1) (t+1) h∗ai (ya − Zi←a ) (t+1)
N
V (t) (ya − Za
(t−1)
)
x̂i→a ≈ x̂i − v̂i . (37) (t)
X
(t) Za(t) = hai x̂i − , (43b)
1+ Va 1 + V (t−1)
i=1
(t) 4
2 f (x + 4x, y + 4y) = f (x, y) + 4xf 0 (x, y) + 4yf 0 (x, y), where f 0
x y x Σi = 1 + V (t) = Σ(t) , (43c)
and fy0 are the partial derivation of f (x, y) w.r.t. x and y, respectively.
3 Provided that f (x) is an arbitrary bounded and non-negative function
M
(t) (t)
X
f (x)N (x|m,v)
and define a distribution P(x) = R f (x)N (x|m,v)dx . Denote its mean and
ri = x̂i + h∗ai (ya − Za(t) ), (43d)
a=1
− E{x})2 P(x)dx.
R R
variance as E{x} = xP(x)dx and Var{x} = (x
(t+1) (t)
x x−m x̂i = Fβ (xi ; ri , Σ(t) ), (43e)
R R R
∂ xP(x)dx v
f (x)N (x|m,v)dx· f (x)N (x|m,v)dx
We have = −
∂m [ f (x)N (x|m,v)dx]2
R
(t+1) (t)
Σ(t) F0β (xi ; ri , Σ(t) ),
R x−m
v̂i = (43f)
R
xf (x)N (x|m,v)dx· v
f (x)N (x|m,v)dx Var{x}
= v .
[ f (x)N (x|m,v)dx]2
R
7
(t)
where F0β (xi ; ri , Σ(t) ) is the partial derivation of λ + τ̂ (t) D 0 (t) E
(t) (t) τ̂ (t+1) = η (x̂ + HT z(t) , λ + τ̂ (t) ) . (48c)
Fβ (xi ; ri , Σ(t) ) w.r.t. ri . α
4 (t)
Defining z(t) = y − Z(t) with Z(t) = {Za , ∀a}, we have By abusing η, we get the original AMP (10) for LASSO
inference problem.
z(t) = y − Hx̂(t)
1 D E
+ z(t−1) F0β (x; x̂(t−1) + HT z(t−1) , Σ(t−1) ) ,
α C. Bayes-optimal AMP
(44a)
In LASSO inference problem, both the prior and likelihood
x̂(t+1) = Fβ (x; x̂(t) + HT z(t) ), Σ(t) ), (44b)
D E are unknown. However, in the Bayes-optimal setting, where
Σ(t+1) = Σ(t) F0β (x; x̂(t) + HT z(t) , Σ(t) ) . (44c) both prior and likelihood function are perfectly given, the
MMSE estimator can achieve Bayes-optimal error. Actually,
In large β, by Laplace method of integration we have this situation is common in communications. In those cases,
(t) it is assumed that each element of x follows IID distribution
lim Fβ (xi ; ri , Σ(t) )
β→∞ PX . The joint distribution is then represented as
Z
1 1 (t)
= lim xi pos exp −β λ|xi | + (x i − ri ) dxi P(x, y) = P(y|x)P(x)
β→∞ Z 2Σ(t)
M N
1 (t)
(xi − ri )2 + λ|xi |.
Y Y
= arg min (45) = P(ya |x) PX (xi ). (50)
xi 2Σ(t)
a=1 i=1
Similar to (5)-(6), we get
Similar to the derivation of AMP for LASSO, we get the
(t) (t) (t)
lim Fβ (xi ; ri , Σt ) = sign(ri ) max(|ri |, λΣ(t) ), (46) Bayes-optimal AMP as depicted in Algorithm 2, where the
β→∞
( expectation in (49e) and (49f) is taken over
(t)
1 |ri | ≥ Σ(t)
lim F0β (xi ; rit , Σ(t) ) = . (47) PX (xi )N (xi |ri , Σi )
(t) (t)
β→∞ 0 otherwise P̂ (t) (xi |y) = R . (51)
(t) (t)
PX (x)N (x|ri , Σi )dx
(t) (t)
Defining η(r, γ) = sign(r) max(|r|, γ) and τ̂ = λV , we
have This form of AMP is widely applied to many engineering
regions. We call it as Bayes-optimal AMP since (1) this
z(t) = y − Hx̂(t) algorithm is based on Bayes-optimal setting; (2) the SE of this
1 D E
algorithm perfectly matches the fixed point of the exact MMSE
+ z(t−1) η 0 (x̂(t−1) + HT z(t−1) , λ + τ̂ (t−1) ) ,
α estimator predicted by replica method. Similar to AMP for
(48a) LASSO, the form of Bayes-optimal AMP can also be written
x̂(t+1) = η(x̂(t) + HT z(t) , λ + τ̂ (t) ), (48b) as (48) with η(·) being MMSE denoiser.
8
d
D. State Evolution A|G = B.
In this subsection, we only give a sketch of proving AMP’s Gaussianity. The equations (53) shows that in the large system
SE in [16]. Let’s introduce the following general iterations. limit, each entry of h(t+1) and b(t) tends to Gaussian RV.
Regarding h(t) and b(t) as column vectors, then for t ≥ 0,
h(t+1) = HT m(t) − ξt q(t) , (52a) from (52), we have
(t) (t) (t−1)
b = Hq − λt m , (52b) h i
h(1) + ξ0 q(0) , · · · , h(t) + ξt−1 q(t−1)
where m(t) = gt (b(t) , n), q(t) = ft (h(t) , x), ξt = | {z }
gt0 (b(t) , n) , and λt = α1 ft0 (h(t) , x) . t
4
=X
Pertaining to this general iterations, the following conclu-
h i
= H m , · · · , m(t−1) ,
T (0)
(61)
sions can be established. In the large system limit, for any | {z }
pseudo-Lipschitz function ϕ : R2 7→ R of order k and all 4
=Mt
t ≥ 0, h i
N
b(0) , b(1) + λ1 m(0) , · · · , b(t−1) + λt−1 m(t−2)
1 X (t+1) a.s.
lim ϕ(hi , xi ) = EZ,X {ϕ(τt Z, X)} , (53a)
| {z }
4
N →∞ N =Yt
i=1 h i
1 X
M
(t) a.s.
= H q(0) , · · · , q(t−1) . (62)
lim ϕ(bi , ni ) = EZ,N {ϕ(σt Z, N)} , (53b)
M →∞ M
| {z }
i=1 4
=Qt
where Let Gt1 ,t2 denote the event that H satisfies the linear con-
τt2 = E gt (σt Z, N)2 , strains Xt1 = HT Mt1 and Yt2 = HQt2 . Then the conditional
(54)
1 distribution of h(t+1) and b(t) can be expressed as
σt2 = E ft (τt−1 Z, X)2 ,
(55)
α d
h(t+1) |Gt+1,t = H|Gt+1,t m(t) − ξt q(t) , (63)
where N ∼ PN and X ∼ PX are independent of Z ∼ N (0, 1). d
Specially, σ02 = limN →∞ N1α kq(0) k2 . b(t) |Gt,t = H|Gt,t q(t) − λt m(t−1) . (64)
Define The approximated expressions are shown in [16, Lemma 1],
(t)
gt (b , n) = b (t)
− n, (56) where t-iteration h(t+1) (or b(t) ) on the conditions Gt+1,t
(t) (t) (or Gt,t ) is expressed as a combination of all preceding
ft (h , x) = ηt−1 (x − h ) − x. (57)
{h(τ ) , ∀τ ≤ t} (or {b(τ ) , τ < t}). The proof of Lemma 1 is
0
Then ξt = 1 and λt = − α1 ηt−1 (x − h(t) ) . To coincide rigorous since the induction on t is rigorous. Be aware, during
with AMP (Donoho) in (10), it implies that x − h(t+1) = the proof of Lemma 1, the fact that H has IID Gaussian entries
HT z(t) + x(t) . We thus have is applied to derive the Gaussianity of h(t+1) and b(t) .
10 -2
SNR=16dB
where ηt (·) can be an arbitrary pseudo-Lipschitz function and
C is a constant. In this case, we have η̃t0 (r(t) ) = 0.
10 -3 SNR=24dB
For convenience, we define the input and output errors
q(t) = x̂(t) − x, (67)
(t) (t)
h =r − x. (68)
10 -4
0 5 10 15 20
Iteration Substituting the system model y = Hx + n and (65) into
equations above, we have
Fig. 7. Iterative behavior of Bayes-optimal AMP and its SE in compressed
sensing. H has IID Gaussian entries with zero mean and 1/M variance.
2 . The signal of interest x has IID
LE : h(t) = (I − Wt H)q(t) + Wt n, (69a)
M = αN , N = 1024 and SNR=1/σw
(t+1) (t)
entries following BG(0, 0.05). NLE : q = η̃t (x + h ) − x. (69b)
Also, we define error-related parameters
conditional number, non-zero mean). To extend the scope of 1 (t) 2 1 (t) 2
AMP to more general random matrices (unitrarily-invariant v̂ (t) = lim kq̂ k2 , τt2 = lim kh k2 . (70)
N
N →∞ N →∞ N
matrix4 ), a modified AMP algorithm termed OAMP [24]
Similar to AMP, we assume that the following assumptions
was proposed. Different from AMP, the denoiser of OAMP
hold
is divergence-free so that the Onsager term vanishes and (t)
• Assumption 1: the input error h consists of IID zero-
the LMMSE de-correlated matrix is applied to ensure the
orthogonality5 of input and output errors of denoiser. mean Gaussian entries independent of x, i.e., R(t) = X +
τt Z with Z being a standard Gaussian RV.
(t+1)
• Assumption 2: the output error q consists of IID
A. Orthogonality of input and output errors entries independent of H and noise n.
Let’s consider the following general iterations containing a We will show that based on the assumptions above, the de-
linear estimation (LE) and a nonlinear estimation (NLE): correlated matrix Wt and divergence-free imply the orthogo-
LE :
(t)
r(t) = x̂(t) + Wt (y − Hx̂(t) ) + rOnsager , (65a) nality between input error h(t) and output error q(t+1) . We say
LE is de-correlated one if Tr(I − Wt H) = 0, which implies
4 We say A = UΣVT is unitarily-invariant if U, V, and Σ are mutually
N
independent, and U, V are Haar-distributed. Wt = Ŵt , (71)
5 Given two random variables X, Y, we say X is orthogonal to Y if E{XY} = Tr(Ŵt H)
0. Provided that x ∈ RN and y ∈ RN are generated by X and Y, respectively,
1 T 1 PN a.s.
then N x y= N i=1 xi yi = E{XY} = 0.
6 We say η : R 7→ R is divergence-free if E{η 0 (R)} = 0.
10
N i=j γ2 P N γ2
N σ2 − i=1 λi γ2 +σ2
w w
PN
Since Tr(Bt ) = 0, we then have E{Bt } = 0 and further 1
γ2 i=1 λi γ2 +σ2
w
=P
N
E{h(t) (q(t) )T } = 0. (85) 1
i=1 σw
1
2 − λ γ +σ 2
i 2 w
PN
This completes the proof of orthogonality of the input and γ2 N − γ2 i=1 λi γλ2i +σ
γ2
2
w
output errors. = PN λ γ
. (89)
i 2
2
i=1 λi γ2 +σw
2
−1
N σw
r(t) = x̂(t) + HT H + (t)
I HT (y − Hx̂(t) )
Tr(Ŵt H) v̂
2
−1 2
−1 " 2
#
N T σw T N T σw Tr(Ŵt H) T σw (t) T (t)
= H H + (t) I H y+ H H + (t) I H H + (t) I x̂ − H Hx̂
Tr(Ŵt H) v̂ Tr(Ŵt H) v̂ N v̂
2
−1 2
−1 "
σw σw
HT H + v̂(t) HT y HT H + v̂(t) N
! N
#
I I 1 X λi v̂ (t) 1 X λi σw2
T (t) (t)
= PN λi v̂ (t)
+ 1 PN λi v̂ (t)
− 1 H Hx̂ + x̂
1
i=1 v̂ (t) λi +σw 2 i=1 v̂ (t) λi +σw 2
N i=1 v̂ (t) λi + σw 2 N i=1 v̂ (t) λi + σw
2
N N
2
−1
σw PN
HT H + v̂(t) I HT y σw 2
−1 2 σ 2 1 1
σw x̂(t) −
wN i=1 v̂ λi +σw
(t) 2
= 1
PN λi v̂ (t)
+ HT H + (t) I (t) 1
PN λi
HT Hx̂(t) . (91)
i=1 (t) 2
v̂ v̂ v̂ (t)
i=1 (t) 2
N v̂ λi +σw N v̂ λi +σw
γ2 x̂1 v̂1 r2
r1 = −
γ2 − v̂1 γ2 − v̂1
−1
−1 1 −2 T 1
γ2 σw −2 T
H H + γ2 I 1 −2 T
σw H y + γ22r
N Tr σ w H H + γ2 I r2
= −1 − −1
−2 T −2 T
γ2 − N1 Tr σw H H + γ12 I γ2 − N1 Tr σw H H + γ12 I
−1
σ2 σ2
2 P
σw N
γ2 HT H + γw2 I HT y + γw2 r2 N
γ2
i=1 λi γ2 +σw 2 r2
= σ 2 N
− σ 2 N
γ2 − Nw i=1 λi γ2γ+σ γ2 − Nw i=1 λi γ2γ+σ
P 2
P 2
2 2
w w
2
−1 2
σ 2
1 PN σw
HT H + γw2 I HT y σw2
−1 σw 2
σw N i=1 λi γ2 +σw 2
T γ2 T
= 1
PN λi γ2
+ H H+ I PN λ γ
r2 − H H + I 1 PN λ i γ2
r2
i=1 λi γ2 +σw 2
γ2 1 i
i=1 λi γ2 +σw
2
2
γ2 i=1 2
N N N γ2 λi +σw
2
−1
σ N
HT H + γw2 I HT y −1 2 1 1
P
σw2
σw2 σw N i=1 λi γ2 +σw2
T
= 1 N λ γ
+ H H + I r2 − N λi
HT Hr2 . (92)
γ γ 1
P i 2
P
N i=1 2
λi γ2 +σw
2 2 γ2 i=1
N 2
γ2 λi +σw
−1 !
σ2
where the last equation holds by Assumption 1. As we 1 T T
= Tr Σ ΣΣ + (t) I Σ
observed from OAMP in Algorithm 3, in the large system N v̂
limit, the variance of OAMP estimator can be written as N
1 X σi2
N = 2
(t+1) 1 X (t) N i=1 σ 2 + σ(t) w
i
v̂mmse = Var{xi |ri , τt } v̂
N i=1
( )
a.s. λ
a.s.
n
2
o =E 2
σw
, (97)
= EX,Z (ηtmmse (X + τt Z) − X) . (94) λ + v̂(t)
Combining (93) and (94) proves that the variance of OAMP where σi is the i-th diagonal element of Σ, and the expectation
estimator is equal to asymptotic MSE of OAMP almost sure, in E{λ} is taken over the asymptotic eigenvalue distribution
(t+1) a.s. (t+1)
i.e., v̂mmse = mse(x, t + 1). Note that v̂mmse in (94) only of HT H.
2
relies on the parameter τt and this parameter can be obtained
In the sequel, we obtain the SE of OAMP as below
by
−1 ( )−1
(t) 1 1 λ2
v̂ = (t)
− 2 , (95) LE: τt2 = v̂ (t) E − 1 (98)
v̂mmse τt−1 σw2
! λ2 + v̂(t)
N
τt2 = v̂ (t)
n o
−1 , (96) (t+1)
v̂mmse = EX,Z (ηtmmse (X + τt Z) − X)
2
Tr(Ŵt H) NLE: −1 (99)
1
v̂ (t+1) = (t+1) − τ12
where by SVD H = UΣVT v̂mmse t
−1 ! (t+1)
σ2
1 1 T T Be aware, in the NLE part, the v̂mmse is output MSE rather
Tr(Ŵt H) = Tr H HH + (t) I H
N N v̂ than v̂ (t+1) .
13
Fig. 8. Iterative behavior of OAMP, AMP and their SEs in compressed It can be verified that θt = (λ† + ςt )−1 with λ† = λmax +λ
2
min
† T † −1 (t+1)
where B = λ I−HH and θt = (λ +ςt ) . Note that x̂mmse
IV. L ONG M EMORY AMP is the output estimator rather than x̂(t+1) .
Remark 2. As can be seen from the MAMP algorithm
Although OAMP can be applied to more general random in (104a)-(104b), the parameter λ† = λmin +λ 2
max
of MAMP
matrices, its complexity with roughly O(N 3 ) is larger than T
algorithm relies on the eigenvalue of HH which is roughly
AMP with roughly O(N 2 ). To balance the computational with cost of O(N 3 ). Although some works give the approx-
complexity and region of random measurement matrix, sev- imations to the maximum or minimum singular value of H,
eral long memory algorithms have been proposed, such as its complexity is still huge. We also note that in the long
convolution AMP (CAMP) [25] and memory AMP (MAMP) version [50], a simple bound of maximum eigenvalue and
[26]. CAMP only adjusts the Onsager term where all preceding minimum eigenvalue is applied to provide a close performance
messages are involved to ensure the Gaussianity of input error. of perfect eigenvalues, especially in low condition number. In
However, CAMP may fail to convergence in ill-conditioned the case of given eigenvalues of HHT , the MAMP balances
measurement matrix such as large conditional number. Follow- the computational complexity and random measurement region
ing CAMP and OAMP, MAMP applies a few terms of matrix well.
Taylor series to carry out the matrix inversion in OAMP and
modifies the structure of input signal of denoiser to ensure (a)
A. Derivation of MAMP
the orthogonality of all preceding input errors and t-th output
error, (b) the orthogonality of t-th input error and original Similar to OAMP, the following assumptions are applied
(t)
signal x, (c) the orthogonality of t-th input error and all • Assumption 3: the input error h consists of IID zero-
preceding output errors. mean Gaussian entries independent of x, i.e., R(t) = X +
14
1 (i) T
τt,t Zt with Z being a standard Gaussian RV. Let’s define Using the facts N (q ) x and independence of n and x, we
ηt = τt,t Zt . Different from OAMP, MAMP assumes that have
[η1 , · · · , ηt ]T follows joint Gaussian N (0, Vt ) with Vt = N 2
2 2 1 X
[τi,j ]t×t . τt,t = Q t n + Hti q(i)
• Assumption 4: the output error q(t+1) consists of IID N ε2t i=1 2
entries independent of H and noise n.
t t
1 X X 2
Tr HT B2t−i−j H
Using initial conditions z(0) = x̂(0) = 0, from (104a) = 2 ξi ξj θt,i θt,j σw
N εt i=1 j=1
t t
X Y t X t
z(t) = ξi θj Bt−i (y − Hx̂(i) ). (105)
X t T t
− v̂i,j Tr (Hi ) Hi , (115)
i=1 j=i+1
i=1 j=1
Qt 1 (i) T (j)
Defining θt,i = (θt,i = 1 for i ≥ t), we have where v̂i,j = N (q ) q and v̂i,j = v̂j,i . Defining
j=i+1 θj
t
! ϑt,i = ξi θt,i , (116)
(t) 1 X
1
r = Qt y + Hti x̂(i) , (106) Wt = HT Bt H, wt = Tr(Wt ), (117)
εt i=1 N
1
where Ni,j = Wi Wj , wi,j = Tr {Ni,j } − wi wj , (118)
N
t
X we get pt,i = ϑt,i wt−i and
Qt = ξi θt,i HT Bt−i , (107) t t
2 1 XX 2
i=1 τt,t = ϑt,i ϑt,j σw w2t−i−j + v̂i,j wt−i,t−j
ε2t i=1 j=1
Hti = pt,i I − ξi θt,i HT Bt−i H. (108)
ct,1 ξt2 − 2ct,2 ξt + ct,3
From the orthogonality of input error and original signal = , (119)
a.s.
w02 (ξt + ct,0 )2
i.e., N1 (h(t) )T x = 0, we have
where
t−1
1 (t) T X pt,i
h x ct,0 = ,
N w
0
!T i=1
t
1 1 1 X t (i) ct,1 = 2
σw w0 + v̂t,t w0,0 ,
= Qt (Hx + n) + H (q + x) − x x
N εt εt i=1 i t−1
X
2
t ct,2 = − ϑt,i (σw wt−i + v̂t,i w0,t−i ),
1 T X 1
= x ((Qt H)T + (Hti )T )x − xT x, (109) i=1
N εt i=1
N t−1 X
X t−1
2
ct,3 = ϑt,i ϑt,j (σw w2t−i−j + v̂i,j wt−i,t−j ).
1 (i) T a.s.
where N (q ) x = 0 is applied. Then we get i=1 j=1
2 ∂τ 2
( t
) The parameter ξt is obtained by minimizing τt,t . Zeroing ∂ξt,t
1 X t
Tr Qt H + Hti = 1. (110) ct,2 ct,0 +ct,3
gets two points ξt = −ct,0 and ξt = ct,1 ct,0 +ct,2 , where ξt =
N εt i=1 −ct,0 is maximum value point while
( c c +c
From the orthogonality of input error and output errors, i.e., t,2 t,0 t,3
ct,1 ct,0 + ct,2 6= 0
?
1 (t) T (i) a.s. ξt = ct,1 ct,0 +ct,2 . (120)
N (h ) q = 0, we have
+∞ otherwise
Tr{Hti } = 0. (111) Defining the residual error z̃(t) = y − Hx̂(t) , the crossed
variance v̂i,j can be provided by
Combining (110) and (111), we have 1 (i) T (j) 1 h iT h i
(z̃ ) z̃ = H(x − x̂(i) ) + n H(x − x̂(j) ) + n
1 N N
ξi θt,i Tr HT Bt−i H
pt,i = (112) 1 T
N = −Hq(i) + n −Hq(j) + n
Xt N
εt = pt,i (113) 1 2
= Tr HT H v̂i,j + ασw . (121)
i=1 N
where the parameters pt,i and εt can be determined once the It implies v̂i,j = ( N1 (z̃(i) )T z̃(j) − ασw
2
)/w0 .
From (116), we get
parameters {θt } and {ξt } are determined, where ξt is obtained
by minimizing the averaged input error θt ϑt−1,i 0 ≤ i < t − 1
1 (t) ϑt,i = ξt−1 θt i=t−1 . (122)
2
τt,t = lim kr − xk22 . (114)
ξt i=t
N →∞ N
15
NMSE
damping, especially in the case of large condition number
(e.g., κ(H) > 102 ). To ensure the convergence of MAMP, the
10 -2
damping factor is applied to the parameters x̂(t+1) , v̂t+1,t+1 ,
and z̃(t+1)
(t) (t)
x̂(t+1) = β1 x̂(t+1) + (1 − β1 )x̂(t) ,
(t) (t) 10 -3
z̃(t+1) = β1 z̃(t+1) + (1 − β1 )z̃(t+1) ,
(t) (t) 0 10 20 30 40 50 60 70 80 90 100
v̂t+1,i = β2 v̂t+1,i + (1 − β2 )v̂t+1,i−1 , Iteration
for 1 < i < t + 1. Different from the damping presented here, Fig. 9. Comparison of MAMP and OAMP in different condition numbers.
[26] shows another kind damping. But, in fact, the damping Each entry of x is generated from BG distribution BG(0, 0.1). (M, N ) =
(1024, 512) and SNR = 1/σw 2 = 20dB. The measurement matrix is
factor only has the effect on the convergence speed if algorithm generated by H = UΣVT where both U and V are Haar-distributed
converges. and Σ is rectangular matrix whose diagonal elements are σ1 , · · · , σM with
σi PM σmax (H)
σi+1
= κ(H)1/M and 2
i=1 σi = N , where κ(H) = σmin (H) with
σmax (H) and σmin (H) being maximum and minimum singular values of H,
B. State Evolution (t) (t)
respectively. The damping factors β1 = 0.7 and β2 = 0.8 are applied
(t) (t)
Similar to other AMP-like algorithms, the MSE of MAMP to the cases of κ(H) = 1 and κ(H) = 10, while β1 = β2 = 0.4 are
applied to the case of κ(H) = 50.
can also be predicted by its SE. The asymptotic MSE of
MAMP is defined as
1 (t+1) V. C ONCLUSIONS
mse(x, t + 1) = kx̂ − xk22
N mmse In this paper, we reviewed several AMP-like algorithms:
a.s.
= E (ηtmmse (X + τt,t Z) − X)2 AMP, OAMP, VAMP, and MAMP. We began at introducing
a.s. AMP algorithm, which is originally proposed for providing a
= v̂t+1,t+1 . (123)
sparse solution to LASSO inference problem but is widely
This term only relies on the parameter τt,t 2
, which can be applied to a lot of engineering fields under Bayes-optimal
a.s. setting. In IID sub-Gaussian random measurement matrices
obtained by (119). In τt,t , the parameter v̂i,j = N1 (q(i) )T q(j)
2
is obtained numerically by generating x following PX (x) = region, the AMP algorithm can achieve Bayes-optimal MSE
ρN (x|0, ρ−1 ) + (1 − ρ)δ(x) and r(t) = x + nt with performance, but it may fail to converge if random mea-
2
[n1 , · · · , nt ] ∼ N (0, Ξt ) where Ξt = [τi,j ]t×t and surement is beyond IID sub-Gaussian. Following AMP, we
introduced a modified AMP algorithm termed OAMP, which
2 1 (t) modified AMP in two aspects: LMMSE de-correlated matrix
τt,τ = lim (r − x)T (r(τ ) − x)
N →∞ N and divergence-free denoiser. The OAMP algorithm can be
t τ applied to more general region: unitarily-invariant matrix, but
1 XX 2
= ϑt,i ϑτ,j σw wt+τ −i−j + v̂ij wt−i,τ −j , it should be payed more computational complexity due to
εt ετ i=1 j=1
matrix inversion. To balance the computational complexity
with τt,τ = ττ,t . Then, and random measurement region, the MAMP algorithm ap-
n o plies several terms of matrix Taylor series to approximate
∀τ < t : v̂t,τ = E (x̂(t) − x)(x̂(τ ) − x) . matrix inversion and applies all preceding messages to ensure
three orthogonality. The MAMP algorithm relies on the given
spectral of sample of random measurement matrix. Although,
C. Numeric Simulation several works gave some approximations to it, the complexity
In Fig. 9, we show the pre-iteration behavior of MAMP is still huge. In addition, the convergence speed of MAMP is
and OAMP by varying the condition number in application slower than OAMP especially in the case of large condition
of compressed sensing. As can be observed from this figure, number. On the other hand, a significant feature of AMP-like
MAMP and OAMP converge to the same fixed point. In algorithms is that their asymptotic MSE performance can be
κ(H) = 1, MAMP has the comparable convergence speed fully predicted by their SEs. We also gave a brief derivation
as OAMP. However, as the κ(H) increases, MAMP need to of their SEs.
pay more iteration times to converge the same fixed point as
OAMP. Also, we note that the convergence speed and NMSE VI. ACKNOWLEDGEMENTS
performance of MAMP and OAMP tend to worse in large We are grateful to Y. Kabashima, D. Cai, and Y. Fu for
condition number. valuable comments and useful discussions.
16