AMP_paper

1
A Concise Tutorial on Approximate Message

Passing
Qiuyun Zou and Hongwen Yang
Abstract—High-dimensional signal recovery of standard linear problem can be formalized as a least absolute shrinkage and
regression is a key challenge in many engineering fields, such selection operator (LASSO) [1] inference problem
as, communications, compressed sensing, and image processing.
The approximate message passing (AMP) algorithm proposed 1
x̂LASSO = arg min ky − Hxk22 + λkxk1 , (2)
by Donoho et al is a computational efficient method to such x 2
arXiv:2201.07487v3 [cs.IT] 1 Mar 2022
problems, which can attain Bayes-optimal performance in inde-

pendent identical distributed (IID) sub-Gaussian random matri- where k·k1 and k·k2 are `1 , `2 norm, respectively, and λ ≥ 0 is
ces region. A significant feature of AMP is that the dynamical the parameter of regularization that balances the sparsity and
behavior of AMP can be fully predicted by a scalar equation error of solution. The inference problem above is also known
termed station evolution (SE). Although AMP is optimal in IID as basis pursuit de-noising (BPDN) inference. Such problem
sub-Gaussian random matrices, AMP may fail to converge when
measurement matrix is beyond IID sub-Gaussian. To extend the has a mass of applications in many fields such as compressed
region of random measurement matrix, an expectation prop- sensing [1], [2], [3], [4], [5], [6], image processing [7], [8],
agation (EP)-related algorithm orthogonal AMP (OAMP) was and sparse channel estimation in wireless communications etc.
proposed, which shares the same algorithm with EP, expectation To solve the LASSO inference problem, there are many
consistent (EC), and vector AMP (VAMP). This paper aims kinds of algorithms. For example,
at giving a review for those algorithms. We begin with the
worst case, i.e., least absolute shrinkage and selection operator • Convex relaxation. LASSO inference problem is a com-
(LASSO) inference problem, and then give the detailed derivation pound optimization problem involving a smooth function
of AMP derived from message passing. Also, in the Bayes-optimal and a non-smooth function such as `1 norm regulariza-
setting, we give the Bayes-optimal AMP which has a slight tion. There are a mass of algorithms for compound opti-
difference from AMP for LASSO. In addition, we review some
AMP-related algorithms: OAMP, VAMP, and Memory AMP mization problem such as sub-gradient method, proximal
(MAMP), which can be applied to more general random matrices. gradient descent, also known as iterative soft threshold
algorithm (ISTA) [9], Newton acceleration algorithm,
Index Terms—Standard linear regression, message passing, and alternating direction method of multiplies (ADMM)
expectation propagation, state evolution. [10], etc. Among them, ADMM alternatively optimizes
the objective function containing quadric error and the
objective function involving `1 norm regularization.
I. I NTRODUCTION • Greedy algorithm. A kind of alternative method refers
We focus on the sparse signal recovery of the standard linear to greedy algorithms [11] in compressed sensing, such
regression as, match pursuit (MP), orthogonal match pursuit (OMP)
[12], and subspace pursuit (SP) [13], etc. In those greedy
y = Hx + n, (1) algorithms, they make a ‘hard’ decision based upon
where x ∈ RN is the sparse signal to be estimated, H ∈ some locally optimal optimization criterion. All of those
RM ×N (M N ) is the measurement matrix which is per- methods can be regarded as a variant of least square.
fectly known beforehand, n is the additive white Gaussian The basic ideal of them is to approximate the signal of
noise with zero mean and covariance σw 2
, and y ∈ RM is the interest by selecting the atom or sub-hyperplane from
observation. In the existing works, the sparse signal can be measurement matrix that best matches the residual error
divided into two kinds: one is that x is k-sparsity but without of each iteration. Among them, MP projects the residual
true distribution, i.e., only k elements of x being non-zero, and error of each iteration onto a specific atom, while OMP
the other is that x is drawn from a specific distribution with projects the residual error of each iteration onto a sub-
sparsity pattern, such as Bernoulli-Gaussian (BG) distribution. hyperplane from measurement matrix.
• Bayesian estimation. The Bayesian estimator [14, Chap-
Throughout, we focus on the large system limit, in which the
dimensions of system tend to infinity (M, N ) → ∞ but the ter 10] is a kind of algorithm which aims at minimizing
ratio α = M the Bayes loss function. According to different Bayes
N is fixed. At the worst case, where prior and
likelihood function are both unknown, this sparse inference risk functions, the Bayesian estimator can be generally
divided into minimum mean square error (MMSE) and
Q. Zou and H. Yang are with Beijing University of Posts and maximum a posterior (MAP). In fact, the exact MMSE
Telecommunications, Beijing 100876, China (email: qiuyun.zou@bupt.edu.cn; or MAP is NP-hard problem in general cases. How-
yanghong@bupt.edu.cn).
The Matlab code of this paper is available in https://github.com/QiuyunZou/ ever, there are some algorithms which implement the
AMPTutorial. exact Bayesian estimator iteratively. Among them, the
2
Standard Linear Models
Divergence-free denoiser
CAMP [25] (2020) MAMP [26] (2021) EP [27] (2oo1)
Three orthogonality
Long memory Matrix Taylor series

Onsager term and three orthogonality
Gaussian approximation AMP [15] (2009) Divergence-free denoiser OAMP [24] (2016) Scalar variance EC(single-loop) [28]
Message passing (2005)
Taylor series IID sub-Guassian LMMSE de-correlated matrix unitrarily-invariant
Onsager term
ISTA VAMP [29] (2016)
Fig. 1. The relations between the message passing based algorithms in standard linear regression inference problem.
approximate message passing (AMP) [15] algorithm, the random measurement matrix, recently, some long memory
main focus in this paper, is a celebrated implementation algorithms such as convolutional AMP (CAMP) [25], and
of Bayes estimation. By postulated posterior/MMSE, in memory AMP [26] were proposed. Different from OAMP,
which the postulated prior and likelihood function are dif- CAMP only modifies the Onsager term of AMP. The Onsager
ferent from true ones, AMP can provide the exact sparse term of CAMP includes all proceeding messages to ensure
solution to LASSO inference problem using Laplace the Gaussianity of input signal of denoiser. However, CAMP
method of integration. In general, we call the algorithm may fail to converge in the case of large condition number.
which relies on Bayesian formula as Bayesian algorithms. Following CAMP and OAMP, the MAMP algorithm applies
finite terms of matrix Taylor series to approximate matrix
On the other hand, in the Bayes-optimal setting (may inversion of OAMP and involves all previous messages to
Mention of
M ≥ N ) where both prior and likelihood function are known, ensure three orthogonality. EP
the MMSE and MAP give a much better performance than Another efficient algorithm related to AMP is called ex- algorithm
convex relaxation. However, due to high-dimensional integra- pectation propagation (EP) [27]. EP is earlier than AMP,
tion, the exact MMSE is hard to obtain. Fortunately, some which approximates the factorable factors by choosing a
existing works [16] showed that AMP can achieve the Bayes- distribution from Gaussian family via minimizing Kullback-
optimal MSE performance but with affordable complexity in Leibler (KL) divergence. Some EP-related methods refer to
independent identical distributed (IID) sub-Gaussian random expectation consistent (EC) [28, Appendix D] (single-loop),
measurement matrices region [17]. For convenience, we depict OAMP [24], and vector AMP (VAMP) [29]. They were
Fig. 1 to show the relations between AMP and its related proposed independently in different manners but share the
algorithms. The AMP derives from the message passing [18] same algorithm. Actually, EP/EC (single-loop) have a slight
algorithm in coding theory, which is also known as belief difference from OAMP/VAMP, since EP/EC has the element-
propagation [19] in computer science or cavity method [20] in wise variances and they can be reduced to OAMP/VAMP by
statistic mechanics. The AMP algorithm is closely related to taking the mean operation for element-wise variance. Among
the Thouless-Anderson-Palmer (TAP) [21] equations which is them, EC approximation is based on the minimum Gbiss
used to approximate marginal moments in large probabilistic free energy. It means that those methods can be regarded as
models. In [22], the first AMP algorithm was proposed for the an example of solving the fixed point of Gbiss free energy.
code division multiple access (CDMA) multi-user detection Almost at the same time as OAMP, the VAMP was proposed
problem. A significant feature of AMP algorithm is that the using a EP-type message passing and the dynamic of VAMP
dynamic of AMP can be fully predicted by a scalar equation was rigorously analyzed in [29]. Recently, [30] proved that
termed state evolution (SE) [16], which is perfectly agree with VAMP and AMP have identical fixed points in their state
the fixed point of the exact MMSE estimator using replica evolutions in their overlapping random matrices. We also
method [23]. The AMP algorithm is also related to ISTA, the note that under the mismatch case [31], where the prior
difference between them is the Onsager term, which leads and likelihood function applied to the inference problem are
to AMP more faster than ISTA but it doesn’t change its fixed different from the true prior and likelihood function, the AMP
points. As the measurement matrix is beyond IID sub-Gaussian as well as its related algorithms may not converge although
region, AMP methods often fail to converge. Beyond IID sub- the corresponding SE converges to a fixed point predicted by
Gaussian region, the orthogonal AMP (OAMP) [24] can be replica method. Actually, AMP for LASSO is one case of
applied to more general unitarily-invariant matrices via the mismatched model, but its convergence is guaranteed due to
LMMSE de-correlated matrix and divergence-free denoiser, convex nature of LASSO [32]. The failure of AMP can occur
but it should pay more computational complexity due to the when the mismatched models are defined by non-convex cost
matrix inversion. To balance the complexity and region of function [33].
3
Besides, there are some algorithms that extend AMP to more x

r
general models beyond standard liner model. In [34], a gener-
alized AMP (GAMP) algorithm was proposed for generalized ®t ¸
linear model which allows an arbitrary row-wise mapping. A
concise derivation of GAMP using EP projection can be found x r
in [35], [36]. Further, Park et al [37] developed bilinear GAMP ¡®t ¸ ®t ¸
¡®t ¸
(BiG-AMP) which extends the GAMP algorithm to bilinear
model in which both the signal of interest and measurement
matrix are unknown. Recent works showed that the BiG-AMP (a) (b)
can be obtained by Plefka-Georges-Yedidia method [38], [39].
Following VAMP, [40], [41] developed a generalized linear Fig. 2. (a) x-z coordinate axis; (b) z-x coordinate axis.
model VAMP (GLM-VAMP) algorithm by constructing an
equivalent linear model. Compared to GAMP, GLM-VAMP
can be applied to more general random matrices but needs to where αt is step size and x̂(t) is the estimator of x at t-
pay more computational complexity. Similar to GLM-VAMP, a iteration. Adding `1 norm regularization, (3) becomes
generalized version of MAMP was proposed in [42]. Beyond 1
single-layer model, some extensions of AMP in multi-layer x̂(t) = arg min kx − (x̂(t−1) − αt−1 ∇f (x̂(t−1) ))k22 + λkxk1 .
x 2αt
regions can be found in [43], [44], [45], [46]. However, those (4)
algorithms are out of the scope of this paper.
Although AMP and its related methods have attracted a Defining r(t) = x̂(t) − αt ∇f (x̂(t) ) = x̂(t) + αt HT (y −
lot of attention in many engineering fields, there still isn’t a Hx̂(t) ), the equation above becomes
tutorial that gives a clear line to summarize them and provides
(t) 1 (t−1) 2
concise derivations. That is the purpose of this paper. For that ∀i : x̂i = arg min (xi − ri ) + λ|xi | . (5)
purpose, we begin with the LASSO inference problem, which xi 2αt
is original goal of AMP. By Laplace method of integration, the Zeroing the gradients w.r.t. xi yields ri = x̂i
(t−1) (t)
+
LASSO inference problem can be converted into the limit of (t)
αt λsign(x̂i ) . Then swapping the axes (see Fig 2) gets
postulated MMSE estimator. Using factor graph representation
and message passing, we give the detailed derivation of AMP (t)
x̂i = sign(ri
(t−1)
) max(|ri
(t−1)
| − αt λ, 0). (6)
for LASSO. And then we move to the Bayes-optimal setting,
which is more attractive and common in some engineering Totally, the ISTA is summarized as
fields, such as wireless communications. Beyond IID sub-
Gaussian random matrices, we review several extensions of r(t) = x̂(t) + αt HT (y − Hx̂(t) ), (7a)
(t+1) (t) (t)
AMP: OAMP, VAMP, and MAMP, and illustrate their relations x̂ = sign(r ) max(|r | − αt λ, 0). (7b)
and differences.
Notations: Throughout, we use x and X to denote column
vector and matrix, respectively. (·)T denotes transpose operator To in line with AMP, let’s define z(t) = y − Hx̂(t) and
such as XT . Tr(A) denotes the trace of square matrix A. η(r(t) , αt λ) = sign(r(t) ) max(|r(t) |, αt λ). The ISTA algo-
a.s.
= means equal almost sure. Given the original signal x and rithm can be written as
its estimator x̂, the 2normalized MSE (NMSE) is defined as z(t) = y − Hx̂(t) , (8a)
kx̂−xk
NMSE(x) = kxk2 2 with k · k2 being `2 norm. We apply (t+1) (t) T (t)
2 x̂ = η(x̂ + H z , λ), (8b)
N (x|a, A) to denote a Gaussian probability density function
with mean a and variance A described by: where the step size αt is set to αt = 1. However, in practical,
1

(x − a)2
αt may cause the algorithm to diverge and actually αt ∈
N (x|a, A) = √ exp − . [0.1, 0.35] is appropriate in our simulation. The complexity
2πA 2A
of ISTA is dominated by the matrix multiplication with the
BG(µ, ρ) is a Bernoulli Gaussian distribution: BG(µ, ρ) = cost of O(M N ). However, the convergence speed of ISTA is
ρN (x|µ, ρ−1 ) + (1 − ρ)δ(x). too slow. To improve the convergence speed of ISTA, the fast
ISTA (FISTA) [7] was proposed. The FISTA is beyond the
II. A PPROXIMATE M ESSAGE PASSING scope of this paper. We only post it as below
A. Iterative Soft Threshold Algorithm
t − 2 (t−1)
Before introducing AMP algorithm, we first review a AMP x̂(t) = x̂(t−1) + (x̂ − x̂(t−2) ), (9a)
t+1
related algorithm: ISTA. Recalling that the term f (x) = 12 ky−
z(t) = y − Hx̂(t) , (9b)
Hxk22 in (2) is continuous and derivative while the second term
(t+1) (t) T (t)
g(x) = λkxk1 is not differentiable at x = 0. The minimization x̂ = η(x̂ + H z ). (9c)
of f (x) can be achieved by gradient descent
Comparing FISTA in (9) with ISTA in (7), the difference
1 between them is that the term x̂(t) is constructed from two
x̂(t) = arg min kx − (x̂(t−1) − αt−1 ∇f (x̂(t−1) ))k22 , (3)
x 2α t previous results.
4
B. AMP for LASSO AMP ISTA

1 6
0.8
The AMP algorithm [15] posted below is related to ISTA
4
0.6
z(t) = y − Hx̂(t)
Quantiles of Input Sample
Quantiles of Input Sample

0.4
2
1 D
0
E
0.2
+ z(t−1) ηt−1 (x̂(t−1) + HT z(t−1) ) , (10a)
α 0 0
x̂(t+1) = ηt (x̂(t) + HT z(t) ), (10b)
-0.2
-2
PN -0.4
where h·i is empirical mean such as hxi = N1 i=1 xi and
0 -0.6
ηt−1 (r) is the partial derivation of ηt−1 (·) w.r.t. r. -4
Compared to ISTA algorithm in (8), the key dif- -0.8
ference between AMP and ISTA is the Onsager term -1 -6
1 (t−1) 0
αz ηt−1 (x(t−1) + HT z(t−1) ) . This term can improve -4 -2 0 2 4 -4 -2 0 2 4
Standard Normal Quantiles Standard Normal Quantiles
the convergence speed of ISTA but does not change its fixed
point. Essentially, this term ensures that the input r(t) of Fig. 3. QQplot comparing the distribution of input error r(t) − x of AMP
denoiser η(·) can be expressed as the original signal adding and ISTA at t = 5. The system setups are similar to that of Fig. 4. The blue
points match the red line better, the closer the input error is to the Gaussian
an additive Gaussian noise (Gaussianity, see Fig. 3) and it distribution. Notice that the input error of AMP remains Gaussianity due to
leads to faster convergence than ISTA. As shown in Fig. 4, we Onsager term.
compare per-iteration NMSE behavior of the AMP with ISTA
and FISTA. From Fig. 4, we can see that AMP converges with
t = 18 iterations which is far small than FISTA (t = 108)
and ISTA (t = 235). Be aware, in ISTA, one should adjust
the step size αt to ensure the convergence but the step size
is unnecessary to AMP. In addition, an appropriate step size
ensures the algorithm to converge but does not change the fixed
NMSE
point. The below is the detailed derivation to obtain AMP for

LASSO inference problem.
As shown in [47, Appendix D], [48], the LASSO inference
problem can be expressed as the limit of the postulated MMSE
estimator using Laplace method of integration
Z
1 1 2
x̂ = lim x pos exp −β ky − Hxk2 + λkxk1 dx
β→∞ Zβ 2 Iteration
| {z }
q(x|y)
Fig. 4. Comparison of AMP, FISTA, and ISTA for LASSO inference problem.
1 H has IID Gaussian entries with zero mean and 1/M variance. N = 1024,
= arg min ky − Hxk22 + λkxk1 , (11) M = 512, α = M = 12 , and λ = 0.05. SNR = 1/σw2 = 50dB. x has IID
x 2 N
BG entries following BG(0, 0.05). The step size αt = 0.35 and αt = 0.2
are applied to ISTA and FISTA, respectively.
where Zβpos is the normalization constant. Using Bayes’ rules,
the postulated posterior q(x|y) in (11) is expressed as
passing, we suggest [49, Chapter 2] for more details. From
1 this figure, the messages are addressed as
q(x|y) = q(x)q(y|x),
q(y)
1 M
q(x) = pri exp(−βλkxk1 ), (t+1) (t)
Y
Zβ µi→a (xi ) ∝ e−βλ|xi | µi←b (xi ), (12a)
b6=a
1 β
q(y|x) = lik exp − ky − Hxk22 , (t)
Z N
Y (t)
Zβ 2 µi←a (xi ) ∝ q(ya |x) µj→a (xj )dx\i , (12b)
j6=i
where q(y), Zβpri , and Zβlik are normalization constants, and

(t+1)
q(x) is the postulated prior while q(y|x) is the postulated where x\i is x expect xi , µi→a (xi ) is the message from
likelihood function. The postulated (t)
QM likelihood
PNfunction can variable node xi to factor node q(ya |x), µi←a (xi ) is the
also be formalized as q(y|x) = a=1 N (ya | i=1 hai xi , β1 ). message in opposite direction at t-iteration, and superscript
The factor graph of postulated posterior defined in (11) is t denotes the number of iteration. It is worth noting that at t-
depicted in Fig. 5. For basis of factor graph and message iteration, the marginal posterior P(xi |y) can be approximated
5
(t) (t)
q(y1 jx) Zi←a , β1 Vi←a ). By Gaussian reproduction lemma1 , the mes-
(t)
q(x1 ) sage µi←a (xi ) is approximated as
x1
(t) (t) 1 (t)
q(y2 jx) µi←a (xi ) ∝ N 0|ya − hai xi − Zi←a , (1 + Vi←a )
β
...
q(xi ) (t)
!
xi t
ya − Zi←a 1 + Vi←a
∝ N xi | , . (17)
q(ya jx) hai β|hai |2
...
q(xN ) xN In the sequel, the mean and variance of µi←a (xi ) are defined
(t)
q(yM jx) and evaluated as

(t) (t)
(t) ya − Zi←a (t) 1 + Vi←a
x̂i←a = , v̂i←a = . (18)
hai β|hai |2
Fig. 5. Factor graph of postulated posteriorPq(x|y) defined in (11), where
q(xi ) ∝ e−β|xi | and q(ya |x) = N (ya | N
Be aware the equation (17) is mathematically invalid as hai =
i=1 hai xi , 1/β). The square
denotes the factor node (e.g. q(xi )) while the circle denotes the variable node 0. However, in the rest of this section we will show that several
(e.g. xi ). The messages delivers between factor nodes and variable nodes via zero elements in H has no effect on the final result.
their edges. (t+1)
Let’s move to calculate the message µi→a (xi ) in (12)
based on the approximated result above. Applying Gaussian
QM (t) (t+1)
reproduction property, the term b6=a µi←b (xi ) in µi→a (xi )
by
is proportion to
QM (t) M
(t+1) e−βλ|xi | a=1 µi←a (xi ) Y (t) (t) (t)
q̂ (xi |y) = R M (t)
, (13) µi←b (xi ) ∝ N (xi |ri→a , Σi→a ), (19)
e−βλ|xi | a=1 µi←a (xi )dxi
Q
b6=a
where
while the mean of the approximated posterior q̂ (t+1) (xi |y)  −1
will serve as an approximation of MMSE estimator. 2
(t)
X |hbi |
To reduce the complexity of sum-product message passing Σi→a = (t)
 , (20)
(t)
shown in (12), we first simplify the message µi←a (xi ) as b6=a 1 + Vi←b
below X h∗ (yb − Z (t) )
(t)
ri→a = Σti→a bi i←b
(t)
. (21)
(t)
µi←a (xi ) b6=a 1 + Vi←b
Z Z N N Note that several zero value elements in H have no effect on
(t)
X Y
(t) (t)
∝ q(ya |za )δ(za − hak xk )dza µj→a (xj )dx\i Σi→a , ri→a as well as rest parameters in the derivation of
x\i za k=1 j6=i AMP.
   (t+1)
Z  X  As a result, the message µi→a (xi ) is approximated as the
∝ q(ya |za )E δ za − haj xj − hai xi  dza , product of a Laplace prior and a Gaussian likelihood function
za  
j6=i
(t+1) 1 −βλ|xi | (t) (t)
(14) µi→a (xi ) = e N (xi |ri→a , Σi→a ), (22)
Zβ
QN (t) where Zβ is normalized constant.
where the expectation is over j6=i µj→a (xj ). We define
(t) (t) For convenience, define a distribution
random variable (RV) ζi←a associated with za and ξj→a
(t) 1 1
following µj→a (xj ) associated with xj . Denote the mean and fβ (x; r, Σ) = exp −β λ|x| + (x − r)2 , (23)
(t) (t) (t) Zβ 2Σ
variance of ξj→a as x̂j→a and v̂j→a /β, respectively. From
(14), as the dimension N tends to infinity, using central limit and its mean and variance
(t)
Z
(CLT) theorem the RV ζi←a converges to a Gaussian RV with Fβ (x; r, Σ) = xfβ (x; r, Σ)dx, (24)
mean and variance Z
(t) (t) (t) 1 (t) Gβ (x; r, Σ) = x2 fβ (x; r, Σ)dx − |Fβ (x; r, Σ)|2 . (25)
E{ζi←a } = Zi←a + hai xi , Var{ζi←a } = V , (15)
β i←a
(t+1)
The mean and variance of the message µi→a (xi ) are repre-
where sented as
(t+1) (t) (t)
x̂i→a = Fβ (xi ; ri→a , Σi→a ), (26)
X X
t
Zi←a = haj x̂tj→a , t
Vi←a = |haj |2 v̂j→a
t
. (16)
(t+1) (t) (t)
j6=i j6=i v̂i→a = βGβ (xi ; ri→a , Σi→a ). (27)
Based
P on this Gaussian approximation, the term E{δ(za − 1 N (x|a, A)N (x|b, B)
= N (x|c, C)N
(0|a − b, A + B) with C =
j6=i aj xj − hai xi )} in (14) is replaced by N (za |hai xi +
a b
h (A−1 + B −1 )−1 and c = C A +B
6
(t+1)
Recalling the approximated posterior q̂ (t+1) (xi |y) in (13), we Applying Taylor series to v̂i→a in (27), we have
define
(t+1) (t+1) ∂ (t) (t)
M
!−1 v̂i→a ≈ v̂i + 4r βGβ (xi ; ri , Σi ). (38)
(t)
X |hai |2 ∂r
Σi = (t)
, (28)
a=1 1 + Vi←a
Combining (36) with (38) into (33) obtains
M (t)
(t) (t)
X h∗ai (ya − Zi←a ) N
X (t)
∗
(t) hai (ya − Zi←a )
(t)
ri = Σi (t)
. (29) Va(t) = |hai |2 v̂i − Σi
1+ Vi←a (t)
a=1 i=1 1 + Va

QM (t) (t) (t) ∂ (t)
The term a=1 µi←a (xi ) is proportion to N (xi |ri , Σi ). × βGβ (xi ; r, Σi )
Accordingly, the mean and variance of approximated posterior ∂r
N N (t)
q̂ (t+1) (xi |y) can be denoted as X (t)
X |hai |3 (ya − Zi←a )
≈ |hai |2 v̂i − PM 2
a=1 |hai |
(t+1) (t) (t)
x̂i = Fβ (xi ; ri , Σi ), (30) i=1 i=1
(t+1) (t) (t) ∂ (t)
v̂i = βGβ (xi ; ri , Σi ). (31) × βGβ (xi ; r, Σi )
∂r
N
Also define X (t)
√
= |hai |2 v̂i + O(1/ N )
N
X (t) i=1
Za(t) = hai x̂i→a (32) N
(t)
X
i=1 ≈ |hai |2 v̂i . (39)
(t)
X
Va(t) = |hai |2 v̂i→a
t
≈ Vi←a (33) i=1
i=1
Substituting (37) into (32) gets
(t) (t)
where Va= Vi←a
holds by ignoring infinitesimal. N N (t−1)
(t+1)
X (t)
X |hai |2 (ya − Z ) (t)
Applying first-order Taylor series2 to x̂i→a in (26), we have Za(t) ≈ hai x̂i − i←a
(t−1)
v̂i
i=1 i=1 1 + Va
(t+1) (t+1) ∂ (t) (t) N N
x̂i→a ≈ x̂i + 4r Fβ (xi ; ri , Σi ) X (t)
X (t) (t−1)
|hai |2 v̂ (ya − Za + hai x̂
(t−1)
)
∂r = hai x̂i − i i
∂ (t−1)
+ 4Σ
(t) (t)
Fβ (xi ; ri , Σi ), (34) i=1 i=1 1 + Va
∂Σ N (t) (t−1)
X (t) Va (ya − Za )
where ≈ hai x̂i − (t−1)
. (40)
i=1 1+ Va
(t) (t)
4Σ = Σi→a − Σi
Inserting (37) into (29) yields
|hai |2
(t)
1+Va M (t) (t)
= (t) (t)
X h∗ai (ya − Za + hai x̂i )
PM |hai |2 PM |hbi |2 ri ≈ Σi (t)
a=1 1+V (t)
i←a
b6=a 1+V (t)
i←b a=1 1 + Va
M (t)
≈ 0, (35) (t) (t)
X h∗ai (ya − Za )
(t) (t)
= x̂i + Σi (t)
. (41)
4r = ri→a − ri a=1 1 + Va
∗ (t)
(t) hai (ya − Zi←a ) Up to now, the derivation of AMP for LASSO is complete.
≈ −Σi (t)
, (36)
1+ Va The AMP algorithm is shown in Algorithm 1.
To in line with Donoho’s AMP, we still need to carry out
(t) (t)
where we use the approximations Va = Vi←a + O(1/N ) the following simplifications using the fact |hai |2 = O(1/M )
(t)
and Σi = Σti→a + O(1/N ) to obtain 4r. Applying the fact3
(t+1)
∂ (t) β (t) (t) v̂i
∂r Fβ (xi ; r, Σi )|r=ri(t) = (t) Gβ (x; ri , Σi ) = , (34) N
Σi Σi
(t)
1 X t 4 (t)
can be simplified as Va(t) = v̂ = V , (43a)
M i=1 i
(t)
(t+1) (t+1) h∗ai (ya − Zi←a ) (t+1)
N
V (t) (ya − Za
(t−1)
)
x̂i→a ≈ x̂i − v̂i . (37) (t)
X
(t) Za(t) = hai x̂i − , (43b)
1+ Va 1 + V (t−1)
i=1
(t) 4
2 f (x + 4x, y + 4y) = f (x, y) + 4xf 0 (x, y) + 4yf 0 (x, y), where f 0
x y x Σi = 1 + V (t) = Σ(t) , (43c)
and fy0 are the partial derivation of f (x, y) w.r.t. x and y, respectively.
3 Provided that f (x) is an arbitrary bounded and non-negative function
M
(t) (t)
X
f (x)N (x|m,v)
and define a distribution P(x) = R f (x)N (x|m,v)dx . Denote its mean and
ri = x̂i + h∗ai (ya − Za(t) ), (43d)
a=1
− E{x})2 P(x)dx.
R R
variance as E{x} = xP(x)dx and Var{x} = (x
(t+1) (t)
x x−m x̂i = Fβ (xi ; ri , Σ(t) ), (43e)
R R R
∂ xP(x)dx v
f (x)N (x|m,v)dx· f (x)N (x|m,v)dx
We have = −
∂m [ f (x)N (x|m,v)dx]2
R
(t+1) (t)
Σ(t) F0β (xi ; ri , Σ(t) ),
R x−m
v̂i = (43f)
R
xf (x)N (x|m,v)dx· v
f (x)N (x|m,v)dx Var{x}
= v .
[ f (x)N (x|m,v)dx]2
R
7
Algorithm 1: AMP for LASSO Algorithm 2: Bayes-Optimal AMP

1. Input: y, H. 2
1. Input: y, H, σw , P(x).
(1) (1) (0) (1) (1) (0)
2. Initialization: x̂i = 0, v̂i = 1, Za = ya . 2. Initialization: x̂i = 0, v̂i = 1, Za = ya .
(T )
3. Output: x̂ . 3. Output: x̂(T ) .
4. Iteration: 4. Iteration:
for t = 1, · · · , T do for t = 1, · · · , T do
N N
(t)
X
Va(t) = |hai |2 v̂i (t)
X
(42a) Va(t) = |hai |2 v̂i (49a)
i=1 i=1
N (t) (t−1) N (t) (t−1)
X (t) Va (ya − Za ) Va (ya − Za )
Za(t) = (t)
X
hai x̂i − (t−1)
(42b) Za(t) = hai x̂i − (49b)
1+ Va 2 + (t−1)
i=1 i=1 σw Va
M
!−1 !−1
M
(t)
X |hai |2 (t)
X |hai |2
Σi = (t)
(42c) Σi = (49c)
1+ Va 2 +V (t)
a=1 a=1 σw a
M (t)
(t) (t) (t)
X h∗ai (ya − Za ) (t) (t) (t)
M
X (t)
h∗ai (ya − Za )
ri = x̂i + Σi (t)
(42d) ri = x̂i + Σi (49d)
1 + Va 2 (t)
a=1 a=1 σw + Va
(t+1) (t) (t) (t+1) (t) (t)
x̂i = Fβ (xi ; ri , Σi ) (42e) x̂i = E{xi |ri , Σi } (49e)
(t+1) (t) (t) (t+1) (t) (t)
v̂i = βGβ (xi ; ri , Σi ) (42f) v̂i = Var{xi |ri , Σi } (49f)
end end
(t)
where F0β (xi ; ri , Σ(t) ) is the partial derivation of λ + τ̂ (t) D 0 (t) E
(t) (t) τ̂ (t+1) = η (x̂ + HT z(t) , λ + τ̂ (t) ) . (48c)
Fβ (xi ; ri , Σ(t) ) w.r.t. ri . α
4 (t)
Defining z(t) = y − Z(t) with Z(t) = {Za , ∀a}, we have By abusing η, we get the original AMP (10) for LASSO
inference problem.
z(t) = y − Hx̂(t)
1 D E
+ z(t−1) F0β (x; x̂(t−1) + HT z(t−1) , Σ(t−1) ) ,
α C. Bayes-optimal AMP
(44a)
In LASSO inference problem, both the prior and likelihood
x̂(t+1) = Fβ (x; x̂(t) + HT z(t) ), Σ(t) ), (44b)
D E are unknown. However, in the Bayes-optimal setting, where
Σ(t+1) = Σ(t) F0β (x; x̂(t) + HT z(t) , Σ(t) ) . (44c) both prior and likelihood function are perfectly given, the
MMSE estimator can achieve Bayes-optimal error. Actually,
In large β, by Laplace method of integration we have this situation is common in communications. In those cases,
(t) it is assumed that each element of x follows IID distribution
lim Fβ (xi ; ri , Σ(t) )
β→∞ PX . The joint distribution is then represented as
Z
1 1 (t)
= lim xi pos exp −β λ|xi | + (x i − ri ) dxi P(x, y) = P(y|x)P(x)
β→∞ Z 2Σ(t)
M N
1 (t)
(xi − ri )2 + λ|xi |.
Y Y
= arg min (45) = P(ya |x) PX (xi ). (50)
xi 2Σ(t)
a=1 i=1
Similar to (5)-(6), we get
Similar to the derivation of AMP for LASSO, we get the
(t) (t) (t)
lim Fβ (xi ; ri , Σt ) = sign(ri ) max(|ri |, λΣ(t) ), (46) Bayes-optimal AMP as depicted in Algorithm 2, where the
β→∞
( expectation in (49e) and (49f) is taken over
(t)
1 |ri | ≥ Σ(t)
lim F0β (xi ; rit , Σ(t) ) = . (47) PX (xi )N (xi |ri , Σi )
(t) (t)
β→∞ 0 otherwise P̂ (t) (xi |y) = R . (51)
(t) (t)
PX (x)N (x|ri , Σi )dx
(t) (t)
Defining η(r, γ) = sign(r) max(|r|, γ) and τ̂ = λV , we
have This form of AMP is widely applied to many engineering
regions. We call it as Bayes-optimal AMP since (1) this
z(t) = y − Hx̂(t) algorithm is based on Bayes-optimal setting; (2) the SE of this
1 D E
algorithm perfectly matches the fixed point of the exact MMSE
+ z(t−1) η 0 (x̂(t−1) + HT z(t−1) , λ + τ̂ (t−1) ) ,
α estimator predicted by replica method. Similar to AMP for
(48a) LASSO, the form of Bayes-optimal AMP can also be written
x̂(t+1) = η(x̂(t) + HT z(t) , λ + τ̂ (t) ), (48b) as (48) with η(·) being MMSE denoiser.
8
d
D. State Evolution A|G = B.
In this subsection, we only give a sketch of proving AMP’s Gaussianity. The equations (53) shows that in the large system
SE in [16]. Let’s introduce the following general iterations. limit, each entry of h(t+1) and b(t) tends to Gaussian RV.
Regarding h(t) and b(t) as column vectors, then for t ≥ 0,
h(t+1) = HT m(t) − ξt q(t) , (52a) from (52), we have
(t) (t) (t−1)
b = Hq − λt m , (52b) h i
h(1) + ξ0 q(0) , · · · , h(t) + ξt−1 q(t−1)
where m(t) = gt (b(t) , n), q(t) = ft (h(t) , x), ξt = | {z }
gt0 (b(t) , n) , and λt = α1 ft0 (h(t) , x) . t
4
=X
Pertaining to this general iterations, the following conclu-
h i
= H m , · · · , m(t−1) ,
T (0)
(61)
sions can be established. In the large system limit, for any | {z }
pseudo-Lipschitz function ϕ : R2 7→ R of order k and all 4
=Mt
t ≥ 0, h i
N
b(0) , b(1) + λ1 m(0) , · · · , b(t−1) + λt−1 m(t−2)
1 X (t+1) a.s.
lim ϕ(hi , xi ) = EZ,X {ϕ(τt Z, X)} , (53a)
| {z }
4
N →∞ N =Yt
i=1 h i
1 X
M
(t) a.s.
= H q(0) , · · · , q(t−1) . (62)
lim ϕ(bi , ni ) = EZ,N {ϕ(σt Z, N)} , (53b)
M →∞ M
| {z }
i=1 4
=Qt
where Let Gt1 ,t2 denote the event that H satisfies the linear con-
τt2 = E gt (σt Z, N)2 , strains Xt1 = HT Mt1 and Yt2 = HQt2 . Then the conditional

(54)
1 distribution of h(t+1) and b(t) can be expressed as
σt2 = E ft (τt−1 Z, X)2 ,

(55)
α d
h(t+1) |Gt+1,t = H|Gt+1,t m(t) − ξt q(t) , (63)
where N ∼ PN and X ∼ PX are independent of Z ∼ N (0, 1). d
Specially, σ02 = limN →∞ N1α kq(0) k2 . b(t) |Gt,t = H|Gt,t q(t) − λt m(t−1) . (64)
Define The approximated expressions are shown in [16, Lemma 1],
(t)
gt (b , n) = b (t)
− n, (56) where t-iteration h(t+1) (or b(t) ) on the conditions Gt+1,t
(t) (t) (or Gt,t ) is expressed as a combination of all preceding
ft (h , x) = ηt−1 (x − h ) − x. (57)
{h(τ ) , ∀τ ≤ t} (or {b(τ ) , τ < t}). The proof of Lemma 1 is
0
Then ξt = 1 and λt = − α1 ηt−1 (x − h(t) ) . To coincide rigorous since the induction on t is rigorous. Be aware, during
with AMP (Donoho) in (10), it implies that x − h(t+1) = the proof of Lemma 1, the fact that H has IID Gaussian entries
HT z(t) + x(t) . We thus have is applied to derive the Gaussianity of h(t+1) and b(t) .
h(t+1) = x − (HT z(t) + x(t) ) (58a)

(t) (t)
E. Numeric Simulations
q = x̂ − x, (58b)
In Fig. 6, we show the comparison of Bayes-optimal AMP
b(t) = n − z(t) , (58c) and its SE in the application of wireless communications. As
m(t) = −z(t) . (58d) can be seen from Fig. 6, firstly, AMP matches the SE curve
very well; secondly, the performance of AMP becomes better
Using (56)-(57) and (58), the general iterative equations (52) as SNR increases; finally, in small SNR, the measurement ratio
reduce to the original AMP. α has the effect on convergence speed and fixed point while
From (54)-(57), we get the SE of AMP in large SNR, the effect of α on fixed point can be ignored in
2 2 2 the case of QPSK prior. Specially, as SNR=12dB, the curves
τt+1 = σw + σt+1
1 of α = 1 and α = 4 converge to the same fixed point almost
2
= σw + E (ηt (X + τt Z) − X)2 . (59) sure.
α
In Fig 7, we show the comparison of Bayes-optimal AMP
For the proof of AMP’s SE, we have the following remarks:
and its SE in compressed sense. As can be observed from
Remark 1. Conditional distribution. To prove the equations Fig. 7, AMP matches the SE curves in all settings. We also see
(53), the so-called condition technique is applied. Let’s con- that similar to application in wireless communications, as SNR
sider a linear constrain Y = AX, where A follows PA (A). increases, the gap between the difference measurement ratios
Let G denote the event that A satisfies the linear constrain will be decreased. Besides, as measurement ratio increases,
Y = AX. Then we say that A under G is distributed as B the convergence speed of AMP will be faster.
following
1 III. F ROM AMP TO OAMP
PA|G (B) = PA (B) · 1B∈L , (60)
Z Although AMP can achieve the Bayes-optimal MSE per-
where Z is normalized constant and L denotes the set of A formance in IID sub-Gaussian region, the AMP algorithm
that satisfies the linear constrain Y = AX. We write it as may fail to converge when H is ill-conditioned (e.g. large
9
10 0 NLE : x̂(t+1) = η̃t (r(t) ). (65b)

AMP ( =4)
SE ( =4)
AMP ( =1) where Wt is a linear transform matrix that maps residual error
SE ( =1) (t)
10 -1
SNR=8dB
y − Hx̂(t) onto RN , rOnsager is the Onsager term, and η̃t is the
(t) (t−1) (t−1)
−x̂ 0
denoiser. Specially, as rOnsager = r α ηt−1 (r(t−1) ) ,
Wt = HT , and η̃t (r(t) ) = ηt (r(t) ), the above general
NMSE
10 -2 SNR=10dB iterations reduce to Donoho’s AMP. In AMP algorithm, the

(t)
Onsager term rOnsager ensures the Gaussianity of input signal
r(t) and AMP can achieve Bayes-optimal performance in
10 -3 IID sub-Gaussian random measurement matrix. A significant
SNR=12dB disadvantage of AMP is that AMP may diverge when the
random measurement matrix is beyond IID sub-Gaussian. To
-4
10 extend the scope of AMP to more general case, [24] proposed
0 5 10 15 20 a modified AMP algorithm called OAMP.
Iteration The main ideal of OAMP is to design a linear transform
Fig. 6. Iterative behavior of Bayes-optimal AMP and its SE in wireless
matrix Wt and denoiser η̃t so that
6
communications. H has IID Gaussian entries with zero mean and 1/M • Divergence-free . The modified algorithm does not de-
variance. M = 1024, N = M α
and SNR=1/σw 2 . The signal of interest
pendent on the Onsager term so that the Onsager term
x has IID entries from the set {± √1 ± J √1 } with J2 = −1.
2 2 vanishes;
10 0 • Orthogonality. The modified algorithm maintains the
AMP ( =1/2)
SE ( =1/2) orthogonality of the input and output errors of denoiser
AMP ( =1/4) η̃t (·).
SE ( =1/4)
10 -1 For the first issue, a divergence-free denoiser η̃t can be
SNR=8dB
constructed as
h D Ei
η̃t (r(t) ) = C ηt (r(t) ) − r(t) ηt0 (r(t) ) , (66)
NMSE
10 -2
SNR=16dB
where ηt (·) can be an arbitrary pseudo-Lipschitz function and
C is a constant. In this case, we have η̃t0 (r(t) ) = 0.
10 -3 SNR=24dB
For convenience, we define the input and output errors
q(t) = x̂(t) − x, (67)
(t) (t)
h =r − x. (68)
10 -4
0 5 10 15 20
Iteration Substituting the system model y = Hx + n and (65) into
equations above, we have
Fig. 7. Iterative behavior of Bayes-optimal AMP and its SE in compressed
sensing. H has IID Gaussian entries with zero mean and 1/M variance.
2 . The signal of interest x has IID
LE : h(t) = (I − Wt H)q(t) + Wt n, (69a)
M = αN , N = 1024 and SNR=1/σw
(t+1) (t)
entries following BG(0, 0.05). NLE : q = η̃t (x + h ) − x. (69b)
Also, we define error-related parameters
conditional number, non-zero mean). To extend the scope of 1 (t) 2 1 (t) 2
AMP to more general random matrices (unitrarily-invariant v̂ (t) = lim kq̂ k2 , τt2 = lim kh k2 . (70)
N
N →∞ N →∞ N
matrix4 ), a modified AMP algorithm termed OAMP [24]
Similar to AMP, we assume that the following assumptions
was proposed. Different from AMP, the denoiser of OAMP
hold
is divergence-free so that the Onsager term vanishes and (t)
• Assumption 1: the input error h consists of IID zero-
the LMMSE de-correlated matrix is applied to ensure the
orthogonality5 of input and output errors of denoiser. mean Gaussian entries independent of x, i.e., R(t) = X +
τt Z with Z being a standard Gaussian RV.
(t+1)
• Assumption 2: the output error q consists of IID
A. Orthogonality of input and output errors entries independent of H and noise n.
Let’s consider the following general iterations containing a We will show that based on the assumptions above, the de-
linear estimation (LE) and a nonlinear estimation (NLE): correlated matrix Wt and divergence-free imply the orthogo-
LE :
(t)
r(t) = x̂(t) + Wt (y − Hx̂(t) ) + rOnsager , (65a) nality between input error h(t) and output error q(t+1) . We say
LE is de-correlated one if Tr(I − Wt H) = 0, which implies
4 We say A = UΣVT is unitarily-invariant if U, V, and Σ are mutually
N
independent, and U, V are Haar-distributed. Wt = Ŵt , (71)
5 Given two random variables X, Y, we say X is orthogonal to Y if E{XY} = Tr(Ŵt H)
0. Provided that x ∈ RN and y ∈ RN are generated by X and Y, respectively,
1 T 1 PN a.s.
then N x y= N i=1 xi yi = E{XY} = 0.
6 We say η : R 7→ R is divergence-free if E{η 0 (R)} = 0.
10
where Ŵt can be chosen from: Algorithm 3: Bayes-Optimal Orthogonal AMP

2

H
 T
matched filter (MF) 1. Input: y, H, σw , and P(x).
(1) (1) (0)
pinv 1.Initialization: x̂i = 0, v̂i = 1, Za = ya .

Ŵt = Ŵt pseudo-inverse (72)
 2
−1 2.Output: x̂(T ) .
HT HHT + σ(t)

I LMMSE
v̂ 3.Iteration:
for t = 1, · · · , T do
where Ŵtpinv = HT (HHT )−1 for M < N and Ŵtpinv = 2
−1
σw
(HT H)−1 HT for M ≥ N . Considering LMMSE de-correlated T
Ŵt = H H + (t) I HT (81a)
matrix, one should need to determine v̂ (t) . We consider ηt as v̂
!
MMSE denoiser and denote it as ηtmmse to distinguish η̃t . Based N
on Assumption 1, from (66), we have τt2 = v̂ (t) −1 (81b)
Tr(Ŵt H)
mmse (t)
ηt (r ) r(t)

N
η̃t (r(t) ) = C v̂mmse
(t)
− , (73) r(t) = x̂(t) + Ŵt (y − Hx̂(t) )
(t)
v̂mmse τt2 Tr(Ŵt H)
v̂ (t)
(81c)
where the relation ηtmmse 0 (r(t) ) = mmse
τt2
is applied. The
x̂(t+1) mmse (t)
mmse = ηt (r , τt ) (81d)
MMSE estimator and its variance are defined as
N
(t) (t) 1 (t)
ηtmmse (ri ) = E{xi |ri = x + τt z},
X
(74) (t)
v̂mmse = Var{xi |ri , τt } (81e)
N
N i=1
(t) 1 X (t) −1
v̂mmse = Var{xi |ri = xi + τt z}, (75)

(t+1) 1 1
N i=1 v̂ = − 2 (81f)
(t+1) τt
(t)
v̂mmse
PX (xi )N (xi |ri ,τt2 )
!
(t+1)
where the expectation is taken over (t) 2 . x̂mmse r(t)
x̂(t+1) = v̂ (t+1)
R
PX (x)N (x|ri ,τt )dx − 2 (81g)
Then (t+1) τt
v̂mmse
1 (t) 2 end
v̂ (t) = lim kq k
N →∞ N
( mmse 2 )
a.s. (t) ηt (X + τt Z) X + τt Z
= EZ,X C v̂mmse − −X
(t)
v̂mmse τt2 Be aware, there is an unknown noise-related parameter τt2 ,
 !2which
 is expressed as
(t) 2 (t)

mmse C v̂ mmse + τt C v̂ mmse

= EZ,X Cηt (X + τt Z) − X− Z 1 (t) 2
 τt2 τt  τt2 = lim kh k
N →∞ N
(76) 1
= Tr (I − Wt H)(I − Wt H)T v̂ (t)

The coefficients of ηt mmse
and X should be equal, i.e., C = N
(t)
C v̂mmse +τt2 1 2
τt2
, and it leads to + Tr(Wt WtT )σw
N
 
τt2 (a) (t) N Tr(Ŵt HH Ŵt )
T T T
C= (t)
. (77) = v̂  −  + N Tr(Ŵt Ŵt ) σw 2
τt2 − v̂mmse

Tr2 Ŵt H Tr2 (Ŵt H)
Substituting this fact into v̂ (t) obtains " T σw 2
T
#
N Tr(Ŵ t (HH + (t) I)Ŵ t )
= v̂ (t) v̂
−1
("
τt2 2
( Ŵ H)
v̂ (t) = EZ,X (t)
(η mmse
t (X + τ t Z) − X) Tr t
τt2 − v̂mmse !
(b) (t) N
(t)
#2  = v̂ −1 , (80)
τt v̂mmse 
Tr(Ŵt H)
− (t)
Z
τt2 − v̂mmse 
N
where the fact Wt = Tr(Ŵ Ŵt is used to obtain (a) and the
!2 !2 t A)
2 (t) −1
τt τt v̂mmse
2
= v̂ (t)
mmse + LMMSE de-correlated matrix Ŵt = HT HHT + v̂σ(t) I
(t) (t)
τt2 − v̂mmse τt2 − v̂mmse is applied to obtain (b). This completes the derivation of
−1
1 1 orthogonal AMP and we post OAMP algorithm in Algorithm
= (t)
− 2 , (78) 3. Be aware, we here use LMMSE de-correlated W =
v̂
mmse
τt −1 t
σ2
(t) a.s. HT H + HT and it can be verified that this form
v̂ (t)
I
where the facts v̂mmse = EX,Z {(ηtmmse (X + τt Z) − X)2 } and −1
2
(ηtmmse (X + τt Z) − X) independent of Z are applied. is equal to Wt = HT HHT + v̂σ(t) I via SVD.
Inserting (77) into (73) obtains The below is to prove orthogonality of input and output
mmse (t)
ηt (r ) r(t)
errors. Define Bt = I − Wt H, we have
η̃t (r(t) ) = v̂ (t) − . (79)
(t)
v̂mmse τt2 E{h(t) (q(t) )T } = E{Bt }E{q(t) (q(t) )T }
11
+ E{Wt }E{n(q(t) )T } where by SVD H = UΣVT

= E{Bt }E{q(t) (q(t) )T }, −1 # −1 #
" "
(82) σw2 2
σw
T T
Tr H H + I = Tr Σ Σ +
where the last equation holds by Assumption 2. By the SVDs γ2 γ2
H = UΣVT and Wt = VGt UT , we have N
X γ2
= 2
, (88)
M
X i=1
λ i γ2 + σw
E{(Bt )ij } = E{Vim Vjm }(1 − gm σm ), (83)
m=1 where λi is the i-th eigenvalue of HT H. Note that if we
assume that HT H only has K non-zero eigenvalues then
where gm is m-th element of Gt and σm is m-th element of λi = 0 for i > K.
Σ. Since V is Haar distribution, we have From (87), we have
( PN
0 i 6= j γ2 i=1 λi γ2γ+σ
2
2
E{(Bt )ij } = Tr(Bt ) . (84) γ1 = w
N i=j γ2 P N γ2
N σ2 − i=1 λi γ2 +σ2
w w
PN
Since Tr(Bt ) = 0, we then have E{Bt } = 0 and further 1
γ2 i=1 λi γ2 +σ2
w
=P
N
E{h(t) (q(t) )T } = 0. (85) 1
i=1 σw
1
2 − λ γ +σ 2
i 2 w
PN
This completes the proof of orthogonality of the input and γ2 N − γ2 i=1 λi γλ2i +σ
γ2
2
w
output errors. = PN λ γ
. (89)
i 2
2
i=1 λi γ2 +σw
On the other hand, from (81b), we have

B. Relation to Vector AMP  
In this subsection, we will show that OAMP shares the same N
τt2 = v̂ (t)  PN − 1
 
algorithm as Vector AMP (VAMP) [29]. For convenience, we λi
2
i=1 σw
post the VAMP algorithm by omitting iteration as below λi +
v̂ (t)
PN λi v̂ (t)

1
−1
r2
N v̂ (t) − v̂ (t) i=1 v̂ (t) λi +σw 2
−2 T −2 T = . (90)
x̂1 = σw H H + I σw H y + , (86a) PN λi v̂ (t)
γ2 γ2 i=1 v̂ (t) λi +σw 2
" −1 #
1 −2 T 1 This completes the proof of the equivalence of γ1 in VAMP
v̂1 = Tr σw H H+ I , (86b)
N γ2 and τt2 in OAMP. In addition, (91) and (92) complete the proof
−1 of the equivalence of r(t) of OAMP and r1 of VAMP. Besides,
1 1
γ1 = − , (86c) one could find that VAMP algorithm (86a)-(86h) is same as the
v̂1 γ2 diagonal expectation propagation (EP) [27] and expectation

x̂1 r2 consistent (EC) [28, Appendix D] (single-loop). They were
r1 = γ1 − , (86d)
v̂1 γ2 proposed independently in different manners but shares the
x̂2 = E {x|r1 , γ1 } , (86e) same form. Actually, EP/EC (single-loop) with element-wise
N variance has a slight difference from OAMP/VAMP, where
1 X
v̂2 = Var{xi |r1i , γ1 }, (86f) EP/EC (single-loop) is reduced to OAMP/VAMP by taking
N i=1 the mean operation for element-wise variance. The EP was
−1 proposed by modifying the assumed density filter, the EC was
1 1
γ2 = − , (86g) proposed by minimizing the Gibbs free energy, OAMP was
v̂2 γ1
proposed by extending AMP to more general measurement
x̂2 r1
r2 = γ2 − . (86h) matrix region, and VAMP was proposed using EP-type mes-
v̂2 γ1 sage passing. The order of them is EP (2001) by Minka, EC
Comparing VAMP (86) with OAMP in Algorithm 3, it can (2005) by Opper, OAMP (2016) by Ma, and VAMP (2016)
be found that equations (81d)-(81g) of OAMP are equal to by Rangan.
equations (86d)-(86h) of VAMP. To show the equivalence of
OAMP and VAMP, one should prove the equivalence of (81b)- C. State Evolution
(81c) and (86c)-(86d). From (86c), we have The asymptotic MSE of OAMP is defined as
γ2 v̂1 1 (t+1)
γ1 = mse(x, t + 1) = lim kx̂mmse − xk22
γ2 − v̂1 N →∞ N
N
−1
2
γ2 Tr H H+ T σw 1 X (t+1)
γ2 I = lim (x̂mmse,i − xi )2
N →∞ N
= −1 , (87) i=1
2
σw
γ2
n o
N 2
σw − Tr HT H + γ2 I
a.s.
= EX,Z (ηtmmse (X + τt Z) − X) .
2
(93)
12
2
−1
N σw
r(t) = x̂(t) + HT H + (t)
I HT (y − Hx̂(t) )
Tr(Ŵt H) v̂
2
−1 2
−1 " 2
#
N T σw T N T σw Tr(Ŵt H) T σw (t) T (t)
= H H + (t) I H y+ H H + (t) I H H + (t) I x̂ − H Hx̂
Tr(Ŵt H) v̂ Tr(Ŵt H) v̂ N v̂
2
−1 2
−1 "
σw σw
HT H + v̂(t) HT y HT H + v̂(t) N
! N
#
I I 1 X λi v̂ (t) 1 X λi σw2
T (t) (t)
= PN λi v̂ (t)
+ 1 PN λi v̂ (t)
− 1 H Hx̂ + x̂
1
i=1 v̂ (t) λi +σw 2 i=1 v̂ (t) λi +σw 2
N i=1 v̂ (t) λi + σw 2 N i=1 v̂ (t) λi + σw
2
N N
2
−1  
σw PN
HT H + v̂(t) I HT y σw 2
−1 2 σ 2 1 1
 σw x̂(t) −
wN i=1 v̂ λi +σw
(t) 2
= 1
PN λi v̂ (t)
+ HT H + (t) I (t) 1
PN λi
HT Hx̂(t)  . (91)
i=1 (t) 2
v̂ v̂ v̂ (t)
i=1 (t) 2
N v̂ λi +σw N v̂ λi +σw
γ2 x̂1 v̂1 r2
r1 = −
γ2 − v̂1 γ2 − v̂1
−1
−1 1 −2 T 1
γ2 σw −2 T
H H + γ2 I 1 −2 T
σw H y + γ22r
N Tr σ w H H + γ2 I r2
= −1 − −1
−2 T −2 T
γ2 − N1 Tr σw H H + γ12 I γ2 − N1 Tr σw H H + γ12 I
−1
σ2 σ2
2 P
σw N
γ2 HT H + γw2 I HT y + γw2 r2 N
γ2
i=1 λi γ2 +σw 2 r2
= σ 2 N
− σ 2 N
γ2 − Nw i=1 λi γ2γ+σ γ2 − Nw i=1 λi γ2γ+σ
P 2
P 2
2 2
w w
2
−1  2

σ 2
1 PN σw
HT H + γw2 I HT y σw2
−1 σw 2
σw N i=1 λi γ2 +σw 2
T γ2 T
= 1
PN λi γ2
+ H H+ I  PN λ γ
r2 − H H + I 1 PN λ i γ2
r2 
i=1 λi γ2 +σw 2
γ2 1 i
i=1 λi γ2 +σw
2
2
γ2 i=1 2
N N N γ2 λi +σw
2
−1  
σ N
HT H + γw2 I HT y −1 2 1 1
P
σw2
σw2 σw N i=1 λi γ2 +σw2
T
= 1 N λ γ
+ H H + I  r2 − N λi
HT Hr2  . (92)
γ γ 1
P i 2
P
N i=1 2
λi γ2 +σw
2 2 γ2 i=1
N 2
γ2 λi +σw
−1 !
σ2

where the last equation holds by Assumption 1. As we 1 T T
= Tr Σ ΣΣ + (t) I Σ
observed from OAMP in Algorithm 3, in the large system N v̂
limit, the variance of OAMP estimator can be written as N
1 X σi2
N = 2
(t+1) 1 X (t) N i=1 σ 2 + σ(t) w
i
v̂mmse = Var{xi |ri , τt } v̂
N i=1
( )
a.s. λ
a.s.
n
2
o =E 2
σw
, (97)
= EX,Z (ηtmmse (X + τt Z) − X) . (94) λ + v̂(t)
Combining (93) and (94) proves that the variance of OAMP where σi is the i-th diagonal element of Σ, and the expectation
estimator is equal to asymptotic MSE of OAMP almost sure, in E{λ} is taken over the asymptotic eigenvalue distribution
(t+1) a.s. (t+1)
i.e., v̂mmse = mse(x, t + 1). Note that v̂mmse in (94) only of HT H.
2
relies on the parameter τt and this parameter can be obtained
In the sequel, we obtain the SE of OAMP as below
by
−1  ( )−1 
(t) 1 1 λ2
v̂ = (t)
− 2 , (95) LE: τt2 = v̂ (t) E − 1 (98)
v̂mmse τt−1 σw2
! λ2 + v̂(t)
N
τt2 = v̂ (t)
 n o
−1 , (96) (t+1)
v̂mmse = EX,Z (ηtmmse (X + τt Z) − X)
2
Tr(Ŵt H) NLE: −1 (99)
1
v̂ (t+1) = (t+1) − τ12
where by SVD H = UΣVT v̂mmse t
−1 ! (t+1)
σ2

1 1 T T Be aware, in the NLE part, the v̂mmse is output MSE rather
Tr(Ŵt H) = Tr H HH + (t) I H
N N v̂ than v̂ (t+1) .
13
10 0 Recalling the OAMP iterations in Algorithm 3, the complex-

AMP
AMP SE ity of OAMP is dominated by the matrix inversion in (81a).
2
OAMP σw
OAMP SE
Let’s define ςt = v̂(t) and a relaxation parameter θt . Then
−1 −1
θt HHT + ςt I = I − I − θt (HHT + ςt I)

10 -1
.
(100)
NMSE
Defining Ct = I − θt (HHT + ςt I), we have

−1
10 -2
θt HHT + ςt I = (I − Ct )−1 . (101)
As spectral radius of Ct satisfies ρ(Ct ) < 1, applying matrix
Taylor series gets
∞
−1 X
10 -3 θt HHT + ςt I = Ckt . (102)
0 5 10 15 20 25 30
Iteration k=0
Fig. 8. Iterative behavior of OAMP, AMP and their SEs in compressed It can be verified that θt = (λ† + ςt )−1 with λ† = λmax +λ
2
min
2 =20dB. x has IID BG

sensing. M = 512, N = 1024 and SNR=1/σw satisfies ρ(Ct ) < 1, where λmax and λmin denote the maximum
entries following BG(0, 0.05). The measurement matrix is generated by and minimum eigenvalue of HHT , respectively. For conve-
H = UΣVT where U and V are both Haar distribution and Σ is
rectangular whose diagonal is σ1 · · · σM with σi /σi+1 = κ(H)1/M nience, defining B = λ† I − HHT yields
PM matrix2 = N such that |h |2 = O(1/M ). The condition number
and σ
i=1 i ai ∞
σ (H)
X
(t)
is defined as κ(H) = σmax(H) where σmax (H) and σmin (H) denotes
min Ŵt (y − Hx̂ ) = H T
(θt B)k (y − Hx̂(t) ). (103)
maximum and minimum singular values of H, respectively.
k=0
However, the complexity of the exact approximation is still

huge. The MAMP applies a few terms of matrix series to
D. Numeric Simulations
represent matrix inversion and use all preceding terms to
In Fig 8, we present the comparison of OAMP, AMP and ensure three orthogonality.
their SEs in compressed sensing. In κ(H) = 1, the SE curves The MAMP considers the following structure:
match AMP or OAMP well and OAMP converges faster than ( (t) (t−1)
z = θt Bz + ξt (y − Hx̂(t) )
AMP. In this case, the gap of the fixed point of AMP and LE : Pt , (104a)
OAMP can be ignored. On the other hand, in κ(H) = 100, r(t) = ε1t HT z(t) + i=1 pt,i x̂(i)
AMP fails to converge while OAMP and its SE converge to the  (t+1)
x̂mmse = ηtmmse (r(t) , τt,t )
same fixed point. Besides, the MSE performance of OAMP in


 (t+1) PN (t)
κ(H) = 1 is better than that in κ(H) = 100 (ill-conditioned v̂mmse

 = N1 i=1 Var{xi |ri , τt,t }
−1
NLE : , (104b)

1
matrix). Note that since the SE of AMP is obtained under the v̂t+1,t+1 = (t+1) − τ 12
v̂mmse
 t,t
Gaussian random matrix and thus the condition number has

−1

(t+1)
 (t)

x̂mmse
x̂(t+1) − rτ 2

no effect on the performance of SE of AMP. = v̂t+1,t+1 (t+1)
v̂mmse t,t
† T † −1 (t+1)
where B = λ I−HH and θt = (λ +ςt ) . Note that x̂mmse
IV. L ONG M EMORY AMP is the output estimator rather than x̂(t+1) .
Remark 2. As can be seen from the MAMP algorithm
Although OAMP can be applied to more general random in (104a)-(104b), the parameter λ† = λmin +λ 2
max
of MAMP
matrices, its complexity with roughly O(N 3 ) is larger than T
algorithm relies on the eigenvalue of HH which is roughly
AMP with roughly O(N 2 ). To balance the computational with cost of O(N 3 ). Although some works give the approx-
complexity and region of random measurement matrix, sev- imations to the maximum or minimum singular value of H,
eral long memory algorithms have been proposed, such as its complexity is still huge. We also note that in the long
convolution AMP (CAMP) [25] and memory AMP (MAMP) version [50], a simple bound of maximum eigenvalue and
[26]. CAMP only adjusts the Onsager term where all preceding minimum eigenvalue is applied to provide a close performance
messages are involved to ensure the Gaussianity of input error. of perfect eigenvalues, especially in low condition number. In
However, CAMP may fail to convergence in ill-conditioned the case of given eigenvalues of HHT , the MAMP balances
measurement matrix such as large conditional number. Follow- the computational complexity and random measurement region
ing CAMP and OAMP, MAMP applies a few terms of matrix well.
Taylor series to carry out the matrix inversion in OAMP and
modifies the structure of input signal of denoiser to ensure (a)
A. Derivation of MAMP
the orthogonality of all preceding input errors and t-th output
error, (b) the orthogonality of t-th input error and original Similar to OAMP, the following assumptions are applied
(t)
signal x, (c) the orthogonality of t-th input error and all • Assumption 3: the input error h consists of IID zero-
preceding output errors. mean Gaussian entries independent of x, i.e., R(t) = X +
14
1 (i) T
τt,t Zt with Z being a standard Gaussian RV. Let’s define Using the facts N (q ) x and independence of n and x, we
ηt = τt,t Zt . Different from OAMP, MAMP assumes that have
[η1 , · · · , ηt ]T follows joint Gaussian N (0, Vt ) with Vt = N 2
2 2 1 X
[τi,j ]t×t . τt,t = Q t n + Hti q(i)
• Assumption 4: the output error q(t+1) consists of IID N ε2t i=1 2
entries independent of H and noise n.

t t
1 X X 2
Tr HT B2t−i−j H

Using initial conditions z(0) = x̂(0) = 0, from (104a) = 2 ξi ξj θt,i θt,j σw
N εt i=1 j=1
 
t t

X Y t X t
z(t) = ξi  θj  Bt−i (y − Hx̂(i) ). (105)
X t T t
− v̂i,j Tr (Hi ) Hi  , (115)
i=1 j=i+1
i=1 j=1
Qt 1 (i) T (j)
Defining θt,i = (θt,i = 1 for i ≥ t), we have where v̂i,j = N (q ) q and v̂i,j = v̂j,i . Defining
j=i+1 θj
t
! ϑt,i = ξi θt,i , (116)
(t) 1 X
1
r = Qt y + Hti x̂(i) , (106) Wt = HT Bt H, wt = Tr(Wt ), (117)
εt i=1 N
1
where Ni,j = Wi Wj , wi,j = Tr {Ni,j } − wi wj , (118)
N
t
X we get pt,i = ϑt,i wt−i and
Qt = ξi θt,i HT Bt−i , (107) t t
2 1 XX 2

i=1 τt,t = ϑt,i ϑt,j σw w2t−i−j + v̂i,j wt−i,t−j
ε2t i=1 j=1
Hti = pt,i I − ξi θt,i HT Bt−i H. (108)
ct,1 ξt2 − 2ct,2 ξt + ct,3
From the orthogonality of input error and original signal = , (119)
a.s.
w02 (ξt + ct,0 )2
i.e., N1 (h(t) )T x = 0, we have
where
t−1
1 (t) T X pt,i
h x ct,0 = ,
N w
0
!T i=1
t
1 1 1 X t (i) ct,1 = 2
σw w0 + v̂t,t w0,0 ,
= Qt (Hx + n) + H (q + x) − x x
N εt εt i=1 i t−1
X
2
t ct,2 = − ϑt,i (σw wt−i + v̂t,i w0,t−i ),
1 T X 1
= x ((Qt H)T + (Hti )T )x − xT x, (109) i=1
N εt i=1
N t−1 X
X t−1
2
ct,3 = ϑt,i ϑt,j (σw w2t−i−j + v̂i,j wt−i,t−j ).
1 (i) T a.s.
where N (q ) x = 0 is applied. Then we get i=1 j=1
2 ∂τ 2
( t
) The parameter ξt is obtained by minimizing τt,t . Zeroing ∂ξt,t
1 X t
Tr Qt H + Hti = 1. (110) ct,2 ct,0 +ct,3
gets two points ξt = −ct,0 and ξt = ct,1 ct,0 +ct,2 , where ξt =
N εt i=1 −ct,0 is maximum value point while
( c c +c
From the orthogonality of input error and output errors, i.e., t,2 t,0 t,3
ct,1 ct,0 + ct,2 6= 0
?
1 (t) T (i) a.s. ξt = ct,1 ct,0 +ct,2 . (120)
N (h ) q = 0, we have
+∞ otherwise
Tr{Hti } = 0. (111) Defining the residual error z̃(t) = y − Hx̂(t) , the crossed
variance v̂i,j can be provided by
Combining (110) and (111), we have 1 (i) T (j) 1 h iT h i
(z̃ ) z̃ = H(x − x̂(i) ) + n H(x − x̂(j) ) + n
1 N N
ξi θt,i Tr HT Bt−i H

pt,i = (112) 1 T
N = −Hq(i) + n −Hq(j) + n
Xt N
εt = pt,i (113) 1 2
= Tr HT H v̂i,j + ασw . (121)
i=1 N
where the parameters pt,i and εt can be determined once the It implies v̂i,j = ( N1 (z̃(i) )T z̃(j) − ασw
2
)/w0 .
From (116), we get
parameters {θt } and {ξt } are determined, where ξt is obtained 
by minimizing the averaged input error θt ϑt−1,i 0 ≤ i < t − 1

1 (t) ϑt,i = ξt−1 θt i=t−1 . (122)
2
τt,t = lim kr − xk22 . (114)

ξt i=t

N →∞ N
15
Totally, the MAMP is run in the following steps: (a) cal- 10 0

MAMP
culating parameters: θt , ϑt,i , pt,i (i < t); (b) calculating OAMP
parameters: ct,0 , ct,1 , ct,2 , and ct,3 , and applying them to get
2
ξt = ϑt,t , pt,t , and εt ; (c) calculating τt,t and carrying out
LE; (d) carrying out NLE and calculating [v̂i,j ](t+1)×(t+1) . 10 -1
In fact, the MAMP is easy to fail to converge without
NMSE
damping, especially in the case of large condition number
(e.g., κ(H) > 102 ). To ensure the convergence of MAMP, the
10 -2
damping factor is applied to the parameters x̂(t+1) , v̂t+1,t+1 ,
and z̃(t+1)
(t) (t)
x̂(t+1) = β1 x̂(t+1) + (1 − β1 )x̂(t) ,
(t) (t) 10 -3
z̃(t+1) = β1 z̃(t+1) + (1 − β1 )z̃(t+1) ,
(t) (t) 0 10 20 30 40 50 60 70 80 90 100
v̂t+1,i = β2 v̂t+1,i + (1 − β2 )v̂t+1,i−1 , Iteration
for 1 < i < t + 1. Different from the damping presented here, Fig. 9. Comparison of MAMP and OAMP in different condition numbers.
[26] shows another kind damping. But, in fact, the damping Each entry of x is generated from BG distribution BG(0, 0.1). (M, N ) =
(1024, 512) and SNR = 1/σw 2 = 20dB. The measurement matrix is
factor only has the effect on the convergence speed if algorithm generated by H = UΣVT where both U and V are Haar-distributed
converges. and Σ is rectangular matrix whose diagonal elements are σ1 , · · · , σM with
σi PM σmax (H)
σi+1
= κ(H)1/M and 2
i=1 σi = N , where κ(H) = σmin (H) with
σmax (H) and σmin (H) being maximum and minimum singular values of H,
B. State Evolution (t) (t)
respectively. The damping factors β1 = 0.7 and β2 = 0.8 are applied
(t) (t)
Similar to other AMP-like algorithms, the MSE of MAMP to the cases of κ(H) = 1 and κ(H) = 10, while β1 = β2 = 0.4 are
applied to the case of κ(H) = 50.
can also be predicted by its SE. The asymptotic MSE of
MAMP is defined as
1 (t+1) V. C ONCLUSIONS
mse(x, t + 1) = kx̂ − xk22
N mmse In this paper, we reviewed several AMP-like algorithms:
a.s.
= E (ηtmmse (X + τt,t Z) − X)2 AMP, OAMP, VAMP, and MAMP. We began at introducing
a.s. AMP algorithm, which is originally proposed for providing a
= v̂t+1,t+1 . (123)
sparse solution to LASSO inference problem but is widely
This term only relies on the parameter τt,t 2
, which can be applied to a lot of engineering fields under Bayes-optimal
a.s. setting. In IID sub-Gaussian random measurement matrices
obtained by (119). In τt,t , the parameter v̂i,j = N1 (q(i) )T q(j)
2
is obtained numerically by generating x following PX (x) = region, the AMP algorithm can achieve Bayes-optimal MSE
ρN (x|0, ρ−1 ) + (1 − ρ)δ(x) and r(t) = x + nt with performance, but it may fail to converge if random mea-
2
[n1 , · · · , nt ] ∼ N (0, Ξt ) where Ξt = [τi,j ]t×t and surement is beyond IID sub-Gaussian. Following AMP, we
introduced a modified AMP algorithm termed OAMP, which
2 1 (t) modified AMP in two aspects: LMMSE de-correlated matrix
τt,τ = lim (r − x)T (r(τ ) − x)
N →∞ N and divergence-free denoiser. The OAMP algorithm can be
t τ applied to more general region: unitarily-invariant matrix, but
1 XX 2

= ϑt,i ϑτ,j σw wt+τ −i−j + v̂ij wt−i,τ −j , it should be payed more computational complexity due to
εt ετ i=1 j=1
matrix inversion. To balance the computational complexity
with τt,τ = ττ,t . Then, and random measurement region, the MAMP algorithm ap-
n o plies several terms of matrix Taylor series to approximate
∀τ < t : v̂t,τ = E (x̂(t) − x)(x̂(τ ) − x) . matrix inversion and applies all preceding messages to ensure
three orthogonality. The MAMP algorithm relies on the given
spectral of sample of random measurement matrix. Although,
C. Numeric Simulation several works gave some approximations to it, the complexity
In Fig. 9, we show the pre-iteration behavior of MAMP is still huge. In addition, the convergence speed of MAMP is
and OAMP by varying the condition number in application slower than OAMP especially in the case of large condition
of compressed sensing. As can be observed from this figure, number. On the other hand, a significant feature of AMP-like
MAMP and OAMP converge to the same fixed point. In algorithms is that their asymptotic MSE performance can be
κ(H) = 1, MAMP has the comparable convergence speed fully predicted by their SEs. We also gave a brief derivation
as OAMP. However, as the κ(H) increases, MAMP need to of their SEs.
pay more iteration times to converge the same fixed point as
OAMP. Also, we note that the convergence speed and NMSE VI. ACKNOWLEDGEMENTS
performance of MAMP and OAMP tend to worse in large We are grateful to Y. Kabashima, D. Cai, and Y. Fu for
condition number. valuable comments and useful discussions.
16
R EFERENCES [27] T. P. Minka, “A family of algorithms for approximate bayesian infer-

ence,” Ph.D. dissertation, Massachusetts Institute of Technology, 2001.
[1] D. L. Donoho, “For most large underdetermined systems of equations, [28] M. Opper, O. Winther, and M. J. Jordan, “Expectation consistent
the minimal L1-norm near-solution approximates the sparsest near- approximate inference.” Journal of Machine Learning Research, vol. 6,
solution,” Communications on Pure and Applied Mathematics: A Journal no. 12, 2005.
Issued by the Courant Institute of Mathematical Sciences, vol. 59, no. 7, [29] S. Rangan, P. Schniter, and A. K. Fletcher, “Vector approximate message
pp. 907–934, 2006. passing,” IEEE Trans. Inf. Theory, vol. 65, no. 10, pp. 6664–6684, 2019.
[2] ——, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52, no. 4, [30] H. Zhang, “Identical fixed points in state evolutions of AMP and
pp. 1289–1306, 2006. VAMP,” Signal Processing, vol. 173, p. 107601, 2020.
[3] Y. Kabashima, T. Wadayama, and T. Tanaka, “A typical reconstruction [31] T. Takahashi and Y. Kabashima, “Macroscopic analysis of vector ap-
limit for compressed sensing based on lp-norm minimization,” Journal proximate message passing in a model mismatch setting,” in 2020 IEEE
of Statistical Mechanics: Theory and Experiment, vol. 2009, no. 09, p. Int. Symp. Inf. Theory (ISIT). IEEE, 2020, pp. 1403–1408.
L09003, 2009. [32] C. Gerbelot, A. Abbara, and F. Krzakala, “Asymptotic errors for teacher-
[4] D. L. Donoho, A. Maleki, and A. Montanari, “Message passing algo- student convex generalized linear models (or: How to prove kabashima’s
rithms for compressed sensing: I. motivation and construction,” in 2010 replica formula),” arXiv preprint arXiv:2006.06581, 2020.
IEEE information theory workshop on information theory (ITW 2010, [33] T. Obuchi and A. Sakata, “Cross validation in sparse linear regression
Cairo). IEEE, 2010, pp. 1–5. with piecewise continuous nonconvex penalties and its acceleration,”
[5] ——, “Message passing algorithms for compressed sensing: II. anal- Journal of Physics A: Mathematical and Theoretical, vol. 52, no. 41, p.
ysis and validation,” in 2010 IEEE Information Theory Workshop on 414003, 2019.
Information Theory (ITW 2010, Cairo). IEEE, 2010, pp. 1–5. [34] S. Rangan, “Generalized approximate message passing for estimation
[6] ——, “How to design message passing algorithms for compressed with random linear mixing,” in 2011 IEEE International Symposium on
sensing,” preprint, 2011. Information Theory Proceedings. IEEE, 2011, pp. 2168–2172.
[7] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algo- [35] X. Meng, S. Wu, L. Kuang, and J. Lu, “An expectation propagation
rithm for linear inverse problems,” SIAM journal on imaging sciences, perspective on approximate message passing,” IEEE Signal Process.
vol. 2, no. 1, pp. 183–202, 2009. Lett., vol. 22, no. 8, pp. 1194–1197, 2015.
[8] C. A. Metzler, A. Maleki, and R. G. Baraniuk, “From denoising to [36] Q. Zou, H. Zhang, C.-K. Wen, S. Jin, and R. Yu, “Concise derivation
compressed sensing,” IEEE Trans. Inf. Theory, vol. 62, no. 9, pp. 5117– for generalized approximate message passing using expectation prop-
5144, 2016. agation,” IEEE Signal Process. Lett., vol. 25, no. 12, pp. 1835–1839,
[9] S. J. Wright, R. D. Nowak, and M. A. Figueiredo, “Sparse reconstruction 2018.
by separable approximation,” IEEE Trans. Signal Process., vol. 57, no. 7, [37] J. T. Parker, P. Schniter, and V. Cevher, “Bilinear generalized approxi-
pp. 2479–2493, 2009. mate message passing¡ªpart i: Derivation,” IEEE Trans. Signal Process.,
[10] S. Boyd, N. Parikh, and E. Chu, Distributed optimization and statistical vol. 62, no. 22, pp. 5839–5853, 2014.
learning via the alternating direction method of multipliers. Now [38] A. Maillard, L. Foini, A. L. Castellanos, F. Krzakala, M. Mézard,
Publishers Inc, 2011. and L. Zdeborová, “High-temperature expansions and message passing
[11] T. Blumensath, M. E. Davies, G. Rilling, Y. Eldar, and G. Kutyniok, algorithms,” Journal of Statistical Mechanics: Theory and Experiment,
“Greedy algorithms for compressed sensing.” 2012. vol. 2019, no. 11, p. 113301, 2019.
[12] J. A. Tropp and A. C. Gilbert, “Signal recovery from random mea- [39] A. Maillard, F. Krzakala, M. Mézard, and L. Zdeborová, “Perturbative
surements via orthogonal matching pursuit,” IEEE Trans. Inf. Theory, construction of mean-field equations in extensive-rank matrix factoriza-
vol. 53, no. 12, pp. 4655–4666, 2007. tion and denoising,” arXiv preprint arXiv:2110.08775, 2021.
[13] W. Dai and O. Milenkovic, “Subspace pursuit for compressive sensing [40] P. Schniter, S. Rangan, and A. K. Fletcher, “Vector approximate message
signal reconstruction,” IEEE Trans. Inf. Theory, vol. 55, no. 5, pp. 2230– passing for the generalized linear model,” in 2016 50th Asilomar
2249, 2009. Conference on Signals, Systems and Computers. IEEE, 2016, pp. 1525–
[14] S. M. Kay, Fundamentals of statistical signal processing: estimation 1529.
theory. Prentice-Hall, Inc., 1993. [41] H. He, C.-K. Wen, and S. Jin, “Generalized expectation consistent signal
[15] D. L. Donoho, A. Maleki, and A. Montanari, “Message-passing algo- recovery for nonlinear measurements,” in 2017 IEEE Int. Symp. Inf.
rithms for compressed sensing,” Proceedings of the National Academy Theory (ISIT). IEEE, 2017, pp. 2333–2337.
of Sciences, vol. 106, no. 45, pp. 18 914–18 919, 2009. [42] F. Tian, L. Liu, and X. Chen, “Generalized memory approximate
[16] M. Bayati and A. Montanari, “The dynamics of message passing on message passing,” arXiv preprint arXiv:2110.06069, 2021.
dense graphs, with applications to compressed sensing,” IEEE Trans. [43] A. Manoel, F. Krzakala, M. Mézard, and L. Zdeborová, “Multi-layer
Inf. Theory, vol. 57, no. 2, pp. 764–785, 2011. generalized linear estimation,” in 2017 IEEE Int. Symp. Inf. Theory
[17] M. Bayati, M. Lelarge, and A. Montanari, “Universality in polytope (ISIT). IEEE, 2017, pp. 2098–2102.
phase transitions and message passing algorithms,” The Annals of [44] A. K. Fletcher, S. Rangan, and P. Schniter, “Inference in deep networks
Applied Probability, vol. 25, no. 2, pp. 753–822, 2015. in high dimensions,” in 2018 IEEE Int. Symp. Inf. Theory (ISIT). IEEE,
[18] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs and 2018, pp. 1884–1888.
the sum-product algorithm,” IEEE Trans. Inf. Theory, vol. 47, no. 2, pp. [45] P. Pandit, M. Sahraee, S. Rangan, and A. K. Fletcher, “Asymptotics of
498–519, 2001. MAP inference in deep networks,” in 2019 IEEE Int. Symp. Inf. Theory
[19] J. Kim and J. Pearl, “A computational model for causal and diagnostic (ISIT). IEEE, 2019, pp. 842–846.
reasoning in inference systems,” in International Joint Conference on [46] Q. Zou, H. Zhang, and H. Yang, “Multi-layer bilinear generalized
Artificial Intelligence, 1983, pp. 0–0. approximate message passing,” IEEE Trans. Signal Process., vol. 69,
[20] M. Mezard and A. Montanari, Information, physics, and computation. pp. 4529–4543, 2021.
Oxford University Press, 2009. [47] S. Rangan, A. K. Fletcher, and V. K. Goyal, “Asymptotic analysis of
[21] D. J. Thouless, P. W. Anderson, and R. G. Palmer, “Solution of’solvable MAP estimation via the replica method and applications to compressed
model of a spin glass’,” Philosophical Magazine, vol. 35, no. 3, pp. 593– sensing,” IEEE Trans. Inf. Theory, vol. 58, no. 3, pp. 1902–1923, 2012.
601, 1977. [48] N. Merhav, Statistical physics and information theory. Now Publishers
[22] Y. Kabashima, “A CDMA multiuser detection algorithm on the basis of Inc, 2010.
belief propagation,” Journal of Physics A: Mathematical and General, [49] T. Richardson and R. Urbanke, Modern coding theory. Cambridge
vol. 36, no. 43, pp. 11 111–11 121, 2003. university press, 2008.
[23] D. Guo and S. Verdú, “Randomly spread CDMA: Asymptotics via [50] L. Liu, S. Huang, and B. M. Kurkoski, “Memory approximate message
statistical physics,” IEEE Trans. Inf. Theory, vol. 51, no. 6, pp. 1983– passing,” arXiv preprint arXiv:2012.10861, 2020.
2010, 2005.
[24] J. Ma and L. Ping, “Orthogonal AMP,” IEEE Access, vol. 5, pp. 2020–
2033, 2017.
[25] K. Takeuchi, “Bayes-optimal convolutional AMP,” IEEE Trans. Inf.
Theory, 2021.
[26] L. Liu, S. Huang, and B. M. Kurkoski, “Memory approximate message
passing,” in 2021 IEEE Int. Symp. Inf. Theory (ISIT). IEEE, 2021, pp.
1379–1384.

AMP_paper

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AMP_paper

Uploaded by

Copyright:

Available Formats

1

A Concise Tutorial on Approximate Message

problems, which can attain Bayes-optimal performance in inde-

Standard Linear Models

Long memory Matrix Taylor series

ISTA VAMP [29] (2016)

Besides, there are some algorithms that extend AMP to more x

B. AMP for LASSO AMP ISTA

Quantiles of Input Sample

Quantiles of Input Sample

point. The below is the detailed derivation to obtain AMP for

where q(y), Zβpri , and Zβlik are normalization constants, and

q(yM jx) and evaluated as

Algorithm 1: AMP for LASSO Algorithm 2: Bayes-Optimal AMP

h(t+1) = x − (HT z(t) + x(t) ) (58a)

10 0 NLE : x̂(t+1) = η̃t (r(t) ). (65b)

10 -2 SNR=10dB iterations reduce to Donoho’s AMP. In AMP algorithm, the

where Ŵt can be chosen from: Algorithm 3: Bayes-Optimal Orthogonal AMP

+ E{Wt }E{n(q(t) )T } where by SVD H = UΣVT

On the other hand, from (81b), we have

10 0 Recalling the OAMP iterations in Algorithm 3, the complex-

Defining Ct = I − θt (HHT + ςt I), we have

2 =20dB. x has IID BG

However, the complexity of the exact approximation is still

Totally, the MAMP is run in the following steps: (a) cal- 10 0

In fact, the MAMP is easy to fail to converge without

R EFERENCES [27] T. P. Minka, “A family of algorithms for approximate bayesian infer-

You might also like