A Nonlocal Bayesian Image Denoising Algorithm: M. Lebrun, A. Buades, J. M. Morel

SIAM J.
IMAGING SCIENCES
c 2013 Society for Industrial and Applied Mathematics
Vol. 6, No. 3, pp. 1665–1688
A Nonlocal Bayesian Image Denoising Algorithm∗

M. Lebrun†, A. Buades‡, and J. M. Morel†
Downloaded 12/22/13 to 134.99.128.41. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
Abstract. Recent state-of-the-art image denoising methods use nonparametric estimation processes for 8 × 8
patches and obtain surprisingly good denoising results. The mathematical and experimental evidence
of two recent articles suggests that we might even be close to the best attainable performance in image
denoising ever. This suspicion is supported by a remarkable convergence of all analyzed methods.
Still more interestingly, most patch-based image denoising methods can be summarized in one
paradigm, which unites the transform thresholding method and a Markovian Bayesian estimation.
As the present paper shows, this unification is complete when the patch space is assumed to be
a Gaussian mixture. Each Gaussian distribution is associated with its orthonormal basis of patch
eigenvectors. Thus, transform thresholding (or a Wiener filter) is made on these local orthogonal
bases. In this paper a simple patch-based Bayesian method is proposed, which on the one hand
keeps most interesting features of former methods, and on the other hand slightly improves the
state of the art of color images.
Key words. image denoising, Bayesian, patch-based, block-matching
AMS subject classifications. 68U10, 94A08, 62-04, 62C12, 62F15, 62G07, 60H40
DOI. 10.1137/120874989
1. Introduction. Most digital images and movies are currently obtained by a charge-
coupled device (CCD). The value ũ(i) observed by a sensor at each pixel i is a Poisson random
variable whose mean u(i) would be the ideal image. The difference between the observed image
and the ideal image ũ(i) − u(i) = n(i) is a “shot noise.”
By the use of an Anscombe transform [1] applied to the noisy image, this Poisson noise
can be converted into homoscedastic white noise. (Of course this operation must be done
immediately on the raw image before demosaicing.) With little loss of generality, this paper
will therefore define image denoising as the operation removing additive white Gaussian noise
from digital images.
Getting back a better estimate of u(i) by observing only ũ(i) is impossible. A first solution,
proposed as early as 1966 in [20], consists in multiplying the Fourier transform of the image
by optimal coefficients to attenuate the noise. It results in a convolution of the image with a
low-pass kernel.
From a stochastic viewpoint, the observed regularity of digital images also implies that
values ũ(j) at neighboring pixels j of a pixel i are positively correlated with ũ(i). Thus,
∗
Received by the editors April 25, 2012; accepted for publication (in revised form) March 19, 2013; published
electronically September 10, 2013. This research was partially financed by the MISS project of Centre National
d’Etudes Spatiales, the Office of Naval research under grant N00014-97-1-0839, the European Research Council,
advanced grant “Twelve labours,” and the Spanish government under TIN2011-27539.
http://www.siam.org/journals/siims/6-3/87498.html
†
CMLA, Ecole Normale Supérieure de Cachan, Cachan Cedex 94235, France (marc.lebrun.ik@gmail.com,
morel@cmla.ens-cachan.fr).
‡
Mathematics, Universitat de les Illes Balears, Palma de Mallorca 07122, Spain (toni@buades.uib.es).
1665
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

1666 M. LEBRUN, A. BUADES, AND J. M. MOREL
these values can be taken into account to obtain a better estimate of u(i). These values
being nondeterministic, Bayesian approaches were proposed in 1972 [38]. In short, there are
two complementary early approaches to denoising—the Fourier method and the Bayesian
estimation. The Fourier method has been extended in the past 30 years to other linear space-
frequency transforms such as the windowed discrete cosine transform (DCT) [42] or the many
wavelet transforms [31].
Being first parametric and limited to rather restrictive Markov random field models [17],
the Bayesian method is becoming nonparametric. The idea for the recent nonparametric
Markovian estimation methods is a now famous algorithm to synthesize textures from exam-
ples [16]. The underlying Markovian assumption is that, in a textured image, the stochastic
model for a given pixel i can be predicted from a local image neighborhood P of i, a “patch.”
The assumption for recreating new textures from samples is that there are enough pixels
j similar to i in a texture image ũ to recreate a new but similar texture u. The construction
of u is done by nonparametric sampling, which amounts to an iterative copy-paste process.
As was proposed in [4], an adaptation of this synthesis principle yields an image denoising
algorithm. The observed image is the noisy image ũ. The reconstructed image is the denoised
image û. The patch is a square centered at i, and the pasting yielding u(i) is replaced by
a weighted average of values at all pixels ũ(j) similar to i. This simple change leads to the
“nonlocal means” algorithm (Algorithm 1). See also Awate and Whitaker [2] for a method
with a similar principle and [25] for an extended review of patch-based denoising methods.
Algorithm 1 Nonlocal means algorithm

Input: noisy image ũ, σ noise standard deviation.
Output: denoised image û.
Set parameter κ × κ: dimension of patches.
Set parameter λ × λ: dimension of research zone in which similar patches are searched.
Set parameter C.
for each pixel i do
Select a square reference subimage (or “patch”) P̃ around i of size κ × κ.
Call P̂ the denoised version of P̃ obtained as a weighted average of the patches Q̃ in a
square neighborhood of i of size λ × λ. The weights in the average are proportional to
d2 (P̃ ,Q̃)
w(P̃ , Q̃) = e− Cσ2 , where d(P̃ , Q̃) is the Euclidean distance between patches P̃ and Q̃.
end for
Aggregation: recover a final denoised value û(i) at each pixel i by averaging all values at i
of all denoised patches Q̂ containing i
The algorithm gives the best possible mean square estimation if the image is modeled as
an infinite stationary ergodic spatial process [4]. The algorithm was called “nonlocal” because
it was allowed to use patches Q̃ at some distance from P̃ , and even patches taken from other
images.
Two types of performance bounds proposed in [28] and [8] address the same class of patch-
based algorithms and propose denoising methods approaching these bounds (section 3). The
denoising method proposed in [28] is NL-means, applied with the adequate parameter C to

NL-BAYES DENOISING ALGORITHM 1667
account for a Bayesian linear minimum mean square estimation (LMMSE) of the noisy patch,
given a huge database of 1010 patches. According to these authors, at least one algorithm,
BM3D [13], might be close to the best predicted estimation error.
In this paper, we examine several recently proposed Bayesian patch-based denoising meth-
ods and propose a particularly simple one. This new algorithm improves the state of the art
of color denoising algorithms, with a lower complexity. The algorithm, which we shall call
nonlocal Bayes (NL-Bayes), will be compared to seven other algorithms, both in theory and
in practice.
Our plan is as follows. Section 2 gives the general framework for the Bayesian patch-based
method and describes the proposed algorithm. Section 3 summarily describes the most recent
and influential related patch-based methods. In section 4, these methods are all compared
from the theoretical viewpoint. Not all can be compared in practice, since some are not
available or fully reproducible. Nevertheless, a practical comparison is given in section 5 for
seven of them.
2. The Bayesian patch-based method. Given u, the noiseless ideal image, and ũ, the
noisy image corrupted with Gaussian noise of standard deviation σ so that
(1) ũ = u + n,
the conditional distribution P(ũ | u) is

1 ||u−ũ||2
(2) P(ũ | u) = M e− 2σ 2 ,
(2πσ 2 ) 2
where M is the total number of pixels in the image. In order to compute the probability of
the original image given the degraded one, P(u | ũ), we need to introduce a prior on u.
Recent Bayesian methods have abandoned the global patch models formulated by an a
priori Gibbs energy [17], [3]. Instead, the methods build local nonparametric “patch models”
learned from the image itself, usually as a local Gaussian model around each given patch,
or as a Gaussian mixture. In these nonparametric models, the patches are larger, usually
8 × 8, while the cliques of a priori models are often confined to 3 × 3 neighborhoods. Given a
noiseless patch P of u with dimension κ × κ, and P̃ , an observed noisy version of P , the white
noise model gives by the independence of noise pixel values
P̃ −P 2
(3) P(P̃ |P ) = c · e− 2σ 2 ,
where c is a multiplicative constant, P and P̃ are considered as vectors with κ2 components,

and ||P || denotes the Euclidean norm of P . Knowing P̃ , our goal is to deduce P by maximizing
P(P |P̃ ). Using Bayes’s rule, we can compute this last conditional probability as
P(P̃ |P )P(P )
(4) P(P |P̃ ) = .
P(P̃ )
P̃ being observed, this formula can in principle be used to deduce the patch P maximizing the
right term, viewed as a function of P . This is only possible if we have a probability model for
P , and these models will be generally learned from the image itself, or from a set of images.

Each patch is associated to a subset of patches of the image or set of images and then
denoised by a Bayesian estimator in this cluster, which is treated as a set of Gaussian samples.
Another still more direct way to build a model for a given patch P̃ is to group the patches
similar to P̃ in the image. Assuming that these similar patches are samples of a Gaussian
vector yields a standard Bayesian restoration. Assume that the patches Q similar to P follow a
Gaussian model with (observable, empirical) covariance matrix CP and (observable, empirical)
mean P . This means that
(Q−P )t C−1 (Q−P )
P(Q) = α.e−
P
(5) 2 ,
where α is a normalization constant. From (2) and (4) we obtain for each observed noisy
patch P̃ the following equivalence of problems:
arg max P(P |P̃ ) = arg max P(P̃ |P )P(P )

P P
P −P̃ 2 (P −P )t C−1 (P −P )
= arg max e− e−
P
2σ 2 2
P
P − P̃ 2
= arg min + (P − P )t C−1
P (P − P ).
P σ2
Nevertheless, this expression does not immediately yield a denoising algorithm for P̃ . Indeed,
the noiseless patch P and the patches similar to P are not observable. Yet, the noisy version P̃
is observable and the patches Q̃ similar to P̃ can be found. An empirical mean and covariance
matrix can therefore be obtained for the patches Q̃ similar to P̃ . Furthermore, if the set of
selected patches Q̃ is similar to that which would be selected by comparing with the original
patch P , then the following approximations hold:
(6) CP̃ CP + σ 2 I; P̃ P .
If the above approximates are correct, our maximum a posteriori estimation (MAP) problem
finally boils down by (6) to the following feasible minimization problem:
P − P̃ 2
max P(P |P̃ ) ⇔ min + (P − P̃ )t (CP̃ − σ 2 I)−1 (P − P̃ ).
P P σ2
Differentiating this quadratic function with respect to P and equating to zero yields
P − P̃ + σ 2 (CP̃ − σ 2 I)−1 (P − P̃ ) = 0.
Taking into account that I + σ 2 (CP̃ − σ 2 I)−1 = (CP̃ − σ 2 I)−1 CP̃ , this yields
(CP̃ − σ 2 I)−1 CP̃ P = P̃ + σ 2 (CP̃ − σ 2 I)−1 P̃ ,
and therefore
P = C−1P̃
(CP̃ − σ 2 I)P̃ + σ 2 C−1
P̃
P̃

= P̃ + CP̃ − σ 2 I C−1 P̃
(P̃ − P̃ ).

Thus we have proved that a restored patch P̂1 can be obtained from the observed patch P̃ by
the one-step estimation

(7) P̂1 = P̃ + CP̃ − σ 2 I C−1
P̃
(P̃ − P̃ ),
which resembles a local Wiener filter. The performance of this strategy depends on the
precision of the approximations (6) and therefore on the ability to select many patches sharing
the same model of P , and only these ones.
Sections 2.1, 3.1, 3.2, 3.3, 3.6, and 3.7 will examine no less than six Bayesian algorithms
deriving patch-based denoising algorithms from variants of (7). The first question when
looking at this formula is obviously how the matrix CP̃ can be learned from the image itself.
Each method proposes a different notion to learn the patch model. This is really the only
difference between them.
2.1. Application: Nonlocal Bayesian denoising. The relation (7),

P̂1 = P̃ + CP̃ − σ 2 I C−1
P̃
(P̃ − P̃ ),
gives by itself a denoising algorithm, provided we can compute the patch expectations and
patch covariance matrices. We shall now explain how the NL-Bayes algorithm does it. (All
implementation details and an online demo are available in [5].)
In order to select the set of similar patches P(P̃ ), the distance L2 between the reference
patch P̃ and all patches Q̃ in a neighborhood around P̃ is computed. Then, only a fixed
number of similar patches is kept, obviously those with the lowest distance to the reference
one. This fixed number must be larger than the dimension of the patch to obtain an invertible
covariance matrix CP̃ . Then, the mean P̃ and covariance matrix CP̃ write as
1 t 1
(8) CP̃ Q̃ − P̃ Q̃ − P̃ , P̃ Q̃.
#P(P̃ ) − 1 #P(P̃ )
Q̃∈P(P̃ ) Q̃∈P(P̃ )
The selection of similar patches can be improved in a second step by using the first estimate
P̂1 as oracle. Table 1 shows that iterating the oracle estimate more than once only brings
a marginal improvement. Indeed, the patch similarity is better estimated with the denoised
patches and a better approximation of P and CP can be obtained. The new set of similar
patches P(P̂1 ) is set to contain neighboring denoised patches Q̂1 with L2 distance to P̂1 below
a certain threshold τ0 . Then,
1 t 1
(9) CP̂1 Q̂1 − P̂1 Q̂1 − P̂1 , P̂1 Q̃1 ,
#P(P̂1 ) − 1 #P(P̂1 )
Q̂1 ∈P(P̂1 ) Q̂1 ∈P(P̂1 )
respectively, approximate CP and P . This approximation is different from (6), where CP was
approximated by using the covariance matrix of noisy patches, CP̃ ≡ CP + σ 2 I. Thus, we
obtain a second estimation Bayesian formula
−1
(10) P̂2 = P̃1 + CP̂1 CP̂1 + σ 2 I (P̃ − P̃1 ).

Table 1
Mean PSNRs on seven noiseless images with an added noise of standard deviation σ = 2, 5, 10, 20, 30,
40, 60, 80, and 100. Only the first three digits are actually significant; the last one may vary with different
white noise realizations. “Classic” refers to the NL-Bayes algorithm with only one “oracle estimation,” and
“Iterate” refers to the NL-Bayes algorithm with two “oracle estimations.”
Method σ=2 σ=5 σ = 10 σ = 20 σ = 30 σ = 40 σ = 60 σ = 80 σ = 100

Classic 46.13 40.87 36.56 32.64 30.13 28.31 27.36 26.30 25.14
Iterate 46.18 41.13 37.26 33.06 29.76 27.81 27.15 26.42 25.27
The estimates (7) and (10) may appear equivalent, but they are not so in practice. CP̂1 ,
obtained after a first denoising step, is a better estimation than CP̃ . Furthermore, P̃1 is a more
accurate mean than P̃ . Nevertheless, this second step cannot be considered as an iteration
of the first step algorithm. Indeed, it is still the original patch P̃ which is being denoised at
the second step. Since in Algorithm 2 the number of independent samples (the number of
patches) used for computing the covariance matrix is larger than the dimension of the vectors
(size of patches), the empirical covariance matrices of the form CP are always symmetric
nonnegative. They are actually positive with high probability on natural images. Thus, it is
always possible in practice to inverse them. Numerically, a Cholesky-based symmetric matrix
inversion algorithm is used. For the sake of completeness (not happening in practice, but
which could perhaps happen on a synthetic image) if the matrix cannot be inverted, the
algorithm sets P̂1 = P̃ or P̂2 = P̂1 .
Related works. The first algorithm proposing a global computation of a principal compo-
nent analysis (PCA) on sliding image patches seems to be due to Muresan and Parks [33].
A study on the compared performance of a local PCA versus global PCA for two-step image
denoising (TSID) is proposed in [7]. One can relate NL-Bayes to the algorithm proposed
in [47], called in the following TSID, and to BM3D-SAPCA [14]. Indeed, both algorithms
apply a PCA on the set of patches similar to P̃ . Both algorithms perform a threshold, or a
Wiener filter, on the coefficients of the patches on the PCA basis. For the BM3D-SAPCA, this
two-dimensional (2D) transform is followed by an additional one-dimensional (1D) transform
by using a 1D Haar wavelet transform. The TSID algorithm is iterated twice, but without
the use of any oracle. The same algorithm is simply applied again to the result of the first
estimate. BM3D-SAPCA instead applies an oracle strategy followed by a third iteration. This
third iteration does not always improve the performance of NL-Bayes as shown in Table 1.
Nevertheless, a slight improvement can be obtained for small and very large values of noise.
However, applying too many iterations may deteriorate the result, since more details can
disappear from one iteration to the other.
BM3D-SAPCA demonstrates state-of-the-art results on grey level images, but it is a far
more complex algorithm than NL-Bayes, as shown by its description by its authors. It is “an
image denoising method that exploits nonlocal image modeling, principal component analysis
(PCA), and local shape-adaptive anisotropic estimation. The nonlocal modeling is exploited
by grouping similar image patches in 3-D groups. The denoising is performed by shrinkage
of the spectrum of a 3-D transform applied on such groups. The effectiveness of the shrink-
age depends on the ability of the transform to sparsely represent the true-image data, thus

Algorithm 2 NL-Bayes image denoising for grey level images.

Input: noisy image
Output: denoised image ⎧ ⎧
⎪
⎨3 if σ < 20, ⎪
⎨3 if σ < 50,
Size of patches κ1 = 5 if σ < 50, κ2 = 5 if σ < 70,

⎪
⎩ ⎪
⎩
7 otherwise. 7 otherwise.
2
Number of similar patches retained N1 = 3κ1 N2 = 3κ22
Parameters:
Size of search zone λ1 = N1 /2 λ2 = N2 /2
Fixed threshold for step 2 τ0 = 16κ22
for all patches P̃ of the noisy image do

Find a set P(P̃ ) of patches Q̃ similar to P̃ .
Compute the expectation P̃ and covariance matrix CP̃ of these patches by
1 t 1
CP̃ Q̃ − P̃ Q̃ − P̃ , P̃ Q̃.
#P(P̃ ) − 1 #P(P̃ )
Q̃∈P(P̃ ) Q̃∈P(P̃ )
Obtain the first step estimation:

P̂1 = P̃ + CP̃ − σ 2 I C−1
P̃
(P̃ − P̃ ).
end for
Obtain the pixel value of the basic estimate image û1 as a uniform average of all values of
all denoised patches Q̂1 which contain i.
for all patches P̃ of the noisy image do
Find a new set P1 (P̃ ) of noisy patches Q̃ similar to P̃ by comparing their denoised
“oracular” versions Q1 to P1 .
1
Compute the new expectation P̃ and covariance matrix CP̂1 of these patches:
t
1 1 1 1 1
CP̂1 Q̂1 − P̃ Q̂1 − P̃ , P̃ Q̃.
#P(P̂1 ) − 1 #P(P̂1 )
Q̂1 ∈P(P̂1 ) Q̂1 ∈P(P̂1 )
Obtain the second step patch estimate

1 −1 1
P̂2 = P̃ + CP̂1 CP̂1 + σ 2 I (P̃ − P̃ )
end for
Obtain the pixel value of the denoised image û(i) as a uniform average of all values of all
denoised patches Q̂2 which contain i.

separating it from the noise. We propose to improve the sparsity in two aspects. First, we
employ image patches (neighborhoods) which can have data-adaptive shape. Second, we pro-
pose PCA on these adaptive-shape neighborhoods as part of the employed 3-D transform.
The PCA bases are obtained by eigenvalue decomposition of empirical second-moment matri-
ces that are estimated from groups of similar adaptive-shape neighborhoods” [14]. It would
be interesting to accelerate and simplify this promising algorithm and to extend it to color
images.
2.2. Dealing with flat area. For medium and large values of noise, some artifacts inherent
in the method may appear in flat/homogeneous area. In order to reduce those artifacts, a
special treatment is added during the collaborative filtering of the first step. After that P(P̃ )
has been obtained, we start by detecting if P̃ belongs to a homogeneous area by processing
the square of the standard deviation of P(P̃ ):
⎛ ⎛ ⎞2 ⎞
N κ
1 2 1
σP̃2 = κ P̃ ⎝ κ Q̃(x) − ⎝ κ Q̃(x)⎠ ⎠ .
NP̃ − 1 NP̃ NP̃
Q̃∈P(P̃ ) x∈Q̃ Q̃∈P(P̃ ) x∈Q̃
Since a huge number (NP̃κ = #P(P̃ )κ2 ) of realizations of the variable u(i) is taken into account,
in a homogeneous area this random variable should be very concentrated around its mean.
Thus, fixing a threshold γ close to 1, if σP̃ ≤ γσ, we can assume with high probability that
P̃ belongs to a homogeneous area. In this case, the better result that can be obtained for the
group is the average. Therefore, the estimate of all patches in the set of similar patches P(P̃ )
is
1
∀Q̃ ∈ P(P̃ ), ∀x ∈ Q̂1 , Q̂1 (x) = κ Q̃(y)
NP̃
Q̃∈P(P̃ ) y∈Q̃
instead of (7).
2.3. Considerations about the complexity of the algorithm. For a given reference patch
P , we need to look for similar patches, build the Gaussian model, and apply the Bayes
estimation. Let λ be the size of the search zone, κ the patch size, and N the fixed number of
similar patches. Then the complexity is as follows for each step.
1. Search for similar patches.
Computing the L2 distance between all patches in the neighborhood: λ2 κ2 = 32 κ4 .
2
Partial sorting of the N closest patches: N 2 log( λ2 ) = 9κ4 log( 98 κ4 ), where it is taken into
account that the size of the search zone depends on the size of the patch (see Algorithm 2).
2. Bayes estimation.
Process of CP̃ : N κ4 = 3κ6 .
Inversion of CP̃ : 13 κ6 + κ4 .
Noticing that P̃ + [CP̃ − σ 2 I]C−1
P̃
(P̃ − P̃ ) = P̃ − σ 2 C−1
P̃
(P̃ − P̃ ), only one matrix product is
needed. Thus it takes for all similar patches of the three-dimensional (3D) group N κ4 = 3κ6
operations.
In summary, the overall complexity for the first step patch restoration is

19 6 4 13 9 4
κ +κ + 9 log κ
3 4 8

operations, and it doubles for the complete algorithm (step 1 + step 2). Nevertheless, this
process can be considerably accelerated. Once a Gaussian model has been estimated, it can
be applied to denoise all the patches that have been used to estimate the model. In this way,
when denoising a patch all similar patches are also simultaneously denoised. This acceleration
is used in the accompanying online demo [5]. It divides the above computed complexity by a
κ2 factor.
2.4. Three tools beneficial for all methods. The performance of all denoising methods
depends on three generic tools: color transform, aggregation, and an oracle step. They increase
the performance of any denoising principle. They have been applied in the above Bayesian
method and are also applied in the methods we shall compare it to.
Dealing with color images. The straightforward strategy to extend denoising algorithms to
color or multivalued images is to apply the algorithm independently to each channel. The use
of this simple strategy often introduces perturbing color artifacts. Depending on the algorithm
formulation, a vector-valued version dealing at the same time with all color channels could be
proposed. This solution is, for example, well adapted to neighborhood filters.
The alternative option is to convert the usual RGB (red, green, blue) image to a different
color space where the independent denoising of each channel does not create noticeable color
artifacts. The classical transform used in the literature is the Y U V or Y CrCb color system,
which somewhat separates the geometric from the chromatic information. The Y component
is computed as an average of red, green, and blue giving more weight to the green channel,
which is supposed to contain more geometrical information. It is observed that instead the U0
and V0 components mainly contain colored texture information and are simply zero in grey
parts of the image. NL-Bayes uses an analogous transform Yo Uo Vo introduced by [35], given
by the matrix ⎛ ⎞
1 1 1
3 3 3
⎜ ⎟
Yo Uo Vo = ⎝ 21 0 − 12 ⎠ .
1
4 − 12 1
4
This color transform to the space Yo Uo Vo is an orthogonal transform. It has the advantage
of maximizing the noise reduction of the geometric component, since this component is a
uniform average of the three colors. It also permits a higher noise reduction on the chromatic
components Uo and Vo , due to their observable regularity. This latter strategy is adopted
by transform thresholding filters. NL-Bayes also adopts this strategy, denoising each channel
separately, but nevertheless finds the set of similar patches for each pixel by using the distances
in Yo , as originally proposed in [12].
Aggregation of estimates. Aggregation techniques combine for any pixel a set of m possible
estimates. If these estimates were independent and had equal variance, then a uniform average
would reduce this estimator variance by a factor m. Such an aggregation strategy was the
main proposition of the translation invariant wavelet thresholding algorithm [10]. This method
denoises several translations of the image by a wavelet thresholding algorithm and averages
these different estimates once the inverse translation has been applied to the denoised images.
Statistical arguments lead to attributing to each estimator a weight inversely proportional
to its variance [34]. When applied without aggregation, the denoising methods leave visible
“halos” of residual noise near edges. For example, in the sliding window DCT method [44],

patches containing edges have many large DCT coefficients which are kept by thresholding. In
flat zones instead, most DCT coefficients are canceled, and the noise is completely removed.
Guleryuz [19] proposed approximating the variance of each estimated patch by the number of
DCT coefficients kept after thresholding to remove this “halo” effect. The same approach is
used by BM3D [13] or PLE [9] and further studied by Salmon and Strozecki [40].
Iteration and oracle filters. A first step denoised image can be used to improve the reappli-
cation of the denoising method to the initial noisy image. In a second step application of a
denoising principle, the denoised DCT coefficients, or the patch distances, can be computed
in the first step denoised image. They are an approximation to the true measurements that
would be obtained from the noise-free image. Thus, the first step denoised image is used as
an oracle for the second step.
For averaging filters such as neighborhood filters or NL-means, the image u can be denoised
in a first step by the method under consideration. This first step denoised image denoted by
û1 is used for computing more accurate color distances between pixels or patches. This is
what is used in the second step of NL-Bayes.
Similarly, for linear transform Wiener-type methods, the image is first denoised by its
classical definition. In a second iteration, the coefficients of the denoised image approximate
the true coefficients of the noise-free image. These oracle strategies are conceptually different
from a simple iteration such as the one used in TSID [47]. It is still the noisy image which
is being denoised in the second step, but the first estimate is used in order to approximate
the true model or measurements which are not available. We refer the reader to [32] for a
comparison of the oracle strategy on patch-based techniques with other classical approaches
such as “twicing” [41] and Bregman iterations [36].
3. Alternative Bayesian algorithms. This section summarily describes the most recent
and influential patch-based methods. In the next section they will be compared from a the-
oretical viewpoint. Not all can be compared in practice, since some are not available or are
hard to reproduce. Nevertheless, a more practical comparison will be given in section 5 for
those which could be tried.
3.1. Patch-based near-optimal image denoising (PLOW). While in NL-Bayes as de-
scribed in section 2.1 a local model is estimated in a neighborhood of each patch, in the
PLOW [9] method the idea is to learn from the image a sufficient number of patch clus-
ters, actually 15, and to apply the LMMSE estimate to each patch after having assigned
it to one of the clusters obtained by clustering. Thus, this empirical Bayesian algorithm
starts by clustering the patches by the classic K-means clustering algorithm. To take into
account that similar patches can actually have varying contrast, the interpatch distance is
photometrically neutral, and the authors call it a “geometric distance.” The clustering phase
is accelerated by a dimension reduction obtained by applying a PCA to the patches. The
clustering is therefore a segmentation of the set of patches, and the denoising of each patch
is then performed within its cluster. Since each cluster contains geometrically similar, but
not necessarily photometrically similar, patches, the method identifies for each patch in the
cluster the photometrically similar patches as those whose quadratic distance to the reference
patch are within the bounds allowed by noise. Then an LMMSE [24] estimate is obtained for
the reference patch by a variant of (7). The algorithm uses a first phase, which performs a

first denoising before constituting the clusters. Thus the main phase is actually using the first
phase as oracle to get the covariance matrices of the sets of patches.
3.2. Obtaining the patch model from a very large database. Two recent methods learn
from a very large patch set extracted from tens of thousands of images (up to 1010 patches).
Each patch of a noisy image is denoised by deducing its likeliest estimate, given the patch
database. In the case of [49], the patch space is organized as a Gaussian mixture with about
200 components. The approach of [28] is to define the simplest universal “shotgun” method,
where a huge set of patches is used to estimate the upper limits a patch-based denoising method
will ever reach. The results suggest a “near optimality of state of the art denoising results,”
the results obtained by the BM3D algorithm being only 1 decibel away from optimality for
methods using small patches (typically 8 × 8 and moderate noise).
This experiment uses to evaluate the minimum mean square estimation (MMSE) a set
of 20,000 images from the LabelMe dataset [39]. The method, even if not practical, is very
simple. Given a clean patch P , the noisy patch P̃ with Gaussian noise of standard deviation
σ has probability distribution
1 ||P −P̃ ||2

(11) P(P̃ | P ) = κ2
e− 2σ 2 ,
(2πσ 2 ) 2
where κ2 is the number of pixels in the patch. Then, given a noisy patch P̃ , its optimal
estimator for the Bayesian MMSE is by Bayes’s formula

P(P̃ | P )
(12) P̂ = E[P | P̃ ] = P(P | P̃ )P dP = P(P )P dP.
P(P̃ )
Using a huge set of M natural patches (with a distribution supposedly approximating

1
the real natural patch density), one can approximate the terms in (12) by P(P )dP M and
1
P(P̃ ) M i P(P̃ | Pi ), which in view of (11) yields
1
i P(P̃ | Pi )Pi
P̂ M
1
.
M i P(P̃ | Pi )
Thus the final MMSE estimator is nothing but the exact application of NL-means on the
huge patch database.
3.3. The Portilla et al. [37] wavelet neighborhood denoising (BLS-GSM). This algo-
rithm models a noiseless “wavelet coefficient neighborhood,” P , by a Gaussian scale mixture
(GSM) which is defined as √
P = zU,
where U is a zero-mean Gaussian random vector and z is an independent positive scalar
random variable. The wavelet coefficient neighborhood is a patch of an oriented channel
of the image at a given scale, complemented with a coefficient of the channel at the same
orientation and the next lower scale. Thus, we can adopt again the patch notation P . Using
a GSM model for P estimated from the image itself, the method makes a Bayes least square

(BLS) estimator. For this reason, the method is called here BLS-GSM (Bayes least square
estimate of Gaussian scale mixture; the authors called it simply BLS). To use the GSM model
for wavelet patch denoising, the noisy input image is first decomposed into a wavelet pyramid,
and each image of the pyramid is denoised separately. The resulting denoised image is obtained
by the reconstruction algorithm from the wavelet coefficients. Assume now that the image has
been corrupted by independent additive Gaussian noise. Therefore, a typical neighborhood
of wavelet coefficients can be represented as
√
(13) P̃ = P + N = zU + N,
where noise, N , and P are considered to be independent. Define ps (i, j) as the sample at
position (i, j) of the subband P s , the subbands being enumerated as, e.g., s = 1, . . . , 49.
The neighborhood of the wavelet coefficient ps (i, j) is composed of its spatial neighbors for
the same subband s, supplemented with one coefficient at the same location and at the next
coarser scale with the same orientation. Using the observed noisy vector, P̃ , an estimation of
P can be obtained by ∞
E(P | P̃ ) = P(z | P̃ )E(P | P̃ , z)dz.
0
This estimation is the Bayesian denoised value of the reference coefficient. The integral is
computed numerically on experimentally obtained sampled intervals of z. This method is
actually extremely similar to other patch-based Bayesian methods. It has received a more
recent extension, reaching state-of-the-art performance, in [29].
3.4. K-SVD and sparse coding. Sparse coding algorithms learn a redundant set D of
vectors called a dictionary and choose the right atoms to describe the current patch.
For a fixed patch size, the dictionary is encoded as a matrix of size κ2 ×ndic , where κ2 is the
number of pixels in the patch and ndic ≥ κ2 . The dictionary patches, which are columns of the
matrix, are normalized (in a Euclidean norm). This dictionary may collect usual orthogonal
bases (such as discrete cosine transform, wavelets, and curvelets) but also patches extracted
(or learned) from clean images or even from the noisy image itself.
The dictionary permits us to compute a sparse representation α of each patch P , where
α is a coefficient vector of size ndic satisfying P ≈ Dα. This sparse representation α can be
obtained with an orthogonal recursive matching pursuit (ORMP) [11]. The ORMP gives an
approximate solution to the (NP-complete) problem
(14) arg min ||α||0 such that ||P − Dα||22 ≤ κ2 (Cσ)2 ,
α
where α0 refers to the l0 norm of α, i.e., the number of nonzero coefficients of α. This last
constraint brings in a new parameter C. This coefficient multiplying the standard deviation
σ guarantees that, with high probability, a white Gaussian noise of standard deviation σ
on κ2 pixels has an L2 norm lower than κCσ. In K-SVD (see [27, 30]) and other current
sparse coding algorithms, the previous denoising strategy is used as the first step of a two-
step algorithm. The selection step is iteratively combined with an update of the dictionary,
taking into account the image and the sparse codifications already computed. At each step of
the algorithm, (14) is solved thanks to the ORMP algorithm, and coefficients of D and α are
updated.

3.5. BM3D. BM3D is a denoising method extending sliding window DCT denoising [43]
(available implementation of [43] in [44]) and NL-means [4]. Instead of locally adapting a
basis or choosing from a large dictionary, it uses a fixed basis. The main difference with
DCT denoising is that a set of similar patches is used to form a 3D block which is filtered
by using a 3D transform—hence the name collaborative filtering. The method has four steps:
(a) finding the image patches similar to a given image patch and grouping them in a 3D block;
(b) 3D linear transform of the 3D block; (c) shrinkage of the transform spectrum coefficients;
(d) inverse 3D transformation. This 3D filter therefore filters out simultaneously all 2D image
patches in the 3D block. The filtered patches are then returned to their original positions,
and an adaptive aggregation procedure is applied by taking into account the number of kept
coefficients per patch during the thresholding process.
The first collaborative filtering step is much improved in a second step using an oracle
Wiener filtering. This second step mimics the first step, with two differences. The first
difference is that it compares the filtered patches instead of the original patches as described in
section 2.4. The second difference is that the new 3D group (built with the unprocessed image
samples, but using the patch distances of the filtered image) is processed by an oracle Wiener
filter using coefficients from the denoised image obtained at the first step to approximate the
true coefficients given by the classic Wiener method. The final aggregation step is identical
to those of the first step.
The algorithm is extended to color images through the Yo Uo Vo color system. The previous
strategy is applied independently to each channel with the exception that similar patches are
always selected by computing distances in the channel Yo .
As we shall argue later on, NL-Bayes is nothing but a more flexible version of BM3D,
where the fixed orthonormal basis is replaced by an adaptive basis made of the eigenvectors
of the local Gaussian model.
Here we described the basic implementation given in its seminal paper, and which will
also be used in the comparison section. Yet, BM3D has several more recent variants that
improve its performance. There is a variant with shape-adaptive patches [14]. In this al-
gorithm denominated BM3D-SAPCA, the sparsity of image representation is improved in
two respects. First, it employs image patches (neighborhoods) which can have data-adaptive
shape. Second, the PCA bases are obtained by eigenvalue decomposition of empirical second-
moment matrices that are estimated from groups of similar adaptive-shape neighborhoods.
This method improves BM3D especially in preserving image details and introducing very few
artifacts. The anisotropic shape-adaptive patches are obtained using the eight-directional
LPA-ICI techniques [23]. A very recent development of BM3D is presented in [22], [15], where
it is generalized to become a generic image restoration tool, including deblurring.
3.6. The piecewise linear estimation (PLE) method. The Bayesian restoration model
proposed in [45] and [46] is a general framework for restoration, including denoising, deblur-
ring, and inpainting.
The patch density law is modeled as a mixture of Gaussian distributions {N (μk , Ck )}1≤k≤K
parameterized by their means μk and covariance matrices Ck . Thus each patch P̃i is assumed
to be independently drawn from one of these Gaussian distributions with an unknown index

k and a density function
1 − 12 (Pi −μk )T C−1

k (Pi −μk )
p(Pi ) = κ2 1
e i .
(2π) 2 |Cki | 2
Estimating all patches Pi from their noisy observations P̃i amounts to solving the following
problems:
1. Estimate the Gaussian parameters (μk , Ck )1≤k≤K from the degraded data P̃i .
2. Identify the index ki of the Gaussian distribution generating the patch Pi .
3. Estimate Pi from its corresponding Gaussian distribution (μki , Cki ) and from its noisy
version P̃i .
In consequence PLE [46] has two distinct steps in the estimation procedure. In an E-step
(E for “estimate”), the Gaussian parameters (μk , Ck )k are known, and for each patch the
MAP estimate P̂ik is computed with each Gaussian model. Then the best Gaussian model ki
is selected to obtain the estimate P̂i = P̂iki .
In the M-step (M for “model”), the Gaussian model selection ki and the signal estimates
fî are assumed to be known for all patches i and permit us to again estimate the Gaussian
models (μk , Ck )1≤k≤K . According to the terminology of section 2.4, this section gives the
oracle permitting us to estimate in the E-step the patches by a Wiener-type filter.
For each image patch with index i, the patch estimation and its model selection are
obtained by maximizing the log a posteriori probability P(Pi | P̃i , k),
(15) (P̂i , ki ) = arg max log P(Pi | P̃i , Ck )

P,k

(16) = arg max log P(P̃i | Pi , Ck ) + log P(Pi | Ck )
P,k

(17) = arg min Pi − P̃i 2 + σ 2 (Pi − μk )T C−1
k (P i − μ k ) + σ 2
log |C k | ,
Pi ,k
where the second equality follows from the Bayes rule, the third assumes a white Gaussian
noise with diagonal matrix σ 2 I (of the dimension of the patch), and Pi N (μk , Ck ). This
minimization can be made first over Pi , which amounts to a linear filter, and then over k,
which is a simple comparison of a small set of real values. The index k being fixed, the optimal
Pik satisfies

Pik = arg min Pi − P̃i 2 + σ 2 (Pi − μk )T C−1
k (Pi − μ k ) + log |C k | ,
Pi
and therefore
Pik = μk + (I + σ 2 C−1 −1
ki ) (P̃i − μk ),
which is the formula (10) already seen in section 2.1. Then the best Gaussian model ki is
selected as

ki = arg min Pik − P̃i 2 + σ 2 (Pik − μk )T σk−1 (Pik − μk ) + log |Ck | .
k

Assuming now that for each patch Pi its model ki and its estimate P̂i are known, the
next task is to give for each k the maximum likelihood estimate for (μk , Ck ) knowing all the
patches assigned to the kth cluster Ck , namely,
(μk , Ck ) = arg max log P({P̂i }i∈Ck | μk , Ck ).
μk ,Ck
This yields the empirical estimate

1 1
μk = P̂i , Ck = (P̂i − μk )(P̂i − μk )T ,
#Ck #Ck − 1
i∈Ck i∈Ck
which are the estimates (9) also used in section 2.1.

Finally the above MAP-EM (maximum a posteriori–expectation maximization) algorithm
is iterated, and the authors observe that the MAP probability of the observed signals P({P̂i }i |
{P̃i }i , {μk , Ck }k ) always increases. This algorithm requires a good initialization. The authors
created 19 orthogonal bases in the following way: one of them, say k = 0, is the classic DCT
basis and corresponds to the “texture cluster.” The others are obtained by fixing 18 uniformly
sampled directions in the plane. For each direction, the PCA is applied to a set of patches
extracted from a synthetic image containing an edge in that direction. The PCA yields an
oriented orthonormal basis.
3.7. The expected patch log likelihood (EPLL) method. Formally, this method [49] is
an almost literal application of the PLE method [46]; see section 3.6. We shall therefore not
repeat the same formulae. But it is “shotgun,” being applied to a huge set of patches instead of
the image itself. The Gaussian mixture model is learned from a set of 2.106 patches, sampled
from the Berkeley database with their mean removed. The 200 mixture components with zero
means and full covariance matrices are obtained using the expectation maximization (EM)
algorithm. Thus the following are learned: 200 means (actually they are all zero), 200 full
covariance matrices, and 200 mixing weights which constitute the Gaussian mixture model of
this set of patches.
Once the Gaussian mixture is learned, the denoising method maximizes the EPLL while
being close to the corrupted image in a way which is dependent on the (linear) corruption
model. The method can be used for denoising, and several experiments seem to indicate that
it equals the performance of BM3D and LLSC [30].
4. Algorithmic comparison. All considered methods are patch-based, including BLS-
GSM. Indeed, each “wavelet neighborhood” contains a 3×3 patch of a wavelet channel, com-
plemented with one more sample from the down-scale channel sharing the same orientation.
Thus, like the others, this algorithm builds Bayesian estimates of patches. The difference
is that the patches belong to the wavelet channels. Each one of these channels is denoised
separately, before the reconstruction of the image from its wavelet channels.
In short, even if the BLS-GSM formalization at first looks different from the other algo-
rithms, it relies on similar principles: it estimates patch models to denoise them and aggregates
the results. But, it also is the only multiscale algorithm among those considered here. Indeed,
it denoises the image at all scales. Furthermore, it introduces a scale interaction. These fea-
tures are neglected in the other algorithms and might make a significant difference in future
algorithms.

Table 2
Synoptic table of all considered methods.
Method Denoising principle Number of patches Patch size

DCT transform threshold one 8
NL-means average neighborhood 3

NL-Bayes Bayes neighborhood 3–7
PLOW Bayes, 15 clusters image 11
Shotgun-NL Bayes 1010 patches 3–20
EPLL Bayes, 200 clusters 2.1010 patches 8
BLS-GSM Bayes in GSM Image 3
K-SVD sparse dictionary Image 8
BM3D transform threshold neighborhood 8–12
PLE Bayes, 19 clusters Image 8
Table 2 shows a synopsis of the ten patch-based methods that have been discussed. The
classification criteria are as follows.
The denoising principle. The dominant principle is to compute an LMMSE after building
a Bayesian patch model. As a matter of fact, even if this is not always explicit, all methods
very closely follow the same LMMSE estimator principle. For example a DCT threshold [44]
is nothing but a Wiener thresholding version of the Bayesian LMMSE. This threshold is used
because the DCT of the underlying noiseless image is actually unknown. A close examination
of K-SVD shows that this algorithm is very close to PLE, PLOW, or EPLL, and conversely.
Indeed, the patch clustering performed in these three algorithms interprets the patch space
as a redundant dictionary. Each cluster is treated by a Bayesian estimator as a Gaussian
vector, for which an orthogonal eigenvector basis is computed. This basis is computed from
the cluster patches by PCA. Thus PLE, PLOW, and EPLL actually deliver a dictionary,
which is the union of several orthogonal bases of patches. PLE, PLOW, and EPLL select for
each noisy patch one of the bases, on which the patch will be sparse. In short, like K-SVD,
they compute for each patch a sparse representation in an overcomplete dictionary. In this
argument, we simply follow the interpretation proposed with the PLE method in [46], [45].
From the algorithmic viewpoint, EPLL is a variant of PLE but is used in a different setting.
The comparison of these two almost identical Gaussian mixture models is of particular interest.
EPLL is applied to a huge set of patches (of the order of 1010 ) united in some 200 clusters.
PLE is applied with 19 clusters each learned from some 64 patches. Thus, the open question
is as follows: How many clusters and how many learning patches are actually necessary to
obtain the best peak signal to noise ratio (PSNR)? The disparity between these figures is
certainly too large to be realistic. The good performance of NL-Bayes permits us to argue
that a local model based on five dozen patches extracted from the image itself performs as
well as a huge patch database.
We must finally wonder if transform thresholding methods fit into the united view of all
algorithms. The Bayesian–Gaussian estimate used by most mentioned algorithms can be in-
terpreted as a Wiener filter on the eigenvector basis of the covariance matrix of the Gaussian
vector. It sometimes includes a threshold to avoid negative eigenvalues for the covariance ma-
trix of the Gaussian vector. Thus, the only difference between Bayesian–Gaussian methods
and the classic transform thresholding is that in the Bayesian methods the orthogonal basis is

adapted to each patch. Therefore, they appear to be a direct extension of transform thresh-
olding methods and have logically replaced them. BM3D combines several linear transform
thresholds (2D-bior 1.5, 2D-DCT, 1D-Walsh–Hadamard), applied to the 3D block obtained
by grouping similar patches. Clearly, it has found by a rather systematic exploration the right
2D orthogonal bases and therefore does not need to estimate them for each patch group.
We shall now reunite two groups of methods that are only superficially different. On the
one hand, NL-means, NL-Bayes, Shotgun-NL, and BM3D denoise a patch after comparing it
to a group of similar patches. The other five patch-based Bayesian methods do not perform a
search for similar patches.
These other patch methods, PLE, PLOW, EPLL, BLS-GSM, and K-SVD, globally pro-
cess the “patch space” and construct patch models. Nevertheless, this difference is easily
reduced. Indeed, PLE, PLOW, and EPLL segment the patch space into a sufficient number
of clusters, each one endowed with a rich structure (an orthonormal basis). Thus, the patches
contributing to the denoising of a given patch estimation are not compared to each other,
but they are compared to the clusters. Similarly, the dictionary-based methods like K-SVD
propose overcomplete dictionaries learned from the image or from a set of images. Finding
the best elements of the dictionary to decompose a given patch, as K-SVD does, amounts to
classifying this patch. This is what is suggested by the authors of PLE in [46]: the dictionary
is a list of orthogonal bases which are initiated by sets of oriented edges. Each basis is there-
fore associated with an orientation (plus one associated with the DCT). Thus PLE is very
similar to BLS-GSM, which directly applies a set of oriented filters. Another link between the
Bayesian method and sparse modeling is elaborated in [48].
The number of patches involved in the estimation. The second column in the classification
Table 2 indicates the number of patches used for the denoising method, and where they are
found. The trivial DCT uses only the current patch to denoise it; NL-means, NL-Bayes, and
BM3D compare the reference patch with a few hundred patches in its spatial neighborhood;
PLE, PLOW, BLS-GSM, and K-SVD compare each noisy patch to a learned model of all
image patches; finally Shotgun-NL and EPLL involve in the estimation a virtually infinite
number of patches. Surprisingly enough, the performance of all methods is relatively similar.
Thus, the huge numbers used to denoise in Shotgun-NL and EPLL clearly depend on the fact
that the patches were not learned from the image itself, and their number can arguably be
considerably reduced.
Size (of the patches). The third column in our synoptic table compares the patch sizes.
All methods without exception try to deduce the correct value of a given pixel i by using a
neighborhood of i called a patch. This patch size goes from 3 × 3 to 8 × 8, with a strong
dominance of 8 × 8 patches. Nevertheless, the size of the patches obviously depends on the
amount of noise and should be adapted to the noise standard deviation. For very large noises,
a size 8 × 8 can be insufficient, while for small noises small patches might be better. As a
matter of fact, all articles focus on noise standard deviations around 30 (most algorithms are
tested for σ between 20 and 80). There is little work on small noise (below 10). For large
noise, above 50, most algorithms do not deliver a satisfactory result and most papers show
denoising results for 20 ≤ σ ≤ 40. This may also explain the homogeneity of the patch size.
As we shall see in the next section, NL-Bayes attains a better performance on color images
with significantly smaller patches.

5. Experimental comparison. In this section we shall compare the following state-of-the-

art denoising algorithms: the sliding DCT filter as specified in [44], the wavelet neighborhood
Gaussian scale mixture (BLS-GSM) algorithm, the classical vector valued NL-means, the
BM3D algorithm, the K-SVD denoising method, and NL-Bayes. These algorithms have been
chosen for two reasons. First they have a public and completely transparent code available,
which is in agreement with their description. Second, they all represent distinct denoising
principles and therefore illustrate the methodological progress and the diversity of denoising
principles. At least two of them (K-SVD and BM3D) represent the state of the art in terms
of PSNR performance.
In the comparison, we use the standard BM3D, which works on color images. Nevertheless,
the variant with shape-adaptive patches BM3D-SAPCA [14] is sharper but works only on grey
level images. A comparison for grey level images can be found in the companion paper [5].
5.1. Comparing visual quality. Figures 1–2 display the noisy and denoised images for the
algorithms under comparison for noise standard deviations of 30 and 40.
Figures 1 and 2 illustrate the low frequency color artifacts of DCT, BLS-GSM, and K-
SVD. These are the most local algorithms and therefore have more trouble in removing the low
frequencies of the noise. As a consequence, the denoised images present many low frequency
color artifacts in flat and dark zones. These artifacts are noticeable for all these algorithms
even if they use different strategies to deal with color images. DCT and BLS-GSM are applied
independently to each Yo Uo Vo component, while K-SVD is a vector valued algorithm. NL-
means does not suffer from these noise low frequency problems, but it leaves some isolated
noise points on nonrepetitive structures, mainly on corners. BM3D and NL-Bayes give similar
performances that are superior to the rest of algorithms. In these figures, DCT and BLS-GSM
also suffer from a strong Gibbs effect near all image boundaries. This Gibbs effect is nearly not
noticeable in the denoised image by K-SVD, since the use of the whole dictionary permits us
to better reconstruct edges when the right atoms are present in the dictionary. The NL-means
denoised image has no visual artifacts but is more blurred than those given by BM3D and
NL-Bayes, which give performances clearly superior to those of the rest of the algorithms. The
BM3D denoised image still has some Gibbs effect near the edges, which sometimes degrades
the visual quality of the solution.
5.2. Comparing by PSNR. The mean square error is the square of the Euclidean dis-
tance between the original image and its estimate. In the denoising literature an equivalent
measurement, up to a decreasing scale change, is the PSNR,

2552
P SN R = 10 log 10 .
M SE
These numerical quality measurements are the most objective, since they do not rely on any
visual interpretation. Table 3 displays the PSNR of state-of-the-art denoising methods using
the images in Figure 3 and several values of σ from 2 to 40. When σ exceeds 40, all methods
fail by creating artifacts or excessive blur. The current state of the art is that some algorithms
(but not all) reach acceptable results in the interval σ ∈ [2, 40], which is therefore explored in
this comparison. Many of the compared methods are actually not designed or optimized for
lower SNR.

Original Noisy
DCT sliding window (27.64) BLS-GSM (28.20) NL-means (27.71)
K-SVD (28.59) BM3D (28.85) NL-Bayes (29.35)

Figure 1. Comparison of visual quality. The noisy image was obtained by adding a Gaussian white noise
of standard deviation 30. PSNRs are given in parentheses for each method.
The PSNR comparison cannot lead to a completely objective ranking of algorithms. In-
deed, all methods with some existence actually have variants. We are using here the basic
algorithms as they were announced in their seminal papers. For example, it is shown in [21]
that BM3D can be slightly improved for heavy noise > 40 by changing the method parameters.
The following PSNR comparison thus only gives some hints depending on the particular
implementation of the denoising principles. We observe in the results that DCT denoising,
BLS-GSM, K-SVD, and NL-means have similar PSNRs. The relative performance of the

Original Noisy
DCT sliding window (24.88) BLS-GSM (25.50) NL-means (24.44)
K-SVD (25.32) BM3D (25.80) NL-Bayes (26.02)

Figure 2. Comparison of visual quality. The noisy image was obtained by adding a Gaussian white noise
of standard deviation 40. PSNRs are given in parentheses for each method.
methods depends on the kind of image and on the noise level σ. On average, K-SVD and
BLS-GSM are slightly superior to the other two, even if this is not the case visually, where
K-SVD and BLS-GSM have a poor visual quality compared to NL-means. In all cases, BM3D
and NL-Bayes have a better PSNR performance than the others. Because of a superior noise
reduction in flat zones and the presence of fewer artifacts of NL-Bayes, the PSNR of BM3D
is slightly inferior to that of NL-Bayes. BM3D seems to retain the best conservation of detail.
Some ringing artifacts near boundaries can probably be eliminated by the same trick that

Table 3
Mean PSNRs on six noiseless images with an added noise of standard deviation σ = 2, 5, 10, 20, 30,
and 40. Only the first three digits are actually significant; the last one may vary with different white noise
realizations.
NL-Bayes BM3D BLS-GSM K-SVD NL-means DCT denoising

σ=2 46.79 46.34 46.02 44.98 45.48 45.96
σ=5 42.11 41.64 41.18 41.37 40.54 40.99
σ = 10 38.57 38.10 37.54 37.53 37.09 37.12
σ = 20 35.10 34.60 34.00 33.45 33.28 33.11
σ = 30 33.06 32.59 32.04 31.78 31.25 30.52
σ = 40 31.55 31.09 30.62 30.17 29.66 28.64
Figure 3. A set of noiseless images used for the comparison tests.
NL-Bayes uses, namely, detecting and giving special treatment to flat 3D groups, as described
in section 2.2.
While the present paper shows comparison results for color images, the companion paper
[5] shows a detailed comparison on grey level images, which are relevant for medical imaging,
among other applications. Its conclusion is that NL-Bayes is not the best denoising algorithm
for grey level images. Indeed, for small values of the noise, BM3D-SAPCA performs better
in terms of PSNR and also visually. Furthermore, using a homogeneous area criterion as
described in section 2.2 in grey level images (with three times fewer samples than in color
images) blurs the result of NL-Bayes. This criterion should therefore be refined for grey level
images.

Conclusion. The proposed algorithm improves (slightly) the state of the art on color im-
ages, while being rather simple. As we made clear, it can be viewed as an optimized reinter-
pretation of TSID in a Bayesian framework. It shows that the somewhat painful specification
of special 2D transforms in BM3D can be replaced by a more flexible Bayesian estimate of the
best basis at each image point. It therefore establishes a bridge between the several Bayesian
methods we examined and state-of-the-art BM3D. The best algorithm for grey level images
(and low level noise) still remains BM3D-SAPCA, which uses more clever geometric optimiza-
tion on the shape of patches. This also indicates that there is still room for improvement.
We encourage and invite readers to simply test this algorithm and others at Image Processing
On Line (denoising chapter), http://www.ipol.im, where all algorithms can be executed on
exactly the same noiseless images with the same white noise added. In particular, this can be
done with TV-denoising [18], NL-means [6], BM3D [26], KSVD [27], and NL-Bayes [5].
REFERENCES
[1] F. J. Anscombe, The transformation of Poisson, binomial and negative-binomial data, Biometrika, 35
(1948), pp. 246–254.
[2] S. P. Awate and R. T. Whitaker, Unsupervised, information-theoretic, adaptive image filtering for
image restoration, IEEE Trans. Pattern Anal. Machine Intell., 28 (2006), pp. 364–376.
[3] P. Brémaud, Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues, Texts Appl. Math.
31, Springer-Verlag, New York, 1999.
[4] A. Buades, B. Coll, and J. M. Morel, A review of image denoising algorithms, with a new one,
Multiscale Model. Simul., 4 (2005), pp. 490–530.
[5] A. Buades, M. Lebrun, and J. M. Morel, Implementation of the “non-local Bayes” image denoising
algorithm, preprint, Image Processing On Line, http://www.ipol.im/pub/art/2013/16/, 2012.
[6] A. Buades, B. Coll, and J.-M. Morel, Non-local means denoising, Image Processing On Line (2011),
http://www.ipol.im/pub/art/2011/bcm nlm/.
[7] J. Salmon, C.-A. Deledalle, and A. Dalalyan, Image denoising with patch based PCA: Local versus
global, in Proceedings of the British Machine Vision Conference, BMVA Press, Cambridge, UK, 2011,
pp. 25.1–25.10.
[8] P. Chatterjee and P. Milanfar, Is denoising dead?, IEEE Trans. Image Process., 19 (2010), pp. 895–
911.
[9] P. Chatterjee and P. Milanfar, Patch-based near-optimal image denoising, IEEE Trans. Image Pro-
cess., 21 (2012), pp. 1635–1649.
[10] R. R. Coifman and D. L. Donoho, Translation-invariant de-noising, in Wavelets and Statistics, Lecture
Notes in Statistics, Springer, New York, 1995, pp. 125–125.
[11] S. F. Cotter, R. Adler, R. D. Rao, and K. Kreutz-Delgado, Forward sequential algorithms for
best basis selection, IEE Proc. Vision Image Signal Process., 146 (1999), pp. 235–244.
[12] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, Image denoising by sparse 3-D transform-
domain collaborative filtering, IEEE Trans. Image Process., 16 (2007), pp. 2080–2095.
[13] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, Image denoising by sparse 3D transform-
domain collaborative filtering, IEEE Trans. Image Process., 16 (2007), pp. 2080–2095.
[14] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, BM3D image denoising with shape-adaptive
principal component analysis, in Proceedings of the Workshop on Signal Processing with Adaptive
Sparse Structured Representations, Saint-Malo, France, 2009.
[15] A. Danielyan, V. Katkovnik, and K. Egiazarian, BM3D frames and variational image deblurring,
IEEE Trans. Image Process., 21 (2012), pp. 1715–1728.
[16] A. Efros and T. Leung, Texture synthesis by non parametric sampling, in Proceedings of the Interna-
tional Conference on Computer Vision, IEEE, Washington, DC, 1999, pp. 1033–1038.

[17] S. Geman and D. Geman, Stochastic relaxation, Gibbs distributions and the Bayesian restoration of
images, IEEE Trans. Pattern Anal. Machine Intell., 6 (1984), pp. 721–741.
[18] P. Getreuer, Rudin-Osher-Fatemi total variation denoising using split Bregman, Image Processing On
Line (2012), http://www.ipol.im/pub/art/2012/g-tvd.
[19] O. G. Guleryuz, Weighted averaging for denoising with overcomplete dictionaries, IEEE Trans. Image
Process., 16 (2007), pp. 3020–3034.

[20] J. L. Harris, Image evaluation and restoration, J. Optical Soc. Amer., 56 (1966), pp. 569–570.
[21] Y. Hou, C. Zhao, D. Yang, and Y. Cheng, Comments on “Image Denoising by Sparse 3-D Transform-
Domain Collaborative Filtering,” IEEE Trans. Image Process., 20 (2011), pp. 268–270.
[22] V. Katkovnik, A. Danielyan, and K. Egiazarian, Decoupled inverse and denoising for image deblur-
ring: Variational BM3D-frame technique, in Proceedings of IEEE International Conference on Image
Processing (ICIP), IEEE, Washington, DC, 2011. DOI: 10.1117/3.660178.
[23] V. I. A. Katkovnik, V. Katkovnik, K. Egiazarian, and J. Astola, Local Approximation Techniques
in Signal and Image Processing, SPIE Press Monogr. PM157, SPIE Press, Bellingham, WA, 2006.
[24] S. M. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory, Prentice–Hall, Englewood
Cliffs, NJ, 1993.
[25] M. Lebrun, M. Colom, A. Buades, and J. M. Morel, Secrets of image denoising cuisine, Acta
Numer., 21 (2012), pp. 475–576.
[26] M. Lebrun, An analysis and implementation of the BM3D image denoising method, Image Processing
On Line (2012), http://www/ipol.im/pub/art/2012/l-bm3d.
[27] M. Lebrun and A. Leclaire, An implementation and detailed analysis of the K-SVD image denoising
algorithm, Image Processing On Line (2012), http://www.ipol.im/pub/art/2012/llm-ksvd.
[28] A. Levin and B. Nadler, Natural image denoising: Optimality and inherent bounds, in Proceedings of
the International Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Wash-
ington, DC, 2011, pp. 2833–2840.
[29] S. Lyu and E. P. Simoncelli, Modeling multiscale subbands of photographic images with fields of Gaus-
sian scale mixtures, IEEE Trans. Pattern Anal. Machine Intell., 31 (2009), pp. 693–706.
[30] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, Non-local sparse models for image
restoration, in Proceedings of the 12th International Conference on Computer Vision (ICCV), IEEE,
Washington, DC, 2009, pp. 2272–2279.
[31] Y. Meyer, Wavelets: Algorithms and Applications, SIAM, Philadelphia, 1993.
[32] P. Milanfar, A tour of modern image filtering, IEEE Signal Process. Mag., 2 (2011).
[33] D. D. Muresan and T. W. Parks, Adaptive principal components and image denoising, in Proceedings
of the International Conference on Image Processing (ICIP), Vol. 1, IEEE, Washington, DC, 2003,
pp. I–101.
[34] A. Nemirovski, Topics in non-parametric statistics, in Lectures on Probability Theory and Statistics
(Saint-Flour, 1998), Lecture Notes in Math. 1738, Springer, Berlin, 2000, pp. 85–277.
[35] Y.-I. Ohta, T. Kanade, and T. Sakai, Color information for region segmentation, Computer Graphics
Image Process., 13 (1980), pp. 222–241.
[36] S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin, Using Geometry and Iterated Refinement for
Inverse Problems 1: Total Variation Based Image Restoration, preprint, Department of Mathematics,
UCLA, Los Angeles, CA, 2004, pp. 4–13.
[37] J. Portilla, B. Strela, M. J. Wainwright, and E. P. Simoncelli, Image denoising using scale
mixtures of Gaussians in the wavelet domain, IEEE Trans. Image Process., 12 (2003), pp. 1338–1351.
[38] W. H. Richardson, Bayesian-based iterative method of image restoration, J. Opt. Soc. Amer., 62 (1972),
pp. 55–59.
[39] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, Labelme: A database and web-
based tool for image annotation, Internat. J. Comput. Vis., 77 (2008), pp. 157–173.
[40] J. Salmon and Y. Strozecki, Patch reprojections for non-local methods, Signal Process., 92 (2012),
pp. 477–489.
[41] J. W. Tukey, Exploratory Data Analysis, Addison–Wesley, Reading, MA, 1976.
[42] L. Yaroslavsky and M. Eden, Fundamentals of Digital Optics, Birkhäuser Boston, Boston, 2003.
[43] L. P. Yaroslavsky, K. O. Egiazarian, and J. T. Astola, Transform domain image restoration
methods: Review, comparison, and interpretation, Proc. SPIE, 4304 (2001), p. 155.

[44] G. Yu and G. Sapiro, DCT image denoising: A simple and effective image denoising algorithm, Image
Processing On Line (2011), http://www.ipol.im/pub/art/2011/ys-dct/.
[45] G. Yu, G. Sapiro, and S. Mallat, Image modeling and enhancement via structured sparse model selec-
tion, in Proceedings of the International Conference on Image Processing (ICIP), IEEE, Washington,
DC, 2010, pp. 1641–1644.
[46] G. Yu, G. Sapiro, and S. Mallat, Solving inverse problems with piecewise linear estimators: From
Gaussian mixture models to structured sparsity, IEEE Trans. Image Process., 21 (2012), pp. 2481–
2499.
[47] L. Zhang, W. Dong, D. Zhang, and G. Shi, Two-stage image denoising by principal component
analysis with local pixel grouping, Pattern Recognition, 43 (2010), pp. 1531–1549.
[48] M. Zhou, H. Yang, G. Sapiro, D. Dunson, and L. Carin, Dependent hierarchical beta process for
image interpolation and denoising, in International Conference on Artificial Intelligence and Statistics,
Ft. Lauderdale, FL, 2011, pp. 883–891.
[49] D. Zoran and Y. Weiss, From learning models of natural image patches to whole image restoration, in
International Conference on Computer Vision, IEEE, Washington, DC, 2011, pp. 479–486.

A Nonlocal Bayesian Image Denoising Algorithm: M. Lebrun, A. Buades, J. M. Morel

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Nonlocal Bayesian Image Denoising Algorithm: M. Lebrun, A. Buades, J. M. Morel

Uploaded by

Copyright:

Available Formats

SIAM J.

A Nonlocal Bayesian Image Denoising Algorithm∗

Key words. image denoising, Bayesian, patch-based, block-matching

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Algorithm 1 Nonlocal means algorithm

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

the conditional distribution P(ũ | u) is

where c is a multiplicative constant, P and P̃ are considered as vectors with κ2 components,

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

arg max P(P |P̃ ) = arg max P(P̃ |P )P(P )

(CP̃ − σ 2 I)−1 CP̃ P = P̃ + σ 2 (CP̃ − σ 2 I)−1 P̃ ,

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Method σ=2 σ=5 σ = 10 σ = 20 σ = 30 σ = 40 σ = 60 σ = 80 σ = 100

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Algorithm 2 NL-Bayes image denoising for grey level images.

Size of patches κ1 = 5 if σ < 50, κ2 = 5 if σ < 70,

Fixed threshold for step 2 τ0 = 16κ22

for all patches P̃ of the noisy image do

Obtain the ﬁrst step estimation:

Obtain the second step patch estimate

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

1 ||P −P̃ ||2

Using a huge set of M natural patches (with a distribution supposedly approximating

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

k and a density function

1 − 12 (Pi −μk )T C−1

(15) (P̂i , ki ) = arg max log P(Pi | P̃i , Ck )

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

This yields the empirical estimate

which are the estimates (9) also used in section 2.1.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Method Denoising principle Number of patches Patch size

NL-means average neighborhood 3

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

5. Experimental comparison. In this section we shall compare the following state-of-the-

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

DCT sliding window (27.64) BLS-GSM (28.20) NL-means (27.71)

K-SVD (28.59) BM3D (28.85) NL-Bayes (29.35)

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

DCT sliding window (24.88) BLS-GSM (25.50) NL-means (24.44)

K-SVD (25.32) BM3D (25.80) NL-Bayes (26.02)

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

NL-Bayes BM3D BLS-GSM K-SVD NL-means DCT denoising

Figure 3. A set of noiseless images used for the comparison tests.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Process., 16 (2007), pp. 3020–3034.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

You might also like