Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

On Deep Learning for Inverse Problems

Jaweria Amjad Jure Sokolić Miguel R.D. Rodrigues


Electronics & Electrical Engineering Dept Biomedical Engineering Dept Electronics & Electrical Engineering Dept
University College London King’s College London University College London
London, UK London, UK London, UK
jaweria.amjad.16@ucl.ac.uk jure.sokolic@kcl.ac.uk m.rodrigues@ucl.ac.uk

Abstract—This paper analyses the generalization behaviour of A as follows:


a deep neural networks with a focus on their use in inverse
problems. In particular, by leveraging the robustness framework x̂ = arg min ky − Axk22 s.t. kxk1 ≤ k (2)
by Xu and Mannor, we provide deep neural network based x
regression generalization bounds that are also specialized to
sparse approximation problems. The proposed bounds show that where k · k2 and k · k1 are the `2 and `1 norms of a vector.
the sparse approximation performance of deep neural networks Moreover, the BPDN estimate of the desired vector can also be
can be potentially different from that of classical sparse recon- shown to approximate very well the true vector provided that
struction algorithms, with reconstruction errors limited only by the linear operator A obeys various conditions [5]. Other state-
the noise level. of-the-art approaches exploiting sparsity to solve this class of
linear inverse problems – such as iteratively reweighted least
I. I NTRODUCTION squares and iterative soft-thresholding methods – are reported
A large number of phenomena arising in science and in [6], [7]. However, these various approaches often require the
engineering – including problems in medical imaging, remote linear operator to satisfy certain conditions to guarantee exact
sensing, chemometrics, and more – can be approximated using inference (in the absence of noise) or stable inference (in the
the linear observation model given by: presence of noise) of the desired vector from the observation
vector [5], [8], failing drastically provided these conditions are
y = Ax + e (1) not met.
Another class of approaches to solve linear inverse problems
where y ∈ Y ⊆ RNy corresponds to a vector of observations, has also recently emerged in view of advances in deep learn-
x ∈ X ⊆ RNx corresponds to a vector of underlying causes, ing. In particular, the use of deep learning approaches to solve
e ∈ RNy is a vector modelling noise or other perturbations, inverse problems involves two phases: (i) in the training phase,
and A ∈ RNy ×Nx is a usually known linear operator mod- a number of pairs of training vectors x and y corresponding to
elling the relationship between the observations and the causes. one another are used to tune the set of parameters of a deep
A very common problem – known as inverse problem – neural network (DNN) architecture in order to implement a
then involves inferring the vector x from the vector y given mapping from y to x; 1 (ii) in the testing phase, a test vector
knowledge of the linear operator A. However, for Ny < Nx , y is mapped onto the vector x via the network. Interestingly,
this problem is severely ill-posed so – without resorting to this procedure has been shown to perform exceedingly well
additional assumptions – a unique solution does not exist (even in a wide variety of inverse problems such as compressed
in the absence of noise). sensing [9], image denoising [10], image deblurring [11],
A number of approaches to solve inverse problems have image super-resolution [12], and many more [13]. However,
therefore been proposed over the past years leveraging the fact a justification for such outstanding performance is currently
that many phenomena in nature admit some form of structure unknown, because recent frameworks attempting to provide
– such as sparsity, group sparsity, manifold structures, and a rationale for the efficacy of DNNs primarily focus on
more [1], [2] – that is key to restrict the space of possible classification tasks rather than the regression tasks arising in
solutions. In particular, the use of sparsity – exploiting the fact inverse problems [14], [15], [16].
that the vector to be inferred from observations admits a sparse This paper – which aims to fill-in this gap – is motivated
representation in some basis or frame – has led to a number by two overarching questions:
of methods to approximate the solution of a linear inverse
• How can we quantify the performance of DNN ap-
problem using greedy algorithms [3] or convex optimization
proaches in solving inverse problems?
based algorithms [4]. For example, under the assumption
that the desired vector contains at most k  Nx non-
1 Note that the operational principle associated with deep learning networks
zero entries, the well-known Basis Pursuit Denoise (BPDN)
is different from that of classical approaches. Classical approaches to solve
algorithm delivers an estimate of the desired vector x from the inverse problems attempt to directly invert the mapping from x to y. In
observation vector y given knowledge of the linear operator contrast, deep learning approaches attempt to learn a mapping from y to x.
TABLE I
A L IST OF P OINT-W ISE ACTIVATION F UNCTIONS [z]σ = {σ(zi )}i≤Ni .
Input Layer
Name Function σ(zi ) Derivative: σ 0 (zi )
Hyperbolic tangent tan(zi ) 1 − σ(zi )2
ReLU max(zi , 0) {1 if zi > 0; 0 otherwise}
Sigmoid 1/ σ(zi )(1 − σ(zi ))
1+exp(−zi ) (
exp(zi )/P σ(zi )(1 − σ(zi ))
Hidden Softmax exp(zj )
layer s
j −σ(zi )σ(zj ), otherwise

of linear and non-linear transformations that can learn increas-


ingly abstract concepts with layer depth [19]. See Fig. 1.
In particular, we can express the i-th layer output x̃i ∈ RNi
Output Layer
in terms of the i-th layer input x̃i−1 ∈ RNi−1 as follows:
x̃i = Wi x̃i−1 + bi σ
 
Fig. 1. A d-layer deep neural network.
where Wi ∈ RNi ×Ni−1 is the i-th layer weight matrix,
bi ∈ RNi is the i-th layer bias vector, and [.]σ represents
• How does the performance of DNN approaches compare an element-wise nonlinear activation function. The network
to the performance of other classification approaches for input is x̃1 = y and the network output is ΞS (y) = x̃.
solving inverse problems? Activation functions such as hyperbolic tangent, rectified
linear units (ReLU), and sigmoid, are normally used in hidden-
In particular, in our attempt to answer these questions,
layers and softmax is typically preferred in the output layer.
we build upon the robustness framework introduced by Xu
See Table I.
and Mannor in [17]: (i) we then introduce new DNN based
The various hyper-parameters associated with a deep neural
regression generalization bounds; (ii) we show how these
network can be learnt using optimization techniques based on
bounds can be used to quantify the performance of DNNs
training data [24]. State-of-the-art approaches include [20],
in solving inverse problems; and (iii) we also show how the
[21].
performance of a DNN compares with the performance of
We will then measure the quality of the DNN estimate of
other classical approaches, notably BPDN, for solving inverse
the vector x given the vector y, which has been learnt using
problems.
the training set S, using the loss function:
The remainder of the paper is organized as follows: We
start by introducing our problem set-up in Section II. We l(x̃, x) = l(ΞS (y), x) = kΞS (y) − xk2
then provide DNN generalization bounds applicable to general In particular, we are interested in characterising the gener-
regression problems in Section III. We also provide special- alization error (GE) associated with DNN regressors given
izations of these generalization bounds applicable to typical by:
inverse problems in Section IV. This opens up the possibility
of comparing DNN based approaches to classical approaches GE(ΞS ) = |lexp (ΞS ) − lemp (ΞS )| (4)
to solving inverse problems. Finally, concluding remarks are where
drawn in Section V. lexp (ΞS ) = E[l(ΞS (y), x)]
[Due to space limitations, the proofs are appearing in an
corresponds to the expected error associated with a pair of
upcoming preprint [18].]
vectors (x, y) and
II. S ETUP 1 X
lemp (ΞS ) = l(ΞS (yi ), xi )
We consider the problem of estimating a vector x ∈ X from m i
another vector y ∈ Y, where the pair of vectors s = (x, y) corresponds to the empirical error associated with the training
is drawn from the sample space D = X × Y according to vectors {(xi , yi )}i≤m .
some distribution µ, using a supervised learning setup. We We deliver in the sequel a characterization of the gener-
also consider we have access to a set of m training samples alization error of DNN based regressors where, for technical
S = {(xi , yi )}i≤m , drawn independently and identically reasons, we will be assuming that both the input space X and
distributed (i.i.d.) according to µ, to learn a regressor the output space Y are compact with respect to the `2 -metric
ΞS (·) : Y → X (3) and that the sample space D = X × Y is compact with respect
to the sup-metric.
that can then be used to deliver an estimate of the desired We want to characterize the performance of a DNN re-
vector x ∈ X given the observation vector y ∈ Y. gressor and compare with the traditional methods for solving
Our focus is on the use of DNN based regressors, corre- inverse problems. This section introduces the notation and
sponding to multi-layered architectures consisting of a series framework underlying our approach.
III. G ENERALIZATION E RROR B OUNDS : G ENERAL C ASE on the training set S is
  d
! !
We now derive performance guarantees for DNN based ψ Y
N ; D, ρ , 1 + kWi kF ψ − robust
regression by capitalizing on the robustness framework [17]. 2 i=1
A very important element of the robustness framework is  
the notion of algorithmic robustness. for any ψ > 0, where N ψ2 ; D, ρ < ∞ represents the
covering number of the metric space (D, ρ) using metric balls
Definition 1 (Algorithmic Robustness [17]). Let S denote of radius ψ/2.
the training set and D denote the sample space. A learning
algorithm is said to be (K, (S))-robust if the sample space Proof: We provide a sketch of the proof only. A full
D can be partitioned into K disjoint sets Kk , k = 1 . . . K, version will appear in an upcoming manuscript [18]. The loss
such that for all (xi , yi ) ∈ S and all (x, y) ∈ D function of a Lipschitz continuous DNN can be shown to
follow the Lipschitz continuity principle using triangular and
(xi , yi ), (x, y) ∈ Kk =⇒ (5) reverse Minkowski inequality. Thus the difference of the losses
|l(ΞS (yi ), xi ) − l(ΞS (y), x)| ≤ (S) between two samples is Qupper bounded by the product of the
d
Lipschitz constant (1+ i=1 kWi kF ) and distance ψ, between
In other words, a learning algorithm is robust provided that the samples. And so the claim follows.
the losses of a training sample and a test sample belonging to
The following theorem – building upon the previous one
the same partition are close.
– now characterizes a bound to the generalization error of a
The relevance of this definition is associated with the d-layer neural network.
fact that it provides a route to study the generalization Theorem 3. (GE Bound) Consider again that X and Y are
ability of various learning algorithms, including deep neural compact spaces with respect to the `2 metric. Consider also
networks [14]. However, Sokolic et al. [14] have provided the sample space D = X × Y equipped with a sup metric ρ.
generalization bounds for DNN based classifiers in lieu of It follows that a d-layer DNN based regressor ΞS (·) : Y →
DNN based regressors, so the results cannot be used to cast X trained on a training set S consisting of m i.i.d. training
insight on the performance of deep neural networks in solving samples obeys with probability 1 − ζ, for any ζ > 0, the
inverse problems. generalization error bound given by:
We will therefore generalize the results in [14] from the d
!
classification to the regression setting. We first show that a Y
GE(ΞS ) ≤ 1 + kWi kF ψ
d-layer DNN based regressor satisfies a Lipschitz continuity
i=1
condition. v
u    
u 2N ψ ; D, ρ log(2) + 2 log 1
Theorem 1. (Adapted from Theorem 2 and Lemma 1 in [14]) t 2 ζ
+ M (S)
Consider a d-layer DNN based regressor ΞS (·) : Y → X . m
Then, for any y1 , y2 ∈ Y, it follows that (6)
d
Y for any ψ > 0, where M (S) < ∞.
kΞS (y1 ) − ΞS (y2 )k2 ≤ kWi kF ky1 − y2 k2
i=1 Proof: This result follows from the gener-
alization error
  bound provided in [17]. For a
where k.kF denotes the Frobenious norm of a matrix.  Qd 
(N ψ2 ; D, ρ , 1 + i=1 kWi kF ψ)-robust DNN, the
Proof: We only outline the proof. The results follows proof is straight forward. A full version of the proof will
from Theorem 2 in [14] which proves that the ratio between appear in an upcoming manuscript [18].
the Euclidean distance at the input and the output of a d-layer Theorems 2 and 3 provide various insights that are also
DNN is less than the `2 -norm of the Jacobian matrix which is aligned with previous results in the literature. In particular,
upper bounded by the product of the frobenious norm of the these theorems suggest that the robustness and generalization
weight matrices [14]. Full version of the proof will appear in properties of a d-layer neural network are not associated with
an upcoming manuscript [18]. the number of network parameters per layer but rather with
appropriate norms of the weight matrices. Bartlett [22] had
We can now show the main results. The following theorem
also shown the size of the network has no effect on the
characterizes the robustness of a d-layer neural network.
generalization error of a neural network by bounding the fat
Theorem 2. (Robustness) Consider that X and Y are compact shattering dimension as a function of the `1 norm of the
spaces with respect to the `2 metric. Consider also the sample weights, so implying independence of the number of hidden
space D = X × Y equipped with a sup metric ρ. It follows units. Xu and Mannor [17] have also shown that robustness
that a d-layer DNN based regressor ΞS (·) : Y → X trained of a neural network does not depend on its size. Similarly, in
[23], it is argued that norm based regularization can improve Corollary 1. Consider the spaces X and Y in (7) and (8)
the generalization ability of a deep neural network. equipped with a `2 metric, the space D = X ×Y equipped with
These theorems also suggest that a deeper network may the sup-metric ρ, and the Lipschitz continuous mapping in (9).
generalize better than a shallower one, by guaranteeing the It follows that a d-layer DNN based regressor ΞS (·) : Y → X
Forbenius norm of the weight matrices is less than one. trained on the training set S is
This result is aligned with similar claims by Neyshabur[23]  k  k d
! !
Nx e 4 Y
resulting from matrix factorization approaches. In fact, it is 1+ , 1+ kWi kF (Lδ + 2η) −robust
possible to explicitly bound the norm of weight matrices via k δ i=1
reprojection using gradient decent [24], and regularization of
weight matrices has been empirically shown to result in better Sketch of Proof: For the system model given by eqs.
generalization [25]. (7), (8) and (9), the (Lδ + 2η)/2-covering number of metric
Finally, Theorem 3 also suggests that – beyond the depen- space (D, ρ) is upper bounded by the δ/2-covering number of
dence on the number of training samples – the generalization X . This result together with Theorem 2 proves the corollary.
ability of a d-layer neural network also depends directly on A full version of the proof will appear in an upcoming
the complexity of the data space D captured via its covering manuscript [18].
number. In particular, the generalization error of more complex
data spaces will tend to be higher than the generalization error Corollary 2. Consider again the spaces X and Y in (7) and
of a simpler data space. (8) equipped with a `2 metric, the space D = X × Y equipped
with the sup-metric ρ, and the Lipschitz continuous mapping
IV. G ENERALIZATION E RROR B OUNDS FOR in (9). It follows that a d-layer DNN based regressor ΞS (·) :
I NVERSE P ROBLEMS Y → X trained on a training set S consisting of m i.i.d.
We now specialize the performance guarantees from general training samples obeys with probability 1 − ζ, for any ζ > 0,
regression problems to inverse problems, with a focus on the generalization error bound given by:
sparse approximation tasks. Yd
!
We consider specifically the linear observation model in (1), GE(ΞS ) ≤ 1 + kWi kF (Lδ + 2η)
with some additional assumptions: i=1
v
First, the space X consists of unit `2 -norm k-sparse
 
Nx e k 4 k
u
• u2   1
k 1+ δ log(2) + 2 log ζ
vectors, i.e.
t
+ M (S)
m
X = {x ∈ RNx : kxk0 ≤ k, kxk2 ≤ 1} (7) (11)
for any δ > 0, for some M (S) < ∞.
• Second, the space Y consists to a linear projection of
the input space induced by observation matrix plus a Proof: The results follows by directly from Theorem 3
perturbation associated with bounded `2 -norm noise, i.e. and Corollary 2. A full version of the proof will appear in an
upcoming manuscript [18].
Y := {y = Ax + e ∈ RNy : x ∈ X , kek2 ≤ η} (8) The results embodied in these two corollaries can be used
to illuminate further the performance of sparse approximation
• Third, we assume that the linear mapping represented
based on deep learning networks. In particular, let us assume
by the matrix A is Lipschitz continuous with Lipschitz
we employ a regularization strategy during the training phase
contant L, i.e.
constraining the Frobenius norm of the weight matrices to be
kAx1 − Ax2 k2 ≤ Lkx1 − x2 k2 (9) less than one, such as reprojection using gradient descent [24].
This leads immediately to another generalization error
for any x1 , x2 ∈ X . Note that this condition is in practice
bound holding with probability 1 − ζ
obeyed by linear mappings that conform to the Restriced
Isometry Property (RIP) [26]. GE(ΞS ) ≤ 2(Lδ + 2η)
v
We also consider that an appropriately trained d-layer net-  
u 2 Nx e k 1 + 4 k log(2) + 2 log 1
u
work – using a training set S – is employed to deliver an t k δ ζ
+ M (S)
estimate of the sparse vector x given the measurement vector m
y. (12)
We can now immediately specialize the results appearing  1

in Theorems 2 and 3 to this particular setting. The following for any ζ > 0 and any δ > 0, and by setting δ = o m− k
uper bound on the covering number of the input space will be and by setting trivially ζ to be a function of m such that
very useful [15]: log (1/ζ) /m = o(1), to another generalization bound behav-
 k  k ing as follows
Nx e 4
N (δ/2; X , k.k2 ) ≤ 1+ (10) GE(ΞS ) ≤ 4 · η + o(1) (13)
k δ
This suggests that – with the increase of the number of V. C ONCLUSIONS
training samples m – the generalization ability of a deep neural This paper attempts to provide a rationale for the recently
network is limited only by the level of the noise independently reported superb performance of deep learning approaches in a
of the parameters of the linear observation model, namely Ny , wide range of inverse problems.
Nx , k, and L. Instead, these parameters mainly influence the
In particular, by drawing on the robustness framework intro-
speed at which the generalization error asymptotics kick-in.
duced by Xu and Mannor, this paper puts forth a generalization
In turn, in view of the fact that the generalization error bound for deep neural network based reconstruction that can
is upper bounded by the sum of the expected and empirical be specialized for a wide range of settings.
error, it is also possible to upper bound the expected sparse The specialization of this bound to sparse approximation
approximation error associated with a deep neural network as problems – occurring in various signal and image processing
follows: tasks – has shown that deep neural networks can lead to
lexp (ΞS ) ≤ lemp (ΞS )+GE(ΞS ) ≤ lemp (ΞS )+4·η+o(1) (14) generalization errors that depend on the noise level only. This
– together with the fact that recently established results suggest
Recent results suggest that deep neural networks – with that deep neural networks can potentially memorize datasets
a sufficient number of parameters – tend to memorize the – also suggests that the sparse approximation error incurred
training dataset [16] suggesting that via the use of deep neural networks also depends on the noise
level only. This behaviour can be in sharp contrast with the
lexp (ΞS ) ≤ GE(ΞS ) ≤ 4 · η + o(1) (15) behaviour of classical sparse approximation algorithms.
Future work will specialise these results to a wide range of
We conclude by comparing how the performance of a deep inverse problems, including compressive sensing, image de-
neural network compares to the performance of a well-known noising, image deblurring, image super-resolution, and more.
algorithm – BPDN – in sparse approximation problems.
ACKNOWLEDGEMENTS
Theorem 4 ([27]). Consider the linear observation model in This research is supported by the Commonwealth Scholar-
(1) where x ∈ X = {x ∈ RNx : kxk0 ≤ k} and y ∈ Y = ship Commission in UK.
{y = Ax + e ∈ RNy : kxk0 ≤ k, kek2 ≤ η}. Consider also
the sparse approximation algorithm delivering an estimate of R EFERENCES
x from y given knowledge of A: [1] G. Peyré and J. Fadili, “Group sparsity with overlapping partition func-
tions,” in Signal Processing Conference, 2011 19th European. IEEE,
x̃ = arg min kxk1 subject to, ky − Axk2 ≤ ε 2011, pp. 303–307.
x∈RNx [2] A. Tarantola, Inverse problem theory and methods for model parameter
estimation. siam, 2005, vol. 89.
where  ≥ η. It follows – under the assumption that k ≤ [3] J. A. Tropp, “Greed is good: Algorithmic results for sparse approxi-
(1 + µ) /4µ – the error of the approximation delivered by this mation,” IEEE Transactions on Information theory, vol. 50, no. 10, pp.
algorithm can be bounded as follows: 2231–2242, 2004.
[4] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles:
η+ε Exact signal reconstruction from highly incomplete frequency informa-
kx̃ − xk2 ≤ p tion,” IEEE Transactions on information theory, vol. 52, no. 2, pp. 489–
1 − µ(4k − 1) 509, 2006.
[5] M. F. Duarte and Y. C. Eldar, “Structured compressed sensing: From
where µ corresponds to the mutual coherence of the matrix theory to applications,” IEEE Transactions on Signal Processing, vol. 59,
no. 9, pp. 4053–4085, 2011.
A. [6] R. Chartrand and W. Yin, “Iteratively reweighted algorithms for com-
pressive sensing,” in Acoustics, speech and signal processing, 2008.
This sparse approximation algorithm – along with other ICASSP 2008. IEEE international conference on. IEEE, 2008, pp.
3869–3872.
sparse approximation algorithms based on convex optimization [7] Y. Liu, Z. Zhan, J.-F. Cai, D. Guo, Z. Chen, and X. Qu, “Projected
approaches or greedy approaches (see [8] and references iterative soft-thresholding algorithm for tight frames in compressed
within) – are known to exhibit a phase transition. Here, when sensing magnetic resonance imaging,” IEEE transactions on medical
imaging, vol. 35, no. 9, pp. 2130–2140, 2016.
the data sparsity k ≤ (1 + µ) /4µ, the algorithm provides a [8] J. A. Tropp and S. J. Wright, “Computational methods for sparse solution
reconstruction that scales with the amount of noise η; this is of linear inverse problems,” Proceedings of the IEEE, vol. 98, no. 6, pp.
akin to the behaviour of the sparse approximation delivered 948–958, 2010.
[9] A. Mousavi, A. B. Patel, and R. G. Baraniuk, “A deep learning
by a deep neural network. approach to structured signal recovery,” in Communication, Control, and
On the other hand, when the data sparsity k > (1 + µ) /4µ Computing (Allerton), 2015 53rd Annual Allerton Conference on. IEEE,
the algorithm does not give any reconstruction guarantees 2015, pp. 1336–1343.
[10] J. Xie, L. Xu, and E. Chen, “Image denoising and inpainting with deep
but the deep neural network may still be able to deliver an neural networks,” in Advances in neural information processing systems,
appropriate reconstruction of the sparse vector given its under- 2012, pp. 341–349.
sampled linear observation. Authors of [9] have empirically [11] Y. Yang, S. Cheng, Z. Xiong, and W. Zhao, “Wyner-ziv coding based
on tcq and ldpc codes,” in Signals, Systems and Computers, 2003.
demonstrated that the performance of a DNN degrades grad- Conference Record of the Thirty-Seventh Asilomar Conference on, vol. 1.
ually as the number of measuremets Ny are decreased. IEEE, 2003, pp. 825–829.

You might also like