This document summarizes a paper that analyzes the generalization behavior of deep neural networks for solving inverse problems. It introduces the problem of using deep learning to infer an underlying cause vector x from observation vector y given a linear operator A. While deep learning has shown success in applications like compressed sensing, its justification for inverse problems is unknown. The paper aims to quantify deep learning performance for inverse problems and compare it to other classical sparse reconstruction algorithms. It discusses training and testing deep networks to learn the mapping from observations to causes in inverse problems.
This document summarizes a paper that analyzes the generalization behavior of deep neural networks for solving inverse problems. It introduces the problem of using deep learning to infer an underlying cause vector x from observation vector y given a linear operator A. While deep learning has shown success in applications like compressed sensing, its justification for inverse problems is unknown. The paper aims to quantify deep learning performance for inverse problems and compare it to other classical sparse reconstruction algorithms. It discusses training and testing deep networks to learn the mapping from observations to causes in inverse problems.
This document summarizes a paper that analyzes the generalization behavior of deep neural networks for solving inverse problems. It introduces the problem of using deep learning to infer an underlying cause vector x from observation vector y given a linear operator A. While deep learning has shown success in applications like compressed sensing, its justification for inverse problems is unknown. The paper aims to quantify deep learning performance for inverse problems and compare it to other classical sparse reconstruction algorithms. It discusses training and testing deep networks to learn the mapping from observations to causes in inverse problems.
Electronics & Electrical Engineering Dept Biomedical Engineering Dept Electronics & Electrical Engineering Dept University College London King’s College London University College London London, UK London, UK London, UK jaweria.amjad.16@ucl.ac.uk jure.sokolic@kcl.ac.uk m.rodrigues@ucl.ac.uk
Abstract—This paper analyses the generalization behaviour of A as follows:
a deep neural networks with a focus on their use in inverse problems. In particular, by leveraging the robustness framework x̂ = arg min ky − Axk22 s.t. kxk1 ≤ k (2) by Xu and Mannor, we provide deep neural network based x regression generalization bounds that are also specialized to sparse approximation problems. The proposed bounds show that where k · k2 and k · k1 are the `2 and `1 norms of a vector. the sparse approximation performance of deep neural networks Moreover, the BPDN estimate of the desired vector can also be can be potentially different from that of classical sparse recon- shown to approximate very well the true vector provided that struction algorithms, with reconstruction errors limited only by the linear operator A obeys various conditions [5]. Other state- the noise level. of-the-art approaches exploiting sparsity to solve this class of linear inverse problems – such as iteratively reweighted least I. I NTRODUCTION squares and iterative soft-thresholding methods – are reported A large number of phenomena arising in science and in [6], [7]. However, these various approaches often require the engineering – including problems in medical imaging, remote linear operator to satisfy certain conditions to guarantee exact sensing, chemometrics, and more – can be approximated using inference (in the absence of noise) or stable inference (in the the linear observation model given by: presence of noise) of the desired vector from the observation vector [5], [8], failing drastically provided these conditions are y = Ax + e (1) not met. Another class of approaches to solve linear inverse problems where y ∈ Y ⊆ RNy corresponds to a vector of observations, has also recently emerged in view of advances in deep learn- x ∈ X ⊆ RNx corresponds to a vector of underlying causes, ing. In particular, the use of deep learning approaches to solve e ∈ RNy is a vector modelling noise or other perturbations, inverse problems involves two phases: (i) in the training phase, and A ∈ RNy ×Nx is a usually known linear operator mod- a number of pairs of training vectors x and y corresponding to elling the relationship between the observations and the causes. one another are used to tune the set of parameters of a deep A very common problem – known as inverse problem – neural network (DNN) architecture in order to implement a then involves inferring the vector x from the vector y given mapping from y to x; 1 (ii) in the testing phase, a test vector knowledge of the linear operator A. However, for Ny < Nx , y is mapped onto the vector x via the network. Interestingly, this problem is severely ill-posed so – without resorting to this procedure has been shown to perform exceedingly well additional assumptions – a unique solution does not exist (even in a wide variety of inverse problems such as compressed in the absence of noise). sensing [9], image denoising [10], image deblurring [11], A number of approaches to solve inverse problems have image super-resolution [12], and many more [13]. However, therefore been proposed over the past years leveraging the fact a justification for such outstanding performance is currently that many phenomena in nature admit some form of structure unknown, because recent frameworks attempting to provide – such as sparsity, group sparsity, manifold structures, and a rationale for the efficacy of DNNs primarily focus on more [1], [2] – that is key to restrict the space of possible classification tasks rather than the regression tasks arising in solutions. In particular, the use of sparsity – exploiting the fact inverse problems [14], [15], [16]. that the vector to be inferred from observations admits a sparse This paper – which aims to fill-in this gap – is motivated representation in some basis or frame – has led to a number by two overarching questions: of methods to approximate the solution of a linear inverse • How can we quantify the performance of DNN ap- problem using greedy algorithms [3] or convex optimization proaches in solving inverse problems? based algorithms [4]. For example, under the assumption that the desired vector contains at most k Nx non- 1 Note that the operational principle associated with deep learning networks zero entries, the well-known Basis Pursuit Denoise (BPDN) is different from that of classical approaches. Classical approaches to solve algorithm delivers an estimate of the desired vector x from the inverse problems attempt to directly invert the mapping from x to y. In observation vector y given knowledge of the linear operator contrast, deep learning approaches attempt to learn a mapping from y to x. TABLE I A L IST OF P OINT-W ISE ACTIVATION F UNCTIONS [z]σ = {σ(zi )}i≤Ni . Input Layer Name Function σ(zi ) Derivative: σ 0 (zi ) Hyperbolic tangent tan(zi ) 1 − σ(zi )2 ReLU max(zi , 0) {1 if zi > 0; 0 otherwise} Sigmoid 1/ σ(zi )(1 − σ(zi )) 1+exp(−zi ) ( exp(zi )/P σ(zi )(1 − σ(zi )) Hidden Softmax exp(zj ) layer s j −σ(zi )σ(zj ), otherwise
of linear and non-linear transformations that can learn increas-
ingly abstract concepts with layer depth [19]. See Fig. 1. In particular, we can express the i-th layer output x̃i ∈ RNi Output Layer in terms of the i-th layer input x̃i−1 ∈ RNi−1 as follows: x̃i = Wi x̃i−1 + bi σ
Fig. 1. A d-layer deep neural network. where Wi ∈ RNi ×Ni−1 is the i-th layer weight matrix, bi ∈ RNi is the i-th layer bias vector, and [.]σ represents • How does the performance of DNN approaches compare an element-wise nonlinear activation function. The network to the performance of other classification approaches for input is x̃1 = y and the network output is ΞS (y) = x̃. solving inverse problems? Activation functions such as hyperbolic tangent, rectified linear units (ReLU), and sigmoid, are normally used in hidden- In particular, in our attempt to answer these questions, layers and softmax is typically preferred in the output layer. we build upon the robustness framework introduced by Xu See Table I. and Mannor in [17]: (i) we then introduce new DNN based The various hyper-parameters associated with a deep neural regression generalization bounds; (ii) we show how these network can be learnt using optimization techniques based on bounds can be used to quantify the performance of DNNs training data [24]. State-of-the-art approaches include [20], in solving inverse problems; and (iii) we also show how the [21]. performance of a DNN compares with the performance of We will then measure the quality of the DNN estimate of other classical approaches, notably BPDN, for solving inverse the vector x given the vector y, which has been learnt using problems. the training set S, using the loss function: The remainder of the paper is organized as follows: We start by introducing our problem set-up in Section II. We l(x̃, x) = l(ΞS (y), x) = kΞS (y) − xk2 then provide DNN generalization bounds applicable to general In particular, we are interested in characterising the gener- regression problems in Section III. We also provide special- alization error (GE) associated with DNN regressors given izations of these generalization bounds applicable to typical by: inverse problems in Section IV. This opens up the possibility of comparing DNN based approaches to classical approaches GE(ΞS ) = |lexp (ΞS ) − lemp (ΞS )| (4) to solving inverse problems. Finally, concluding remarks are where drawn in Section V. lexp (ΞS ) = E[l(ΞS (y), x)] [Due to space limitations, the proofs are appearing in an corresponds to the expected error associated with a pair of upcoming preprint [18].] vectors (x, y) and II. S ETUP 1 X lemp (ΞS ) = l(ΞS (yi ), xi ) We consider the problem of estimating a vector x ∈ X from m i another vector y ∈ Y, where the pair of vectors s = (x, y) corresponds to the empirical error associated with the training is drawn from the sample space D = X × Y according to vectors {(xi , yi )}i≤m . some distribution µ, using a supervised learning setup. We We deliver in the sequel a characterization of the gener- also consider we have access to a set of m training samples alization error of DNN based regressors where, for technical S = {(xi , yi )}i≤m , drawn independently and identically reasons, we will be assuming that both the input space X and distributed (i.i.d.) according to µ, to learn a regressor the output space Y are compact with respect to the `2 -metric ΞS (·) : Y → X (3) and that the sample space D = X × Y is compact with respect to the sup-metric. that can then be used to deliver an estimate of the desired We want to characterize the performance of a DNN re- vector x ∈ X given the observation vector y ∈ Y. gressor and compare with the traditional methods for solving Our focus is on the use of DNN based regressors, corre- inverse problems. This section introduces the notation and sponding to multi-layered architectures consisting of a series framework underlying our approach. III. G ENERALIZATION E RROR B OUNDS : G ENERAL C ASE on the training set S is d ! ! We now derive performance guarantees for DNN based ψ Y N ; D, ρ , 1 + kWi kF ψ − robust regression by capitalizing on the robustness framework [17]. 2 i=1 A very important element of the robustness framework is the notion of algorithmic robustness. for any ψ > 0, where N ψ2 ; D, ρ < ∞ represents the covering number of the metric space (D, ρ) using metric balls Definition 1 (Algorithmic Robustness [17]). Let S denote of radius ψ/2. the training set and D denote the sample space. A learning algorithm is said to be (K, (S))-robust if the sample space Proof: We provide a sketch of the proof only. A full D can be partitioned into K disjoint sets Kk , k = 1 . . . K, version will appear in an upcoming manuscript [18]. The loss such that for all (xi , yi ) ∈ S and all (x, y) ∈ D function of a Lipschitz continuous DNN can be shown to follow the Lipschitz continuity principle using triangular and (xi , yi ), (x, y) ∈ Kk =⇒ (5) reverse Minkowski inequality. Thus the difference of the losses |l(ΞS (yi ), xi ) − l(ΞS (y), x)| ≤ (S) between two samples is Qupper bounded by the product of the d Lipschitz constant (1+ i=1 kWi kF ) and distance ψ, between In other words, a learning algorithm is robust provided that the samples. And so the claim follows. the losses of a training sample and a test sample belonging to The following theorem – building upon the previous one the same partition are close. – now characterizes a bound to the generalization error of a The relevance of this definition is associated with the d-layer neural network. fact that it provides a route to study the generalization Theorem 3. (GE Bound) Consider again that X and Y are ability of various learning algorithms, including deep neural compact spaces with respect to the `2 metric. Consider also networks [14]. However, Sokolic et al. [14] have provided the sample space D = X × Y equipped with a sup metric ρ. generalization bounds for DNN based classifiers in lieu of It follows that a d-layer DNN based regressor ΞS (·) : Y → DNN based regressors, so the results cannot be used to cast X trained on a training set S consisting of m i.i.d. training insight on the performance of deep neural networks in solving samples obeys with probability 1 − ζ, for any ζ > 0, the inverse problems. generalization error bound given by: We will therefore generalize the results in [14] from the d ! classification to the regression setting. We first show that a Y GE(ΞS ) ≤ 1 + kWi kF ψ d-layer DNN based regressor satisfies a Lipschitz continuity i=1 condition. v u u 2N ψ ; D, ρ log(2) + 2 log 1 Theorem 1. (Adapted from Theorem 2 and Lemma 1 in [14]) t 2 ζ + M (S) Consider a d-layer DNN based regressor ΞS (·) : Y → X . m Then, for any y1 , y2 ∈ Y, it follows that (6) d Y for any ψ > 0, where M (S) < ∞. kΞS (y1 ) − ΞS (y2 )k2 ≤ kWi kF ky1 − y2 k2 i=1 Proof: This result follows from the gener- alization error bound provided in [17]. For a where k.kF denotes the Frobenious norm of a matrix. Qd (N ψ2 ; D, ρ , 1 + i=1 kWi kF ψ)-robust DNN, the Proof: We only outline the proof. The results follows proof is straight forward. A full version of the proof will from Theorem 2 in [14] which proves that the ratio between appear in an upcoming manuscript [18]. the Euclidean distance at the input and the output of a d-layer Theorems 2 and 3 provide various insights that are also DNN is less than the `2 -norm of the Jacobian matrix which is aligned with previous results in the literature. In particular, upper bounded by the product of the frobenious norm of the these theorems suggest that the robustness and generalization weight matrices [14]. Full version of the proof will appear in properties of a d-layer neural network are not associated with an upcoming manuscript [18]. the number of network parameters per layer but rather with appropriate norms of the weight matrices. Bartlett [22] had We can now show the main results. The following theorem also shown the size of the network has no effect on the characterizes the robustness of a d-layer neural network. generalization error of a neural network by bounding the fat Theorem 2. (Robustness) Consider that X and Y are compact shattering dimension as a function of the `1 norm of the spaces with respect to the `2 metric. Consider also the sample weights, so implying independence of the number of hidden space D = X × Y equipped with a sup metric ρ. It follows units. Xu and Mannor [17] have also shown that robustness that a d-layer DNN based regressor ΞS (·) : Y → X trained of a neural network does not depend on its size. Similarly, in [23], it is argued that norm based regularization can improve Corollary 1. Consider the spaces X and Y in (7) and (8) the generalization ability of a deep neural network. equipped with a `2 metric, the space D = X ×Y equipped with These theorems also suggest that a deeper network may the sup-metric ρ, and the Lipschitz continuous mapping in (9). generalize better than a shallower one, by guaranteeing the It follows that a d-layer DNN based regressor ΞS (·) : Y → X Forbenius norm of the weight matrices is less than one. trained on the training set S is This result is aligned with similar claims by Neyshabur[23] k k d ! ! Nx e 4 Y resulting from matrix factorization approaches. In fact, it is 1+ , 1+ kWi kF (Lδ + 2η) −robust possible to explicitly bound the norm of weight matrices via k δ i=1 reprojection using gradient decent [24], and regularization of weight matrices has been empirically shown to result in better Sketch of Proof: For the system model given by eqs. generalization [25]. (7), (8) and (9), the (Lδ + 2η)/2-covering number of metric Finally, Theorem 3 also suggests that – beyond the depen- space (D, ρ) is upper bounded by the δ/2-covering number of dence on the number of training samples – the generalization X . This result together with Theorem 2 proves the corollary. ability of a d-layer neural network also depends directly on A full version of the proof will appear in an upcoming the complexity of the data space D captured via its covering manuscript [18]. number. In particular, the generalization error of more complex data spaces will tend to be higher than the generalization error Corollary 2. Consider again the spaces X and Y in (7) and of a simpler data space. (8) equipped with a `2 metric, the space D = X × Y equipped with the sup-metric ρ, and the Lipschitz continuous mapping IV. G ENERALIZATION E RROR B OUNDS FOR in (9). It follows that a d-layer DNN based regressor ΞS (·) : I NVERSE P ROBLEMS Y → X trained on a training set S consisting of m i.i.d. We now specialize the performance guarantees from general training samples obeys with probability 1 − ζ, for any ζ > 0, regression problems to inverse problems, with a focus on the generalization error bound given by: sparse approximation tasks. Yd ! We consider specifically the linear observation model in (1), GE(ΞS ) ≤ 1 + kWi kF (Lδ + 2η) with some additional assumptions: i=1 v First, the space X consists of unit `2 -norm k-sparse
Nx e k 4 k u • u2 1 k 1+ δ log(2) + 2 log ζ vectors, i.e. t + M (S) m X = {x ∈ RNx : kxk0 ≤ k, kxk2 ≤ 1} (7) (11) for any δ > 0, for some M (S) < ∞. • Second, the space Y consists to a linear projection of the input space induced by observation matrix plus a Proof: The results follows by directly from Theorem 3 perturbation associated with bounded `2 -norm noise, i.e. and Corollary 2. A full version of the proof will appear in an upcoming manuscript [18]. Y := {y = Ax + e ∈ RNy : x ∈ X , kek2 ≤ η} (8) The results embodied in these two corollaries can be used to illuminate further the performance of sparse approximation • Third, we assume that the linear mapping represented based on deep learning networks. In particular, let us assume by the matrix A is Lipschitz continuous with Lipschitz we employ a regularization strategy during the training phase contant L, i.e. constraining the Frobenius norm of the weight matrices to be kAx1 − Ax2 k2 ≤ Lkx1 − x2 k2 (9) less than one, such as reprojection using gradient descent [24]. This leads immediately to another generalization error for any x1 , x2 ∈ X . Note that this condition is in practice bound holding with probability 1 − ζ obeyed by linear mappings that conform to the Restriced Isometry Property (RIP) [26]. GE(ΞS ) ≤ 2(Lδ + 2η) v We also consider that an appropriately trained d-layer net- u 2 Nx e k 1 + 4 k log(2) + 2 log 1 u work – using a training set S – is employed to deliver an t k δ ζ + M (S) estimate of the sparse vector x given the measurement vector m y. (12) We can now immediately specialize the results appearing 1
in Theorems 2 and 3 to this particular setting. The following for any ζ > 0 and any δ > 0, and by setting δ = o m− k uper bound on the covering number of the input space will be and by setting trivially ζ to be a function of m such that very useful [15]: log (1/ζ) /m = o(1), to another generalization bound behav- k k ing as follows Nx e 4 N (δ/2; X , k.k2 ) ≤ 1+ (10) GE(ΞS ) ≤ 4 · η + o(1) (13) k δ This suggests that – with the increase of the number of V. C ONCLUSIONS training samples m – the generalization ability of a deep neural This paper attempts to provide a rationale for the recently network is limited only by the level of the noise independently reported superb performance of deep learning approaches in a of the parameters of the linear observation model, namely Ny , wide range of inverse problems. Nx , k, and L. Instead, these parameters mainly influence the In particular, by drawing on the robustness framework intro- speed at which the generalization error asymptotics kick-in. duced by Xu and Mannor, this paper puts forth a generalization In turn, in view of the fact that the generalization error bound for deep neural network based reconstruction that can is upper bounded by the sum of the expected and empirical be specialized for a wide range of settings. error, it is also possible to upper bound the expected sparse The specialization of this bound to sparse approximation approximation error associated with a deep neural network as problems – occurring in various signal and image processing follows: tasks – has shown that deep neural networks can lead to lexp (ΞS ) ≤ lemp (ΞS )+GE(ΞS ) ≤ lemp (ΞS )+4·η+o(1) (14) generalization errors that depend on the noise level only. This – together with the fact that recently established results suggest Recent results suggest that deep neural networks – with that deep neural networks can potentially memorize datasets a sufficient number of parameters – tend to memorize the – also suggests that the sparse approximation error incurred training dataset [16] suggesting that via the use of deep neural networks also depends on the noise level only. This behaviour can be in sharp contrast with the lexp (ΞS ) ≤ GE(ΞS ) ≤ 4 · η + o(1) (15) behaviour of classical sparse approximation algorithms. Future work will specialise these results to a wide range of We conclude by comparing how the performance of a deep inverse problems, including compressive sensing, image de- neural network compares to the performance of a well-known noising, image deblurring, image super-resolution, and more. algorithm – BPDN – in sparse approximation problems. ACKNOWLEDGEMENTS Theorem 4 ([27]). Consider the linear observation model in This research is supported by the Commonwealth Scholar- (1) where x ∈ X = {x ∈ RNx : kxk0 ≤ k} and y ∈ Y = ship Commission in UK. {y = Ax + e ∈ RNy : kxk0 ≤ k, kek2 ≤ η}. Consider also the sparse approximation algorithm delivering an estimate of R EFERENCES x from y given knowledge of A: [1] G. Peyré and J. Fadili, “Group sparsity with overlapping partition func- tions,” in Signal Processing Conference, 2011 19th European. IEEE, x̃ = arg min kxk1 subject to, ky − Axk2 ≤ ε 2011, pp. 303–307. x∈RNx [2] A. Tarantola, Inverse problem theory and methods for model parameter estimation. siam, 2005, vol. 89. where ≥ η. It follows – under the assumption that k ≤ [3] J. A. Tropp, “Greed is good: Algorithmic results for sparse approxi- (1 + µ) /4µ – the error of the approximation delivered by this mation,” IEEE Transactions on Information theory, vol. 50, no. 10, pp. algorithm can be bounded as follows: 2231–2242, 2004. [4] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: η+ε Exact signal reconstruction from highly incomplete frequency informa- kx̃ − xk2 ≤ p tion,” IEEE Transactions on information theory, vol. 52, no. 2, pp. 489– 1 − µ(4k − 1) 509, 2006. [5] M. F. Duarte and Y. C. Eldar, “Structured compressed sensing: From where µ corresponds to the mutual coherence of the matrix theory to applications,” IEEE Transactions on Signal Processing, vol. 59, no. 9, pp. 4053–4085, 2011. A. [6] R. Chartrand and W. Yin, “Iteratively reweighted algorithms for com- pressive sensing,” in Acoustics, speech and signal processing, 2008. This sparse approximation algorithm – along with other ICASSP 2008. IEEE international conference on. IEEE, 2008, pp. 3869–3872. sparse approximation algorithms based on convex optimization [7] Y. Liu, Z. Zhan, J.-F. Cai, D. Guo, Z. Chen, and X. Qu, “Projected approaches or greedy approaches (see [8] and references iterative soft-thresholding algorithm for tight frames in compressed within) – are known to exhibit a phase transition. Here, when sensing magnetic resonance imaging,” IEEE transactions on medical imaging, vol. 35, no. 9, pp. 2130–2140, 2016. the data sparsity k ≤ (1 + µ) /4µ, the algorithm provides a [8] J. A. Tropp and S. J. Wright, “Computational methods for sparse solution reconstruction that scales with the amount of noise η; this is of linear inverse problems,” Proceedings of the IEEE, vol. 98, no. 6, pp. akin to the behaviour of the sparse approximation delivered 948–958, 2010. [9] A. Mousavi, A. B. Patel, and R. G. Baraniuk, “A deep learning by a deep neural network. approach to structured signal recovery,” in Communication, Control, and On the other hand, when the data sparsity k > (1 + µ) /4µ Computing (Allerton), 2015 53rd Annual Allerton Conference on. IEEE, the algorithm does not give any reconstruction guarantees 2015, pp. 1336–1343. [10] J. Xie, L. Xu, and E. Chen, “Image denoising and inpainting with deep but the deep neural network may still be able to deliver an neural networks,” in Advances in neural information processing systems, appropriate reconstruction of the sparse vector given its under- 2012, pp. 341–349. sampled linear observation. Authors of [9] have empirically [11] Y. Yang, S. Cheng, Z. Xiong, and W. Zhao, “Wyner-ziv coding based on tcq and ldpc codes,” in Signals, Systems and Computers, 2003. demonstrated that the performance of a DNN degrades grad- Conference Record of the Thirty-Seventh Asilomar Conference on, vol. 1. ually as the number of measuremets Ny are decreased. IEEE, 2003, pp. 825–829.
Edgar Osuna Robert Freund Federico Girosi Center For Biological and Computational Learning and Operations Research Center Massachusetts Institute of Technology Cambridge, MA, 02139, U.S.A