Professional Documents
Culture Documents
Training-Free, Single-Image Super-Resolution Using A Dynamic Convolutional Network
Training-Free, Single-Image Super-Resolution Using A Dynamic Convolutional Network
1, JANUARY 2018 85
Abstract—The typical approach for solving the problem of used for making performance comparisons. A thorough review
single-image super-resolution (SR) is to learn a nonlinear mapping of single-image SR techniques is available in [5].
between the low-resolution (LR) and high-resolution (HR) repre-
sentations of images in a training set. Training-based approaches
can be tuned to give high accuracy on a given class of images, but A. Related Literature
they call for retraining if the HR → LR generative model deviates
or if the test images belong to a different class, which limits their A landmark contribution was made recently by Yang et al.
applicability. On the other hand, we propose a solution that does [6], [7], who trained dictionaries for LR and HR image patches.
not require a training dataset. Our method relies on constructing The key assumption is that the LR and HR patches have the same
a dynamic convolutional network (DCN) to learn the relation be- sparse representation in respective dictionaries. The sparse rep-
tween the consecutive scales of Gaussian and Laplacian pyramids. resentation corresponding to an LR patch from an unseen image
The relation is in turn used to predict the detail at a finer scale,
thus leading to SR. Comparisons with state-of-the-art techniques is used to synthesize the corresponding HR patch, thus lead-
on standard datasets show that the proposed DCN approach re- ing to SR. With this as the central idea, Yang et al. [8] also
sults in about 0.8 and 0.3 dB gain in peak signal-to-noise ratio for developed a coupled LR/HR dictionary model optimized us-
2× and 3× SR, respectively. The structural similarity index is on ing a joint cost function. Kim and Kwon used kernel ridge
par with the competing techniques. regression and incorporated image priors to suppress ringing
Index Terms—Convolutional neural network (CNN), deep learn- artifacts [9]. Timofte et al. proposed anchored neighborhood
ing, dynamic convolutional network (DCN), Gaussian/Laplacian regression [10] in which they solve a ridge regression prob-
pyramids, super-resolution (SR). lem with neighborhood constraints, leading to a closed-form
solution for the regression coefficients. This approach is fast
I. INTRODUCTION and leads to qualitatively the same results as the compet-
ing techniques. In [11], they developed an advanced version
INGLE-IMAGE super-resolution (SR) is an important tool
S in applications such as biomedical imaging [1], face hal-
lucination [2], etc. In single-image SR [3], [4], one infers lo-
of the algorithm, which learns from the patches in the local
neighborhood of the anchor patch from the training dataset
and not from the dictionary. These were by far the best per-
cal image properties from the low-resolution (LR) image or forming techniques before the advent of neural network SR
learns them over a collection of given high-resolution (HR)–LR approaches.
pairs. The learning approaches employ dictionaries or neural Within the learning paradigm, the SR problem is essentially
networks, which capture the LR–HR association. In this letter, posed as one of discovering the nonlinear association between
we develop a technique that infers HR image features starting the LR and HR patches. Dong et al. used a convolutional neural
from the LR image without going through the standard process network (CNN) [12], [13] to learn the end-to-end mapping be-
of training, obviating the need for a training dataset. Before tween LR and HR pairs [14], [15]. The training is data-intensive
proceeding further, we review recent techniques that specifi- and time-consuming, whereas the run-time complexity is low
cally tackle the single-image SR problem, and highlight their leading to fast SR. Recently, Dong et al. proposed a threefold
strengths and weaknesses. Some of these techniques will be acceleration strategy:
1) introducing a deconvolution layer at the end of the CNN;
Manuscript received June 28, 2017; revised August 17, 2017; accepted
2) reducing the dimensionality of the input feature; and
September 4, 2017. Date of publication September 15, 2017; date of current 3) employing smaller filter sizes, all of which resulted in a
version November 29, 2017. The associate editor coordinating the review of 40× speed-up.
this manuscript and approving it for publication was Dr. S. Channappayya.
(Aritra Bhowmik and Suprosanna Shit contributed equally to this work and
This method is referred to as fast SRCNN [16]. Shi et al.
must be treated as joint first authors). (Corresponding author: Chandra Sekhar developed an efficient subpixel CNN to perform SR [17]. The
Seelamantula). early layers operate on the LR image, whereas the final layer
The authors are with the Department of Electrical Engineering, Indian In-
stitute of Science, Bangalore 560012, India (e-mail: aritra0593@gmail.com;
relies on subpixel convolution to upscale the image. Kim et al.
suprosanna93@gmail.com; chandra.sekhar@ieee.org). proposed a very deep CNN architecture that learns the residual
This letter has supplementary downloadable material available at http:// images instead of the HR image [18]. They showed that the
ieeexplore.ieee.org.
Color versions of one or more of the figures in this letter are available online
very deep SR technique overcomes the limitations of SRCNN.
at http://ieeexplore.ieee.org. They also proposed a method called deep recursive CNN, which
Digital Object Identifier 10.1109/LSP.2017.2752806 incorporates skip connections between each hidden layer and
1070-9908 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
86 IEEE SIGNAL PROCESSING LETTERS, VOL. 25, NO. 1, JANUARY 2018
d0 = x0 − x̂0 = (I 1 − Ĝ1 U 1 L1 G1 )x0 , (3) 2) Contrast Enhancing Regularization: The Laplacian pyra-
mid contains high-frequency details largely comprising texture
where I 1 is the identity matrix. Similarly and edges. In order to preserve texture and edge content, local
d1 = (I 2 − Ĝ2 U 2 L2 G2 )x1 . (4) contrast-enhancing loss-functions have been suggested by Liu
et al. [31]. We incorporate this as an additional regularizer act-
From (1), (3), and (4), we get an expression for the coarser level ing on the predicted HR image X̃ −1 = X̂ −1 + D̃ −1 , in a 3 × 3
difference image in terms of the finer level one as patch-based fashion. Since not all patches contain edges or tex-
d1 = (I2 − Ĝ2 U 2 L2 G2 )L1 G1 (I1 − Ĝ1 U 1 L1 G1 )−1 d0 . ture, we use a weight wk for the kth patch P k that indicates
whether a patch does contain an edge/texture (wk = 1) or does
P1
not (wk = 0). The resulting contrast regularizer is given by
P 1 is rank-deficient and hence, not invertible, making the prob-
lem of recovering d0 given P 1 and d1 ill-posed. Also, P 1 is C contrast = − wk (X̃ −1 (pi ) − X̃ −1 (pj ))2 , (6)
a huge matrix, of size mn × mn for an m × n image. For in- k pi ,pj
stance, for a 256 × 256 image, it would require about 16 GB of where pi and pj denote the pixel coordinates in the patch. In
memory only for storing P 1 considering double-precision rep- order to determine the 1/0 weight assignment for wk , we use a
resentation. The high demand on memory makes it impractical Canny edge-map of the upsampled image X̂ −1 .
to handle the image all at once. The mapping between d1 and 3) Smoothness Preserving Regularization: To enforce local
d0 is modeled using a DCN and the filter-sets are learnt based smoothness in regions lacking edges/texture, we add a comple-
on the input LR image. mentary regularization functional:
A. Network Architecture C smoothness = wk (X̃ −1 (pi ) − X̃ −1 (pj ))2 ,
k pi ,pj
We use an L-layer CNN with the rectified linear unit (ReLU)
activation function η for modeling the nonlinearity. For mod- where the weights are set as 1 for smooth regions, and 0 other-
erate network depths, the ReLU activation is known to over- wise. The contrast regularizer has a negative sign since it has to
come the problem of vanishing gradient [30]. Since the im- be maximized, but the smoothness regularizer does not since it
ages in the Laplacian pyramid, unlike those in the Gaussian has to be minimized to favor a smooth reconstruction.
pyramid, contain both positive and negative values, we sepa- The total objective for learning the CNN filters is given by
rate the detail images into positive (D + 1 ) and negative compo-
nents (−D − + − C total = C CNN + λ1 C sparsity + λ2 C contrast + λ3 C smoothness ,
1 ). For instance, D 1 is divided into D 1 and D 1 , so
+ − + −
that D 1 = D 1 − D 1 , where D 1 , D 1 0. Next, we upsample where {λ1 , λ2 , λ3 } are the regularization weights. We optimize
+ −
D+ −
1 and D 1 to get D̂ 0 and D̂ 0 , respectively, and pass them the cost C total with respect to the CNN filters using the ADAM
through two CNNs. Each stage of the CNN is a tensor filter. Cor- optimization technique [32]. Denote the optimized filter-set as
+ −
respondingly, for the positive-part and negative-part images, we {K̂ i , K̂ i }. Our SR algorithm is based on the assumption that
have the filters K + −
i and K i , respectively, for i = 1, 2, . . . , L. the relationship that exists between D 1 and D 0 also holds be-
+ −
These filters take the upsampled images D̂ 0 and D̂ 0 as input tween D 0 and D −1 . The assumption is justified given the in-
and predict the detail images at the immediate finer scale result- terscale correlation and recurrence of image patches [33], [34].
+ − + −
ing in D̃ 0 and D̃ 0 , respectively. The prediction equations are Therefore, we predict D̃ −1 and D̃ −1 as follows:
as follows: + + + +
+ + D̃ −1 = η K̂ L ∗ η · · · ∗ η K̂ 1 ∗ D̂ 0 , (7)
D̃ 0 = η(K + +
L ∗ η(· · · ∗ η(K 1 ∗ D̂ 0 ))),
− − − −
− − D̃ −1 = η K̂ L ∗ η · · · ∗ η K̂ 1 ∗ D̂ 0 , (8)
D̃ 0 = η(K − −
L ∗ η(· · · ∗ η(K 1 ∗ D̂ 0 ))),
+ −
where η is the ReLU activation and ∗ denotes convolution. The and construct D̃ −1 = D̃ −1 − D̃ −1 . Finally, the super-resolved
+ −
positive-part prediction D̃ 0 and negative-part prediction D̃ 0 image is obtained as X̃ −1 = X̂ −1 + D̃ −1 .
+ −
are combined to result in D̃ 0 : D̃ 0 − D̃ 0 . The cost function
measures the fidelity between the predicted detail and the true IV. EXPERIMENTAL RESULTS
one and is expressed, using Frobenius norm (F), as We consider a three-layer CNN (L = 3). The first layer has
+ −
sixteen 9 × 9 filters, the second layer consists of one hundred
C CNN = C CNN + C CNN = D 0 − D̃ 0 2F . (5) and twenty eight 3 × 3 filters, and the third one has eight 5 × 5
filters. We perform the SR method on the Y channel of the
B. Incorporating Regularization YCbCr decomposition and upsample the other channels using
Since the detail images are structured and sparse, we incor- bicubic interpolation. We validate on two standard databases,
porate appropriate regularizers in the optimization. Set5 and Set14, which consist of image classes encompassing
88 IEEE SIGNAL PROCESSING LETTERS, VOL. 25, NO. 1, JANUARY 2018
TABLE I
COMPARISON OF THE DCN METHOD WITH THE STATE-OF-THE-ART TECHNIQUES
Metric Scale Bicubic SC SRCNN SelfeXSR DCN Bicubic SC SRCNN SelfeXSR DCN
PSNR 2× 27.93 28.17 28.370 28.49 29.56 24.46 24.86 24.78 24.85 25.40
3× 25.78 26.74 26.35 26.45 26.50 23.00 23.60 23.38 23.47 23.62
mw-PSNR 2× 27.10 27.36 27.55 27.62 28.91 24.59 24.77 24.93 25.01 25.69
3× 25.59 26.02 26.10 26.17 26.42 23.46 23.99 23.82 23.87 24.24
SSIM 2× 0.929 0.932 0.935 0.936 0.946 0.823 0.830 0.836 0.838 0.852
3× 0.889 0.908 0.900 0.901 0.899 0.767 0.791 0.783 0.786 0.796
ms-SSIM 2× 0.973 0.979 0.976 0.976 0.983 0.948 0.951 0.953 0.953 0.963
3× 0.945 0.950 0.955 0.955 0.957 0.907 0.927 0.921 0.921 0.931