Professional Documents
Culture Documents
PULSE: Self-Supervised Photo Upsampling Via Latent Space Exploration of Generative Models
PULSE: Self-Supervised Photo Upsampling Via Latent Space Exploration of Generative Models
Sachit Menon*, Alexandru Damian*, Shijia Hu, Nikhil Ravi, Cynthia Rudin
Duke University
Durham, NC
{sachit.menon,alexandru.damian,shijia.hu,nikhil.ravi,cynthia.rudin}@duke.edu
arXiv:2003.03808v1 [cs.CV] 8 Mar 2020
Abstract 1. Introduction
4321
ysis methods (such as image segmentation, action recogni-
tion, or disease diagnosis) which depend on having high-
resolution images [17] [19]. In addition, as consumer lap-
top, phone, and television screen resolution has increased
over recent years, popular demand for sharp images and
video has surged. This has motivated recent interest in the
computer vision task of image super-resolution, the creation
of realistic high-resolution (henceforth HR) images that a
given low-resolution (LR) input image could correspond to.
While the benefits of methods for image super-resolution
are clear, the difference in information content between HR
and LR images (especially at high scale factors) hampers
efforts to develop such techniques. In particular, LR images
inherently possess less high-variance information; details
can be blurred to the point of being visually indistinguish-
Figure 2. FSRNet tends towards an average of the images that
able. The problem of recovering the true HR image depicted
downscale properly. The discriminator loss in FSRGAN pulls it
by an LR input, as opposed to generating a set of potential in the direction of the natural image manifold, whereas PULSE
such HR images, is inherently ill-posed, as the size of the always moves along this manifold.
total set of these images grows exponentially with the scale
factor [2]. That is to say, many high-resolution images can images within the set of feasible solutions; that is, to find
correspond to the exact same low-resolution image. points which actually lie on the natural image manifold and
Traditional supervised super-resolution algorithms train also downscale correctly. The (weighted) pixel-wise aver-
a model (usually, a convolutional neural network, or CNN) age of possible solutions yielded by the MSE does not gen-
to minimize the pixel-wise mean-squared error (MSE) be- erally meet this goal for the reasons previously described.
tween the generated super-resolved (SR) images and the We provide an illustration of this in Figure 2.
corresponding ground-truth HR images [14] [7]. However, Our method generates images using a (pretrained) gen-
this approach has been noted to neglect perceptually rele- erative model approximating the distribution of natural im-
vant details critical to photorealism in HR images, such as ages under consideration. For a given input LR image, we
texture [15]. Optimizing on an average difference in pixel- traverse the manifold, parameterized by the latent space of
space between HR and SR images has a blurring effect, en- the generative model, to find regions that downscale cor-
couraging detailed areas of the SR image to be smoothed rectly. In doing so, we find examples of realistic images
out to be, on average, more (pixelwise) correct. In fact, in that downscale properly, as shown in 1.
the case of mean squared error (MSE), the ideal solution is Such an approach also eschews the need for supervised
the (weighted) pixel-wise average of the set of realistic im- training, being entirely self-supervised with no ‘training’
ages that downscale properly to the LR input (as detailed needed at the time of super-resolution inference (except
later). The inevitable result is smoothing in areas of high for the unsupervised generative model). This framework
variance, such as areas of the image with intricate patterns presents multiple substantial benefits. First, it allows the
or textures. As a result, MSE should not be used alone as a same network to be used on images with differing degra-
measure of image quality for super-resolution. dation operators even in the absence of a database of cor-
Some researchers have attempted to extend these MSE- responding LR-HR pairs (as no training on such databases
based methods to additionally optimize on metrics intended takes place). Furthermore, unlike previous methods, it does
to encourage realism, serving as a force opposing the not require super-resolution task-specific network architec-
smoothing pull of the MSE term [15, 7]. This essentially tures, which take substantial time on the part of the re-
drags the MSE-based solution in the direction of the natu- searcher to develop without providing real insight into the
ral image manifold (the subset of RM ×N that represents the problem; instead, it proceeds alongside the state-of-the-art
set of high-resolution images). This compromise, while im- in generative modeling, with zero retraining needed.
proving perceptual quality over pure MSE-based solutions, Our approach works with any type of generative model
makes no guarantee that the generated images are realistic. with a differentiable generator, including flow-based mod-
Images generated with these techniques still show signs of els, variational autoencoders (VAEs), and generative adver-
blurring in high variance areas of the images, just as in the sarial networks (GANs); the particular choice is dictated by
pure MSE-based solutions. the tradeoffs each make in approximating the data mani-
To avoid these issues, we propose a new paradigm for fold. For this work, we elected to use GANs due to recent
super-resolution. The goal should be to generate realistic advances yielding high-resolution, sharp images [12, 11].
generative adversarial networks, we explore the latent
space to find regions that map to realistic images and
downscale correctly. No retraining is required. Our
particular implementation, using StyleGAN [12], al-
lows for the creation of any number of realistic SR
samples that correctly map to the LR input.
5. Robustness 7. Conclusions
The main aim of our algorithm is to perform percep- We have established a novel methodology for image
tually realistic super-resolution with a known downscaling super-resolution as well as a new problem formulation.
operator. However, we find that even for a variety of un- This opens up a new avenue for super-resolution methods
known downscaling operators, we can apply our method us- along different tracks than traditional, supervised work with
ing bicubic downscaling as a stand-in for more substantial CNNs. The approach is not limited to a particular degrada-
degradations applied–see Figure 6. In this case, we provide tion operator seen during training, and it always maintains
only the degraded low-resolution image as input. We find high perceptual quality.
that the output downscales approximately to the true, non- Acknowledgments: Funding was provided by the Lord
noisy LR image (that is, the bicubically downscaled HR) Foundation of North Carolina and the Duke Department of
rather than to the degraded LR given as input. This is de- Computer Science. Thank you to the Google Cloud Plat-
sired behavior, as we would not want to create an image that form research credits program.
References [13] Deokyun Kim, Minseon Kim, Gihyun Kwon, and Dae-Shik
Kim. Progressive face super-resolution via attention to facial
[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im- landmark. In Proceedings of the 30th British Machine Vision
age2StyleGAN: How to embed images into the StyleGAN Conference (BMVC), 2019.
latent space? In Proceedings of the International Confer- [14] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate
ence on Computer Vision (ICCV), 2019. image super-resolution using very deep convolutional net-
[2] Simon Baker and Takeo Kanade. Limits on super-resolution works. In Proceedings of IEEE Conference on Computer
and how to break them. In Proceedings IEEE Conference Vision and Pattern Recognition (CVPR), 2016.
on Computer Vision and Pattern Recognition (CVPR), vol- [15] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero,
ume 2, pages 372–379. IEEE, 2000. Andrew Cunningham, Alejandro Acosta, Andrew Aitken,
[3] Yijie Bei, Alexandru Damian, Shijia Hu, Sachit Menon, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-
Nikhil Ravi, and Cynthia Rudin. New techniques for pre- realistic single image super-resolution using a generative ad-
serving global structure and denoising with low information versarial network. In Proceedings of the IEEE Conference
loss in single-image super-resolution. In 2018 IEEE Con- on Computer Vision and Pattern Recognition (CVPR), pages
ference on Computer Vision and Pattern Recognition Work- 4681–4690, 2017.
shops, CVPR Workshops 2018, Salt Lake City, UT, USA, June [16] A. Mittal, R. Soundararajan, and A. C. Bovik. Making a
18-22, 2018, pages 874–881. IEEE Computer Society, 2018. completely blind image quality analyzer. IEEE Signal Pro-
[4] Yochai Blau, Roey Mechrez, Radu Timofte, Tomer Michaeli, cessing Letters, 20(3):209–212, March 2013.
and Lihi Zelnik-Manor. The 2018 PIRM challenge on per- [17] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
ceptual image super-resolution. In The European Conference net: Convolutional networks for biomedical image segmen-
on Computer Vision (ECCV) Workshops, September 2018. tation. In Proceedings of the International Conference on
[5] Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G. Di- Medical Image Computing and Computer-Assisted Interven-
makis. Compressed sensing using generative models. In Pro- tion, pages 234–241. Springer, 2015.
ceedings of the 34th International Conference on Machine [18] Wenzhe Shi, Jose Caballero, Ferenc Huszar, Johannes Totz,
Learning (ICML), 2017. Andrew P. Aitken, Rob Bishop, Daniel Rueckert, and Zehan
[6] Adrian Bulat, Jing Yang, and Georgios Tzimiropoulos. To Wang. Real-time single image and video super-resolution
learn image super-resolution, use a gan to learn how to do using an efficient sub-pixel convolutional neural network. In
image degradation first. In Proceedings of the European Proceedings of the IEEE Conference on Computer Vision
Conference on Computer Vision (ECCV), pages 185–200, and Pattern Recognition (CVPR), June 2016.
2018. [19] Karen Simonyan and Andrew Zisserman. Two-stream con-
[7] Yu Chen, Ying Tai, Xiaoming Liu, Chunhua Shen, and Jian volutional networks for action recognition in videos. In
Yang. Fsrnet: End-to-end learning face super-resolution Advances in Neural Information Processing Systems, pages
with facial priors. In Proceedings of the IEEE Conference 568–576, 2014.
on Computer Vision and Pattern Recognition (CVPR), pages [20] Amanjot Singh and Jagroop Singh Sidhu. Super resolu-
2492–2501, 2018. tion applications in modern digital image processing. In-
[8] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou ternational Journal of Computer Applications, 150(2):0975–
Tang. Learning a deep convolutional network for image 8887, 2016.
super-resolution. In Proceedings of the European Confer- [21] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky.
ence on Computer Vision (ECCV), pages 184–199, 2014. Deep image prior. In Proceedings of the IEEE Conference
[9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing on Computer Vision and Pattern Recognition (CVPR), June
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and 2018.
Yoshua Bengio. Generative adversarial nets. In Advances in [22] Roman Vershynin. Random Vectors in High Dimensions,
Neural Information Processing Systems, pages 2672–2680, page 3869. Cambridge Series in Statistical and Probabilistic
2014. Mathematics. Cambridge University Press, 2018.
[10] Seokhwa Jeong, Inhye Yoon, and Joonki Paik. Multi- [23] Nannan Wang, Dacheng Tao, Xinbo Gao, Xuelong Li, and
frame example-based super-resolution using locally direc- Jie Li. A comprehensive survey to face hallucination. Inter-
tional self-similarity. IEEE Transactions on Consumer Elec- national Journal of Computer Vision, 106(1):9–30, 2014.
tronics, 61(3):353–358, 2015. [24] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu,
[11] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Chao Dong, Yu Qiao, and Chen Change Loy. ESRGAN: En-
Progressive growing of GANs for improved quality, stability, hanced super-resolution generative adversarial networks. In
and variation. CoRR, abs/1710.10196, 2018. Appeared at the Proceedings of the European Conference on Computer Vi-
6th Annual International Conference on Learning Represen- sion (ECCV) Workshops, September 2018.
tations. [25] Tianyu Zhao, Changqing Zhang, Wenqi Ren, Dongwei
[12] Tero Karras, Samuli Laine, and Timo Aila. A style-based Ren, and Qinghua Hu. Unsupervised degradation learn-
generator architecture for generative adversarial networks. ing for single image super-resolution. arXiv preprint
In Proceedings of the IEEE Conference on Computer Vision arXiv:1812.04240, 2018.
and Pattern Recognition (CVPR), pages 4401–4410, 2019.
PULSE: Supplementary Material
Sachit Menon*, Alexandru Damian*, Nikhil Ravi, Shijia Hu, Cynthia Rudin
Duke University
Durham, NC
{sachit.menon,alexandru.damian,nikhil.ravi,shijia.hu,cynthia.rudin}@duke.edu
arXiv:2003.03808v1 [cs.CV] 8 Mar 2020
1
Figure 1. Further comparison of PULSE with bicubic upscaling, FSRNet, and FSRGAN.
distance between every pair of vectors input to S. This can 2.5. Noise Inputs
be considered a simple form of relaxation on the original
The second parameter of S controls the stochastic vari-
constraint that the 18 input vectors be exactly equal.
ation that StyleGAN adds to an image. When the noise
When v1 , ..., vk are sampled from a sphere, it makes
is set to 0, StyleGAN generates smoothed-out, detail-free
more sense to compare geodesic distances along the sphere.
images. The synthesis network takes 18 noise vectors at
This is the approach we used in generating our results. Let
varying scales, one at each layer. The earlier noise vec-
θ(v, w) denote the angle between the vectors v and w. We
tors influence more global features, for example the shape
then define the geodesic cross loss to be
of the face, while the later noise vectors add finer details,
X such as hair definition. Our first approach was to sample
GEOCROSS(v1 , ..., vk ) = θ(vi , vj )2
the noise vectors before we began traversing the natural im-
i<j
. age manifold, keeping them fixed throughout the process.
Empirically, by allowing the 18 input vectors to S to vary In an attempt to increase the expressive power of the syn-
while applying the soft constraint of the (geo)cross loss, we thesis network, we also tried to perform gradient descent
can increase the expressive potential of the network without on both the latent input and the noise input to S simulta-
large deviations from the natural image manifold. neously, but this tended to take the noise vectors out of the
spherical shell from which they were sampled and produced
2.4. Approximating the input distribution of S unrealistic images. Using a standard Gaussian prior forced
the noise vectors towards 0 as mentioned in the main pa-
StyleGAN begins with a uniform distribution on S 511 ⊂
per. We therefore experimented with two approaches for
R512 , which is pushed forward by the mapping network to a
the noise input:
transformed probability distribution over R512 . Therefore,
another requirement to ensure that S([v1 , ..., v18 ], η) is a 1. Fixed noise: Especially when upsampling from 16 ×
realistic image is that each vi is sampled from this push- 16 to 1024 × 1024, StyleGAN was already expressive
forward distribution. While analyzing this distribution, we enough to upsample our images correctly and so we
found that we could transform this back to a distribution on did not need to resort to more complicated techniques.
the unit sphere without the mapping network by simply ap-
plying a single linear layer with a leaky-ReLU activation– 2. Partially trainable noise: In order to slightly increase
an entirely invertible transformation. We therefore inverted the expressive power of the network, we optimized on
this function to obtain a sampling procedure for this distri- the latent and the first 5-7 noise vectors, allowing us
bution. First, we generate a latent w from S 511 , and then to slightly modify the facial structure of the images we
apply the inverse of our transformation. generated to better match the LR images while main-
Figure 2. (x32) Additional robustness results for PULSE under additional degradation operators (these are downscaling followed by Gaus-
sian noise (std=25, 50), motion blur in random directions with length 100 followed by downscaling, and downscaling followed by salt-
and-pepper noise with a density of 0.05.)
References
[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im-
age2StyleGAN: How to embed images into the StyleGAN la-
tent space? In Proceedings of the International Conference
on Computer Vision (ICCV), 2019.
Figure 3. (64x) Sample 1
Figure 4. (64x) Sample 2
Figure 5. (64x) Sample 3
Figure 6. (64x) Sample 4