Download as pdf or txt
Download as pdf or txt
You are on page 1of 155

Individual Recognition from Low

Quality and Occluded Images and


Videos Using GAN

SOUMYA SHUBHRA GHOSH


School of Electronics, Electrical Engineering and Computer Science

Supervisors: Prof. NEIL ROBERTSON & Dr. YANG HUA

A thesis submitted for the degree of

Doctor of Philosophy

July 19, 2021


2
Abstract

Face recognition in the wild is one of the most interesting problems in the
present world. Many algorithms with high performance have already been
proposed and applied in real-world applications. However, the problem of
recognising degraded faces from low-quality images and videos mostly re-
mains unsolved. A possible intuitive solution can be the recovery of finer
texture details and sharpness of degraded images with low resolution. In
recent times, we have seen a breakthrough in the perceptual quality of im-
age enhancement with the recovery of the fine texture details and sharpness
of degraded images. Several state-of-the-art algorithms, like ARCNN, IRC-
NNN, SRGAN, etc., provide excellent results in the field of image enhance-
ment. However, regarding face recognition, these algorithms fail miserably.
This brings down to two possibilities; either the perceptual quality enhance-
ment of the algorithm till date is not good enough or perceptual quality en-
hancement is not the key to a better face recognition performance. In this
thesis, we have explored both possibilities. Existing image enhancement
approaches often focus on minimising the pixel-wise reconstruction error,
which results in a high peak signal-to-noise ratio. However, they often fail
to match the quality expected in a photorealistic image. Furthermore, an
end-to-end solution for simultaneous superresolution of finer texture de-
tails and sharpness for degraded images with low resolution has not been
explored yet. Therefore, we proposed a versatile framework capable of
inferring photorealistic natural images with both artifact removal and su-
perresolution simultaneously. This includes a new loss function consisting
of a combination of reconstruction loss, feature loss, and an edge loss coun-
terpart. The feature loss helps to push the output image to the natural image
manifold and the edge loss preserves the sharpness of the output image. The
reconstruction loss provides low-level semantic information to the genera-
tor regarding the quality of the generated images. Our approach has been
experimentally proven to recover photorealistic textures from heavily com-
pressed low-resolution images on public benchmarks. However, even with
this algorithm, the recognition performance could not be improved. This
motivated us in designing an algorithm which is specific to the enhance-
ment of faces for better recognition performance. Thus, we present a ver-
satile GAN capable of recovering facial features from degraded videos and
images. Specifically, we use an effective method involving metric learn-
ing and different loss function components operating on different parts of
the generator. The proposed method helps to push the facial features of
the output image to the cluster containing faces of the same identity, and
at the same time, increases the angular margin between different identities.
In other words, it enhances the degraded faces by restoring their lost facial
features rather than its perceptual quality, which ultimately leads to better
performance of any existing face recognition algorithm. Our approach has
been experimentally proven to enhance face detection and recognition, e.g.,
the face detection rate is improved by 3.08% for S3FD [1] and the area un-
der the ROC curve for recognition is improved by 2.55% for ArcFace [2],
evaluated on the SCFace dataset. The limitation using a single image is that
there is an upper bound to which we can enhance the images. Any informa-
tion that is not present cannot be generated out of thin air. This motivated
us to develop the idea of synthesising the frontal face pose from multiple
occluded images of the same identity, taken at different times and from dif-
ferent angles. This helps to further push the upper bound. Our network
crowdsources useful information from all images and rejects information
that is not useful for recognition. We build our generator on top of TPGAN
and use the concept of a U-net discriminator which evaluates the output
both at a global and local level, thus boosting the image to be consistent at
a global and local level.
Acknowledgements

First of all, I would thank my supervisor Prof. Neil Robertson for his guid-
ance and support during my PhD research. I am pleased to have such a
unique mentor. I would also express my sincere gratitude to my second su-
pervisor, Dr. Yang Hua for his constant support, encouragement and guid-
ance. He helped me a lot to make my publications stronger. I would also
thank my friend Dr. Sankha Subhra Mukherjee who had supported me in
all possible ways during my PhD and definitely made my life easier. I am
also very grateful to Anyvision for offering me the opportunity to conduct
my research with an exposure to its cutting-edge technology, state-of-the-
art hardware, good food, drinks, parties and an amazing group. Thanks to
John, Jerry and Dimitrios, who evaluated my work every year and helped
me progress through my research by providing useful feedback. A spe-
cial thanks goes to my colleague cum friends in Anyvision, Alessandro,
David, Piotr, Tomaš, Romain, Elyor, Guosheng, Steven, Tanya and the en-
tire group of colleagues from Queen’s University Belfast, particularly my
dearest friends Rachael Abbott, Andrew Moyes, Stuart Millar, Dr. Xinshao
Wang, Dr. Waqar Saadat, Dr. Umair Naeem and Dr. Sumit Raurale, who
have all contributed significantly to make this journey much more fun. Last
but certainly not least, I am grateful to my beloved family, my friends and
my beloved wife, Suparna Chowdhury for receiving endless support and
trust. Without their support, life during these few years would not be as
easy.
Table of Contents

Table of Contents iv

List of Tables ix

List of Figures xi

1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation, Approach, Importance and Impact . . . . . . . . . . . . . . 2
1.2.1 Basic Design Principles and Pros of our Proposed Method . . . 5
1.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Thesis Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Background 11
2.1 A short history of neural networks . . . . . . . . . . . . . . . . . . . . 11
2.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Convolutional Neural Networks (CNN) . . . . . . . . . . . . . . . . . 14
2.4 Different CNN Architectures . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 LeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.2 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.3 VGGNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.4 GoogLeNet/Inception v1 . . . . . . . . . . . . . . . . . . . . . 17
2.4.5 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.6 Generative Adversarial Networks (GAN) . . . . . . . . . . . . 19
2.4.7 Conditional GANs . . . . . . . . . . . . . . . . . . . . . . . . 20

iv
TABLE OF CONTENTS

2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25


2.5.1 Intelligent Object Detection . . . . . . . . . . . . . . . . . . . 25
2.5.2 GANs for Face Generation . . . . . . . . . . . . . . . . . . . . 26
2.5.3 Artifact Removal from Images and Videos . . . . . . . . . . . . 26
2.5.4 Image Noise Removal . . . . . . . . . . . . . . . . . . . . . . 28
2.5.5 Image Super-resolution . . . . . . . . . . . . . . . . . . . . . . 28
2.5.6 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.7 Perceptual Image Quality Evaluation . . . . . . . . . . . . . . 30
2.5.8 Deep Metric Learning . . . . . . . . . . . . . . . . . . . . . . 31
2.5.9 Face Recognition and Metric Learning . . . . . . . . . . . . . . 32
2.5.10 Facial Occlusion and Synthesis . . . . . . . . . . . . . . . . . 34
2.6 Critical Thinking and Analysis: Summary . . . . . . . . . . . . . . . . 35

3 Multi-purpose Perceptual Quality Image Enhancement Using Generative


Adversarial Network 37
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.1 Image Artifact Removal . . . . . . . . . . . . . . . . . . . . . 39
3.2.2 Image Super-resolution . . . . . . . . . . . . . . . . . . . . . . 40
3.2.3 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.4 Perceptual Image Quality Evaluation . . . . . . . . . . . . . . 41
3.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.2 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Feature Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Edge Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Reconstruction Loss . . . . . . . . . . . . . . . . . . . . . . . 45
Learning Parameters . . . . . . . . . . . . . . . . . . . . . . . 46
Final Loss Function . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.1 Data, Evaluation Metrics and Implementation Details . . . . . . 47
3.4.2 Training Details . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 49

v
TABLE OF CONTENTS

3.4.4 Comparison to State of the Art . . . . . . . . . . . . . . . . . . 54


3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4 GAN Based Simultaneous Super-resolution and Noise Removal of Images


Using Edge Information 56
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.1 Image Artifact Removal . . . . . . . . . . . . . . . . . . . . . 59
4.2.2 Image Noise Removal . . . . . . . . . . . . . . . . . . . . . . 59
4.2.3 Image Super-resolution . . . . . . . . . . . . . . . . . . . . . . 60
4.2.4 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.2.5 Perceptual Image Quality Evaluation . . . . . . . . . . . . . . 61
4.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.1 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Feature Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Edge Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
How Edge Loss Works . . . . . . . . . . . . . . . . . . . . . . 64
Reconstruction Loss . . . . . . . . . . . . . . . . . . . . . . . 66
Final Loss Function . . . . . . . . . . . . . . . . . . . . . . . 67
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4.1 Data, Evaluation Metrics and Implementation Details . . . . . . 68
4.4.2 Analysis of Experimental Results . . . . . . . . . . . . . . . . 75
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5 Improving Recognition of Degraded Faces by Discriminative Feature Restora-


tion Using Generative Adversarial Network 79
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Facial Feature Restoration Using GAN . . . . . . . . . . . . . . . . . . 81
5.2.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2.2 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . 82
5.2.3 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Constrained Angular Metric Learning of Hard and Semi-Hard
Identity Features . . . . . . . . . . . . . . . . . . . . 85

vi
TABLE OF CONTENTS

Identity Feature Loss . . . . . . . . . . . . . . . . . . . . . . . 86


Pixel-wise Identity Loss . . . . . . . . . . . . . . . . . . . . . 87
Perceptual Loss . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Adversarial Loss . . . . . . . . . . . . . . . . . . . . . . . . . 88
Final Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.3.1 Data & Experimental Setup . . . . . . . . . . . . . . . . . . . 90
5.3.2 Results and Observations . . . . . . . . . . . . . . . . . . . . . 90
Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . 93
Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . 96
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6 Synthesis of Frontal Face Pose by Incremental Addition of Information


from Multiple Occluded Images using GANs 98
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.3.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.3.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Part Rotation Network . . . . . . . . . . . . . . . . . . . . . . 107
6.3.3 Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Generator Global Network . . . . . . . . . . . . . . . . . . . . 108
Generator Local Network . . . . . . . . . . . . . . . . . . . . 111
Fusion Network . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3.4 Discriminator . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Encoder Part . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Decoder Part . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.3.5 Generator Loss Functions . . . . . . . . . . . . . . . . . . . . 112
Pixel-wise Loss . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Symmetry Loss . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Identity Feature Loss . . . . . . . . . . . . . . . . . . . . . . . 113
Total Variation Regularisation . . . . . . . . . . . . . . . . . . 114
Adversarial Loss . . . . . . . . . . . . . . . . . . . . . . . . . 114

vii
TABLE OF CONTENTS

6.3.6 Discriminator Loss Functions . . . . . . . . . . . . . . . . . . 115


Global Decision Loss . . . . . . . . . . . . . . . . . . . . . . . 115
Local Decision Loss . . . . . . . . . . . . . . . . . . . . . . . 116
Consistency Regularisation . . . . . . . . . . . . . . . . . . . . 116
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.4.1 Data & Experimental Setup . . . . . . . . . . . . . . . . . . . 118
6.4.2 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.4.3 Final Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

7 Conclusion and Future Work 126


7.1 Future Work and Other Areas of Application . . . . . . . . . . . . . . . 128
7.1.1 Astronomical Photography . . . . . . . . . . . . . . . . . . . . 128
7.1.2 Automatic License Plate Recognition . . . . . . . . . . . . . . 128
7.1.3 Unwanted Object Removal . . . . . . . . . . . . . . . . . . . . 129

A Author’s Publications 130

References 131

viii
List of Tables

3.1 Performance of SR and AR for different edge detectors . . . . . . . . . 49


3.2 Performance of AR with different discriminators and loss functions . . 50
3.3 Performance of IEGAN for JPEG artifact removal compared to other
state-of-the-art algorithms . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4 Super-resolution performance of state-of-the-art algorithms . . . . . . . 51
3.5 Performance of IEGAN for simultaneous artifact removal and super-
resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.1 Performance of our algorithm for simultaneous Gaussian noise removal


and 4x super-resolution . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2 Performance of our algorithm for simultaneous Poisson noise removal
and 4x super-resolution . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3 Performance of our algorithm for simultaneous salt and pepper noise
removal and 4x super-resolution . . . . . . . . . . . . . . . . . . . . . 72
4.4 Performance of our algorithm for simultaneous speckle noise removal
and 4x super-resolution . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5 Performance of our algorithm for simultaneous JPEG artifact removal
and 4x super-resolution . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.6 Performance of our algorithm for simultaneous 4x super-resolution, Salt
and pepper noise removal and JPEG artifact removal on BSD100, LIVE1
and Set14 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.1 Cosine similarity scores w.r.t. gallery for input images and enhanced
images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 Extension of Table 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 93

ix
LIST OF TABLES

5.3 Fréchet Inception Distance and Inception Scores for different algorithms
for the SCFace dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4 True Accept Rate (TAR) at different False Accept Rate (FAR) with dif-
ferent settings for the network . . . . . . . . . . . . . . . . . . . . . . 94
5.5 Detection rates for different algorithms for different image enhancement
algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.6 Comparison of the recognition performance of our method with the
state-of-the-art artifact and noise removal algorithms . . . . . . . . . . 94

6.1 Recognition performance for 15 input images . . . . . . . . . . . . . . 119


6.2 Recognition performance for 15 input images . . . . . . . . . . . . . . 120
6.3 Recognition performance for 35 input images . . . . . . . . . . . . . . 120
6.4 Recognition accuracy evaluated on different face recognition networks
for our output image . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

x
List of Figures

1.1 A typical surveillance room scenario . . . . . . . . . . . . . . . . . . . 2


1.2 Output images for image enhancement pipeline . . . . . . . . . . . . . 3
1.3 Face recognition in action . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Synthesis of frontal face pose from multiple occluded images . . . . . . 6

2.1 The history of neural networks . . . . . . . . . . . . . . . . . . . . . . 12


2.2 The structure of an artificial neuron. . . . . . . . . . . . . . . . . . . . 14
2.3 Block diagram of LeNet architecture. Source: https://medium.com/
@pechyonkin/key-deep-learning-architectures-lenet-5-6fc3c59e6f4 16
2.4 Block diagram of AlexNet architecture. Source: https://medium.
com/coinmonks/paper-review-of-alexnet-caffenet-winner-in-
ilsvrc-2012-image-classification-b93598314160 . . . . . . . 17
2.5 Block diagram of VGG-16 architecture. Source: https://neurohive.
io/en/popular-networks/vgg16/ . . . . . . . . . . . . . . . . . . 18
2.6 Block diagram of a simple GAN . . . . . . . . . . . . . . . . . . . . . 20
2.7 Hand written digits generated by a simple GAN . . . . . . . . . . . . . 20
2.8 Faces generated by a simple GAN . . . . . . . . . . . . . . . . . . . . 21
2.9 Cityscapes generated by a conditional GAN . . . . . . . . . . . . . . . 21
2.10 Block diagram of GoogLeNet architecture. Source: https://developer.
ridgerun.com/wiki/index.php?title=GstInference/Supported_
architectures/InceptionV2 . . . . . . . . . . . . . . . . . . . . . 23
2.11 Block diagram of VGG-19 and ResNet architecture. Source: https://
mc.ai/what-are-deep-residual-networks-or-why-resnets-are-
important/ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

xi
LIST OF FIGURES

3.1 Architecture of our proposed IEGAN network . . . . . . . . . . . . . . 43


3.2 Comparison of edge detection for Canny and HED . . . . . . . . . . . 49
3.3 Results of JPEG artifact removal for different algorithms . . . . . . . . 50
3.4 Super-resolution results for different algorithms . . . . . . . . . . . . . 52
3.5 Results for simultaneous super-resolution and artifact removal of RGB
images using various algorithms . . . . . . . . . . . . . . . . . . . . . 53

4.1 Before and after image enhancement . . . . . . . . . . . . . . . . . . . 57


4.2 Canny edges for input, output and ground truth image . . . . . . . . . . 65
4.3 Visual results for 4x super-resolution + Salt and Pepper noise removal
+ JPEG artifact removal of RGB images combining various algorithms . 69
4.4 Results for 4x super-resolution and Gaussian noise removal of RGB
images combining various algorithms . . . . . . . . . . . . . . . . . . 76

5.1 The generator architecture of the proposed network . . . . . . . . . . . 82


5.2 Visual examples of the outputs from different network components . . . 84
5.3 Illustration of our proposed metric learning . . . . . . . . . . . . . . . 87
5.4 Change of the value of different loss function components with itera-
tions during training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.5 A qualitative comparison of the image level recovery compared to arti-
fact and noise removal algorithms . . . . . . . . . . . . . . . . . . . . 92
5.6 ROC for ArcFace on SCFace dataset . . . . . . . . . . . . . . . . . . . 92
5.7 ROC for CosFace on SCFace dataset . . . . . . . . . . . . . . . . . . . 95
5.8 ROC for FaceNet on SCFace dataset . . . . . . . . . . . . . . . . . . . 96
5.9 A closer look at the ROC curve for different enhancement algorithms . . 97

6.1 Synthesis of frontal face pose from multiple occluded images . . . . . . 99


6.2 Block diagram of the proposed network . . . . . . . . . . . . . . . . . 106
6.3 lignment, cropping and segmentation . . . . . . . . . . . . . . . . . . . 107
6.4 Network architecture of the facial landmarks part rotation network . . . 109
6.5 Network architecture of the generator . . . . . . . . . . . . . . . . . . 110
6.6 CutMix augmentation with real and fake images . . . . . . . . . . . . . 117
6.7 Improvement of output image with number of input images . . . . . . . 121
6.8 Change of accuracy with the number of input images . . . . . . . . . . 123

xii
Chapter 1

Introduction

1.1 Introduction
Face recognition in real world scenarios is not an easy task to solve as various factors
like occlusion, extreme head pose, illumination, etc. comes into play. In this thesis, we
have solved the problem to a significant extent by gathering relevant facial information
from multiple different images of the same identity. This research takes us a step further
towards autonomous security and surveillance where the recognition performance can
be improved significantly. Additionally, this method works in conjunction with any face
detection or recognition algorithm to boost their performance.
Living in an era with an ever-increasing population in ever-increasing urban areas,
security and preservation of law and order have become one of the most important so-
cial aspects. Over the last decade, the number of surveillance cameras grew to un-
precedented levels, both in public and private spaces. Thus, face recognition became an
important biometric trait used in identifying human individuals. Face recognition is one
of the most critical activities forensic examiners conduct when there is a video or pic-
ture available from the crime scene during their investigation. Although automatic face
recognition and verification systems are used, the examiners still do manual examina-
tion of facial images or videos for a match with a large suspect identification photograph
database. In airports, shopping malls, and other important indoor environments, the task

1
1.2 Motivation, Approach, Importance and Impact

Figure 1.1: A typical surveillance room scenario. Source: https://www.


americanguardservices.com/security-services/video-surveillance-
support

of surveillance is mostly carried out manually by security personnel on duty. A typi-


cal surveillance and monitoring room scenario is shown in Figure 1.1 which gives us
an idea of the complexity of the job. The task of monitoring the environment not only
consists of face recognition and verification, but also includes person reidentification
and object detection. This method of manual monitoring is expensive and it is neither
practical nor scalable. Thus, automation in this field of application is highly desired.
This led to a plethora of algorithms involving face recognition, person reidentification,
object detection, and so on.

1.2 Motivation, Approach, Importance and Impact


Face or object detection and recognition from low-quality images are challenging but
highly demanded in real-world applications. These problems are particularly prominent
for highly compressed images and videos, where details are typically absent. This is
most common in surveillance videos where the camera sensors are not of the best qual-
ity. Good detection and recognition performance is highly desirable in these scenarios.
The problem of face detection and recognition becomes essential in high security areas
such as, but not limited to, government offices, military bases, banks, or border posts.
Normally, there are security personnel monitoring the live CCTV footage of multiple
cameras. As we see in Figure 1.1, the job of the person monitoring the videos is not

2
1.2 Motivation, Approach, Importance and Impact

Input Bicubic ARCNN+SRGAN SRGAN+ARCNN Ground Truth

Figure 1.2: Example of output images when a pipeline of image enhancement algo-
rithms are used. Details in Chapter 3. Best viewed in color.

easy and is prone to human errors, especially when you have to do this kind of task for
long hours. Any mistake in this situation can be expensive. Thus, it is important to
determine a solution to close this gap in the field of security, and this research plays a
significant role in doing so.
In the last few years, many robust face detection and recognition algorithms have
been proposed. They have very competitive results on benchmark datasets. However, it
has been observed that the performance of these algorithms falls with degrading image
quality. The biggest challenge lies in a real world scenario, where, unlike a benchmark
dataset, the input images are uncontrollable, the quality of images vary, and faces often
have occlusions or nonfrontal poses. Unfortunately, those algorithms perform poorly in
these scenarios. This motivates us to find a solution to boost the performance of these
algorithms in the real world scenario. A naive approach to solving this problem is to
use image enhancement algorithms and enhance the quality of these images in the hope
of better recognition performance. Hence, our first step to approach this problem is by
exploring photorealistic image enhancement.
Photo-Realistic image enhancement can be broadly classified into three domains:
(1) Super-resolution (SR) [3, 4] focuses on the task of estimating a high-resolution im-
age from its low-resolution counterpart; (2) Noise removal (NR) [5, 6] aims at cleaning
different noises present in the image, e.g., Gaussian noise, Poisson noise, etc.; (3) Ar-
tifact removal (AR) [7, 8] concentrates on estimating an artifact-free sharp image from
its corrupted counterpart, e.g., JPEG artifacts.
A possible intuitive solution can be the recovery of finer texture details and sharp-
ness of degraded images with low resolution. In recent times, we have seen break-
throughs in the perceptual quality of image enhancement with the recovery of the fine

3
1.2 Motivation, Approach, Importance and Impact

Figure 1.3: Face recognition in action. Source: https://spectrum.ieee.org/news-


from-around-ieee/the-institute/ieee-member-news/why-san-francisco-
banned-the-use-of-facial-recognition-technology

texture details and sharpness of degraded images. Several state-of-the-art algorithms


like ARCNN [7], IRCNN [8], SRGAN [3] etc. have proven to be masters in the field
of image enhancement. These methods can of course be used in real world scenarios
and they help in better object detection since object detection relies on the structural
and geometrical feature consistency at a macro level. Talking of real world scenarios,
it is of utmost importance that these algorithms must be fast enough to cope with the
speed at which the environmental scenario within the scope of the receptive field of
the camera sensor is changing. Since most surveillance cameras are not designed in a
way to produce pin-sharp image quality, an object which is far away from the camera
typically needs two-fold processing - super-resolution and artifact/noise removal. With
the current state-of-the-art, this means passing the image through two models, one for
superresolving the image and another to clean the noise and artifacts.
This scenario is not only limited to surveillance applications but also to the images
found on the internet, for example, in social media sites like Twitter or in different
websites, where images are downsampled and compressed before storing them to save
memory footprint. While it does not sound quite intuitive, images passed through two
different image enhancement models can lead to disastrous results in terms of quality

4
1.2 Motivation, Approach, Importance and Impact

and we have shown this in 1.2 (details can be found in Chapter 3 and 4). And this
is in addition to a higher total computation time which is a direct summation of the
computation time of each algorithm in the pipeline.
One of the biggest questions is how effective are these methods regarding face recog-
nition. Although macro-level features can be useful in object detection, in the case of
face recognition, the detailed features of faces are essential and the features involved in
the high perceptual quality of the images may not directly map to the microlevel features
required for successful face recognition or verification. Our research shows that percep-
tual image enhancement actually removes a lot of information from the face which are
important for face recognition. Although the images look better to the human eye after
enhancement, the recognition performance falls.

1.2.1 Basic Design Principles and Pros of our Proposed Method


Considering the above issues, our primary requirements are to enhance the image, and it
should be enhanced in such a way that the recognition performance increases. Thus, we
designed a generative adversarial network which is capable of capturing the essential
facial features required for good recognition performance. First, we tried general (per-
ceptual) image enhancement techniques, but they do not perform well on face recogni-
tion. Therefore, we took a different route and tried to solve the problem using feature
enhancement techniques. In our initial method, we use a single image as an input to
our network. However, that limits the input information, especially when the input im-
age has occlusion or extreme head pose. Since videos are used for capturing faces and
recognising, we can use multiple images to overcome that problem. Therefore, next, we
designed another generative adversarial network which can take in multiple images as
input. The network then captures the relevant information from each image and aggre-
gates them to generate a single frontal facial image of the identity.
Compared to existing approaches, the primary and biggest advantage of our method
is its flexibility. This algorithm can work in conjunction with any other face recognition
algorithm to boost their performance. This includes all existing algorithms and algo-
rithms yet to be proposed. Since our method improves the input image, the enhanced
image can be used as an input to the face recognition algorithms to improve the overall
recognition performance. Secondly, it solves the problem of recognising occluded faces

5
1.3 Goals

Figure 1.4: Example of synthesis of frontal face pose from multiple occluded images
using GANs where partial information from multiple faces is aggregated to create the
final output. Details in Chapter 6. Best viewed in color.

or extreme head poses to a significant extent by taking advantage of the multiple input
images from the video feed. Thirdly, the concept of using aggregated information from
multiple images to generate a single image can be used in other fields of application,
which we have described in detail in Chapter7.

1.3 Goals
Our primary goal of this research is twofold: First and most important is to enhance
facial recognition performance for faces whose performance is limited by the current
state-of-the-art due to various noise, occlusions, and pose limitations present in the fa-
cial images. It is important to understand in which scenario an algorithm works and
where it fails. As discussed in the last section, the state-of-the-art algorithms perform
well in benchmark datasets, but fail to keep up in real world scenarios. Instead of de-
signing a new face recognition algorithm, our goal is to enhance the image so that it can
work with any existing algorithm, or even with future algorithms which are yet to be
invented, in real world scenarios. It is very important for these algorithms to perform
well in practical scenarios so that we can have a better and safer world.
Second, we want to have an understanding on low level information of how face
recognition works and what is important for recognition algorithms to perform suc-
cessfully. Understanding this information is important so that it can facilitate future
researchers to understand what features and properties of faces are important for a face
recognition algorithm to perform well in environments which are beyond our control.
While keeping our primary goals in mind, we also aim to design our algorithms that

6
1.4 Thesis Roadmap

will be robust to various kinds of noise, illumination or races and fast enough so that
these methods will find practical use in a daily life scenario.

1.4 Thesis Roadmap


With the investigation boundaries drawn so far, we have identified that there is a gap in
face recognition research in regard to recognition from low-quality images, particularly
for images with a high amount of noise and artifacts due to compression or inherent
low-quality sensors from cheap cameras. This gap also extends to images with extreme
poses and occlusions where parts of the face are covered with sunglasses, scarves, hats,
or any other external interference.
In Chapter 2, we initially review a short but relevant background for understanding
general deep learning which includes artificial neural networks, convolutional neural
networks and certain terms related to deep learning. We also provide an overview of
Generative Adversarial Networks (GAN) which we will be using as a base through-
out our research. Next, we discuss the research scope of general image enhancement,
which includes image superresolution, noise removal and artifact removal, including a
short overview of different innovative loss functions used for fast and effective conver-
gence of the algorithms. Since perceptual quality evaluation of images from GAN is a
debatable issue, we have gone through some of the works related to this topic. Deep
metric learning has been proven to be successful in recent years in the field of face
recognition. We have provided a short overview of this and the researches conducted
using metric learning in the field of face recognition.
Chapter 3 deals with super-resolution and artifact removal of images. We present
Image Enhancement Generative Adversarial Network (IEGAN), a framework capable
of inferring photo-realistic natural images for both artifact removal and superresolution
simultaneously. Moreover, we propose a new loss function consisting of a combination
of reconstruction loss, feature loss, and an edge loss counterpart. The feature loss helps
to push the output image to the natural image manifold and the edge loss preserves
the sharpness of the output image. The reconstruction loss provides low-level semantic
information to the generator regarding the quality of the generated images compared
to the original. Our approach has been experimentally proven to recover photorealistic
textures from heavily compressed low-resolution images on public benchmarks and our

7
1.4 Thesis Roadmap

proposed high-resolution World100 dataset. Our contributions are as follows

• We propose the first end-to-end network called Image Enhancement Generative


Adversarial Network (IEGAN) which can solve the problem of SR and AR simul-
taneously. Our proposed network is able to generate photorealistic images from
low-resolution images corrupted with artifacts, i.e., it acts as a unified framework
which simultaneously super-resolves the image and recovers it from the compres-
sion artifacts.

• We propose a new and improved perceptual loss function which is the sum of the
reconstruction loss of the discriminator, the feature loss from the VGG network
[9] and the edge loss from the edge detector. This novel loss function preserves
the sharpness of the enhanced image, which is often lost during enhancement.

However, we found out that a drawback of this method is that it fails to clean the image
in the presence of random noise. This led to our next investigation.
In Chapter 4, we present a further versatile framework capable of superresolving and
removing various kinds of noises and artifacts from images simultaneously in a single
end-to-end framework, inferring photorealistic natural images. This offers a substantial
advantage since it is computationally much less expensive and faster compared to a cas-
cade of several networks solving each problem separately. Our algorithm contains an
effective method involving edge reconstruction using a generative adversarial approach.
Our approach has been experimentally proven to recover photorealistic textures from
low-resolution, heavily compressed, and noisy images on several public benchmark
datasets. For example, with 4x superresolution and removal of speckle noise on the
LIVE1 dataset [10], the HaarPSI score [11] for the output images from our framework
is 12.8% better than the next best performing pipeline which is a cascade of SRGAN [3]
and IRCNN [8]. To sum up, our main contribution is threefold:

• To our best knowledge, we are the first to propose an end-to-end multipurpose


image enhancement framework based on GAN, which is able to generate photo-
realistic images from low-resolution images corrupted with noise and artifacts.

• We introduce an edge detection counterpart in our objective function. It helps pre-


serve the sharpness and details of the enhanced images, which plays an important
role for generating photorealistic images.

8
1.4 Thesis Roadmap

• Extensive ablation studies and experiments show the effectiveness and superiority
of our proposed framework on several datasets.

However, when we applied this algorithm to enhance faces for better face recognition,
we found out that the performance actually became worse. This draws an important
conclusion that human visual perception can be tricked and perceptual quality is not the
answer to solving low level computer vision problems. This motivated us for our next
research.
In Chapter 5, we presented a deep generative adversarial network for facial feature
enhancement that sets a new state-of-the-art for the detection and recognition of low-
quality images by restoring their facial features. We have highlighted some limitations
of the existing algorithms and overcame those by reconstructing the facial features and
increasing the separation between different classes. Specifically, we have explored the
possibility of improving face recognition performance on those images for which the
current state-of-the-art methods fail to deliver good results. Our proposed loss function
works very well on real-world images where they inherently are of low quality. We
use metric learning to train the network to generate facial images preserving their iden-
tity, which is more important for improving face detection and recognition performance
rather than perceptual quality, and our proposed algorithm successfully demonstrates
that. Our approach has been experimentally proven to enhance face recognition perfor-
mance on several public benchmarks, e.g., the area under the ROC curve for face recog-
nition is improved by 4.2% on the SCFace [12] dataset when used in conjunction with
ArcFace [2] compared to the recognition performance of ArcFace alone. The detection
performance has been increased by 3.08% when used in conjunction with S3FD [1] on
the SCFace dataset compared to detection using S3FD alone. In a nutshell, our novel
framework offers

• enhancement of facial features for very low quality images,

• a constrained angular metric learning method that learns to reconstruct discrimi-


native facial features(see Section 5.2.3),

• and a weighted combination of different losses operating at various stages of the


network to recover discriminative facial features leading to better performance.

9
1.4 Thesis Roadmap

Although this research provided an answer to the question of facial feature enhance-
ment, the conclusion was that the recognition performance was bounded by an upper
limit when a single image is used. There motivated us for further research and led to
our next work.
Chapter 6 deals with the synthesis of a frontal face pose from multiple occluded im-
ages using GANs where partial information from multiple faces is aggregated to create
the final output. In this chapter, we present a deep generative adversarial network which
takes multiple faces as input, each having different poses and occlusion, and combines
the useful information from each of the images and synthesises an output which closely
resembles a suspect identification photograph, i.e. a facial image having a frontal pose
with no occlusions or facial expressions. We use the architecture of TPGAN [13] as
our base and build on that to create our final generator network. We also use a U-net
discriminator which evaluates the output of the generator from a local and global per-
spective. To sum up, the main contributions in this chapter are as follows.

• To the best of our knowledge, we are the first to propose a GAN based solution
to remove facial occlusion by reconstructing the occluded area. Our method syn-
thesises the frontal face pose by extraction of useful information from multiple
occluded images.

• We have shown the working principles, behaviour, and limitations of our proposal
by making use of the science behind information theory.

• We have verified and shown through images and experiments that an increased
number of input images improve the quality of the output.

Finally, in Chapter 7, we draw the conclusion and the future possibilities and exten-
sion of this research.

10
Chapter 2

Background

This chapter introduces some of the previous work that is most relevant to my research.
I have provided some preliminary background of artificial neural networks and how they
lead to deep neural networks, including some well-known deep network architectures.
I review the contributions in the field of generative adversarial networks, particularly in
the field of noise removal, artifact removal, and superresolution and how they affect face
recognition. I have also provided a critique of existing methods by covering as much
relevant material as possible without too much historical context to keep the review
concise. Finally, I have made a comprehensive discussion on the gap between the state-
of-the-art and the proposed area of research.

2.1 A short history of neural networks


The very first step towards the mathematical model of the artificial neuron we use today
was taken in 1943 by McCulloch and Pitts by mimicking the functionality of a bio-
logical neuron. In the 1960s, an American psychologist called Frank Rosenblatt came
up with the idea of perceptron, which was inspired by the Hebbian theory of synaptic
plasticity, which is the adaptation of brain neurons during the learning process. In the
1980s, the idea of artificial neural networks was further developed, thanks to the work
of Hopfield and the development of the error backpropagation algorithm for training
multilayer perceptrons. The key work on backpropagation was presented by Rumelhart
et al. in their paper [14].

11
12
Figure 2.1: Timeline of the history of neural networks. Source: https://towardsdatascience.com/rosenblatts-
perceptron-the-very-first-neural-network-37a3ec09038a
2.1 A short history of neural networks
2.2 Artificial Neural Networks

This was an important discovery and then in 1989, Yann LeCun came up with the
idea of convolutional neural networks. Convolutional Neural Network is a class of deep
neural networks, most commonly applied to analyse visual imagery. This approach
became the foundation of modern computer vision.
In 2014, Ian Goodfellow came up with the idea of generative adversarial networks,
which is one of the most, if not the most, sophisticated generative algorithms in the field
of computer vision. Figure 2.1 shows a timeline of the history of neural networks.

2.2 Artificial Neural Networks


Artificial Neural Networks are nonlinear systems inspired by biological neural net-
works. The very first step towards the mathematical model of the artificial neuron we
use today was taken in 1943 by McCulloch and Pitts by mimicking the functionality of
a biological neuron, which is called the Threshold Logic Unit, defined by the following
rules:

• It has a binary output y ∈ {0, 1}, where y = 1 indicates that the neuron fires and
y = 0 that it is at rest.

• It has N number of binary inputs xk ∈ {0, 1} and a single inhibitory input i. If it is


on, the neuron cannot fire.

• It has a threshold value Θ. The neuron fires only if the sum of its inputs is larger
than this critical value.

Given the input x = [x1 , x2 , x3 , ..., xn ]T , the output y is computed as follows


 n

σ (x) =
1 if ∑ xk > Θ and i = 0, (2.1)
 k=1
0 otherwise (2.2)

In the 1960s, an American psychologist called Frank Rosenblatt came up with the idea
of perceptron, which was inspired by the Hebbian theory of synaptic plasticity, which
is the adaptation of brain neurons during the learning process. In the 1980s, the idea of
artificial neural networks was further developed, thanks to the work of Hopfield and the
development of the error backpropagation algorithm for training multilayer perceptrons.

13
2.3 Convolutional Neural Networks (CNN)

Figure 2.2: The structure of an artificial neuron.

The artificial neural networks allow learning mappings from a given input space to
the desired output space. It consists of a training phase which can be either super or
unsupervised, or a combination of both, and a prediction phase. Figure 2.2 shows the
structure of a simple artificial neuron.

2.3 Convolutional Neural Networks (CNN)


Convolutional Neural Network is a class of deep neural networks, most commonly ap-
plied to analyse visual imagery. The speciality of convolutional neural networks is the
way the connections between neurons are structured. In a feedforward neural network,
units are organised into layers and the units at a given layer only get input from units in
the layer below. CNNs are feedforward networks. However, unlike standard artificial
neural networks, units in a convolutional neural network have a spatial arrangement. At
each layer, units are organised into sublayers called feature maps. Each of these feature
maps is the result of a convolution performed on the layer below. Some of the important
and common terms used in convolution neural networks are

• Kernel: It is the filter used for convolution with the images.

14
2.4 Different CNN Architectures

• Strides: Stride is the number of pixels shifted over the input matrix.

• Padding: Sometimes the filter does not fit perfectly fit the input image. Therefore,
we pad the picture with zeros (zero-padding) so that it fits.

• Batch-normalization: It is used to normalise the input layer by re-centering and


rescaling. It is a technique for improving the speed, performance, and stability of
the neural network.

• Pooling: Pooling is used to reduce the spatial dimension of the feature maps. The
most commonly used pooling methods are max-pooling and average pooling.

– Max-pooling: The feature maps are downsampled by taking the maximum


value of the patches of the feature map.
– Average pooling: The feature maps are downsampled by taking the mean of
the patches of the feature map.

The mathematical concept of pooling is shown in Figure ?? .

• Fully Connected Layer: Fully Connected Layer is a feed-forward neural network


where the input to the fully connected layer is the output from a convolution layer
which is flattened.

2.4 Different CNN Architectures


The state-of-the-art in person re-id is dictated by a few popular powerful CNNs like
AlexNet [15], VGGNet [9], LeNet [16], GoogLeNet/Inception v1 Network [17], other
version of the Inception network [18, 19], ResNet [20, 21] and DenseNet [22]. The
following subsections contain a short description of some of these networks.

2.4.1 LeNet
This architecture was proposed by LeCun et al. [16] in 1998. However, due to lim-
ited computation capabilities and memory availability, it was successfully implemented
in 2010 using the back-propagation algorithm. It was applied to the handwritten digit

15
2.4 Different CNN Architectures

Figure 2.3: Block diagram of LeNet architecture. Source: https://medium.com/


@pechyonkin/key-deep-learning-architectures-lenet-5-6fc3c59e6f4

recognition task which defined the state-of-the-art. The overall number of weights is
431K. Its architecture is shown in Figure 2.7 (middle) and includes a couple of convo-
lutional layers, two subsampling layers, two fully connected layers, and the final output
layer.

2.4.2 AlexNet
Designed by the group of Alex Krizhevesky, this CNN won the ImageNet Large Scale
Visual Recognition Challenge (ILSVRC) in 2012, defining a new state-of-the-art in
visual object recognition. Dropout and Local Response Normalisation (LRN) was in-
troduced in this network. A dropout layer was applied at the fully connected (FC)
layer. LRN is applied either intrachannel or across feature maps. The input size is
224 × 224 × 3. The first convolutional layer produces 96 output feature maps of size
55 × 55 by using filters with size 11 × 11 with a stride of 4. The filter size for the
max-pooling layer is 3 × 3 with a stride of 2. The subsequent convolutional layers use
filters with size 3 × 3 or 5 × 5. This network contains 61 million learnable parameters.
AlexNet architecture is shown in Figure 2.4.

2.4.3 VGGNet
The VGGNet, shown in Figure 2.7, was proposed in 2014 by the Visual Geometry
Group (VGG) at Oxford University. It is a very heavy network, but it has contributed to
show that the network depth is an essential factor to perform well in visual tasks. Three

16
2.4 Different CNN Architectures

Figure 2.4: Block diagram of AlexNet architecture. Source: https:


//medium.com/coinmonks/paper-review-of-alexnet-caffenet-winner-
in-ilsvrc-2012-image-classification-b93598314160

models have been proposed, VGG-11, VGG-16, and VGG-19, having, respectively, 11,
16, and 19 layers overall. While the number of convolutional layers varies for the three
models (8, 13, and 16, respectively), they all end with three fully connected layers and
use the ReLU activation function. The filter size is 3 × 3 with stride 2. The overall
number of weights for VGG-19 is of 138M.

2.4.4 GoogLeNet/Inception v1
GoogLeNet [17] was introduced by Google, and it won the ILSVRC competition in
2014. It represented a breakthrough in the effort for reducing the computation com-
plexity of CNNs. Indeed, the overall number of parameters of this architecture is 7M,
much lower than for AlexNet and VGG-19. Moreover, it counted the highest number
of layers overall, compared to all other CNNs, by stacking the inceptions layers. The
network architecture of GoogLeNet is shown in Figure 2.10. Although the details in the
image are not very clear due to page size limitations, the image gives an overall idea of
the architecture of the network. This network exploits the concept of allowing variable

17
2.4 Different CNN Architectures

Figure 2.5: Block diagram of VGG-16 architecture. Source: https://neurohive.


io/en/popular-networks/vgg16/

receptive fields by kernels with different sizes, to perform dimension reduction before
more intense computation layers.

2.4.5 ResNet
ResNet [20] was introduced by Kaiming et al. and won the ILSVRC 2015 competition.
It is robust to the vanishing gradient problem, thanks to the idea of residual learning. A
residual block is where the activation of a layer is fast-forwarded to a deeper layer in the
neural network via a skip connection. The network architecture of ResNet is shown in
Figure 2.11. Different expression of ResNet has been proposed with different lengths.
ResNet50 is one of them which gained popularity. It contains 49 convolution layers and
1 fully connected layer, with 25.5 million learnable parameters. Variants of residual
models are also present and some of them combine residual blocks with inception units.

18
2.4 Different CNN Architectures

2.4.6 Generative Adversarial Networks (GAN)


A generative adversarial network is a class of artificial neural networks invented by
Goodfellow et al. [23]. Generative Adversarial Networks consist of

• a generative network G that captures the data distribution

• a discriminative network D that estimates the probability that a sample comes


from the training data rather than G.

Given a training set, this technique learns to generate new data with the same statistics as
the training set. The main concept of GAN is that two neural networks contest with each
other in a zero-sum game. The two-player min-max game of GAN has been described
as follows:

• The generative model can be thought of as analogous to a team of counterfeiters,


trying to produce fake currency and use it without detection.

• The discriminative model is analogous to the police, trying to detect the counter-
feit currency.

• Competition in this game drives both teams to improve their methods until the
counterfeits are indistinguishable from the genuine articles.

For example, a GAN trained on photographs can generate new photographs that look
at least superficially authentic to human observers, having many realistic characteris-
tics. Figure 2.8 show some of those examples where I trained a GAN on faces and it
generated those outputs. Although originally proposed as a form of a generative model
for unsupervised learning, GANs have also proven useful for semisupervised learning,
fully supervised learning, and reinforcement learning. GANs have gone a long way
since their inception and now can generate images that look like real images. Let the
generator be denoted by G and the discriminator be denoted by D. D is trained to max-
imise the probability of assigning the correct label and G is simultaneously trained to
maximise the probability of D to make a mistake. The objective function of a GAN is

min max V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))] (2.3)
G D

19
2.4 Different CNN Architectures

Figure 2.6: Block diagram of a simple GAN. Source: https://sigmoidal.io/


beginners-review-of-gan-architectures/

Figure 2.7: Hand written digits generated by a simple GAN. The network was trained
on the MNIST dataset.

2.4.7 Conditional GANs


The concept of GAN was later extended by Mirza et al. [24] by conditioning the model
with additional information such that it is possible to direct the data generation process.
Since conditional GANs have the ability to take some prior information to design the
posterior data, this concept can be utilised to generate faces from the available data. The
objective function of a conditional GAN [24] is
min max L(D, G) = Ex∼pdata (x) [logD(x|y)] + Ez∼pdata (z) [log(1 − D(G(z|y)))] (2.4)
G D

where D is the discriminator, G is the generator, x is the observed image, y is the output,
z is a normal distribution. This means the best model can minimise the generator error
and maximise the discriminator error.
Figure 2.9 shows some of those examples. The top image is the condition image.
The middle image is the ground truth and the bottom image is the generated image with
the proposed algorithm.
The above described networks are some of the most representative and powerful
models in deep learning. The main advantages of neural networks are

20
2.4 Different CNN Architectures

Figure 2.8: Faces generated by different algorithms by tweaking the parameters and loss
functions.

Figure 2.9: Cityscapes generated by a conditional GAN. The first row of segmented
images are the input to the generator and the last row is the output. The middle row
shows the ground truth

21
2.4 Different CNN Architectures

• multilevel representation from pixel to high-level semantic features. This is due


to the number of filters present in the convolution layer.

• deep CNNs can jointly optimise several related tasks together, for example, Fast
RCNN [25] jointly performs classification and bounding box regression.

• accuracy provided by deep learning methods outperforms classical algorithms and


shallow networks by a considerable margin.

CNN has been widely used in many research areas due to its benefits. These fields
include, but not limited to superresolution, general image enhancement, classification,
object detection, face recognition, person reidentification, etc. The primary focus of our
work is face detection and recognition. In the following sections, we will go through
and analyse some related works which led to this research.

22
23
Figure 2.10: Block diagram of GoogLeNet architecture. Source: https://developer.ridgerun.com/wiki/index.
php?title=GstInference/Supported_architectures/InceptionV2
2.4 Different CNN Architectures
24
Figure 2.11: Block diagram of VGG-19 and ResNet architecture. Source: https://mc.ai/what-are-deep-residual-
networks-or-why-resnets-are-important/
2.4 Different CNN Architectures
2.5 Related Work

2.5 Related Work

2.5.1 Intelligent Object Detection


Generic object detection aims to locate and identify existing objects in any image, and to
mark them with rectangular bounding boxes to display the degree of certainty that they
exist. The systems of generic object detection methods can be divided into two cate-
gories. The first one is the standard object detection pipeline, which involves generating
region proposals and then classifying each proposal into different object categories. Re-
gion proposal-based framework, which is a two-step technique, closely resembles the
visual attention system of the human brain in that it first performs a coarse scan of
the entire scenario before focusing on the regions of interest. Region proposal-based
methods include R-CNN [26], Fast R-CNN [26], Mask R-CNN [27] etc. The second
approach treats object recognition as a regression or classification problem, relying on
a centralised system to generate direct results that provide the class and position. They
are mainly made up of region proposal models of multiple correlated stages that are typ-
ically trained separately, including region proposal generation, feature extraction with
CNN, classification, and bounding box regression. Although, in the more recent end-
to-end module Faster R-CNN, a different training is still needed to obtain the mutual
convolution parameters between the Region Proposal Network (which is the backbone
of the Faster R-CNN model) and the detection network. The drawback is that for real-
time applications, the time spent processing various components becomes a bottleneck
of the framework. The regression or classification-based methods include Attention-
Net [28], SSD [29], YOLO [30], YOLOv2 [31] etc. To predict the confidence for multi-
ple categories and bounding boxes, YOLO makes use of the entire topmost feature map.
Because of the strict spatial constraints put on bounding box predictions, YOLO has
difficulty dealing with small objects in groups. SSD was proposed to overcome these
problems. Instead of using fixed grids like YOLO, for a given feature map, the SSD uses
a collection of default anchor boxes with different aspect ratios and scales to make the
bounding boxes more discrete. The network combines predictions from several feature
maps with different resolutions to manage objects of various sizes. Face detection is
nothing but object detection, which is applied to a very specific field of application, i.e.,
human face. There are several methods like MTCNN [32], S3FD [1] etc. that perform

25
2.5 Related Work

well on clean images. However, the gap in the current state-of-the-art method lies in the
fact that the detection degrades with degraded images.

2.5.2 GANs for Face Generation


Our research mainly deals with conditional GANs for the generation of faces and in this
section we will discuss a bit about the works that have been carried out on face gener-
ation using GANs. The earliest face generation using GANs was done by the inventor
of GAN himself, Ian Goodfellow. Goodfellow et al., in their paper [23], has shown that
faces can be generated using GANs. A year after, Radford et al. [33] introduced the
concept of vector arithmetic for GAN generated faces. Let us suppose that we define
a vector Z1 representing a man with glasses, a vector Z2 representing a man without
glasses and a vector Z3 representing a woman without glasses. Radford et al. showed
that it is possible to generate an image of a woman with glasses by using Z1 − Z2 + Z3 .
Karras et al. took the performance a step further in their research [34]. They proposed a
new training methodology for GANs where the key idea is to grow the generator and the
discriminator progressively. It helps in speeding up the training and stabilises the GAN.
The quality of faces generated by this GAN is almost indistinguishable from the real
ones. The most recent work is by Karras et al. [35] where they analyze a style-based
GAN architecture. They redesigned the generator normalisation, modified the way of
progressive growth, and regularised the generator to encourage good conditioning in the
mapping from latent codes to images. This improves the images to an extent where the
generated faces are indistinguishable from real faces. The results of this GAN can be
seen in the website https://www.thispersondoesnotexist.com/.
As our primary goal is to improve the quality of the facial images for better detection
and recognition performance, our first approach is removing the artifacts and noise by
using GANs. However, before that, we will be reviewing the previous works involving
artifact and noise removal to find the gap between our research and the state-of-the-art.

2.5.3 Artifact Removal from Images and Videos


Artifact removal from images has been extensively treated. To cope with the problem
of generating high perceptual quality images from low-quality ones, different methods

26
2.5 Related Work

have been proposed. Before the advent of deep learning, filter-based approaches were
quite popular. In the spatial domain, different kinds of filters [36–38] have been pro-
posed to adaptively deal with blocking artifacts in specific regions. In the frequency
domain, wavelet transform has been utilised to derive thresholds at different wavelet
scales for deblocking and denoising [39, 40]. However, the problems with these meth-
ods are that they could not reproduce sharp edges and tend to have overly smooth tex-
ture regions. In the recent past, different algorithms for image enhancement using deep
learning have been proposed. Dong et al. [7] showed that directly applying the SRCNN
architecture for JPEG AR resulted in undesired noisy patterns in the reconstructed im-
age, and thus proposed a new improved model. Svoboda et al. [41] proposed a novel
method of image restoration using convolutional networks that had a significant quality
advancement compared to the state-of-the-art methods. The drawback, however, is that
it works on monochromatic images. Moreover, pixel-wise loss functions like MSE or
L1 loss were quite common in those methods, making the images softer. Pixel-wise
loss functions are unable to recover the lost high-frequency details in an image and en-
courage finding pixel-wise averages of possible solutions which are generally smooth
having poor perceptual quality [3, 42–44]. Johnson et al. [44] and Bruna et al. [42]
proposed extracting the features from a pretrained VGG network instead of using pixel-
wise error. They proposed a perceptual loss function based on the Euclidean distance
between feature maps extracted from the VGG19 [9] network. This loss proved to be
quite efficient in capturing the details of an image as it uses the intermediate activations
in the backbone of the VGG network, which represents the deep features. Zhang et al.
proposed an algorithm [8] to solve different problems related to image artifacts. This
method aims to train a set of fast and effective deep learning-based denoisers and inte-
grate them into model-based optimisation methods to solve inverse problems. A recent
paper by Ghosh et al. [45] uses edge loss on top of perceptual loss to simultaneously
remove artifacts and superresolve images. This seemed to be an interesting approach as
including the edge loss in the loss function acted as a booster in preserving the sharp-
ness of the images, thus enhancing the overall quality of the output. Lu et al. [46]
models the video artifact reduction task as a Kalman filtering procedure and restores
the decoded frames through a deep Kalman filtering network. Soh et al. [47] exploits
a simple CNN structure and introduces a new training strategy that captures the tempo-
ral correlation of consecutive frames in a video. They aggregate similar patches from

27
2.5 Related Work

neighbouring frames and use them to reduce artifacts from the videos. GANs proved to
be quite effective in generating clean images, however, GAN based artifact removal still
remains relatively unexplored, which includes artifact removal for colour images. This
is a significant gap in the state-of-the-art which needs immediate attention.

2.5.4 Image Noise Removal


There have been several attempts in the past to deal with image denoising. Dabov et
al. [48] proposed an image deblurring algorithm using Nash equilibrium to derive an it-
erative decoupled deblurring BM3D method. To solve different inverse problems, there
are several existing denoising algorithms that were adopted in model-based optimisa-
tion methods. Chambolle presented an algorithm for minimising the total variation of
an image [49], Zoran et al. designed a framework [50] that restores images using patch-
based prior information. K-SVD [51] and non-local means [52] have also been used
to solve image denoising problems. However, these algorithms have their respective
drawbacks. For example, total variation is known to create watercolour-like artifacts,
non-local means, and BM3D denoiser priors often over-smooths fine texture details.
The problem with K-SVD is that it suffers a high computational burden. Thus, a strong
denoiser prior which can be implemented efficiently is highly demanded. Furthermore,
most existing methods work independently on every channel or mainly focus on mod-
elling the grey image prior. Since the colour channels are highly correlated, working on
the channel simultaneously produces better performance than independently handling
each colour channel [53]. The problem of having noise in the images are often accom-
panied with other problems such as having lower resolution or the presence of com-
pression artifacts. Although there exist separate solutions to each of these problems,
the drawback is that they introduce considerable computational and memory overhead,
thus making the methods practically unusable in real time for images with multiple
problems, for example in a noisy compressed video.

2.5.5 Image Super-resolution


Initially, filtering approaches were used for SR. They are usually very fast but with
overly smooth textures, thus losing a lot of detail. Methods focusing on edge preser-

28
2.5 Related Work

vation [54, 55] also fail to produce photorealistic images. Recently, convolutional neu-
ral network (CNN) based SR algorithms have shown excellent performance. Wang et
al. [56] showed that the sparse coding model for SR can be represented as a neural net-
work and improved results can be achieved. Dong et al. [4] used bicubic interpolation
to upscale an input image and trained a three-layer deep fully convolutional network
end-to-end achieving state-of-the-art SR performance. Their method SRCNN learns an
end-to-end mapping between low-resolution and high-resolution images directly. They
achieved enough speed for practical online usage. Later, Dong et al. [57] and Shi et
al. [58] demonstrated that upscaling filters can be learnt for SR to obtain an increased
performance. To upscale the low resolution feature maps into the high resolution out-
put, they added an effective subpixel convolution layer that learnt an array of upscaling
filters. This effectively replaced the handcrafted bicubic filters in the SR pipeline with
more complex upscaling filters that can be explicitly trained for each feature map. Si-
multaneously, the overall computational complexity of the superresolution operation
was also reduced. The studies of Johnson et al. [44] and Bruna et al. [42] relied on loss
functions that focus on perceptual similarity to recover HR images that are more pho-
torealistic. A recent work by Ledig et al. [3] presented the first framework capable of
generating photorealistic images for 4× upscaling factors. In this method, a perceptual
loss feature with adversarial loss and content loss is proposed. Using a discriminator
network that is trained to distinguish between superresolved images and original photo-
realistic images, the adversarial loss moves the solution to the natural image manifold.
Furthermore, rather than pixel space similarity, they used a content loss inspired by
perceptual similarity. This was a very sophisticated method for photorealistic image
superresolution and set a new standard for SR performance. Face super-resolution is
a domain-specific super-resolution problem. Significant advancement was also done
in this field. Chen et al. proposed a novel face superresolution algorithm [59] that
uses facial landmark heat maps and parsing maps to superresolve very low-resolution
face images. Although significant research has been done in the field of superresolu-
tion, it misses to answer the most important question for face superresolution, i.e., the
recognition performance. Facial image enhancement without improving the recognition
performance is practically useless. A high PSNR or SSiM does not guarantee higher
recognition performance. This is a critical shortcoming in the current state-of-the-art
that needs to be addressed.

29
2.5 Related Work

2.5.6 Loss Functions


To generate high perceptual quality images, designing a proper loss function plays an
important role in our proposed algorithm. Pixel-wise loss functions like MSE or L1
loss are unable to recover the lost high-frequency details in an image. A loss function
that minimises MSE encourages finding pixel averages of possible solutions that are
softened or blurred and although minimising the total loss, the achieved images will
have low perceptual quality for human vision [3, 42–44]. Ledig et al. [3] illustrated
the problem of minimising MSE where multiple plausible solutions with high texture
details are averaged creating a smooth reconstruction. Johnson et al. [44] and Bruna
et al. [42] proposed extracting the features from a pretrained VGG network instead of
using pixel-wise error. They proposed a perceptual loss function based on the Euclidean
distance between feature maps extracted from the VGG19 [9] network. Ledig et al. [3]
proposed a GAN-based network optimised for perceptual loss which is more invariant
to changes in pixel space, obtaining better visual results. As we see from the recent
works, the loss functions are more geared towards improving the perceptual quality of
the image. Our main area of interest is faces. How well these loss functions can improve
the recognition performance of degraded face images remains an unanswered question
that needs further research. Improving perceptual quality is not always correlated with
better recognition performance. For example, a Gaussian filter can make a noisy image
look relatively pleasing, but it removes specific facial features that are essential for good
recognition performance. This ultimately leads to a better perceptual quality, but worse
recognition performance.

2.5.7 Perceptual Image Quality Evaluation


Evaluating the perceptual quality of an image is tricky because most of the statistical
measures do not well reflect human perception. Since we are working on the perceptual
enhancement of images, a proper image perceptual quality metric is essential for a fair
evaluation of our results. Ledig et al. [3] have shown in their work that images with
high PSNR do not necessarily mean a perceptually better image. The same applies to
Structural Similarity (SSIM) as well. NIQE [60] measures the distance between the nat-
ural scene statistics (NSS) based features calculated from a given image to the features
obtained from an image database used to train the model. The features are modelled

30
2.5 Related Work

as multidimensional Gaussian distributions. A smaller score indicates better perceptual


quality. Xue et al. [61] presented an effective and efficient image quality assessment
model called Gradient Magnitude Similarity Deviation (GMSD) which they claimed to
have favourable performance in terms of both perceptual quality and efficiency. This
method shows that the pixel-wise gradient magnitude similarity (GMS) between the
reference and distorted images combined with the standard deviation of the GMS map
can predict the perceptual quality of an image quite accurately. A statistical analysis of
image quality measures conducted by Kundu et al. reported that GMSD [62] showed
a high correlation with the human visual system. A very recent work by Reisenhofer
et al. presents a similarity measure for images called Haar wavelet-based Perceptual
Similarity Index (HaarPSI) [11] that aims to correctly assess the perceptual similarity
between two images with respect to a human viewer. The HaarPSI evaluates the local
similarities between two images as well as the relative importance of image areas using
the coefficients obtained from a Haar wavelet decomposition. It achieves higher corre-
lations with human opinion scores on large benchmark databases in almost every case
and has been proven to be the best perceptual similarity metric available in the literature.
Thus, we will focus our evaluation mainly on GMSD and HaarPSI.

2.5.8 Deep Metric Learning


Deep metric learning learns deep representations of the samples that capture their se-
mantic similarity in the feature space. Deep metric learning can be divided into two
categories - The first one consists of algorithms where traditional metric learning is
applied to the extracted deep features. It essentially has two stages where first the fea-
ture embeddings are learned and then the metric learning is performed, i.e., the metrics
are learned after the weights of the neural network are finalised. For example, Hu et
al. in [63] applies deep transfer metric learning to learn a set of hierarchical nonlinear
transformations for cross-domain recognition. The discriminative knowledge gets trans-
ferred from the training domain to the target domain. They achieve this by minimising
the intraclass variation and maximising the interclass variation. Yang et al. in [64] pro-
posed a novel similarity metric and an efficient strategy to learn it. They consider both
the difference and the common factors of an image pair to discriminate them as much
as possible. Under a pair-constrained Gaussian assumption, they obtain the covariance

31
2.5 Related Work

matrices of dissimilar pairs from the similar pairs. They use log-likelihood ratio during
training to aid a faster and scalable learning process.
The second one is formed by methods that learn a metric jointly with the embedding
in a deep architecture. Usually, the methods belonging to the second category are better
optimised, which is typically achieved by Siamese-based architectures. They exploit
the contrastive loss (or triplet loss) that allows a distance relation between pairs (or
triplets) of feature points to be learned at training time [65, 66]. A second approach to
the joint metric learning is to integrate the metric learning scheme is to integrate the
metric learning scheme into the feature extraction network, and an end-to-end training
is performed by the optimisation process. This technique tends to overfit to the training
data and thus needs weight constraints for better generalisation.
However, it is worth mentioning that the deep metric learning techniques suffer from
slow convergence caused by a subset of trivial samples that has negligible contribution
to the training. To counteract this problem, most of these techniques apply sample
mining strategies by selecting nontrivial samples, commonly known as hard or semihard
pairs. This accelerates the learning process and reduces overfitting.

2.5.9 Face Recognition and Metric Learning


Face recognition using deep learning is a very active area of research [2,65–68]. Before
the era of deep learning, handcrafted features were used for face recognition [69–71]
whose performance was not very high and had limited use in the real world. A substan-
tial leap in modern face recognition started with the algorithm proposed by Taigman et
al. (Deepface [72]) which outperformed the contemporary state-of-the-art by more than
27%. Hu et al. [67] presented a powerful nonlinear tensor-based fusion method which is
a synergistic combination of both hand-crafted and deep conventional features. Schroff
et al. used a deep convolutional network (Facenet [65]) which consistently outper-
formed previous methods on various benchmarks. The network was trained to optimise
the face embedding using triplet loss. The concept of triplet loss revolves around a point
in space, known as the anchor, which represents a certain class. In this loss, the distance
of the anchor is compared to a positive input, which is a point belonging to the same
class as the anchor, and a negative input, which belongs to a different class. The dis-
tance from the anchor to the positive input is minimised, and the distance to the negative

32
2.5 Related Work

input is maximised. Wen et al. developed an algorithm [66] that used the centre loss
to enhance the discrimination of deep features. The center loss is a loss function that
efficiently pulls the deep features of the same class to their class center.
The deep metric learning techniques suffer from drawbacks both in the end-to-end
integration-based approaches and the Siamese network-based approaches. That is, the
slow convergence which is caused by some trivial samples that do not have much con-
tribution during the training phase. The possible solution to most of these techniques is
to apply sample mining strategies by selecting nontrivial samples which can accelerate
convergence and reduce overfitting. These nontrivial samples are called semi-hard nega-
tives/positives or hard negatives/positives. Different sample strategies target either hard
negatives [73–76], or semi-hard negatives [65], or hard positives [74, 76] or semi-hard
positives.
Some recent methods in face recognition using deep metric learning proved to be
very effective. A popular approach of research is to incorporate margins in well-established
loss functions to maximise face class separability. Liu et al. presented an algorithm
(SphereFace [68]) which assumes that the deep features can be used as a representation
of the centre of the given class/label of faces in a polar coordinate system. They intro-
duced a loss function called angular softmax loss that helps networks to learn angular
discriminative features. Wang et al. [77] proposed a novel algorithm, called Large Mar-
gin Cosine Loss (LMCL), which takes the normalised features as input to learn highly
discriminative features by maximising the interclass cosine margin. Recently, Deng et
al. [2] proposed an algorithm called Additive Angular Margin Loss (ArcFace in short)
to obtain highly discriminative features for face recognition. The proposed algorithm
has a very clear interpretation of geometry. The main concept revolves around the use of
the exact correspondence to the geodesic distance on a hypersphere. This has a substan-
tial advantage in face recognition compared to vanilla softmax. Although the softmax
loss provides decent separable feature embedding, it produces noticeable ambiguity in
decision boundaries. However, the ArcFace loss can enforce a better separation between
the nearest classes.

33
2.5 Related Work

2.5.10 Facial Occlusion and Synthesis


Several solutions to the problem of facial occlusion have been proposed over the years,
ranging from dividing the face image into a set of local regions to sophisticated statisti-
cal methods. The first reconstruction-based solution from a single image was proposed
by Jia et al. [78] in 2008. Although it performed fairly for very small occlusions, it was
unsuitable for larger occlusions. The nonfrontal facial view is also a kind of occlusion,
as a part of the face remains hidden. Huang et al. [13] proposed a generative adversar-
ial network called TPGAN [13] for photorealistic frontal view synthesis by perceiving
global structures and local details simultaneously. Ekenel et al. [79] have shown that
the main reason for the low performance is erroneous facial feature localisation, which
was an important discovery. This discovery led to a deeper understanding of the fa-
cial features important for accurate recognition and helped to create a well-defined face
recognition guideline. Andres et al. [80] tackled the problem using compressed sensing
theory which was robust only to certain types and levels of occlusion. This is mainly
due to the fact that the identification of the individual is based on the detected non-
occluded portion. Su et al. [81] also proposed a solution which used the information
from the non-occluded part of the face for recognition. The work of Wen et al. [82]
involves structured occlusion coding, whose main idea also revolves around using the
non-occluded information to recognise the face correctly. Wan et al. offered a solution
called MaskNet [83] which is a masking based solution that masks the occluded area
and is assigned lower importance. Song et al. [84] proposed a similar kind of solu-
tion which is based on a mask learning strategy to find and discard corrupted feature
elements for recognition. The latest research done in this area was by Xiao et al. [85]
which was based on occlusion area detection and recovery. As the previous works sug-
gest, the idea of recognition of faces with occlusion is mainly geared towards using the
available facial information, filtering out the occluded region. A big drawback of this
method is that the probability of recognition of the individual is considerably reduced
with the amount of occlusion. For surveillance videos, this is not desirable and it is im-
portant to maximise the correct recognition rate, as it often becomes a question of safety
and security. A video is a collection of images which are temporally related. Current
state-of-the-art fails to exploit this temporal relation between the images, which has the
potential to improve face detection and recognition.

34
2.6 Critical Thinking and Analysis: Summary

2.6 Critical Thinking and Analysis: Summary


From this literature review, it is clear that the current state of the art does not completely
solve the problems which we address in our research. Current state-of-the-art fails to
keep up with the recognition performance with degrading images. This is due to the
fact that noise and artifacts destroy essential information in the images that contributes
towards high recognition accuracy. Although significant research has been carried out
in the field of face superresolution, improving facial recognition for degraded images
remains relatively unexplored. In recent years, GANs have made a significant develop-
ment, especially in the field of face generation. The state-of-the-art GANs can generate
human faces with such detail that it is almost indistinguishable from real faces. Cur-
rent application of GAN is limited to generating high-quality human faces, which has
limited or no practical use. GANs have immense potential in improving face recogni-
tion which has not been extensively explored yet. Similarly, GAN based noise removal
and artifact removal for colour images also remain relatively unexplored. Therefore, we
propose to fill the gap in this area with an expectation to improve the state-of-the-art.
The adversarial learning process in GANs helps the generative network to create bet-
ter images with high perceptual quality, thus it remains a considerable area of interest
for us. To generate images with high perceptual quality, the loss function plays an im-
portant role. Although pixel-wise mean squared error loss helps generate images with
higher PSNR, the images are not typically sharp, thus making the perceptual quality of
the images quite low. This led to the discovery of novel loss functions which include
perceptual loss. Deep metric learning also plays an important role in the field of clas-
sification. For example, we have seen in the previous sections that metric learning can
boost the face recognition performance due to better separation. However, the usage of
deep metric learning in GANs and its effect on the generated image quality has not been
extensively explored yet. Finally, face recognition with occlusion is one of the most
sought after areas in this field. Our background study reveals that the direction of re-
search in this area fails to explore the temporal relation of images in surveillance videos.
This results in an inefficient use of the available information about the individual whose
face is occluded. This is a considerable gap in this field and we propose to combine the
information obtained from multiple images and help boost the recognition performance.
Having explored the research motivation of this thesis in the context of the current

35
2.6 Critical Thinking and Analysis: Summary

state-of-the-art, we will have a detailed discussion in the following chapters about the
research performed towards achieving our goal and addressing the gaps present in the
current development.

36
Chapter 3

Multi-purpose Perceptual Quality


Image Enhancement Using Generative
Adversarial Network

3.1 Introduction
Photo-Realistic image enhancement is challenging but highly demanded in real-world
applications. Image enhancement can be broadly classified into two domains: super-
resolution (SR) and artifact removal (AR). The task of estimating a high-resolution im-
age from its low-resolution (LR) counterpart is the SR and estimating an artifact-free
sharp image from its corrupted counterpart is the AR. The AR problem is particularly
prominent for highly compressed images and videos, for which texture detail in the re-
constructed images is typically absent. The same problem persists for SR as well. One
major problem with the current state-of-the-art is that there does not exist any end-to-
end network which can solve the problem of AR and SR simultaneously, thus requiring
two different algorithms to be applied on the image if both AR and SR are desirable.
This is an important problem which needs to be addressed. In this era of a data-
driven world, images play an important role. A Forbes report from mid 2018 mentions
that more than 300 million photos get uploaded per day, of which most are uploaded
from mobile phones. The number of videos uploaded every day are also compara-
ble. Current mobile phones are capable of creating images which are more than 15

37
3.1 Introduction

megapixel on average, and can also produce HD to 4k video. Thus, saving all images
and videos in full resolution can be costly. Therefore, most images and videos are down-
scaled and compressed by several factors before saving them, thus losing a lot of detail.
This also introduces compression artifacts like JPEG and MPEG artifacts. This is the
most common problem in images on the Internet (for instance, in Twitter, Instagram,
etc.). This problem also persists for object recognition and classification in surveillance
videos where some people/objects are typically far away from the camera and appear
small in the image. A simultaneous superresolution and artifact removal is highly useful
in these scenarios and we have explored this possibility in this paper. To cope with the
problem of generating high perceptual quality images, different approaches have been
proposed [42–44]. These approaches deal either with SR or with AR, but not both.
Supervised image enhancement algorithm [4, 7, 41] generally tries to minimise the
mean squared error (MSE) between the target high-resolution (HR) image and the
ground truth, thus maximising the PSNR. However, the ability of MSE to capture per-
ceptually relevant differences, such as high texture detail, is insufficient as they are de-
fined based on pixel-wise image differences. This leads to an image having an inferior
perceptual quality. Recently, deep learning has shown impressive results. In particular,
the Super Resolution Convolutional Neural Network (SRCNN) proposed by Dong et
al. [4] shows the potential of an end-to-end deep convolutional network in SR. Ledig et
al. [3] presented a framework called SRGAN which is capable of generating photoreal-
istic images for 4× up-scaling factors, but there are several problems of this framework
when used for SR in conjunction with an AR framework. Dong et al. [7] discovered that
SRCNN directly applied for compression artifact reduction leads to undesirable noisy
patterns, thus proposing a new improved model called Artifact Reduction Convolutional
Neural Networks (ARCNN), which showed better performance. Svoboda et al. [41] pro-
posed the L4 and L8 architecture which has better results compared to ARCNN but still
failed to completely remove all artifacts for highly compressed JPEG images. A ma-
jor drawback for all successful methods till date is that all the proposed methods work
on the luma channel (channel Y in YCbCr colour space which is monochrome), but
none of them reports the performance on colour images, although AR in colour images
is more relevant. As per our knowledge, till date, a versatile robust algorithm which
solves all kinds of image enhancement problems is yet to be proposed. Thus, we see
that there is a significant gap in the state of the art for general image enhancement prob-

38
3.2 Related Work

lems, particularly, an end-to-end solution for the superresolution and artifact removal
problems.
In this chapter, we propose a novel Image Enhancing Generative Adversarial Net-
work (IEGAN) using U-net like generator with skip connections and an autoencoder-
like discriminator. This is a multi-purpose image enhancement network which is ca-
pable of removing artifacts and superresolving with high sharpness and detail in an
end-to-end manner, simultaneously, within a single network. Our main contributions
are summarised as follows:

• We propose the first end-to-end network called Image Enhancement Generative


Adversarial Network (IEGAN) which can solve the problem of SR and AR simul-
taneously. Our proposed network is able to generate photorealistic images from
low-resolution images corrupted with artifacts, i.e., it acts as a unified framework
which simultaneously super-resolves the image and recovers it from the compres-
sion artifacts.

• We propose a new and improved perceptual loss function which is the sum of the
reconstruction loss of the discriminator, the feature loss from the VGG network
[9] and the edge loss from the edge detector. This novel loss function preserves
the sharpness of the enhanced image, which is often lost during enhancement.

• We also create a benchmark dataset named World100 for testing the performance
of our algorithm on high-resolution images.

3.2 Related Work

3.2.1 Image Artifact Removal


AR of compressed images has been extensively dealt with in the past. In the spatial
domain, different kinds of filters [36–38] have been proposed to adaptively deal with
blocking artifacts in specific regions. In the frequency domain, wavelet transform has
been utilised to derive thresholds at different wavelet scales for deblocking and de-
noising [39, 40]. However, the problems with these methods are that they could not
reproduce sharp edges and tend to have overly smooth texture regions. In the recent

39
3.2 Related Work

past, JPEG compression AR algorithm involving deep learning has been proposed. De-
signing a deep model for AR requires a deep understanding of the different artifacts.
Dong et al. [7] showed that directly applying the SRCNN architecture for JPEG AR re-
sulted in undesired noisy patterns in the reconstructed image, and thus proposed a new
improved model. Svoboda et al. [41] proposed a novel method of image restoration us-
ing convolutional networks that had a significant quality advancement compared to the
state-of-the-art methods. They trained a network with eight layers in a single step and in
a relatively short time by combining residual learning, skip architecture, and symmetric
weight initialisation.

3.2.2 Image Super-resolution


Filtering approaches were commonly used for superresolution in the past. Although
they were fast, they had overly smooth textures. Some methods [54, 55] also focused
on edge preservation, however, the results were far from being photorealistic. How-
ever, recently, deep learning based superresolution algorithms have shown excellent
performance. SRCNN, proposed by Dong et al. [4], used bicubic interpolation to up-
scale an input image and achieved state-of-the-art SR performance. Dong et al. [57]
and Shi et al. [58] have shown that upscaling filters can be learnt for SR for increased
performance. The studies of Johnson et al. [44] and Bruna et al. [42] relied on loss
functions which focus on perceptual similarity to recover HR images which are more
photorealistic. Recently, Ledig et al. [3] presented a framework capable of generating
photorealistic images for 4× upscaling factors. It had shown excellent results and set a
new state-of-the-art for super-resolution.

3.2.3 Loss Functions


Pixel-wise loss functions like MSE or L1 loss are unable to recover the lost high-
frequency details in an image. These loss functions encourage finding pixel-wise av-
erages of possible solutions, which are generally smooth but have poor perceptual qual-
ity [3, 42–44]. Ledig et al. [3] illustrated the problem of minimizing MSE where mul-
tiple plausible solutions with high texture details are averaged creating a smooth recon-
struction. Johnson et al. [44] and Bruna et al. [42] proposed extracting the features from

40
3.3 Our Approach

a pretrained VGG network instead of using pixel-wise error. They proposed a percep-
tual loss function based on the Euclidean distance between feature maps extracted from
the VGG19 [9] network. Ledig et al. [3] proposed a GAN-based network optimized for
perceptual losses which are more invariant to changes in pixel space, obtaining better
visual results.

3.2.4 Perceptual Image Quality Evaluation


Evaluating the perceptual quality of an image is tricky because most of the statistical
measures do not well reflect the human perception. Ledig et al. [3] has shown in their
work that images with high PSNR do not necessarily mean a perceptually better image.
The same applies to Structural Similarity (SSIM) as well. Xue et al. [61] presented an
effective and efficient image quality assessment model called Gradient Magnitude Sim-
ilarity Deviation (GMSD) which they claimed to have favorable performance in terms
of both perceptual quality and efficiency. A statistical analysis on image quality mea-
sures conducted by Kundu et al. reported that GMSD [62] showed a high correlation
with the human visual system. A very recent work by Reisenhofer et al. presents a
similarity measure for images called Haar wavelet-based Perceptual Similarity Index
(HaarPSI) [11] that aims to correctly assess the perceptual similarity between two im-
ages with respect to a human viewer. It achieves higher correlations with human opinion
scores on large benchmark databases in almost every case and is probably the best per-
ceptual similarity metric available in the literature.
Taking these into account, the similarity metrics we have selected for evaluating the
performance are GMSD [61] and HaarPSI [11]. We have also calculated the PSNR and
SSIM [86] for a fair comparison with other algorithms.

3.3 Our Approach


In this chapter, we aim to estimate a sharp and artifact-free image I HR from an image
I LR which is either low-resolution or corrupted with artifacts or both. Here I HR is the
enhanced version of its degraded counterpart I LR . For an image with C channels, we
describe I LR by a real-valued tensor of size W × H ×C and I HR and I GT by ρW × ρH ×
ρC respectively, where I GT is the ground truth image and ρ = 2 p where p ∈ {0, 1, 2, ...}.

41
3.3 Our Approach

To estimate the enhanced image for a given low-quality image, we train the genera-
tor network as a feed-forward CNN GθG parameterised by θG . Here θG = W1 : L; b1 : L
denotes the weights and biases of an L-layer deep network and is obtained by optimising
a loss function Floss . The training is done using two sets of n images {IiGT : i = 1, 2, ..., n}
and {I LR GT = G (I LR ) (where I GT and I LR are correspond-
j : j = 1, 2, ..., n} such that Ii θG j i j
ing pairs) and by solving

1 n
θ̂G = arg min Floss (IiGT , GθG (I LR
j )) (3.1)
θG N i,∑
j=1
i= j

Following the work of Goodfellow et al. [23] and Isola et al. [87], we also add a
discriminator network DθD to assess the quality of images generated by the generator
network GθG .
The generative network is trained to generate the target images such that the differ-
ence between the generated images and the ground truth is minimised. While training
the generator, the discriminator is trained in an alternating manner such that the proba-
bility of error of the discriminator (between the ground truth and the generated images)
is minimised. With this adversarial min-max game, the generator can learn to create so-
lutions that are highly similar to real images. This also encourages perceptually superior
solutions residing in the manifold of natural images.

3.3.1 Network Architecture


We follow the architectural guidelines of GAN proposed by Radford et al. [33]. For the
generator we use a convolutional layer with small 3 × 3 kernels and stride=1 followed
by a batch normalisation layer [18] and Leaky ReLU [88] as the activation function.
The number of filters per convolution layer is indicated in Figure 3.1.
For image enhancement problems, even though the input and output differ in appear-
ance, both are actually renderings of the same underlying structure. Therefore, the input
is more or less aligned with the output. We design the generator architecture keeping
these in mind. For many image translation problems, there is a lot of low-level informa-
tion shared between the input and output, and it will be helpful to pass this information
directly across the network. Ledig et al. had used residual blocks and a skip connection
in their SRGAN [3] framework to help the generator carry this information. However,

42
3.3 Our Approach

Figure 3.1: The overall architecture of our proposed network. The convolution layers of
the generator have a kernel size 3 × 3 and stride is 1. Number of filters for each layer is
indicated in the illustration, e.g., n32 refers to 32 filters. For the discriminator network,
the stride is 1 except for the layers which indicates that the stride is 2, e.g., n64s2 refers
to 64 filters and stride=2

we found that it is more useful to add skip connections following the general shape of a
U-Net [89]. As the name implies, in deep architectures, skip connections skip some lay-
ers in the neural network and feed the result of one layer as the input to the next level.
Thus, we give an alternative path for the gradient by using a skip connection during
backpropagation. This additional approach is generally useful for model convergence,
which has been proven experimentally.Specifically, we add skip connections between
each layer n and layer L − n, where L is the total number of layers. Each skip connec-
tion simply concatenates all channels at layer n with those at layer L − n. This type of
skip connection is motivated by the fact that it has an uninterrupted gradient flow from
the first to the last layer, thereby addressing the vanishing gradient problem. Concate-
native skip connections provide an alternative method for ensuring feature reusability
of the same dimension from previous layers. The proposed deep generator network GθG
is illustrated in Figure 3.1. The generator has an additional block containing two sub-
pixel convolution layers immediately before the last layer for cases where p > 0, i.e.,
where the size of the output is greater than the input. These layers are called pixel-
shuffle layers, as proposed by Shi et al. [58]. Pixel shuffle transformation reorganises

43
3.3 Our Approach

the low-resolution image channels to produce a larger image with fewer channels. We
can reorganise the set of pixels into a single larger image by introducing a pixel shuffle
layer after a series of small processing steps. The main advantage of using this trans-
formation is that it increases the computational efficiency of the neural network model.
Each pixel shuffle layer in our network increases the resolution of the image by 2×. In
Figure 3.1, we show two such layers which super-resolves the image by 4×. If p = 0,
we do not need any such block since the size of the output image is equal to the input
image.
The discriminator in our framework is very crucial for the performance and consists
of an encoder and a decoder, which is a typical autoencoder architecture. The output
of the autoencoder is the reconstructed image of its input which is either the ground
truth or the generator output. The discriminator output can then be used to differentiate
between the loss distribution of the reconstructed real image and the reconstructed gen-
erator image. This helps the discriminator to pass back a lot of semantic information to
the generator regarding the quality of the generated images, which is not possible with
a binary discriminator. Detailed analysis of the advantage of this type of discriminator
has been shown in Section 3.3.2. Our proposed discriminator contains eighteen convo-
lutional layers with an increasing number of 3 × 3 filter kernels. The specific number of
filters are indicated in Figure 3.1. Strided convolutions with stride=2 are used to reduce
the feature map size, and pixel-shuffle layers [58] are used to increase them. The overall
architecture of the proposed framework is shown in Figure 3.1 in detail.

3.3.2 Loss Function


The performance of our network highly varies with different loss functions. Thus, a
proper loss function is critical for the performance of our generator network. We im-
prove on Johnson et al. [44], Bruna et al. [42] and Ledig et al. [3] by adding an edge
loss counterpart and the discriminator reconstruction loss to design a loss function that
can assess an image with respect to perceptual features instead of minimising pixel-
wise difference. The absence of the edge loss and the reconstruction loss counterpart
in SRGAN is an important reason why it fails to produce sharp images during AR+SR.
Adding these helps to produce sharp output images even after the removal of artifacts
and 4× up-scaling.

44
3.3 Our Approach

Feature Loss

We choose the feature loss based on the ReLU activation layers of the pretrained 19
layer VGG network described in Simonyan and Zisserman [9]. This loss is described as
VGG loss by Ledig et al. [3] and is mathematically expressed as
W H
i, j i, j
V GG 1
Closs i. j (I GT , GθG (I LR )) = ∑ ∑ (φi, j (I GT )x,y − φi, j (GθG (I LR))x,y)2 (3.2)
Wi, j Hi, j x=1 y=1

where φi, j is the feature map obtained by the jth convolution (after activation) before
the ith max-pooling layer within the pretrained VGG19 network, and W and H represent
the width and height of the input image, respectively.

Edge Loss

Preservation of edge information is essential for the generation of sharp and clear im-
ages. Thus, we add an edge loss to the feature loss counterpart. There are several edge
detectors available in the literature, and we have chosen to design our edge loss func-
tion using the state of the art edge detection algorithm proposed by Xie and Tu called
Holistically-nested Edge Detection (HED) [90] and the classical Canny edge detection
algorithm [91] due to its effectiveness and simplicity. Experimental results prove that
the Canny algorithm provides similar results for the preservation of sharpness but with
greater speed and fewer resource requirements compared to HED. The detailed com-
parison results have been further discussed in Section 3.4.3. For the Canny algorithm,
a Gaussian filter of size 3 × 3 with σ = 0.3 was chosen as the kernel. This loss is
mathematically expressed as

1 W H

edge GT LR GT LR

Eloss (I , GθG (I )) = ∑ ∑ Θ(I )x,y − Θ(G θG
(I ))x,y
(3.3)
W H x=1 y=1

where Θ is the edge detection function.

Reconstruction Loss

Unlike most other algorithms, our discriminator provides a reconstructed image of the
discriminator input. Modifying the idea of Berthelot et al. [92], we design the discrim-
inator to differentiate between the loss distribution of the reconstructed real image and

45
3.3 Our Approach

the reconstructed fake image. Thus, we have the reconstruction loss function as
f ake
LD = |Lreal
D − kt × LD | (3.4)

where Lreal
D is the loss distribution between the input ground truth image and the recon-
structed output of the ground truth image, mathematically expanded as

edge
Lreal
D = r × Eloss (DθD (I
GT
), DθD (GθG (I GT ))) + (1 − r)
(3.5)
V GGi. j
× Closs (DθD (I GT ), DθD (GθG (I GT )))
f ake
LD is the loss distribution between the generator output image and the reconstructed
output of the same, expanded as

f ake edge
LD = r × Eloss (DθD (I LR ), DθD (GθG (I LR ))) + (1 − r)
(3.6)
V GG
× Closs i. j (DθD (I LR ), DθD (GθG (I LR )))

and kt is a balancing parameter at the t th iteration which controls the amount of emphasis
f ake
put on LD .
f ake
kt+1 = kt + λ (γLreal D − LD ) ∀ step t (3.7)

λ is the learning rate of k which is set as 10−3 in our experiments. Details about this
can be found in [92].

Learning Parameters

The reconstruction loss consists of several learning parameters. In this section, we will
explain more in detail what is the function of each of these learning parameters and how
they affect the learning process. In the reconstruction loss, we encounter parameters k,
λ and γ. It has been mentioned earlier that k is a balancing parameter which controls
f ake
the amount of emphasis put on LD , and the value of k is not fixed as at every step
it needs to balance the real and the fake reconstruction losses. γ maintains the equilib-
rium between real image reconstruction loss and the fake image reconstruction loss, i.e.
f ake
D = LD . A high value of γ will put more emphasis on learning to reconstruct
γLreal
the real images better because it will push Lreal
D to have lower values. Similarly, a low
value of γ will help learning to reconstruct the fake images. In our experiments, we have
f ake
D − LD
used the value of 0.7. Ideally, for a trained discriminator, γLreal = 0. While the

46
3.4 Experiments

network is being trained, this is not the case, and there exists a significant difference.
But with learning, this difference becomes smaller. this difference decides the value of
k. We start with the value of k = 0, and then with every step, this difference is added to
the previous value of k. To control the learning of k, the difference is multiplied by λ ,
which is set to 10−3 .

Final Loss Function

We formulate the final perceptual loss Floss as the weighted sum of the feature loss Closs
and the edge loss Eloss component added to the reconstruction loss such that

Floss = r × Eloss + (1 − r) ×Closs + LD (3.8)

Substituting the values from Equation 3.2, 3.3 and 3.4, we have

1 W H

GT LR

Floss =r× ∑ ∑ Θ(I )x,y − Θ(Gθ G
(I ))x,y
+ (1 − r)
W H x=1 y=1
(3.9)
Wi, j Hi, j
1 f ake
× ∑ ∑ D − kt × LD |
(φi, j (I GT )x,y − φi, j (GθG (I LR ))x,y )2 + |Lreal
Wi, j Hi, j x=1 y=1

The value of r has been decided experimentally.

3.4 Experiments

3.4.1 Data, Evaluation Metrics and Implementation Details


To validate the performance of AR, we test our framework on the LIVE1 [10] dataset
(29 images), which is the most popular benchmark for AR. For SR, we evaluate the
performance using the benchmark datasets Set14 [93] (14 images) and BSD100 (100
images), which is a testing set of BSD300 [94]. For simultaneous AR+SR, we conduct
the evaluation using the LIVE1 [10] and the World100 dataset. Our proposed World100
dataset contains 100 high-resolution photos representing photographs commonly found.
The photographs have all characteristics, e.g., texture, colour, gradient, sharpness, etc.,
which are required to test any image enhancement algorithm. All results reported for

47
3.4 Experiments

AR experiments for all the datasets are performed by degrading JPEG images to a qual-
ity factor of 10% (i.e., 90% degradation), the SR experiments are performed with an
upscaling factor of 4, and for AR+SR, the dataset is degraded to a quality factor of 10%
and the resolution is reduced by a factor of 4, which corresponds to a 16 times reduction
in image pixels.

3.4.2 Training Details


We trained all networks on an NVIDIA DGX-1 using a random sample of 60,000 images
from the ImageNet dataset [95]. For each minibatch, we cropped random 96 × 96 HR
sub-images of distinct training images for SR, 256 × 256 for AR, and 128 × 128 for
AR+SR. Our generator model can be applied to images of arbitrary size as it is a fully
convolutional network. We scaled the range of the image pixel values to [-1, 1]. During
feeding the outputs to the VGG network for calculating the loss function, we scale it
back to [0, 1] since VGG network inherently handles image pixels in the range [0, 1].
For optimisation, we use Adam [96] with β1 = 0.9. The value of r in Equation 3.8 is
selected as 0.4. The network was trained with a learning rate of 10−4 and with 5 × 104
update iterations. Our implementation is based on TensorFlow.
We will describe the training methodology in detail for the simultaneous AR+SR as
that is the main point of interest of this research. We train the model in an end-to-end
manner. For this, we cut out random 128 × 128 patches from the training images and
the ground truth. The cropping coordinates of the input image and the corresponding
ground truth are the same so that the training pairs match with each other. The network
is initialised using Xavier initialisation. Every iteration consists of two phases, the
generator training and then the discriminator training. During training the generator, the
discriminator parameters are fixed. The input image is then fed to the generator, and
after the forward propagation, the output image is obtained. The loss is then calculated
between the output image and the ground truth, and this loss is backpropagated to update
the generator parameters. While training the discriminator, the generator parameters are
fixed. The generator output and the ground truth are then fed to the discriminator. The
loss is calculated at the output of the discriminator and is backpropagated to update the
discriminator weights. In this way, an alternating generator and discriminator training
is carried out. Once the network has been trained, we do not need the discriminator

48
3.4 Experiments

Figure 3.2: Comparison of edge detection for Canny and HED. Left to Right - Image,
edge output of Canny, edge output of HED. Best viewed in pdf.

AR
Loss PSNR SSIM GMSD ↓ HaarPSI
VGG+Canny 27.31 0.8124 0.0685 0.7533
VGG+HED 27.27 0.8180 0.0705 0.7481
SR-4x
Loss PSNR SSIM GMSD ↓ HaarPSI
VGG+Canny 25.03 0.7346 0.0850 0.7297
VGG+HED 25.03 0.7457 0.0861 0.7279

Table 3.1: Performance of SR and AR for different edge detectors. AR is evaluated on


LIVE1 and SR on Set14. For GMSD, lower value is better.

anymore and it can be discarded.

3.4.3 Ablation Study


We investigate the effect of different discriminator architectures and loss functions on
the performance of our network.
Discriminator: We use two different discriminators in our experiments. The first
discriminator (Dv1) evaluates the image in the pixel space as described in Section 3.3
and the other one (Dv2) in the feature space, which gives a binary output of 0 or 1
for fake and real images respectively. The architecture of Dv1 is shown in Figure 3.1.
For Dv2, we use a similar discriminator introduced by Ledig et al. [3] in SRGAN. We
train all the models by converting all images to YCbCr colour space. Since the human
visual system has poor frequency response to colour components (CbCr) compared to
luminance (Y), we try to minimise most of the artifacts in the Y channel for the best
perceptual quality.

49
3.4 Experiments

G + Dv1
Loss PSNR SSIM GMSD ↓ HaarPSI
VGG 27.12 0.801 0.074 0.737
L1 27.45 0.803 0.079 0.725
Canny+VGG 27.31 0.803 0.073 0.739
Canny+L1 27.74 0.811 0.075 0.738
G + Dv2
Loss PSNR SSIM GMSD ↓ HaarPSI
VGG 27.26 0.806 0.0740 0.7358
L1 27.68 0.810 0.0753 0.7374
Canny+VGG 27.41 0.810 0.0739 0.7384
Canny+L1 27.73 0.810 0.0750 0.7380

Table 3.2: Performance of AR with different discriminators and loss functions, evalu-
ated on the Y channel (Luminance) for LIVE1 dataset. The numbers in bold signifies
the best performance. For GMSD, a lower value is better.

ARCNN L04 IEGAN(BW)

IEGAN Ground
(RGB) Truth

Figure 3.3: Results of JPEG AR for different algorithms. The Ground Truth was de-
graded to 10% of its original quality. Note that for IEGAN, the image is sharper. The
IEGAN (BW) (black and white) image is provided for a fair comparison with the rest
of the images. Best viewed in pdf.

Loss Function: We use a weighted combination of the VGG feature maps with
the Canny edge detector as a loss function. We also study the performance using VGG
and L1 separately combined with Canny to validate our claim that combining Canny

50
3.4 Experiments

Algorithm PSNR SSIM GMSD ↓ HaarPSI


ARCNN [7] 29.13 0.8232 0.0721 0.7363
L4 [41] 29.08 0.8240 0.0711 0.7358
IEGAN (Ours) 27.31 0.8124 0.0685 0.7533

Table 3.3: Performance of IEGAN for JPEG AR compared to other state-of-the-art


algorithms for the LIVE1 dataset. For GMSD, lower value is better.

Algorithm PSNR SSIM GMSD ↓ HaarPSI


SRCNN [4] 27.04 0.784 0.088 0.713
SRGAN [3] 26.02 0.740 0.086 0.728
ENet-PAT [97] 25.77 0.718 0.088 0.719
IEGAN (Ours) 25.03 0.735 0.085 0.730

Table 3.4: Performance of state-of-the-art algorithms for SR for Set14 dataset for RGB
images. For GMSD lower, value is better.

LIVE1
Algorithm PSNR SSIM GMSD ↓ HaarPSI
ARCNN+SRGAN 21.61 0.5284 0.1980 0.4112
SRGAN+ARCNN 22.70 0.6417 0.1457 0.5302
IEGAN 22.57 0.6319 0.1404 0.5504
World 100
Algorithm PSNR SSIM GMSD ↓ HaarPSI
ARCNN+SRGAN 25.51 0.6809 0.1668 0.4792
SRGAN+ARCNN 27.16 0.7861 0.1059 0.6320
IEGAN 25.62 0.7651 0.1009 0.6429

Table 3.5: Performance of IEGAN for simultaneous AR+SR compared to other state
of the art algorithms for the benchmark LIVE1 dataset and the World100 dataset. For
GMSD, lower value is better.

51
3.4 Experiments

Figure 3.4: Results of SR for different algorithms. The perceptual quality of SRGAN
and IEGAN outputs are visually comparable. Left to Right - SRCNN, SRGAN, IEGAN,
Ground Truth. Best viewed in pdf.

enhances the perceptual quality of the images. Table 3.2 shows the quantitative per-
formance of the algorithm with various discriminator and loss functions. We also ex-
periment with Holistically Nested Edge Detection (HED) [90] by replacing the Canny
counterpart. HED is the state-of–of-the–the-art edge detection algorithm which has bet-
ter edge detection capability compared to Canny. However, from Figure 3.2, we can
observe that both HED and Canny successfully produce the required edge information,
which is perceptually indistinguishable to the human eye. In other words, the different
edge methods are not critical to the overall performance, which is also proved in Table
3.1. Thus, we choose the simple yet fast Canny method in our final framework.
The results from Table 3.1 and Table 3.2 confirm that the GAN with discriminator
Dv1 using a weighted combination of VGG with the Canny loss function gives the best
GMSD and HaarPSI score. The majority of the AR algorithms proposed till date work
on black and white images. Our proposed algorithm works with colour images as well.
The highest PSNR and SSIM values are obtained from the framework having the dis-
criminator Dv1 with the Canny+L1 loss function. PSNR is a common measure used to
evaluate AR and SR algorithms. However, the ability of PSNR to capture perceptually
relevant differences is very limited as they are defined based on pixel-wise image dif-
ferences [3, 98, 99]. We will be using Dv1 with loss VGG + Canny for the rest of the

52
3.4 Experiments

ARCNN+SRGAN SRGAN+ARCNN IEGAN Ground Truth

Figure 3.5: Results for simultaneous SR+AR of RGB images using various algorithms.
Row 1, 3 & 4 are from World100 and row 2 from LIVE dataset. Note that the textures
and details in the output images from IEGAN are much superior than the others. Best
viewed in pdf.

53
3.5 Conclusion

experiments. We name this framework as IEGAN.

3.4.4 Comparison to State of the Art


From Table 3.3, we see that for the JPEG artifact removal purpose, IEGAN performs
significantly better than other algorithms. According to the results of Table 3.4, IE-
GAN gives the best GMSD and HaarPSI scores for Set14. For SR, IEGAN produces
comparable results to SRGAN but the perceptual quality of the images generated by
SRCNN is much inferior. This has been demonstrated in Figure 3.4 with two images
from the BSD100 dataset. Furthermore, Ledig et al. [3] has also argued that the percep-
tual quality of the images generated by SRCNN is not good compared to SRGAN by
mean opinion score (MOS) testing.
The results for end-to-end AR+SR are shown in Table 3.5 where we can see that
IEGAN outperforms the other state of the art pipelines. Figure 3.5 shows the visual
results on parts of the World100 dataset where the algorithms were used to perform
both AR and SR on the same image simultaneously. ARCNN+SRGAN implies that
first ARCNN was used to recover the image from artifacts and then SRGAN was used
to superresolve the image, while SRGAN+ARCNN implies that first SRGAN was used,
and then ARCNN. In contrast, IEGAN provides a one-shot end-to-end solution for both
AR and SR in the same network. However, all algorithms fail to produce the photore-
alistic result for large areas having a very gentle colour gradient, e.g., clear sky, aurora,
etc. The images in the World100 dataset are more than 2000 pixels on at least one side.
Thus, the area of the colour gradient becomes enormously large compared to the recep-
tive field of the network. The network is trained with 128×128 images and the training
data hardly contains any image with such colour gradient, resulting in a lack of training
for such images. Thus, the algorithm fails to learn how to recreate a smooth colour
gradient in the output images. This fact is also true for all other algorithms too.

3.5 Conclusion
We have described a deep generative adversarial network with skip connections that sets
a new state of the art on public benchmark datasets when evaluated with respect to per-
ceptual quality. We have introduced an edge detector function in the loss which results

54
3.5 Conclusion

in sharper outputs at the edges. This network is the first framework which success-
fully recovers images from artifacts and at the same time super-resolves, thus having
a single-shot operation performing two different tasks. We have also shown that the
network is capable of solving the problem of artifact removal or superresolution indi-
vidually, and the results are comparable to the state-of-the-art, if not better. We have
highlighted some limitations of the existing loss functions used for training any image
enhancement network and introduced IEGAN, which augments the feature loss func-
tion with an edge loss during the training of the GAN. Our ablation study shows that a
critic-based discriminator, which are typical to Wasserstein GANs can provide better re-
sults compared to binary discriminators. Using different combinations of loss functions
and by using the discriminator both in feature and pixel space, we confirm that IEGAN
reconstructions for corrupted images are superior by a considerable margin and more
photorealistic than the reconstructions obtained by the current state-of-the-art methods.

55
Chapter 4

GAN Based Simultaneous


Super-resolution and Noise Removal of
Images Using Edge Information

4.1 Introduction
Photo-Realistic image enhancement on low-quality images is challenging but highly
demanded in real-world applications. It can be broadly classified into three domains:
(1) Super-resolution (SR) [3, 4] focuses on the task of estimating a high-resolution im-
age from its low-resolution counterpart; (2) Noise removal (NR) [5, 6] aims at cleaning
different noises present in the image, e.g., Gaussian noise, Poisson noise, etc.; (3) Ar-
tifact removal (AR) [7, 8] concentrates on estimating an artifact-free sharp image from
its corrupted counterpart, e.g., JPEG artifacts.
To cope with the problem of generating high perceptual quality images from low-
quality ones, different methods have been proposed. In the spatial domain, different
kinds of filters [36–38] have been proposed to adaptively deal with blocking artifacts in
specific regions. In the frequency domain, wavelet transform has been utilized to derive
thresholds at different wavelet scales for deblocking and denoising [39, 40]. Recently,
deep learning-based approaches [3–5, 7, 8, 42–44] have shown impressive results. For
image superresolution, the seminal work, Super Resolution Convolutional Neural Net-
work (SRCNN) [4], shows the potential of an end-to-end deep convolutional network in

56
4.1 Introduction

Figure 4.1: Illustration of inputs (i.e., low-resolution images with noises and artifacts)
and outputs (i.e., images after super-resolving, removing different noises and artifacts
simultaneously) of our proposed framework. Images in the first row are original low-
resolution noisy images. Images in the second row are a 4x zoomed version of original
images for proper visualisation of the noise. Images in the third row are the correspond-
ing super-resolved and clean output images obtained from our proposed algorithm. Best
viewed in color.

SR. Ledig et al. [3] presented a framework called SRGAN, which is capable of gener-
ating photorealistic images for 4× up-scaling factors. In the field of noise and artifact
removal from images, several attempts have been made. Zhang et al. [8] proposed a
CNN denoiser integrated with model-based optimization method to solve different in-
verse problems that show significant improvement compared to the existing methods.
Dong et al. [7] discovered that SRCNN directly applied for compression artifact re-
duction leads to undesirable noisy patterns, thus proposing an improved model called
Artifact Reduction Convolutional Neural Network (ARCNN), which shows better per-
formance. Svoboda et al. [41] proposed the L4 and L8 architectures which have better
results compared to ARCNN. Deep Image Prior [5] by Ulyanov et al. shows promising
results for various image enhancement tasks.
The above-mentioned image enhancement approaches provide state-of–of-the–the-
art results while only solving a single problem, e.g., either SR, NR or AR, but not all
together simultaneously. However, in real-world scenarios, it is widespread that low-
quality images can arise from different sources, e.g., out-of-date camera sensors, poor-

57
4.1 Introduction

lighting environments, etc., with several mixed issues. Therefore, conducting multiple
types of image enhancement to generate photorealistic images in a single framework is
highly demanded in practical applications. A naive approach to solve this practical prob-
lem is to build a pipeline consisting of a cascade of superresolution and image cleaning
algorithms, where each of the components incrementally performs its own task. How-
ever, the run-time of this pipeline should be extremely slow since it combines several
different backbones. Furthermore, some components in the pipeline can also bring in
unexpected effects on the following ones, e.g., undesired artifacts from the superreso-
lution module [7], which cause the poor perceptual quality of the output images (see
Section 4.4 for detailed illustrations).
To address these issues, in this chapter, we propose a novel and effective end-to-end
image enhancement framework using a generative adversarial network (GAN). Specif-
ically, an encoder-decoder generator model is utilized to superresolve and clean the
images from noise and artifacts simultaneously, while a discriminator is designed to
help the generator further learn to generate outputs in the natural image manifold. Fur-
thermore, an edge detection module is also introduced to keep the sharpness and texture
details in the enhanced outputs, which is often lost during the enhancement procedure
in other methods. Figure 4.1 shows examples of inputs (i.e., low-resolution images with
noises and artifacts) and outputs (i.e., images after superresolving, removing different
noises and artifacts simultaneously) of our proposed framework.
To sum up, in this chapter, our main contribution is threefold:

• To our best knowledge, we are the first to propose an end-to-end multipurpose


image enhancement framework based on GAN, which is able to generate photo-
realistic images from low-resolution images corrupted with noise and artifacts.

• We introduce an edge detection counterpart in our objective function. It helps pre-


serve the sharpness and details of the enhanced images, which plays an important
role for generating photorealistic images.

• Extensive ablation studies and experiments show the effectiveness and superiority
of our proposed framework on several datasets.

58
4.2 Related Work

4.2 Related Work

4.2.1 Image Artifact Removal


AR of compressed images has been extensively dealt with in the past. In the spatial
domain, different kinds of filters [36–38] have been proposed to adaptively deal with
blocking artifacts in specific regions. In the frequency domain, wavelet transform has
been utilized to derive thresholds at different wavelet scales for deblocking and de-
noising [39, 40]. However, the problems with these methods are that they could not
reproduce sharp edges and tend to have overly smooth texture regions. In the recent
past, JPEG compression AR algorithms involving deep learning have been proposed.
Designing a deep model for AR requires a deep understanding of the different artifacts.
Dong et al. [7] showed that directly applying the SRCNN architecture for JPEG AR re-
sulted in undesired noisy patterns in the reconstructed image, and thus proposed a new
improved model. Svoboda et al. [41] proposed a novel method of image restoration us-
ing convolutional networks that had a significant quality advancement compared to the
state-of-the-art methods. They trained a network with eight layers in a single step and in
a relatively short time by combining residual learning, skip architecture, and symmetric
weight initialization.

4.2.2 Image Noise Removal


There have been several attempts in the past to deal with image denoising. Dabov et
al. [48] proposed an image deblurring algorithm using Nash equilibrium to derive an it-
erative decoupled deblurring BM3D method. To solve different inverse problems, there
are several existing denoiser priors which were adopted in model-based optimization
methods. Chambolle presented an algorithm for minimizing the total variation of an
image [49], Zoran et al. designed a framework [50] which restore images using a patch-
based prior. K-SVD [51] and non-local means [52] have also been used to solve image
denoising problems. However, these algorithms have their respective drawbacks. For
example, total variation is known to create watercolor-like artifacts, nonlocal means,
and BM3D denoiser priors often over-smooths fine texture details. The problem with
K-SVD is that it suffers high computational burden. Thus, a strong denoiser prior which
can be implemented efficiently is highly demanded. Moreover, most existing methods

59
4.2 Related Work

work independently on every channel or mainly focus on modeling the gray image prior.
Since the color channels are highly correlated, working on the channel simultaneously
produces better performance than independently handling each color channel [53].

4.2.3 Image Super-resolution


Initially, filtering approaches were used for SR. They are usually very fast but with
overly smooth textures, thus losing a lot of detail. Methods focusing on edge preser-
vation [54, 55] also fail to produce photorealistic images. Recently, convolutional neu-
ral network (CNN) based SR algorithms have shown excellent performance. Wang et
al. [56] showed that the sparse coding model for SR can be represented as a neural net-
work and improved results can be achieved. Dong et al. [4] used bicubic interpolation to
upscale an input image and trained a three-layer deep fully convolutional network end-
to-end achieving state-of-the-art SR performance. Dong et al. [57] and Shi et al. [58]
demonstrated that upscaling filters can be learnt for SR for increased performance. The
studies of Johnson et al. [44] and Bruna et al. [42] relied on loss functions which fo-
cus on perceptual similarity to recover HR images which are more photorealistic. A
recent work by Ledig et al. [3] presented the first framework capable of generating pho-
torealistic images for 4× upscaling factors. Sajjadi et al. proposed a novel application
of automated texture synthesis in combination with perceptual loss, which focuses on
creating realistic textures.

4.2.4 Loss Functions


To generate high perceptual quality images, designing a proper loss function plays an
important role in our proposed algorithm. Pixel-wise loss functions like MSE or L1
loss are unable to recover the lost high-frequency details in an image. These loss func-
tions encourage finding pixel-wise averages of possible solutions, which are generally
smooth but have poor perceptual quality [3, 42–44]. Ledig et al. [3] illustrated the prob-
lem of minimizing MSE where multiple plausible solutions with high texture details are
averaged creating a smooth reconstruction. Johnson et al. [44] and Bruna et al. [42]
proposed extracting the features from a pretrained VGG network instead of using pixel-
wise error. They proposed a perceptual loss function based on the Euclidean distance

60
4.3 Our Approach

between feature maps extracted from the VGG19 [9] network. Ledig et al. [3] pro-
posed a GAN-based network optimized for perceptual losses which are more invariant
to changes in pixel space, obtaining better visual results.

4.2.5 Perceptual Image Quality Evaluation


Evaluating the perceptual quality of an image is tricky because most of the statistical
measures do not well reflect the human perception. Since we are working on percep-
tual enhancement of images, a proper image perceptual quality metric is essential for a
fair evaluation of our results. Ledig et al. [3] has shown in their work that images with
high PSNR do not necessarily mean a perceptually better image. The same applies to
Structural Similarity (SSIM) as well. Xue et al. [61] presented an effective and effi-
cient image quality assessment model called Gradient Magnitude Similarity Deviation
(GMSD), which they claimed to have favorable performance in terms of both percep-
tual quality and efficiency. A statistical analysis on image quality measures conducted
by Kundu et al. reported that GMSD [62] showed a high correlation with the human
visual system. A very recent work by Reisenhofer et al. presents a similarity measure
for images called Haar wavelet-based Perceptual Similarity Index (HaarPSI) [11] that
aims to correctly assess the perceptual similarity between two images with respect to
a human viewer. It achieves higher correlations with human opinion scores on large
benchmark databases in almost every case and is probably the best perceptual similarity
metric available in the literature. Thus, we will focus our evaluation mainly on GMSD
and HaarPSI.

4.3 Our Approach


In this chapter, we aim to estimate a sharp and artifact-free image I HR from an image
I LR which is either low-resolution or corrupted with artifacts or both. Here I HR is the
enhanced version of its degraded counterpart I LR . For an image with C channels, we
describe I LR by a real-valued tensor of size W × H ×C and I HR and I GT by ρW × ρH ×
ρC respectively, where I GT is the ground truth image and ρ = 2 p where p ∈ {0, 1, 2, ...}.
To estimate the enhanced image for a given low-quality image, we train a generator
network as a feed-forward CNN GθG parametrised by θG . Here θG = W1 : L; b1 : L de-

61
4.3 Our Approach

notes the weights and biases of a L-layer deep network and is obtained by optimising a
loss function Floss . The training is done using two sets of n images {IiGT : i = 1, 2, ..., n}
and {I LR GT = G (I LR ) (where I GT and I LR are correspond-
j : j = 1, 2, ..., n} such that Ii θG j i j
ing pairs) and by solving

1 n
θ̂G = arg min Floss (IiGT , GθG (I LR
j )) (4.1)
θG N i,∑
j=1
i= j

Following the work of Goodfellow et al. [23] and Isola et al. [87], we also add a
discriminator network DθD to assess the quality of images generated by the generator
network GθG .
The generative network is trained to generate the target images such that the differ-
ences between the generated images and the ground truth are minimised. While training
the generator, the discriminator is trained in an alternating manner such that the proba-
bility of error of the discriminator (between the ground truth and the generated images)
is minimised. With this adversarial min-max game, the generator can learn to create so-
lutions that are highly similar to real images. This also encourages perceptually superior
solutions residing in the manifold of natural images.
We follow the architectural guidelines of GAN proposed by Radford et al. [33].
For the generator we use a convolutional layer with small 3 × 3 kernels and stride=1
followed by a batch normalisation layer [18] and Leaky ReLU [88] as the activation
function. The number of filters per convolution layer is indicated in Figure 3.1.
For image enhancement problems, even though the input and output differ in appear-
ance, both are actually renderings of the same underlying structure. Therefore, the input
is more or less aligned with the output. We design the generator architecture keeping
these in mind. For many image translation problems, there is a lot of low-level informa-
tion shared between the input and output, and it will be helpful to pass this information
directly across the network. Ledig et al. had used residual blocks and a skip connection
in their SRGAN [3] framework to help the generator carry this information. However,
we found that it is more useful to add skip connections following the general shape of
a U-Net [89]. Specifically, we add skip connections between each layer n and layer
L − n, where L is the total number of layers. Each skip connection simply concatenates
all channels at layer n with those at layer L − n. The proposed deep generator network

62
4.3 Our Approach

GθG is illustrated in Figure 3.1. The generator has an additional block containing two
subpixel convolution layers immediately before the last layer for cases where p > 0,
i.e., where the size of the output is greater than the input. These layers are called pixel-
shuffle layers, as proposed by Shi et al. [58]. Each pixel shuffle layer increases the
resolution of the image by 2×. In Figure 3.1, we show two such layers which super-
resolves the image by 4×. If p = 0, we do not need any such block since the size of the
output image is equal to the input image.
The discriminator in our framework is very crucial for the performance and is de-
signed in the form of an autoencoder. Thus, the output of the autoencoder is the recon-
structed image of its input which is the ground truth or the generator output. This helps
the discriminator to pass back a lot of semantic information to the generator regarding
the quality of the generated images, which is not possible with a binary discriminator.
Our proposed discriminator contains eighteen convolutional layers with an increasing
number of 3 × 3 filter kernels. The specific number of filters are indicated in Figure
3.1. Convolutions with stride=2 are used for feature map reduction, and pixel-shuffle
layers [58] are used to increase them. The architecture of the proposed framework is
shown in Figure 3.1 in details.

4.3.1 Loss Function


The performance of our network highly varies with different loss functions. Thus, a
proper loss function is critical for the performance of our generator network. We im-
prove on Johnson et al. [44], Bruna et al. [42] and Ledig et al. [3] by adding an edge
loss counterpart and the discriminator reconstruction loss to design a loss function that
can assess an image with respect to perceptual features instead of minimising pixel-wise
difference. The absence of the edge loss and the reconstruction loss counterpart in SR-
GAN is an important reason why it fails to produce sharp images during simultaneous
noise/artifact removal and superresolution. Adding these helps to produce sharp output
images even after the removal of artifacts and 4× up-scaling.

Feature Loss

We choose the feature loss based on the ReLU activation layers of the pretrained 19
layer VGG network described in Simonyan and Zisserman [9]. This loss is described as

63
4.3 Our Approach

VGG loss by Ledig et al. [3] and is mathematically expressed as


W H
i, j i, j
V GGi. j 1
Closs (I GT , GθG (I LR )) = ∑ ∑ (φi, j (I GT )x,y − φi, j (GθG (I LR ))x,y )2 (4.2)
Wi, j Hi, j x=1 y=1

where φi, j is the feature map obtained by the jth convolution (after activation) before
the ith max-pooling layer within the pretrained VGG19 network, and W and H represent
the width and height of the input image, respectively.
It is an alternative to pixel-wise loss, in that it attempts to be closer to perceptual
similarity. This feature loss assists in determining which features are present in the
ground truth image and evaluating how well the model’s predicted features match these.
This enables the model trained with this loss function to generate much finer detail in
the generated features and thus, the overall output.

Edge Loss

Preservation of edge information is essential for the generation of sharp and clear im-
ages. Thus, we add an edge loss to the feature loss counterpart. There are several edge
detectors available in the literature, and we have chosen to design our edge loss func-
tion using the state of the art edge detection algorithm proposed by Xie and Tu, called
Holistically-nested Edge Detection (HED) [90] and the classical Canny edge detection
algorithm [91] due to its effectiveness and simplicity. Experimental results prove that
the Canny algorithm provides similar results for the preservation of sharpness but with
greater speed and fewer resource requirements compared to HED. For the Canny al-
gorithm, a Gaussian filter of size 3 × 3 with σ = 0.3 was chosen as the kernel. For a
very high resolution image, a bigger kernel will produce better results. We used a lower
threshold of 0.050 and an upper threshold of 0.40. This loss is mathematically expressed
as
1 W H

edge GT LR GT LR

Eloss (I , GθG (I )) = ∑ ∑ Θ(I )x,y − Θ(Gθ G
(I ))x,y
, (4.3)
W H x=1 y=1

where Θ is the edge detection function.

How Edge Loss Works

It is common for SR algorithms to produce images with soft edges, or if the pictures
contain noise, the noise is enhanced as well, i.e., any imperfection in the image is su-

64
4.3 Our Approach

Input (4x scaled for better visualisation) Ground Truth

SRGAN+IRCNN Ours

Figure 4.2: Canny edges for the input, ground truth, output of SRGAN+IRCNN and our
proposed algorithm. We have highlighted the areas of significant difference with red
boxes. Best viewed in color.

perresolved. This is quite evident in the images shown in Figure 4.4 where edges of
the images either very soft (e.g., SRCNN + DIP, Bicubic + IRCNN, etc.) or unwanted
distortions like random streaks (e.g. SRGAN + IRCNN, E-net + ARCNN). This obser-
vation motivated us to use the edge loss counterpart to solve this problem. Since the
edge loss map used by us is binary (i.e., consists of 0 and 1 only), the neurons get acti-
vated when there is a difference of edges. Thus, it can minimise the unwanted artifacts
and maximise the sharpness of the output image. Figure 4.2 shows that. The edge loss
counterpart helps to reduce the difference between the edge map of the input and the
ground truth, and as a result, the edge map of our output is quite similar to that of the
ground truth. Other algorithms, such as the cascade of SRGAN+IRCNN cannot use
the edge information, thus resulting in inferior output images. This is clearly visible in
the areas highlighted within the red boxes where our output image does not deviate a
lot from the ground truth. Compared to the other methods, our method offers the most
superior results.
The Canny edge detector works best at region boundaries in the image, i.e., those
areas which correspond to Canny edges. As Canny edges are sparse, this loss counter-
part does not contribute much to the fine internal texture of the image. However, when

65
4.3 Our Approach

the feature loss is combined with the edge loss, it is able to reproduce the image with
high sharpness in the region boundaries as well as the fine internal texture.

Reconstruction Loss

Unlike most other algorithms, our discriminator provides a reconstructed image of the
discriminator input. This is typical to Wasserstein GAN [100] where the discriminator
acts as a critic on how well it matches to the ground truth image. Modifying the idea
of Berthelot et al. [92], we design the discriminator to differentiate between the loss
distribution of the reconstructed real image and the reconstructed fake image. Thus, we
have the reconstruction loss function as
f ake
LD = |Lreal
D − kt × LD | (4.4)

where Lreal
D is the loss distribution between the input ground truth image and the recon-
structed output of the ground truth image, mathematically expanded as

edge
Lreal
D = r × Eloss (DθD (I
GT
), DθD (GθG (I GT ))) + (1 − r)
(4.5)
V GGi. j
× Closs (DθD (I GT ), DθD (GθG (I GT )))
f ake
LD is the loss distribution between the generator output image and the reconstructed
output of the same, expanded as

f ake edge
LD = r × Eloss (DθD (I LR ), DθD (GθG (I LR ))) + (1 − r)
(4.6)
V GGi. j
× Closs (DθD (I LR ), DθD (GθG (I LR )))

and kt is a balancing parameter at the t th iteration which controls the amount of emphasis
f ake
put on LD . The value of k is not fixed at every step as it needs to balance the real and
fake reconstruction losses. We start with the value of k = 0, and then with every step,
this difference is added to the previous value of k. It gets updated as per the following
equation.
f ake
kt+1 = kt + λ (γLreal D − LD ) ∀ step t (4.7)

where γ maintains the equilibrium between real image reconstruction loss and the fake
f ake
D = LD . A high value of γ will put more emphasis
image reconstruction loss, i.e. γLreal
on learning to reconstruct the real images better because it will push Lreal
D to have lower

66
4.4 Experiments

values. Similarly, a low value of γ will help learning to reconstruct the fake images. In
our experiments, we have used the value of 0.7. λ is the learning rate of k which is set
as 10−3 in our experiments. Further details about this can be found in [92].

Final Loss Function

We formulate the final perceptual loss Floss as the weighted sum of the feature loss Closs
and the edge loss Eloss component added to the reconstruction loss such that

Floss = r × Eloss + (1 − r) ×Closs + LD (4.8)

Substituting the values from Equation 4.2, 4.3 and 4.4, we have

1 W H

GT LR

Floss =r× ∑ ∑ Θ(I )x,y − Θ(Gθ G
(I ))x,y
+ (1 − r)
W H x=1 y=1
(4.9)
W H
i, j i, j
1 f ake
× ∑ ∑ D − kt × LD |
(φi, j (I GT )x,y − φi, j (GθG (I LR ))x,y )2 + |Lreal
Wi, j Hi, j x=1 y=1

The value of r has been decided experimentally.

4.4 Experiments
To validate the performance of our proposed algorithm, we have performed extensive
experiments on public benchmark datasets for various combinations of degradation.
This includes noise removal with superresolution and noise with artifact removal and
superresolution simultaneously. The main tasks of this empirical study are to verify
how well our proposed networks perform for each of the noise cases combined with
superresolution. We will also verify how our method performs when images contain
three forms of degradation together, i.e, noise, compression artifacts, and downsam-
pling. Although our method can be tested on any set of images, we have selected some
particular benchmark image sets which have been used extensively in the research com-
munity to test image enhancement. We discuss further details about the datasets and the
implementation in the following section.

67
4.4 Experiments

4.4.1 Data, Evaluation Metrics and Implementation Details


We have used the LIVE1 [10] dataset (29 images) for our experiments, which is the
most popular benchmark for AR. We also evaluate the performance using the bench-
mark datasets Set14 [93] (14 images) and BSD100 (100 images) which is a testing set
of BSD300 [94]. The pictures in the said testing sets have all characteristics e.g., tex-
ture, colour gradient, sharpness etc., which are required to test any image enhancement
algorithm. All datasets have been preprocessed using the following degradation factors:

• For JPEG degradation, images were degraded with a JPEG quality factor of 10%
(i.e., 90% degradation).

• For Gaussian noise, a Gaussian noise distribution with mean=0 and variance=0.01
was added to the image.

• For Poisson noise, a Poisson noise distribution with mean=0 and variance=0.01
was added to the image.

• For Speckle noise, a uniform noise distribution with mean=0 and variance=0.01
was added to the image using out = image + n × image, where n is the uniform
noise with specified mean and variance..

• For salt and pepper noise, random pixels were selected and the value was replaced
with either 0 or 1. The proportion of image pixels replaced with noise is 0.5.

For Gaussian, Poisson, and speckle, all outputs were clipped from the range 0-1. Af-
ter adding the noise, the resolutions of the images were downscaled by a factor of 4,
which corresponds to 16 times the reduction in image pixels. For comparison, we used
bicubic, E-Net [97], SRCNN [4] and SRGAN [3] for super-resolution and bilateral [6],
ARCNN [7], IRCNN [8] and Deep Image Prior [5] for noise and artifact removal. Since
there are no end-to-end networks available for solving these problems simultaneously,
we have used a series of networks for solving each problem separately. For each com-
bination of the superresolution algorithm and artifact/noise removal, we used a cascade
of the superresolution algorithm first and then the artifact/noise removal algorithm. We
have also experimented with the reverse combination, i.e., the artifact/noise removal al-
gorithm first and then the superresolution algorithm, but the results were much worse.

68
4.4 Experiments

Input
Input magnified Ground Truth

SRGAN+ARCNN+IRCNN SRGAN+IRCNN+ARCNN Ours


20.178/0.673/0.185/0.492 19.998/0.663/0.192/0.481 18.921/0.641/0.158/0.497

Figure 4.3: Visual results for 4xSR + Salt and Pepper NR + JPEG AR of RGB images
combining various algorithms. The image is taken from Set14 dataset and cropped for
better visualisation. For each image, from left to right, the scores are PSNR, SSIM,
GMSD and HaarPSI. Best viewed in color.

Thus, we have skipped those due to space limitations. This has also been shown and
verified by Ghosh et al. in [45]. In case of simultaneous noise removal, artifact removal,
and superresolution, the experiments have been compared with a cascade of 3 different
networks consisting of the best available method for superresolution (SRGAN), noise
removal (IRCNN) and artifact removal (ARCNN). We have shown the results for both
combinations, i.e., SRGAN-ARCNN-IRCNN and SRGAN-IRCNN-ARCNN whose re-
sults have been analysed in the following sections.
We trained all networks on an NVIDIA DGX-Station using a random sample of
60,000 images from the ImageNet dataset [95]. For each minibatch, we cropped random
128 × 128 HR sub-images of distinct training images. Our generator model can be
applied to images of arbitrary size as it is a fully convolutional network. We scaled
the range of the image pixel values to [-1, 1]. During feeding the outputs to the VGG
network for calculating the loss function, we scale it back to [0, 1] since VGG network
inherently handles image pixels in the range [0, 1]. For optimisation, we use Adam [96]
with β1 = 0.9. The value of r in Equation 4.8 is selected as 0.4. The network was trained
with a learning rate of 10−4 and with 105 update iterations. Our implementation is based
on TensorFlow.

69
Table 4.1: Performance of our proposed algorithm for simultaneous Gaussian noise removal and 4x superresolution com-
pared to other state of the art algorithms for the benchmark LIVE1 dataset, Set14 Dataset and the BSD100 dataset. For
GMSD, lower value is better.

Gaussian Noise + 4x Super-resolution


LIVE1 Set14 BSD100
Algorithm PSNR ↑ SSIM ↑ GMSD ↓ HaarPSI ↑ PSNR ↑ SSIM ↑ GMSD ↓ HaarPSI ↑ PSNR ↑ SSIM ↑ GMSD ↓ HaarPSI ↑
Bicubic + Bilateral 22.806 0.573 0.218 0.345 22.826 0.571 0.202 0.389 23.332 0.555 0.210 0.384
SRGAN + Bilateral 22.587 0.580 0.209 0.347 22.684 0.576 0.192 0.392 22.001 0.550 0.196 0.381
SRCNN + Bilateral 23.068 0.595 0.210 0.361 23.351 0.597 0.189 0.409 23.563 0.581 0.197 0.403
ENet + Bilateral 22.388 0.573 0.207 0.362 22.571 0.571 0.192 0.404 22.136 0.539 0.196 0.403

70
Bicubic + ARCNN 22.851 0.580 0.210 0.346 22.824 0.574 0.194 0.391 23.621 0.572 0.202 0.398
SRGAN + ARCNN 22.349 0.571 0.200 0.348 22.389 0.563 0.188 0.391 22.001 0.550 0.196 0.381
SRCNN + ARCNN 22.942 0.596 0.200 0.363 23.135 0.593 0.183 0.409 23.258 0.578 0.187 0.405
ENet + ARCNN 22.233 0.564 0.201 0.359 22.369 0.560 0.189 0.401 21.884 0.524 0.194 0.398
Bicubic + IRCNN 22.834 0.581 0.198 0.359 22.851 0.578 0.187 0.398 23.723 0.589 0.189 0.412
SRGAN + IRCNN 22.069 0.546 0.189 0.360 22.073 0.537 0.186 0.397 21.066 0.495 0.189 0.385
SRCNN + IRCNN 22.869 0.586 0.187 0.371 23.130 0.586 0.174 0.413 21.066 0.495 0.189 0.385
ENet + IRCNN 21.622 0.525 0.193 0.369 21.776 0.522 0.188 0.406 20.924 0.475 0.195 0.399
Bicubic + DIP 22.466 0.554 0.212 0.331 22.217 0.545 0.203 0.352 22.823 0.503 0.214 0.359
SRGAN + DIP 20.379 0.448 0.249 0.241 20.234 0.436 0.228 0.279 20.960 0.425 0.232 0.290
SRCNN + DIP 21.755 0.513 0.225 0.293 21.515 0.503 0.214 0.322 22.135 0.472 0.210 0.338
ENet + DIP 20.807 0.462 0.243 0.255 20.515 0.447 0.225 0.290 21.721 0.443 0.224 0.314
Ours 23.206 0.662 0.177 0.424 22.386 0.614 0.164 0.419 22.945 0.595 0.180 0.440
4.4 Experiments
Table 4.2: Performance of our proposed algorithm for simultaneous Poisson noise removal and 4x superresolution compared
to other state of the art algorithms for the benchmark LIVE1 dataset, Set14 Dataset and the BSD100 dataset. For GMSD,
lower value is better.

Poisson Noise + 4x Super-resolution


LIVE1 Set14 BSD100
Algorithm PSNR ↑ SSIM ↑ GMSD ↓ HaarPSI ↑ PSNR ↑ SSIM ↑ GMSD ↓ HaarPSI ↑ PSNR ↑ SSIM ↑ GMSD ↓ HaarPSI ↑
Bicubic + Bilateral 22.949 0.582 0.218 0.348 23.038 0.586 0.200 0.393 23.495 0.566 0.210 0.388
SRGAN + Bilateral 22.699 0.588 0.209 0.348 22.820 0.591 0.191 0.392 22.156 0.566 0.198 0.383
SRCNN + Bilateral 23.226 0.605 0.210 0.365 23.577 0.613 0.188 0.416 23.767 0.594 0.198 0.409
ENet + Bilateral 22.526 0.581 0.208 0.365 22.767 0.586 0.191 0.410 22.439 0.557 0.197 0.407

71
Bicubic + ARCNN 22.984 0.585 0.211 0.347 23.007 0.581 0.194 0.395 23.836 0.582 0.203 0.401
SRGAN + ARCNN 22.463 0.578 0.201 0.348 22.502 0.571 0.187 0.390 21.843 0.549 0.192 0.383
SRCNN + ARCNN 23.101 0.604 0.201 0.364 23.325 0.602 0.182 0.414 23.490 0.590 0.189 0.409
ENet + ARCNN 22.381 0.571 0.202 0.361 22.542 0.569 0.188 0.407 22.223 0.542 0.194 0.402
Bicubic + IRCNN 22.991 0.590 0.199 0.361 23.058 0.594 0.186 0.403 23.994 0.609 0.190 0.419
SRGAN + IRCNN 22.194 0.558 0.190 0.360 22.201 0.553 0.185 0.399 21.340 0.524 0.187 0.389
SRCNN + IRCNN 22.194 0.558 0.190 0.360 23.362 0.605 0.173 0.420 23.444 0.588 0.176 0.417
ENet + IRCNN 21.796 0.539 0.194 0.372 21.970 0.540 0.186 0.412 21.373 0.507 0.192 0.407
Bicubic + DIP 22.672 0.568 0.210 0.339 22.475 0.562 0.198 0.363 23.048 0.520 0.212 0.365
SRGAN + DIP 20.466 0.454 0.249 0.240 20.205 0.441 0.232 0.274 20.996 0.432 0.232 0.291
SRCNN + DIP 21.732 0.513 0.225 0.290 21.664 0.515 0.212 0.324 22.211 0.476 0.211 0.339
ENet + DIP 20.899 0.468 0.242 0.257 20.649 0.459 0.225 0.289 21.817 0.451 0.223 0.316
Ours 22.944 0.638 0.178 0.417 22.672 0.631 0.163 0.423 23.229 0.617 0.179 0.448
4.4 Experiments
Table 4.3: Performance of our proposed algorithm for simultaneous salt and pepper noise removal and 4x superresolution
compared to other state-of-the-art algorithms for the benchmark LIVE1 dataset, Set14 Dataset and the BSD100 dataset. For
GMSD, lower value is better.

Salt and Pepper Noise + 4x Super-resolution


LIVE1 Set14 BSD100
Algorithm PSNR ↑ SSIM ↑ GMSD ↓ HaarPSI ↑ PSNR ↑ SSIM ↑ GMSD ↓ HaarPSI ↑ PSNR ↑ SSIM ↑ GMSD ↓ HaarPSI ↑
Bicubic + Bilateral 22.607 0.564 0.219 0.340 22.527 0.559 0.202 0.383 23.141 0.544 0.210 0.381
SRGAN + Bilateral 22.464 0.570 0.210 0.341 22.450 0.565 0.193 0.386 21.928 0.537 0.196 0.376
SRCNN + Bilateral 22.860 0.584 0.210 0.356 23.024 0.583 0.189 0.403 23.352 0.569 0.198 0.399
ENet + Bilateral 22.202 0.561 0.208 0.357 22.292 0.559 0.193 0.397 21.861 0.522 0.197 0.397

72
Bicubic + ARCNN 22.681 0.574 0.211 0.343 22.566 0.566 0.194 0.387 23.373 0.560 0.202 0.394
SRGAN + ARCNN 22.239 0.561 0.200 0.343 22.176 0.552 0.188 0.386 21.540 0.514 0.192 0.377
SRCNN + ARCNN 22.760 0.587 0.201 0.358 22.838 0.580 0.183 0.405 23.042 0.563 0.188 0.401
ENet + ARCNN 22.059 0.553 0.202 0.356 22.106 0.548 0.190 0.396 21.586 0.502 0.197 0.391
Bicubic + IRCNN 22.642 0.569 0.199 0.352 22.561 0.564 0.187 0.393 23.418 0.566 0.190 0.405
SRGAN + IRCNN 21.953 0.530 0.190 0.352 21.849 0.520 0.187 0.391 20.913 0.468 0.193 0.377
SRCNN + IRCNN 22.655 0.569 0.188 0.363 22.800 0.568 0.175 0.406 22.905 0.546 0.176 0.401
ENet + IRCNN 21.434 0.507 0.193 0.362 21.490 0.503 0.189 0.398 20.557 0.443 0.201 0.388
Bicubic + DIP 22.277 0.545 0.215 0.325 22.027 0.536 0.203 0.354 22.534 0.488 0.218 0.350
SRGAN + DIP 20.349 0.442 0.250 0.237 20.214 0.435 0.230 0.279 21.013 0.420 0.236 0.285
SRCNN + DIP 21.529 0.499 0.232 0.283 21.312 0.496 0.212 0.320 21.981 0.464 0.213 0.332
ENet + DIP 20.766 0.459 0.244 0.254 20.493 0.449 0.226 0.288 21.588 0.435 0.226 0.308
Ours 22.800 0.619 0.179 0.410 22.205 0.601 0.166 0.415 22.775 0.578 0.181 0.435
4.4 Experiments
Table 4.4: Performance of our proposed algorithm for simultaneous speckle noise removal and 4x superresolution compared
to other state-of–of-the–the-art algorithms for the benchmark LIVE1 dataset, Set14 Dataset and the BSD100 dataset. For
GMSD, lower value is better.

Speckle Noise + 4x Super-resolution


LIVE1 Set14 BSD100
Algorithm PSNR ↑ SSIM ↑ GMSD ↓ HaarPSI ↑ PSNR ↑ SSIM ↑ GMSD ↓ HaarPSI ↑ PSNR ↑ SSIM ↑ GMSD ↓ HaarPSI ↑
Bicubic + Bilateral 22.894 0.580 0.219 0.347 22.985 0.584 0.201 0.394 23.442 0.563 0.210 0.387
SRGAN + Bilateral 22.684 0.586 0.210 0.348 22.830 0.590 0.191 0.393 22.052 0.565 0.197 0.382
SRCNN + Bilateral 23.172 0.603 0.210 0.365 23.524 0.612 0.188 0.415 23.722 0.593 0.198 0.407
ENet + Bilateral 22.498 0.580 0.208 0.364 22.755 0.587 0.191 0.410 22.407 0.555 0.197 0.406

73
Bicubic + ARCNN 22.927 0.584 0.211 0.347 23.420 0.569 0.190 0.393 23.763 0.579 0.203 0.400
SRGAN + ARCNN 22.445 0.577 0.201 0.348 22.510 0.572 0.187 0.391 21.728 0.547 0.191 0.382
SRCNN + ARCNN 23.046 0.602 0.202 0.364 23.274 0.601 0.182 0.414 23.443 0.588 0.189 0.408
ENet + ARCNN 22.347 0.571 0.202 0.360 22.534 0.569 0.188 0.408 22.178 0.540 0.194 0.401
Bicubic + IRCNN 22.939 0.589 0.200 0.360 23.014 0.592 0.186 0.403 23.926 0.607 0.191 0.419
SRGAN + IRCNN 22.188 0.557 0.190 0.360 22.237 0.553 0.185 0.399 21.238 0.521 0.187 0.388
SRCNN + IRCNN 23.008 0.597 0.188 0.375 23.321 0.603 0.173 0.421 23.404 0.586 0.177 0.416
ENet + IRCNN 21.777 0.539 0.193 0.372 21.994 0.541 0.186 0.414 21.334 0.504 0.192 0.407
Bicubic + DIP 22.726 0.574 0.206 0.347 22.418 0.558 0.199 0.361 23.143 0.529 0.209 0.375
SRGAN + DIP 20.510 0.453 0.247 0.243 20.248 0.439 0.231 0.277 20.876 0.431 0.232 0.291
SRCNN + DIP 21.700 0.511 0.228 0.291 21.679 0.514 0.212 0.325 22.328 0.485 0.209 0.344
ENet + DIP 20.948 0.468 0.243 0.257 20.596 0.454 0.224 0.289 21.764 0.447 0.221 0.316
Ours 23.159 0.661 0.177 0.423 22.632 0.630 0.163 0.423 23.177 0.615 0.179 0.448
4.4 Experiments
Table 4.5: Performance of our proposed algorithm for simultaneous JPEG Artifact removal and 4x superresolution compared
to other state of the art algorithms for the benchmark LIVE1 dataset, Set14 Dataset and the BSD100 dataset. For GMSD,
the lower value is better.

JPEG Artifacts + 4x Super-resolution


LIVE1 Set14 BSD100
Algorithm PSNR ↑ SSIM ↑ GMSD ↓ HaarPSI ↑ PSNR ↑ SSIM ↑ GMSD ↓ HaarPSI ↑ PSNR ↑ SSIM ↑ GMSD ↓ HaarPSI ↑
Bicubic + Bilateral 22.707 0.574 0.218 0.328 22.827 0.578 0.200 0.385 23.240 0.558 0.208 0.375
SRGAN + Bilateral 22.496 0.579 0.209 0.335 22.686 0.583 0.191 0.387 21.916 0.553 0.196 0.378
SRCNN + Bilateral 22.946 0.595 0.210 0.343 23.331 0.604 0.188 0.404 23.426 0.584 0.197 0.394
ENet + Bilateral 22.288 0.572 0.207 0.343 22.578 0.579 0.191 0.399 22.004 0.542 0.196 0.393

74
Bicubic + ARCNN 22.769 0.579 0.210 0.330 22.818 0.575 0.193 0.387 23.400 0.570 0.202 0.376
SRGAN + ARCNN 22.287 0.570 0.200 0.336 22.389 0.567 0.187 0.386 21.594 0.533 0.191 0.378
SRCNN + ARCNN 22.850 0.595 0.201 0.344 23.115 0.594 0.182 0.403 23.153 0.579 0.187 0.396
ENet + ARCNN 22.170 0.564 0.201 0.342 22.383 0.564 0.187 0.398 21.799 0.527 0.194 0.388
Bicubic + IRCNN 22.736 0.580 0.199 0.341 22.835 0.585 0.186 0.393 23.464 0.587 0.191 0.390
SRGAN + IRCNN 22.003 0.548 0.189 0.347 22.087 0.547 0.184 0.394 21.010 0.501 0.189 0.382
SRCNN + IRCNN 22.769 0.587 0.188 0.353 23.107 0.595 0.173 0.408 23.153 0.579 0.187 0.396
ENet + IRCNN 21.584 0.529 0.192 0.352 21.802 0.534 0.185 0.403 20.870 0.483 0.194 0.391
Bicubic + DIP 22.619 0.575 0.202 0.338 22.198 0.548 0.202 0.351 22.580 0.499 0.216 0.347
SRGAN + DIP 20.460 0.451 0.248 0.240 20.185 0.440 0.230 0.273 20.991 0.427 0.233 0.289
SRCNN + DIP 21.437 0.502 0.231 0.277 21.516 0.511 0.214 0.320 22.030 0.472 0.210 0.334
ENet + DIP 20.734 0.464 0.243 0.251 20.482 0.452 0.222 0.289 21.645 0.444 0.225 0.308
Ours 22.659 0.635 0.181 0.387 22.329 0.612 0.167 0.407 22.747 0.592 0.182 0.423
4.4 Experiments
4.4 Experiments

4.4.2 Analysis of Experimental Results


To our best knowledge, algorithms that superresolve and clean images simultaneously
do not exist. Therefore, for comparison, we are using state-of–of-the–the-art superres-
olution algorithms and noise/artifact removal algorithms in a cascade as baseline. For
superresolution, we have selected bicubic, SRCNN [4], SRGAN [3] and ENet-PAT [97].
For noise/artifact removal, we have selected bilateral filter [6], ARCNN [7], Deep Im-
age Prior [5] and IRCNN [8]. For each combination of the super-resolution algorithm
and artifact/noise removal, we use a cascade of the super-resolution algorithm first and
then the artifact/noise removal algorithm. We have also experimented with the reverse
combination, i.e., the noise/artifact removal algorithm first and then the superresolution
algorithm, but the results are much worse. This has also been shown and verified by
Ghosh et al. in [45]. Therefore, we have skipped that due to space limitations.
The results for cascaded end-to-end, superresolution and artifact/noise removal are
shown in Table 4.1, Table 4.2, Table 4.4, Table 4.3 and Table 4.5, where we can see
that our algorithm outperforms the other state-of–of-the–the-art pipelines for almost all
cases. The PSNR scores for our algorithm are not the best for all kinds of noise except
for Gaussian on LIVE1 dataset, but that is what we expected. As discussed in Section
4.2.5, PSNR is not a practically good perceptual image quality evaluation metric.
We can obtain very high PSNR using pixel-wise mean square reconstruction losses
as shown in [3], but our ultimate goal is to improve the perceptual quality of the image.
The resulting outputs can be observed clearly in Figure 4.4, which shows the visual
result where algorithms are used to perform both NR (Gaussian) and SR on the same
image, simultaneously. The highest PSNR of 19.47 is obtained from the outputs of the
cascaded E-Net+IRCNN and SRGAN+IRCNN, but undesirable streaks in the images
make the overall perceptual quality quite low.
The GMSD and the HaarPSI metrics for our output image are mostly the highest for
all types of noise and for all datasets we have used for our experiments, which is quite
intuitive. The GMSD scores of the images produced by our proposed algorithm for Pois-
son, Salt & Pepper and Speckle noise on BSD 100 dataset are marginally higher than
the scores of SRCNN+IRCNN. Referring to Figure 4.4, we get an idea of the perceptual
quality of the images produced by SRCNN+IRCNN, which neither appears to be sharp
nor clean. For the particular image shown in the comparison figures, the lowest GMSD

75
4.4 Experiments

Input
Input magnified Bicubic+Bilateral Bicubic+ARCNN
18.78/0.57/0.27/0.39 18.56/0.59/0.28/0.41

Bicubic+DIP Bicubic+IRCNN SRCNN+Bilateral SRCNN+ARCNN


19.21/0.47/0.28/0.39 18.63/0.59/0.26/0.4 18.78/0.60/0.27/0.41 18.90/0.58/0.27/0.41

SRCNN+IRCNN SRCNN+DIP E-Net+Bilateral E-Net+ARCNN


19.08/0.56/0.27/0.41 19.21/0.47/0.28/0.39 19.09/0.49/0.27/0.39 19.17/0.48/0.28/0.40

E-Net+IRCNN E-net+DIP SRGAN+Bilateral SRGAN+ARCNN


19.47/0.40/0.29/0.39 19.21/0.47/0.28/0.39 19.10/0.49/0.28/0.38 19.21/0.47/0.28/0.39

SRGAN+IRCNN SRGAN+DIP Ours Ground Truth


19.47/0.41/0.29/0.38 19.21/0.47/0.28/0.39 19.25/0.64/0.28/0.45

Figure 4.4: Results for 4xSR + Gaussian NR of RGB images combining various algorithms.
The image is taken from BSD100 dataset. For each image, from left to right, the scores are
PSNR, SSIM, GMSD and HaarPSI. The scores for perceptual quality and structural consistency
of the image using our algorithm is superior to the rest. Best viewed in color.
76
4.4 Experiments

Table 4.6: Performance of our proposed algorithm for simultaneous 4x super-resolution,


salt and pepper noise removal and JPEG Artifact removal compared to other state of the
art algorithms for different benchmark dataset. For GMSD, the lower value is better.

4x Super-resolution + Salt and Pepper noise + JPEG Artifacts


BSD100
Algorithm PSNR ↑ SSIM ↑ GMSD ↓ HaarPSI ↑
SRGAN + ARCNN + IRCNN 21.988 0.537 0.201 0.358
SRGAN + IRCNN + ARCNN 21.954 0.535 0.202 0.356
Ours 21.775 0.493 0.193 0.391
LIVE1
Algorithm PSNR ↑ SSIM ↑ GMSD ↓ HaarPSI ↑
SRGAN + ARCNN + IRCNN 22.381 0.580 0.205 0.332
SRGAN + IRCNN + ARCNN 22.346 0.578 0.207 0.331
Ours 21.677 0.539 0.196 0.359
Set14
Algorithm PSNR ↑ SSIM ↑ GMSD ↓ HaarPSI ↑
SRGAN + ARCNN + IRCNN 22.084 0.567 0.190 0.379
SRGAN + IRCNN + ARCNN 22.035 0.563 0.192 0.377
Ours 21.318 0.531 0.183 0.389

is given by the image obtained from Bicubic+IRCNN, which appears to be quite blurry,
whereas our image is free from all kinds of noises or artifacts and is reasonably sharp.
Note that the output is obtained from a noisy image and is superresolved 4 times each
side (i.e., 16x increase in overall pixels) simultaneously using a single network. Some
of the cascade networks, for example, SRGAN+IRCNN, E-Net+IRCNN, etc., produce
sharp results, but unwanted streaks of different colors throughout the image degrade the
perceptual quality of the image.
Furthermore, we have shown the results for images degraded with salt and pepper
noise and JPEG compression artifacts along with 4x size reduction in Table 4.6. We
choose the best performing algorithms (i.e., SRGAN for superresolution, IRCNN for
noise removal, and ARCNN for artifact removal) for solving each problem and add
them into the cascaded pipeline. Even though, our scores of GMSD and HaarPSI still
outperform the counterparts. The perceptual quality of the images is shown in Figure

77
4.5 Conclusion

4.3. Our output image is free of undesirable marks and is the closest to the ground truth
in terms of human perceptual quality.

4.5 Conclusion
We have described a deep generative adversarial network with skip connections that
sets a new state of the art on public benchmark datasets when evaluated with respect to
perceptual quality. This is the first framework that is able to successfully recover images
from artifacts and/or various noises and at the same time super-resolves, thus having
a single-shot operation performing two to three different tasks. We have highlighted
some limitations of the existing loss functions used for training any image enhancement
network and introduced our network, which augments the feature loss function with an
edge loss during the training of the GAN. Using different combinations of loss functions
and by using the discriminator both in feature and pixel space, we confirm that our
proposed network’s reconstructions for corrupted images are superior by a considerable
margin and more photorealistic than the reconstructions obtained by the current state-
of-the-art methods.

78
Chapter 5

Improving Recognition of Degraded


Faces by Discriminative Feature
Restoration Using Generative
Adversarial Network

5.1 Introduction
Face detection and recognition from low-quality images is challenging but highly de-
manded in real-world applications. These problems are particularly prominent for highly
compressed images and videos, where details are typically absent. This is most common
in surveillance videos where the camera sensors are not of very high quality. Good de-
tection and recognition performance is highly desirable in these scenarios and we have
explored this possibility in this chapter.
Solving the problem of detecting/recognising faces with high accuracy, different
approaches have been proposed [1,2,32,65,68,72]. Most recognition algorithms attempt
to minimise the distance between similar identities and maximise the distance between
dissimilar identities, thus maximising the overall face recognition performance. The
limitation of these methods is that their performance drops significantly on degraded
images.
Pixel-wise loss functions like MSE or L1 loss are unable to recover the high-frequency

79
5.1 Introduction

details in an image. These loss functions encourage finding pixel-wise averages of pos-
sible solutions. This alone neither leads to reconstruction of discriminative facial fea-
tures nor leads to better perceptual quality [3, 42–44]. Johnson et al. [44] and Bruna et
al. [42] proposed a perceptual loss function based on the Euclidean distance between
feature maps extracted from the VGG19 [9] network. Zhang et al. proposed an algo-
rithm [8] to solve different problems related to image artifacts. A recent paper by Ghosh
et al. [45] uses edge loss on top of perceptual loss to simultaneously remove artifacts
and superresolve images. A major drawback of these perceptual image enhancement al-
gorithms is that they are not suitable for applications regarding face detection or recog-
nition. Improving perceptual quality does not necessarily lead to better face detection
or recognition. Rather, it can get worse, and we have shown that in Section 5.3.
Till date, several algorithms related to face recognition have been proposed, most
of which do not specifically target facial images with degraded quality such as noise or
compression artifacts. There are a few pre-deep learning era algorithms which deal with
face recognition for low quality images, but their performance is not adequate compared
to the practical requirements. Some proposed algorithms deal with face hallucinations
or superresolution, for example FSRNet [59]. Thus, an end-to-end method solving the
problem of low quality face recognition is highly desirable. In this chapter, we propose
a novel discriminative facial feature restoring GAN, which enhances the discriminative
features of degraded face images, leading to better detection and recognition perfor-
mance. Our major contribution lies in the fact that we propose a generative method,
where our network enhances the degraded face. This is of utmost importance as this
image can be fed in with any existing algorithm, or the algorithms that are yet to be
proposed. Thus, this method has the ability to work in conjunction with multiple algo-
rithms and is not limited to a single algorithm. As per our knowledge, the enhancement
of face detection and recognition by restoring the lost features of the input images using
GANs has not been explored yet. In a nutshell, our novel framework offers
• enhancement of facial features for very low quality images,

• a constrained angular metric learning method that learns to reconstruct discrimi-


native facial features(see Section 5.2.3),

• and a weighted combination of different losses operating at various stages of the


network to recover discriminative facial features leading to better performance.

80
5.2 Facial Feature Restoration Using GAN

We achieve outstanding results on benchmark datasets of degraded face images


(see Section 5.3).

5.2 Facial Feature Restoration Using GAN

5.2.1 Method
In our work, we aim to estimate a clean facial image f (I G ) from an image I G which
is corrupted with various artifacts which include but not limited to MPEG compression
artifacts, interlacing artifacts, artifacts due to the inherent quality of camera sensors, etc.
Furthermore, I GT is the ground truth image which is the clean and artifact-free version
of I G and I R,S is the mugshot of the same identity in I G . Here we define a mugshot as a
clean, artifact-free, and properly illuminated frontal face pose of the person in question
with neutral facial expression and without any kind of facial occlusion. For an image
with c channels, we describe I G and I GT by a real-valued tensor such that I ∈ Rw×h×c .
To estimate the enhanced image for a given corrupted image, we train a generator
network as a feed-forward CNN GθG parameterised by θG . Here θG = W1:N ; b1:N denotes
the weights and biases of an N-layer deep network and is obtained by optimising a loss
function L. The training is done using two sets of n images {IiGT : i = 1, 2, ..., n} and
{IiG : i = 1, 2, ..., n} such that IiGT = GθG (IiG ) (where IiGT and IiG are corresponding pairs)
and by solving

1 n
θ̂G = arg min
θG
∑ L(IiGT , GθG (IiG), IiR,S ).
N i=1
(5.1)

We also have a function Φ which is a facial feature extraction model, such that
Φ(I) = vID where vID is the facial feature vector and vID ∈ Rd where d is its dimension.
Following the work of Goodfellow et al. [23], we also add a discriminator network DθD
to evaluate the images which are generated by the generator network GθG . The gener-
ative network is trained to generate the target image GθG (I G ) such that it is similar to
the ground truth. Ideally, we want the cosine distance between Φ(GθG (I G )) and Φ(I GT )
to be 0. This is done with the help of a novel framework which helps in enhancing
the quality and preserving the identity of the image. While training the generator, the
discriminator is trained in an alternating manner such that the probability of error of the

81
5.2 Facial Feature Restoration Using GAN

Figure 5.1: The generator architecture of the proposed network. Best viewed in color.

discriminator (between the ground truth and the generated images) is minimised. With
this adversarial min-max game, the generator can learn to generate images with restored
facial features. This also encourages the generation of perceptually superior solutions
residing in the manifold of clean facial images.

5.2.2 Network Architecture


Figure 5.1 gives a clear overview of the whole network structure. Residual blocks are the
building blocks of our network. The architectural guideline of our generator is inspired
from FSRNet [59] but this network is completely different from FSRNet [59]. The goal
of FSRNet is to superresolve the images whereas we are restoring discriminative facial
features from degraded and corrupted images.
The generator block consists of four main parts, namely, the coarse network, the
prior network, the encoder, and the decoder. We will discuss each of these blocks in
detail. The first block in the generator network is the coarse network. The function
of the coarse network is to partially recover the degraded image. The output of the
coarse network is used to calculate the coarse loss with respect to the ground truth. This
helps the network learn to produce finer images in the initial stages, which makes the
task easy for the succeeding blocks. Once the coarse network partially recovers the
degraded image, it is then sent to the encoder and the prior net for finer recovery.

82
5.2 Facial Feature Restoration Using GAN

The prior net carries prior information from the coarse network to the encoded im-
age. A high quality prior information can boost the performance of the network. Thus,
we minimise the facial feature loss between the ground truth and the coarse image.
The prior network helps to preserve the facial landmarks which are obtained from the
coarse image. The hourglass block acts as an attention network and creates an informa-
tion bottleneck where the most important facial features are allowed to pass and other
unimportant or redundant features are filtered out. Figure 5.2 shows a couple of visual
examples of the prior network output. It is clearly visible in the images that the most
important facial landmarks get highlighted. Facial landmarks are key points that are
used to localise and represent salient regions of the face, such as eyes, eyebrows, nose,
mouth, etc. The output of the prior network is a single channel array which is then
added to all channels of the encoded image.
The encoder encodes the images obtained from the coarse network to extract some
latent features from the degraded faces. Figure 5.2 shows a couple of visual exam-
ples of the encoder network output. The encoder network consists of a convolution-
batchnorm-relu followed by thirteen residual blocks. Each residual block is a series
of two convolution-batchnorm-relu layers. The final layer of the encoder block is a
convolution layer with 3 channel output.
The final block of the network consists of a decoder. The output of the prior network
and the encoder are added, which is then fed to the decoder to obtain the final output.
This addition of the output from the two different blocks contains important facial fea-
tures and landmarks which are essential for a human face to get recognised with high
confidence. The decoder consists of a convolution-batchnorm-relu followed by another
convolution layer. The output from this layer is then passed to three consecutive resid-
ual layers, followed by a final convolution layer. The output of this decoder is the facial
image with enhanced features, which will help us with better recognition performance.
Our network does not use any kind of down-sampling or up-sampling layers (apart
from the hourglass block). Moreover, we use a single hourglass block in the prior net
compared to 2 blocks in FSRNet. The most important differentiating factor is that our
prior net is based on an unsupervised method to create a facial landmark distribution,
whereas FSRNet uses a supervised method. The number of filters per convolution layer
in the coarse encoder and decoder network is 64 with 3 × 3 kernels and stride 1. For
the prior network, the number of filters are 128 with 3 × 3 kernels and stride 1 except

83
5.2 Facial Feature Restoration Using GAN

(a) Prior Network Output

(b) Encoder Network Output

Figure 5.2: Visual examples of the outputs from different network components - (a) Fa-
cial landmark distribution from the prior network. (b) Latent features from the encoder
network. Best viewed in color.

the first convolution layer where the kernel is 7 × 7. The hourglass block consists of a
U shaped network (with skip connections) with 4 blocks of convolution - 2x2 max-pool
layers followed by 4 blocks of convolution - 2x2 nearest neighbour upsampling layers.
A convolution is also applied to the output of each skip connection. The number of
filters are 128 for each convolution layer, kernel size 1x1 and stride is 1.
ReLU activation is used throughout the network. To encourage the generation of
realistic images, a discriminator is added to the network. The design of our discriminator
is in line with the architectural guidelines of the discriminator proposed in [33]. The
network is then trained in an adversarial manner so that the features of the generated
images are driven towards the natural facial feature space.

84
5.2 Facial Feature Restoration Using GAN

5.2.3 Loss Function


The main idea of this research is to enhance face recognition performance using the out-
put images of the GAN. Therefore, designing a proper loss function is essential. This
loss function is different from other image enhancement tasks because here we are not
concerned about the perceptual quality of the image. The loss functions proposed in
GANs [3, 33, 45, 87, 92] are targeted towards generating an output image having high
PSNR, SSIM, or other metrics that evaluate the perceptual quality of an image. How-
ever, we need to design a loss function which is targeted towards generating an image
with restored facial features. Ideally, our loss function should fulfil the following crite-
ria; the loss function should help the network to learn to generate the lost discriminative
facial features, it should help preserve the identity of the person in question, the facial
features of images of different identities generated by the generator should be well sep-
arated and the facial features of images of the same identities generated by the generator
should be as close as possible.

Constrained Angular Metric Learning of Hard and Semi-Hard Identity Features

Although inspired from [65], this method is very different from triplet loss. The con-
strained metric learning relies on two fixed anchors that guides the feature generation in
the correct direction. The feature vector Φ(I) (see Section ??) embeds an image I in a
d dimensional Euclidean space. Let us define Φ(IiR,S ) as the reference point where IiR,S
is the unique and clean mugshot for identity i where i ∈ {1, 2, ...,C} (considering there
are C classes). We aim to ensure that the features of all generated images of any given
class i are as close as possible to the feature of Φ(IiR,S ). These pairs are called similar
pairs. Let us denote a generated similar counterpart as GθG (IiG ) where IiG is the input
image containing artifacts. The generated similar counterpart simultaneously must also
be as further as possible to the mugshot features of any class, but i, and these are called
as dissimilar pairs. Let us denote a dissimilar reference as Φ(I R,D j ) where I j
R,D
is the
unique and clean mugshot for identity j and j ∈ {1, 2, ...,C} but i 6= j. The distance
considered here is the cosine distance and the distance between similar pairs Dsim can
be formulated as
R,S
∑dk=1 Φ(Ii )k Φ{GθG (IiG )}k
D(IiR,S , IiG , GθG ) = 1 − q q , (5.2)
d 2 R,S d 2 G
∑k=1 Φ (Ii )k ∑k=1 Φ {GθG (Ii )}k

85
5.2 Facial Feature Restoration Using GAN

where ni is the number of images in the ith class. Similarly, the distance between dis-
similar pairs Ddsim can be formulated as
R,D
∑dk=1 Φ(I j
)k Φ{GθG (IiG )}k
D(I R,D G
j , Ii , GθG ) =1− q q . (5.3)
d 2 R,D d 2 G
∑k=1 Φ (I j )k ∑k=1 Φ {GθG (Ii )}k

This is visualised in Figure 5.3. We also define a parameter α which is considered as


the angular margin enforced between the similar pairs and the dissimilar pairs. For our
training, we do not need all possible triplets in the space, and thus we randomly select a
subset of triplets.
Ti, j = {IiR,S , GθG (IiG ), I R,D
j }, (5.4)

where the feature Φ{GθG (IiG )} is constrained in space by similar and dissimilar pairs
and the loss is given by

LMET = α + D(IiR,S , IiG , GθG ) − D(I R,D G


j , Ii , GθG ). (5.5)

Ideally, for a trained model, each of these triplets will satisfy

α + D(IiR,S , IiG , GθG ) < D(I R,D G


j , Ii , GθG ). (5.6)

Since one dissimilar counterpart is randomly selected for each similar counterpart,
the total number of triplets is ∑Ci=1 ∑N i
1 and is equal to the total number of training
images, where Ni is the number of images in class i.

Identity Feature Loss

As the preservation of identity is important for the network, the objective function needs
to evaluate the identity vector of the generated images with respect to the ground truth
image. If the ground truth image is denoted by IiGT , then the identity feature loss is
given by the equation

∑dk=1 Φ(IiGT )k Φ{GθG (I G


j )}k
LID =1− q q (5.7)
∑dk=1 Φ2 (IiGT )k ∑dk=1 Φ2 {GθG (I G
j )}k

86
5.2 Facial Feature Restoration Using GAN

Figure 5.3: The proposed metric learning is different from triplet loss. It has two fixed
anchors minimising the cosine distance between the similar pairs and maximising the
cosine distance between dissimilar pairs. Images are taken from SCFace Dataset for the
purpose of explanation. The model was not trained using SCFace Dataset. Best viewed
in color.

Pixel-wise Identity Loss

Since the images are corrupted with artifacts, it is important for the generated images
to learn to reconstruct the corrupted images and drive the reconstruction towards the
natural image manifold. This can be done introducing a pixel-wise identity loss in the
cost function which is the Euclidean norm of the generated and the ground truth image.

GT G R,S G
L px = Ii − GθG (Ii ) + Ii − GθG (Ii ) (5.8)
2 2

Perceptual Loss

To ensure a high-quality output image, we choose a feature loss based on the ReLU
activation layers of the pretrained 19 layer VGG network described in Simonyan and
Zisserman [9]. This loss was referred to as VGG loss by Ledig et al. [3].
Wm,n Hm,n
1
LV GGm,n (IiGT , GθG (IiG )) =
Wm,n Hm,n ∑ ∑ (φi, j (ImGT )x,y − φm,n(GθG (ImG))x,y)2 (5.9)
x=1 y=1

where φm,n is the feature map obtained by the nth convolution (after activation) before
the mth max-pooling layer within the pretrained VGG19 network, and W and H repre-
sent the width and height of the input image, respectively.

87
5.3 Experiments

Adversarial Loss

For training the discriminator, we use a binary cross-entropy loss, which predicts if the
image is a real one or a generated one. A sigmoid function is applied to the output of
the final layer of the discriminator and the resulting output is used to calculate the loss,
which is represented by

Ladv = t1 log(1 − s1 ) − t1 log(s1 ) − 1 (5.10)

Where t1 and s1 are the ground truth and the generated image score for class 1.

Final Loss

We formulate the final loss as the weighted sum of all described cost function compo-
nents. To enhance the prior information, we also minimise the errors after the coarse
network. Let the coarse network be denoted by GCθG . Pixel-wise feature loss is cal-
culated for the output from the coarse network and the angular metric learning of the
features of the coarse image is also carried out at this stage. This ensures a better coarse
image which in turn helps the final image to be enhanced further. The cost function for
the coarse image can be formulated as

LC = β (α + D(IiR,S , IiG,C , GCθG ) − D(IiD,S , IiG,C , GCθG ))



GT

R,S
(5.11)
G G
+ Ii − Gθ C (Ii ) + Ii − Gθ C (Ii ) ,
G 2 G 2

where GCθG is the coarse network, and IiG,C is the coarse image obtained from the coarse
network.
At the output of the network, all losses are combined, and the final loss function is

L = LC + L px + γID LID + γP LV GGm,n (IiGT , GθG (IiG )) + β LMET − γC Ladv , (5.12)

where β , γID , γP and γC are the weighing parameters.

88
5.3 Experiments

(a) Total Loss (b) Adversarial Loss

(c) LMET (d) Identity Feature Loss

(e) LMET at Coarse Network (f) Identity Feature Loss at Coarse Network

(g) Perceptual Loss (h) Pixel-wise ID Loss

Figure 5.4: Change of the value of different loss function components with iterations
during training.

89
5.3 Experiments

5.3 Experiments

5.3.1 Data & Experimental Setup


To validate the face recognition performance enhancement, we test our framework on
various benchmark datasets. To successfully validate our proposal, we need to test
on a dataset having low-quality face images with degradation and artifacts. These re-
quirements are satisfied by the SCFace dataset [12], a public benchmark dataset having
low-quality face images. The network was trained using around 1.3 million images of
14,528 identities which were obtained from the internet. The faces were aligned and
cropped using MTCNN [32] detector. Images in the training dataset were randomly
selected and random degradation was applied to each image. The applied degradation
includes MPEG and JPEG compression, changes in brightness and contrast, interlacing
artifacts, and blurring. Some images were kept untouched to add variation to the train-
ing dataset. We trained all networks on NVIDIA DGX-Station. Each training image is
128 × 128 pixels. The value of β , γID , γP , and γC are 250, 20, 10−3 and 1 respectively.
For optimisation, we use Adam [96] with β1 = 0.9. The network was trained with a
learning rate of 10−4 . Our implementation is based on TensorFlow. [12] dataset. Since
most of the available face datasets are not low-quality and are not suitable for testing
our proposal, we create another dataset from the Webface [101] dataset called Webface
MPEG. We create the Webface MPEG dataset by applying MPEG degradation to the
Webface dataset. The degradation is done using FFMPEG and OpenCV [102]. Since
these degradations can only be done on video, we simulated the same using images. For
each image, we created a video of 1-second duration with 10 frames per second where
the background is black and the image of the face shifts by a few pixels on every frame.
We then apply the degradation and retrieve the image back from the degraded video.

5.3.2 Results and Observations


Ablation Study. Table 5.4 shows the results of recognition score improvement for our
framework trained with different loss functions and network components. This table
explains the importance of the discriminator and the prior network. The performance
drops if any of these components are absent. Moreover, it shows the importance of
different loss function components and the performance drops when they are absent.

90
5.3 Experiments

Gallery Positive pair Enh. pos. pair Negative pair Enh. neg. pair

ID-1 Sim. Score: 0.4505 0.5402 -0.0622 0.0012

ID-2 Sim. Score: 0.2779 0.3444 -0.0404 0.0028

ID-3 Sim. Score: 0.2205 0.2932 0.0848 0.0732

Table 5.1: Cosine similarity score w.r.t. gallery for input image and enhanced images
for positive pairs. The score for a random negative pair is also shown for reference.
These enhanced images are obtained from the network trained using our proposed loss
function. Shown images are taken from SCFace dataset. Best viewed in color.

91
5.3 Experiments

Left to Right - ARCNN, IEGAN, IRCNN, Ours

Figure 5.5: A qualitative comparison of the image level recovery compared to artifact
and noise removal algorithms. Since our algorithm is not targeted towards perceptual
recovery of image, our output is not very pleasing to the eyes. Best viewed in color.

Figure 5.6: ROC for ArcFace on SCFace dataset. XYZ→ Φ denotes that the image was
enhanced using algorithm XYZ and then recognition performance was calculated using
algorithm Φ (ArcFace). There is an improvement of 2.55% for ArcFace [2] for the AUC
after feature restoration. Best viewed in color.

92
5.3 Experiments

Similarity and P-N margin score from Table 5.1


Enhanced Score P-N margin P-N margin
ID
Difference non-enhanced enhanced
A 0.0897 0.5127 0.5390
B 0.0665 0.3183 0.3416
C 0.0727 0.1357 0.2200

Table 5.2: This table is a supplement to Table 5.1. The rows in the first column are the
corresponding IDs. The second column is the score enhancement for the positive pair
(Col. 2 and Col. 3 in Table 5.1). The third column is the positive-negative pair margin
for non-enhanced images (Col. 2 and Col. 4 in Table 5.1), and the fourth column is the
positive-negative pair margin for enhanced images (Col. 3 and Col. 5 in Table 5.1).

Algorithm Fréchet Inc. Dist. Inception Score


ARCNN [7] 103.509 3.943
IRCNN [8] 95.149 4.048
IEGAN [45] 95.9492 2.794
Ours 128.480 3.118

Table 5.3: Fréchet Inception Distance and Inception Scores for different algorithms for
the SCFace dataset. As expected, our algorithm has very high FID and low IS since it is
not targeted to enhance perceptual quality.

Quantitative Results

Figure 5.9 shows an important result of our method. The green line shows the ROC for
the enhanced images and the blue line shows the ROC for the original image. The AUC
of the green curve is 4.2% higher than the blue curve, which implies a better separation
between different identities and thus better face recognition performance. Our method
boosts its performance of ArcFace [2] significantly and its ROC is shown in Figure 5.6.
Table 5.1 shows the cosine similarity scores for input images and enhanced images with
respect to the gallery images for both positive pairs and negative pairs. From Table
5.2 we can observe that the scores have increased for the positive pairs which imply
better matching. At the same time, we can see that the margin for the negative pairs
has also increased after enhancement, which implies better separation. Table 5.6 shows

93
5.3 Experiments

Components TAR%@FAR
Dis LMET 10% 1% 0.1% 0.01% 0.001%
× X 82.62 61.97 46.36 32.90 23.39
X × 84.62 63.95 48.74 31.92 19.47
X X 88.02 67.90 51.01 39.19 33.35

Table 5.4: True Accept Rate (TAR) at different False Accept Rate (FAR) with different
settings for the network and loss functions on SCFace dataset. The last row show the
results of the complete framework.

Algorithm dlib [103] SSD [29] MTCNN [32] S3FD [1]


Original 90.35% 59.27% 91.96% 96.01%
ARCNN [7] 87.38% 55.91% 90.94% 96.01%
IRCNN [8] 87.66% 59.09% 90.56% 95.84%
IEGAN [45] 89.02% 50.91% 87.76% 92.69%
Ours 95.31% 86.36% 95.36% 99.09%

Table 5.5: Detection rates for different algorithms on SCFace dataset for different image
enhancement algorithms.

Algorithm TAR%@FAR
5% 1% 0.1% 0.01% 0.001%
ARCNN [7] 77.39 65.25 50.62 38.39 29.72
IRCNN [8] 75.78 61.29 47.41 33.64 21.15
IEGAN [45] 76.82 63.09 48.82 34.71 20.21
Ours 82.06 67.90 51.01 39.19 33.35

Table 5.6: Comparison of the recognition performance of our method with the state-of-
the-art artifact and noise removal algorithms for SCFace Dataset.

94
5.3 Experiments

Figure 5.7: ROC for CosFace on SCFace dataset. XYZ→ Φ denotes that the image was
enhanced using algorithm XYZ and then recognition performance was calculated using
algorithm Φ (CosFace). There is an improvement of 1.95% for CosFace [77] for the
AUC after feature restoration. Best viewed in color.

the performance comparisons with different state of the art AR algorithms. Thus, the
identity features of the faces at a low level are important for superior performance on
recognition problems.
Face Detection: Table 5.5 shows the detection performance for various algorithms,
which improves by a substantial margin when the images are enhanced using our method,
especially for SSD+MobileNetV2 [29, 104] where there is an improvement of 27.09%.
Note that images enhanced using perceptual image enhancement methods [7, 8, 45] per-
forms worse on detection algorithms as they do not operate on facial features but on
perceptual quality, which is useless for this purpose. This highlights the importance of
facial feature enhancement.
Face Recognition: Table 5.6 shows the recognition performance after image enhance-
ment using different algorithms. The numbers in bold denote the best result. We can
see in Figure 5.8 that the area under the ROC increases using our method, which im-
plies a better separation between different identities. This leads to a significant boost
of face recognition performance for state-of-the-art face recognition algorithms. Thus,
the identity features of the faces at a low level are essential for superior performance on

95
5.4 Conclusion

Figure 5.8: ROC for FaceNet on SCFace dataset. XYZ→ Φ denotes that the image was
enhanced using algorithm XYZ and then recognition performance was calculated using
algorithm Φ (FaceNet). There is an improvement of 0.9% for FaceNet [65] for the AUC
after feature restoration. Best viewed in color.

detection and recognition problems.

Qualitative Results

Although, it is to be noted that this method does not guarantee a good visual or percep-
tual quality of the image to human eyes. Most artifact and noise removal algorithms try
to achieve good perceptual quality and that is evident from Table 5.3. Figure 5.5 shows
the qualitative comparison between algorithms trying to achieve better perceptual qual-
ity and our algorithm which achieves better recognition performance.

5.4 Conclusion
We presented a deep generative adversarial network that sets a new state-of-the-art for
face recognition enhancement for low-quality images by restoring their facial features.
We have highlighted some limitations of the existing face recognition algorithms, which
mainly involves the accuracy drop of degrading images. We have tried to overcome

96
5.4 Conclusion

Figure 5.9: A closer look at the ROC curve for different enhancement algorithms on
SCFace dataset. Best viewed in color.

those limitations by augmenting the input images to increase the separation among dif-
ferent classes. We have used a GAN to recreate the facial images in a way that it
enhances the discriminative features of the facial image. In particular, we have explored
the possibility of improving face recognition performance on those images for which
the current state-of-the-art fails to deliver good results. We have formulated a loss func-
tion involving metric learning to learn the important discriminative features of the face.
Our proposed loss function works best on real-world images where they are inherently
low quality such as SCFace. The use of metric learning during the training helps to
generate facial images preserving their identity. This is essential for improving face
recognition performance compared to good perceptual quality, and our proposed algo-
rithm successfully demonstrates that. It is also worth noting that this method can be
used in conjunction with any existing face recognition algorithm.

97
Chapter 6

Synthesis of Frontal Face Pose by


Incremental Addition of Information
from Multiple Occluded Images using
GANs

6.1 Introduction
Face detection and recognition from low-quality and occluded images are challenging
problems. A viable solution for these kinds of problems is highly demanded in real-
world applications, especially in the field of security. Occluded faces reduce the perfor-
mance of face recognition algorithms because many facial details can be absent in the
images, which are crucial for the recognition of faces with high confidence. Good de-
tection and recognition performance is highly desirable in these scenarios and we have
explored this possibility in this chapter.
For solving the problem of detecting/recognising faces with high accuracy, differ-
ent approaches have been proposed [1], [2], [65], [72], [68], [32]. Most recognition
algorithms attempt to minimise the distance between similar identities and maximise
the distance between dissimilar identities, thus maximising the overall face recognition
performance. These algorithms perform very well on clean facial images, but the per-
formance decreases on degraded images. We have shown in the last chapter that we can

98
6.1 Introduction

Figure 6.1: Our proposed algorithm takes multiple occluded images and generates a
frontal face as output while preserving the identity. Best viewed in color.

overcome some limitations by using a generative adversarial network which restores


some discriminative facial features and helps in better face recognition. There are still
some limitations using this method, particularly for images with occlusions. Our goal in
this chapter is to deal with occluded images for improved face recognition performance.
Several solutions have been proposed to solve the problem of facial occlusion over
the years. They range from sophisticated statistical methods to dividing the face image
into a set of local regions. The first reconstruction-based solution from a single image
was proposed by Jia et al. [78] in 2008. Although it performed fairly for very small
occlusions, it was unsuitable for larger occlusions. Ekenel et al. [79] have shown that
the main reason for the low performance is erroneous facial feature localisation, which
was an important discovery. Andres et al. [80] tackled the problem using compressed
sensing theory which was robust only to certain types and levels of occlusion. Su et
al. [81] proposed a solution which used the information from the non-occluded part
of the face for recognition. The work of Wen et al. [82] involves structured occlusion
coding, which also revolves around using the non-occluded information to recognise the
face correctly. Wan et al. offered a solution called MaskNet [83] which is a masking
based solution that masks the occluded area and is assigned a lower importance. Song et
al. [84] proposed a similar kind of solution which is based on a mask learning strategy
to find and discard corrupted feature elements for recognition. The latest research done
in this area was by Xiao et al. [85] which was based on occlusion area detection and
recovery. It is quite evident from these previous works that the focus of the research
was mostly on avoiding the occlusion and making use of the rest of the information.

99
6.1 Introduction

However, our interest lies in restoring the information lost due to occlusion.
Occlusion results in complete modification of information. Neighbouring pixels can
be used for interpolation to restore the information at a local level as long as the oc-
clusion is very small, but for larger occlusions, the restoration is very difficult. As we
have seen, most of the previous researches carried out were based on using the visible
information available. The main problem with that is a low confidence score for face
matching and high false positives. There is also an important factor to be taken into con-
sideration: face recognition for occluded faces requires a completely different algorithm
which are trained to work with occluded faces. It is highly likely that these algorithms
will be unable to perform well for the recognition of complete unoccluded faces. Our
aim is to fix the problem at the root, i.e., the occlusion. This can be considered more
like a preprocessing step where we are making sure that the face recognition algorithm
receives only high-quality unoccluded faces for evaluation.
Since most of the lost information cannot be recovered from a single image, using
multiple images is a possible option to restore the lost information. Here we are solving
the problem from a new direction. Instead of using limited available information, we
are making use of more information that can be gathered from other images. The basis
of this research approach has been motivated from information theory. We have shown
in Section 6.3.1 why the best solution cannot be obtained using a single image and why
we need multiple images, if available, for a better solution to this problem.
In this chapter, we propose a method which uses multiple images of the same iden-
tity with different poses and occlusions to synthesise a frontal face pose image of the
concerned identity. The main idea of this research is to collect useful information from
each image and incrementally add information to create the frontal face pose. The infor-
mation lost in a certain part of an image due to occlusion can be retrieved from another
image if that part remains unoccluded. We achieve this by using a generative adversar-
ial network, which takes multiple inputs and gives a single image at the output of the
generator. Our network gathers useful information from each of those input images and
incrementally adds the information to get a final output which resembles the frontal face
of the person.
To sum up, the main contributions are as follows.

• To the best of our knowledge, we are the first to propose a GAN based solution to

100
6.2 Related Work

restore the identity by reconstructing the occluded area. Our method synthesises
the frontal face pose by extraction of useful information from multiple occluded
images.

• We have shown the working principles, behaviour, and limitations of our proposal
by making use of the science behind information theory.

• We have verified and shown through images and experiments that the increased
number of input images improves the quality of the output.

6.2 Related Work


To the best of our knowledge, the synthesis of frontal face poses using multiple occluded
images has not been carried out before. However, several past works are worth studying.
These provided us as a base upon which we built our motivation for this research. We
will describe that in this section.
Jia et al. [78] first proposed a reconstruction-based solution from a single image.
This performed fairly with very small occlusions (maximum 10×10 pixels), but the per-
formance fell drastically with larger occlusions. This is quite expected as information
retrieval for smaller occlusions can be done by using neighbouring pixels’ information.
However, the difficulty and error probability increases with increased occlusion. This
also leads to worse recognition performance. Ekenel and Stiefelhagen [79] have inves-
tigated the main reasons for the low recognition performance of occluded images. They
have shown that the main reason for the low performance is erroneous facial feature lo-
calisation. They also discovered that improved face alignment increases the recognition
rate. Although not reconstruction-based solutions, several other research has been car-
ried out to tackle the occlusion-based problem. Andres et al. [80] dealt the problem of
face recognition from occluded images using compressed sensing theory. This method
is only robust to certain types and levels of occlusion and this becomes a big limitation
for real-world scenarios. At a certain point, regression analysis became quite popular
for dealing with occlusions. The basic idea was to recover the occluded area by using
clean training samples and use the reconstructed images for recognition. Furthermore,
this method is not very practical in the real world scenario because in most cases we will
not have the clean training samples that are being used in the regression analysis. Su et

101
6.2 Related Work

al. [81] have shown that doing this can degrade the performance since a lot of noise was
introduced in the recovery procedure. They used the information from the non-occluded
part of the face for recognition. Wen et al. [82] proposed a structured occlusion coding
to address the problem of recognition of occluded faces. For their solution, they have
considered that occlusion is predictable, which in the real world scenario might not be
true. This method also revolves around using the non-occluded information to recognise
the face correctly. Wan et al. offered a solution called MaskNet [83] which is a masking
based solution that masks the occluded area. It is trained to generate different feature
map masks for different occluded face images and automatically assign lower weights
to those that are activated by the occluded facial parts and higher weights to the hidden
units activated by the non-occluded facial parts. Song et al.’s [84] solution is similar,
which is based on a mask learning strategy. It finds and discards corrupted feature ele-
ments for recognition. The latest research done in this area is by Xiao et al. [85] which
is based on occlusion area detection and recovery. They used a robust principal compo-
nent analysis algorithm and a cluster-based saliency detection algorithm to achieve the
processing of face images with occlusion. These methods mostly consider synthetic oc-
clusion or other forms of occlusion like sunglasses or scarves. However, none of these
methods have considered a major form of occlusion, which is the nonfrontal face pose.
Since nonfrontal face pose is a major form of face occlusion, our proposed method
heavily relies on frontal face synthesis. Several methods have been proposed for frontal
face view synthesis till date. This problem is quite challenging due to its ill-posed na-
ture. There are several traditional methods available which address this problem by
statistical modelling [105] or by a 2D or 3D local texture warping [106], [107]. Huang
et al. [13] proposed a method for photorealistic frontal view synthesis from a single face.
They proposed a generative adversarial network called TPGAN [13] for photorealistic
frontal view synthesis by perceiving global structures and local details simultaneously.
The generator of the TPGAN consists of two branches, one of which deals with the
local consistency and the other deals with the global consistency of the facial structure.
Facial landmark detection and alignment plays an important role in frontal face synthe-
sis. Several methods have been proposed in the literature, for example, the MTCNN [32]
detector proposed by Zhang et al., where they propose a deep cascaded multitask frame-
work, which exploits the inherent correlation between detection and alignment to boost
up their performance. S3FD [1] is another state-of-the-art face detection network pro-

102
6.3 Method

posed by Zhang et al., which is a single shot scale-invariant face detector. Bulat and Tz-
imiropoulos proposed an algorithm called Face Alignment Network (FAN) [108] where
a simple approach is taken for face recognition which gradually integrates features from
different layers of a facial landmark localisation network into different layers of the
recognition network.
To improve the quality of our solution, we are using a generative adversarial net-
work. The discriminator is an essential part of our network and various design guide-
lines are found in the literature. Radford et al. [33] have proposed some effective dis-
criminator design guidelines for improved training of a GAN. Schönfeld et al. [109] pro-
posed a U-net based discriminator which evaluates the generator output from a global
perspective as well as a local perspective. This has been proved to be very effective in
driving the generator to create more accurate solutions.

6.3 Method
In this section, we will discuss in detail how we have designed the algorithms and the
loss functions. Firstly, I will be first explaining the basis and motivation of this research
direction from an information theory perspective. This will help us understand why
multiple images will help boost the recognition performance. Then, we will move on to
the network design of the generator and the discriminator and then to the loss functions.

6.3.1 Hypothesis
We have an image I G which is either occluded or corrupted with artifacts and noise. If
XIG is the set of all information {IiG = I1G , ..., InG } that I G could provide, and P(IiG ) is
the probability of IiG ∈ XIG , the entropy of our information source will be denoted by H
where
H = − ∑ P(IiG ) log(P(IiG )).
i

Ideally, we are aiming to obtain I R,S for best recognition performance where I R,S
is a hypothetical image which can provide best recognition performance for the same
identity as I G , also described as the mugshot as seen in the previous chapter. Let {IiR,S =
I1R,S , ..., InR,S } be the set of all information that I R,S provides, and by definition, this is

103
6.3 Method

the information we need for best recognition of the identity in question. According to
information theory, the mutual information of I G relative to I R,S is

P(IiR,S |IiG )
G
I(I ; I R,S
) = EI G ,I R,S [SI(IiG , IiR,S )] = ∑ G R,S
P(Ii , Ii ) log . (6.1)
i P(IiR,S )

Here I denotes the mutual information and SI denotes the specific or pointwise mu-
tual information (PMI). PMI is a measure of association which refers to single events,
and mutual information is the expectation over all PMI. Although, theoretically I(I G ; I R,S ) ≤
I(I R,S ; I R,S ), but for practical cases, we can safely assume that I(I G ; I R,S ) < I(I R,S ; I R,S ).
Let us consider that I(I G ; I R,S ) + α = I(I R,S ; I R,S ), where α is the set of useful informa-
tion that contains in I(I R,S ; I R,S ) but is missing from I(I G ; I R,S ). Our target is to design a
function that can recover all or most of the information contained in α. For ease of the
reader, we denote I(I R,S ; I R,S ) as I(I R,S ) and I(I G ; I R,S ) as I(I G ) from now on.
Since it holds true for all real cases, that information cannot be created from thin
air, so for any given function f , I( f (I G )) < I(I R,S ) if I(I G ) < I(I R,S ). Let there be a set
of C different images belonging to the same identity. Considering G to be a generator
function, I(G(I G j )) < I(I R,S ) for each j, where j is the index of the images in set C.
The set of mutual information X that can be obtained from all the images in set C
will be the union of the mutual information in I(G(I G j )) for all j. Mathematically, this
can be expressed as

C
[
X= I(G(I G j )).
j=1

This gives us two possibilities.

• Case 1: The first possibility is that the information contained in X is less than the
information contained in the hypothetical image I R,S . In that case, we have

C
[
I(G(I G j )) = I(I R,S ) − α (6.2)
j=1

• Case 2: The second possibility is that the information contained in X is equal to

104
6.3 Method

the information contained in the hypothetical image I R,S . In that case, we have

C
[
I(G(I G j )) = I(I R,S ) (6.3)
j=1

Either of the above cases are possible in a real world scenario, which is completely
dependent on the available input information, i.e., {I G1 , ..., I GC }. It can be assumed
that if we have a sufficient number of input images from different times and angles, we
can gather sufficient information for each individual region of the face. This means the
probability of satisfying equation 6.3 is high. If it is difficult to obtain the information
of each individual region of the face from all available input images, as it might happen
in some cases, we assume that equation 6.2 is satisfied. This assumption leads to the
following equation
C
[
Gj
I(G(I )) < I(G(I G j )) ≤ I(I R,S ). (6.4)
j=1

Since our generator takes multiple images as inputs, we can modify equation 6.4 ac-
cordingly to get our final equation

I(G(I G j )) < I(G({I G1 , ..., I GC })) ≤ I(I R,S ). (6.5)

We have carried out our research based on the equation 6.5 and our objective is to find
the parameters (θ ) of G such that θ = arg maxθ I(G({I G1 , ..., I GC })).
We aim to estimate a clean and occlusion free facial image f (I G ) from an image I G ,
in which much information is lost due to various artifacts and/or occlusions. The first
step is to crop and align the faces in the frame. This step is important to maintain the
consistency of the geometric coordinates of different parts of the face like eyes, nose,
mouth, etc. The next step is to get the available landmarks or coordinates of the eyes,
nose, and mouth. These two steps are the preprocessing part of the pipeline and are
accomplished using the Face Alignment Network [108] by Bulat and Tzimiropoulos.
Once the landmarks are obtained, we crop out four sections, the face viz., left eye, right
eye, nose, and mouth. The cropping is done with sufficient padding to make sure that
the whole eye, nose, or mouth is in the cropped part. Next, these parts are passed on
to a rotation network where they process the local textures for the four main landmarks

105
6.3 Method

Figure 6.2: Block diagram of the proposed network.

so that their orientation matches the one from the frontal face pose. Each part is now
assembled at their proper location to get the frontal face pose from all input images.
In parallel to the rotation network, the whole input image is passed to another network
which processes the global structure of the face. These images are now passed to the
fusion network to get a single output image. Figure 6.2 shows the block diagram of the
complete pipeline.

6.3.2 Approach
In this section, we will describe each component of the whole image reconstruction
pipeline shown in Figure 6.2 and explain how each of those works. The first part of the
pipeline consists of an align and crop section. The main function of this block, as the
name suggests, is to align the face images horizontally and vertically. This ensures that
the eyes, nose, and the mouth are approximately in the same position in every image.
The irrelevant parts of the image are also cropped out to make sure that all images are
of the same scale. This is done with the help of a pretrained network. Thus, this step is
more of a preprocessing step.
Once the images are aligned and cropped, they are then sent to the main network for

106
6.3 Method

Figure 6.3: Figure explaining the process of alignment, cropping and segmentation.

further processing. The main network consists of a generator and a discriminator. The
generator consists of two branches, viz. global network and a local network. The local
network processes the image in specific areas. It first cuts out the two eyes, nose and
the mouth according to the landmark coordinates which are obtained during alignment
and cropping. Next, they are sent to the part rotation network for front pose generation.

Part Rotation Network

The part rotation network consists of a UNet based architecture. The downsampling
layers consist of blocks of convolution, batchnorm, relu, and a residual block. The
upsampling layers consist of blocks of convolution, batchnorm, relu, residual block,
transposed convolution, batchnorm, and a final activation. At the output of the network,
there are 2 branches of the network. One gives the feature vector of the output image,
and the other branch is passed on to a convolution layer to give the final output im-
age. Figure 6.4 shows the architecture of the part rotation network. The parts are then
recombined to get the image of the full face.
The global network tries to rotate the face from a global perspective. Figure 6.5
shows the complete generator block. The global network is a UNet based structure
which works on different scales. We can see in the figure that there are several resize
blocks at the input which downscales the image so that the global network can work
on different scales. The outputs from the global and local branches are then passed on
to the fusion network. This part of the network combines all information together to
obtain the final image. This image is then sent to the discriminator for evaluation and

107
6.3 Method

feedback at a local and global level. The details of the main network are discussed in
the following sections in detail. s To estimate the enhanced image for a given corrupted
image, we train a generator network as a feed-forward CNN GθG parameterised by θG .
Here θG = W1:N ; b1:N denotes the weights and biases of an N-layer deep network and
is obtained by optimising a loss function L. The training is done using two sets of n
images {IkGT : k = 1, 2, ..., n} and {IkG : k = 1, 2, ..., n} such that IkGT = GθG (IkG ) (where
IkGT and IkG are corresponding pairs) and by solving

1 n
θ̂G = arg min ∑ L(I GT , GθG (I G ), I R,S ). (6.6)
θG N i=1
We have removed the subscript k for simplicity and we will continue the same. This
will always refer to as corresponding pairs unless otherwise stated.

6.3.3 Generator
Our generator architecture is inspired by the generator proposed by Huang et al. [13]. It
consists of two branches, one of which processes parts of the image, i.e., the two eyes,
nose and mouth, and another branch which processes the input image as a whole. The
outputs of these two branches are then concatenated together. What we do next is to use
all images in the batch and take point-wise maximum in the axis along the batch. This is
where the information from multiple images are collected and the useful information is
fused together. They are then passed through some convolutional layers to get the final
output image. Details of the branches are described in the following sections.

Generator Global Network

The global network consists of an encoder and a decoder, which are connected through
a bottleneck which is a fully connected layer. The final layer of the encoder contains
a fully connected layer with 512 outputs. The next layer is a max-out function fmaxout

108
6.3 Method

Figure 6.4: Network architecture of the facial landmarks part rotation network.

109
6.3 Method

Figure 6.5: Network architecture of the generator inspired by [13]

110
6.3 Method

which operates in the following way.


 
max(l[0], l[256])
max(l[1], l[1 + 256])
 
 
fmaxout (l) =  .. 
.
 
 
max(l[255], l[255 + 256])

Here l is the fully connected layer immediately before the max-out layer. fmaxout (l) is
then fed directly into the decoder. Several layers from the encoder are fed to the decoder
through skip connections having residual blocks in between. Figure 6.5 shows the de-
tailed architecture of the global network. The output of the global generator network is
then fed to the fusion network.

Generator Local Network

The local generator network acts as a low-level transformation network whose job is de-
tailed frontal pose feature representation synthesis of the eyes, nose, and mouth. Figure
6.4 shows the detailed architecture of the local generator network. This network also
takes the form of an encoder-decoder structure, but it is a fully convolutional network.
Skip connections are present in the network, which facilitates direct information prop-
agation from the lower layers to the upper layers. There are 2 outputs for this branch.
One is the feature representation of the face part in question, and the other is the output
image itself. The penultimate layer gives the features, which is followed by a convo-
lution layer with 3 output channels giving the required image representation. The next
step of this layer is the assemble of the different parts in their respective positions. This
is done from the landmark information which we have already acquired earlier. Once
this is done, the result is then fed to the fusion network for further processing.

Fusion Network

The main function of the fusion network is to combine multiple output images into a
single frontal pose image. The output of the generator global and local network is fed
to the fusion network. This part of the generator consists of a residual block followed
by a convolution with batch normalisation and ReLU activation. This is followed by

111
6.3 Method

a batch-wise max-out layer. This layer takes the maximum value of the corresponding
cell from all image features and returns the feature representation for a single frontal
pose image. This is then passed through another residual block followed by a couple of
convolution layers to get the final output.

6.3.4 Discriminator
The discriminator is an essential part of a generative adversarial network which is in
charge of evaluating how well the generator is performing. Inspired by the architec-
tural guidelines set by Schönfeld et al. [109], we design our discriminator in the form
of a u-net with skip connections. The discriminator has an encoder which performs a
high-level evaluation of the image and a decoder which performs a low-level evalua-
tion. Outputs from the various layers of the encoder are directly fed to the decoder for
information propagation.

Encoder Part

The architecture of the encoder part of the discriminator is in line with the guideline set
by Huang et al. [13]. The encoder consists of 3 convolution blocks with leaky ReLU
activation and strides=2. The first convolution layer consists of 64 filters, and each
succeeding layer contains twice the number of layers of the preceding block. The filters
are of size 3 × 3. These layers are followed by a couple of residual blocks, followed by
another convolution block. The final layers consist of a convolution layer with 1 filter
of size 1 × 1 and stride=1. This gives an output of size 2 × 2. In parallel to this final
layer, there is another fully connected layer which is connected to the decoder.

Decoder Part

The architecture of the decoder part of the discriminator is the same as the decoder
shown in Figure 6.5.

6.3.5 Generator Loss Functions


The generator loss consists of five counterparts, each one performing a specific task
to drive the generation of images towards the frontal pose manifold. The following

112
6.3 Method

subsections describe each loss counterpart in detail.

Pixel-wise Loss

Since the information in the input images is corrupted, it is important to learn to re-
construct the occluded regions and drive the reconstruction towards the clean image
manifold. While learning this task, it is important to preserve the geometrical consis-
tency of the image both from a local and global perspective. This is done introducing a
pixel-wise loss L px in the cost function where

L px (GθG ) = IGT − GθG (IG ) . (6.7)

2

Symmetry Loss

Human faces are inherently quite symmetrical. This loss counterpart makes sure that
each side of the face is a mirror image of each other. This loss represents the pixel-wise
difference between the two longitudinal halves of the image. For an image with height
H and width W , this loss can be expressed as
W /2 H
1 G G

Lsym (GθG ) = G −
∑ ∑ θG i x,y θG i W −x+1,y 2 ,
(I ) G (I ) (6.8)

W /2 × H x=1

y=1

where x, y are the pixel indexes of the output image along the row and column, respec-
tively.

Identity Feature Loss

As the preservation of identity is important for the network, the objective function needs
to evaluate the identity vector of the generated images with respect to the ground truth
image. Following Equation 5.2, the identity feature loss LID (GθG ) is given by the dis-
tance between the vector representations of the generated face and the ground truth. If
we consider Φ to be a pretrained network which gives the vector representation of the
facial images, then the Euclidean distance will be given by

GT G
LID (GθG ) = Φ(Ii ) − Φ(GθG (Ii )) (6.9)

2

113
6.3 Method

and the cosine distance will be given by

Φ(IiGT ) • Φ(GθG (IiG ))


LID (GθG ) = 1 − q q (6.10)
Φ(Ii ) • Φ(Ii ) Φ{GθG (IiG )} • Φ{GθG (IiG )}
GT GT

The properties of Φ determines which distance metric to use during training.

Total Variation Regularisation

Total variation regularisation can be used to remove some noise and unwanted spikes
from the image, which will lead to a more consistent image representation. Aly et
al. [110] has shown in their work that a variation regulariser of the functional form
Z
Js ( f ) = L(k∇x f (x)k) dx
W

(refer to [110] for details) can be optimised to remove unwanted details while preserving
important details such as edges without introducing ringing or other artifacts. Using a
similar idea for digital images, the total variation can be given by

x

G G
LTV (GθG ) = ∑ Φ{GθG (I )i, j − Φ{GθG (I )i+1, j

i=1
y (6.11)
+ ∑ Φ{GθG (I G )i, j − Φ{GθG (I G )i, j+1 ,

j=1

where x, y are the pixel index along the row and column, respectively.

Adversarial Loss

The adversarial loss consists of the local loss and the global loss counterparts of the
discriminator. The details about the loss of the discriminator are discussed in detail in
Section 6.3.6. The global loss is given by the encoder part of the network and can be
expressed as

1 2 2
Lglob (DθE ) = − ∑ ∑ {C · log(DθE (I) p,q) + (1 − C) · log(1 − DθE (I)) p,q}, (6.12)
4 p=1 q=1

114
6.3 Method

where DθE is the encoder part of the discriminator network, C = 1 for I = IiGT and
C = 0 for I = GθG (IiG ). The local loss is given by the decoder part of the network and is
expressed as

1 x y
Lloc (DθD ) = − ∑ ∑ {C · log(DθD (I)i, j ) + (1 − C) · log(1 − DθD (I)i, j )}, (6.13)
x · y i=1 j=1

where DθD id the discriminator network, i, j are the pixel indexes for row and column
respectively, and x, y are the rows and columns, respectively. The adversarial loss is
calculated by considering how far the output image is compared to the real one. Thus,
closer the output image is to the real one, the smaller is the adversarial loss. Using this
concept, we can replace C by 1, which gives us

1 2 2
Lglob (DθE ) = − ∑ ∑ {log(DθE (I) p,q ) (6.14)
4 p=1 q=1

1 x y
Lloc (DθD ) = − ∑ ∑ {log(DθD (I)i, j ), (6.15)
x · y i=1 j=1

and the final adversarial loss is the sum of the two losses, which gives us
!
1 2 2 1 x y
Ladv =− ∑ ∑ {log(DθE (I) p,q) + x · y ∑ ∑ {log(DθD (I)i, j ) . (6.16)
4 p=1 q=1 i=1 j=1

6.3.6 Discriminator Loss Functions


Inspired by the work of Schönfeld et al. [109], the discriminator loss in our network
consists of the local loss and the global loss counterparts and a regularisation term. The
following subsections describe each of them in detail.

Global Decision Loss

The global decision is obtained from the output generated by the encoder part of the
discriminator. Instead of using a single output in the global decision loss, we have
broken the decision regions into 4 regions as described by Shrivastava et al. [111]. The
output of each region is evaluated separately to ensure a better global level evaluation by

115
6.3 Method

the discriminator encoder. The global loss is given by the encoder part of the network
and can be expressed as

1 2 2
Lglob (DθE ) = − ∑ ∑ {C · log(DθE (I) p,q) + (1 − C) · log(1 − DθE (I)) p,q}, (6.17)
4 p=1 q=1

where DθE is the encoder part of the discriminator network, C = 1 for I = I GT and C = 0
for I = GθG (I G ).

Local Decision Loss

The local decision is obtained from the output generated by the decoder part of the
discriminator. The local loss is given by the decoder part of the network and is expressed
as
1 x y
Lloc (DθD ) = − ∑ ∑ {C · log(DθD (I)i, j ) + (1 − C) · log(1 − DθD (I)i, j )}, (6.18)
x · y i=1 j=1

where DθD id the discriminator network, i, j are the pixel indexes for row and column
respectively, and x, y are the rows and columns respectively. The local decision loss is
evaluated on a per-pixel basis and is more effective when the images are augmented
using the CutMix [112] technique which is described in the following subsection.

Consistency Regularisation

Unlike what Yun et al. proposed in [112], and Schönfeld et al. performed in [109],
we use the real and the fake image for the augmentation strategy. For each real image,
a rectangular or square patch is replaced with a fake image patch corresponding to the
same region and from the same class. This is explained in Figure 6.6. For the 128 × 128
real image in the top row, we have a rectangular patch of 85 × 60, which is replaced by
a fake image. The black part represents a fake patch and the white part represents the
real part of the image. The encoder output ideally produces
" # " #
0 0 1 1
for fake images, and for real images
0 0 1 1

116
6.3 Method

Figure 6.6: Supplementary image for CutMix augmentation [112] with real and fake
images as explained in section 6.3.6

The red line shown in figure 6.6 divides the whole image into 4 equal parts. Therefore,
in this case, we are looking for an encoder output value of
" # " #
1 1 − ((85 − 64) × 60)/(64 × 64) 1 0.69
=
1 1 − (64 × 60)/(64 × 64) 1 0.06

117
6.4 Experiments

Similarly, for the image in the second row, we have a fake image of 128 × 128, which
contains a rectangular patch of 65 × 80 and ideally, this will give an encoder output of
" # " #
1 ((80 − 64) × 60)/(64 × 64) 1 0.23
=
(1 × 64)/(64 × 64) (1 × (80 − 64))/(64 × 64) 0.016 0.004

For the decoder output, pixels with fake part will ideally have a value of 0 and with
the real ones, it will be 1. Accordingly, we can substitute the value of C in equation 6.17
and 6.18.

6.4 Experiments
Our research is based on solving the problem of occlusion without any (or minimum)
constraints. For example, we do not make a prior guess that the occlusion will be limited
to sunglasses or scarves or hats. In the real world, occlusion can appear in several forms,
ranging from a part of the face being hidden behind an object or a part of the body, to
various poses where only one side of the face is visible. We also consider the fact
that we do not have any prior information about the faces being evaluated. The input
images can be individual frames from a video or separate images from different cameras
and viewing angles. The main motivation of this research is to explore the possibility
of reconstructing the lost information in occluded images and still make sure that the
reconstructed face will have a high recognition accuracy when evaluated with different
face recognition networks. Our solution has been designed keeping these in mind and
this will be reflected in our experimental setup.
To validate the effectiveness of our algorithms, we have conducted extensive exper-
iments on relevant datasets. We have also conducted some ablation studies and showed
how each part of the network is effective in attaining the final results.

6.4.1 Data & Experimental Setup


Our network was trained using a private dataset of around 6.7 million images of 72,412
identities which were obtained from various sources on the internet. The faces of the
training dataset were aligned and cropped using MTCNN [32]. Random blocks of
Gaussian noise were applied to the images in the training dataset. Some images were

118
6.4 Experiments

Output
Avg. Input
Accuracy Accuracy
Algorithm Accuracy
Rank Discriminator Discriminator
of all images
without U-net with U-net
1 0.384 0.391 0.491
FaceNet [65] 3 0.532 0.569 0.657
5 0.602 0.661 0.729
1 0.764 0.734 0.762
Anv v6 3 0.839 0.856 0.873
5 0.865 0.894 0.908
1 0.883 0.769 0.765
ArcFace [2] 3 0.908 0.873 0.869
5 0.916 0.901 0.902

Table 6.1: Recognition performance for different face recognition networks for our pro-
posed algorithm. This table shows the results for 15 input images evaluated on the
Webface dataset.

kept untouched to add variation to the training dataset. We trained the network on an
NVIDIA DGX-Station. Each training image is 128 × 128 pixels and was trained with
a batch size of 24. For optimisation, we use Stochastic Gradient Descent and a learn-
ing rate of 10−4 . We have carried out our test on a subset of WebFace dataset [101]
and CelebA [113] dataset, a public benchmark dataset having face images. Both the
Webface and CelebA subset consists of 7500 images of 150 unique identities. Our im-
plementation is based on Tensorflow. For performance evaluation, we have compared
our method with FaceNet [65], ArcFace [2], CosFace [77] and a private network called
Anv v6 which was developed by Anyvision.

6.4.2 Ablation Studies


Our ablation studies were based on two main things. The first is the effect on perfor-
mance of the network with a varying number of input images. The second one is the
effect of the discriminator with and without u-net. For the first test, we have varied
the number of input images from 5 to 50 and observed how the performance changes.
The recognition accuracy was calculated for FaceNet [65], ArcFace [2] and the private
Anv v6 network. These networks use different approaches for calculating the similarity

119
6.4 Experiments

Output
Avg. Input
Accuracy Accuracy
Algorithm Accuracy
Rank Discriminator Discriminator
of all images
without U-net with U-net
1 0.384 0.424 0.514
FaceNet [65] 3 0.532 0.617 0.705
5 0.602 0.709 0.774
1 0.764 0.790 0.807
Anv v6 3 0.839 0.884 0.902
5 0.865 0.927 0.932
1 0.883 0.811 0.817
ArcFace [2] 3 0.908 0.894 0.899
5 0.916 0.921 0.929

Table 6.2: Recognition performance for different face recognition networks for our pro-
posed algorithm. This table shows the results for 25 input images evaluated on the
Webface dataset.

Output
Avg. Input
Accuracy Accuracy
Algorithm Accuracy
Rank Discriminator Discriminator
of all images
without U-net with U-net
1 0.384 0.433 0.541
FaceNet [65] 3 0.532 0.624 0.713
5 0.602 0.710 0.789
1 0.764 0.787 0.821
Anv v6 3 0.839 0.901 0.905
5 0.865 0.932 0.936
1 0.883 0.837 0.830
ArcFace [2] 3 0.908 0.920 0.905
5 0.916 0.937 0.935

Table 6.3: Recognition performance for different face recognition networks for our pro-
posed algorithm. This table shows the results for 35 input images evaluated on the
WebFace dataset.

120
6.4 Experiments

Figure 6.7: Visual example of how the output image improves with the number of input
images.

between faces. For example, FaceNet uses the Euclidean distance for calculating the
similarity between two faces, while Arcface and Anv v6 use the angular distance. This
ensures that our method is robust to various face recognition algorithms. As expected,
we see an increase of accuracy with the increasing number of images until it saturates at
the point where all information has been gathered. Figure 6.7 shows how the output im-
age improves with increasing number of input images. For the second test, we analysed
the effect of the discriminator with skip connections on the performance of the network.
In Table 6.1, 6.2 and we can see that our network improves the recognition rate com-
pared to the average accuracy of the input images. When the network is trained with the
discriminator with skip connections (U-net discriminator), the performance increases in
most of the cases. Still, we have managed to outperform ArcFace for rank-3 and rank-5
accuracy when the input is 35 images and has a close match for most other cases.

6.4.3 Final Results


From section 6.4.2, we can see that the U-net discriminator described in section 6.3.6
performs best with the proposed generator. Thus, for our final experiments, we will be
evaluating our results based on the said network combination. There are two main things
we need to verify for successfully proving the effectiveness of our proposed solution.
The first point is that the accuracy of the generated images must be greater than the
average accuracy of all input images. This should be true for a variety of networks.

121
6.4 Experiments

Average
Recognition Input Accuracy of output image
FR Algorithm
Rank Accuracy for n input images
of all images
50 images n=15 img n=25 img n=35 img n=45 img

Webface Dataset

1 0.384 0.491 0.514 0.541 0.536


FaceNet [65] 2 0.532 0.657 0.705 0.713 0.711
3 0.602 0.729 0.774 0.789 0.795
1 0.764 0.762 0.807 0.821 0.821
Anv v6 2 0.839 0.873 0.902 0.905 0.906
3 0.865 0.908 0.932 0.936 0.935
1 0.883 0.765 0.817 0.830 0.836
ArcFace [2] 2 0.908 0.869 0.899 0.905 0.905
3 0.916 0.902 0.929 0.935 0.933
1 0.356 0.487 0.510 0.519 0.521
CosFace [77] 2 0.483 0.633 0.635 0.659 0.660
3 0.549 0.685 0.689 0.711 0.712

CelebA Dataset

1 0.535 0.625 0.643 0.653 0.653


FaceNet [65] 2 0.702 0.778 0.805 0.811 0.809
3 0.771 0.845 0.864 0.864 0.867
1 0.889 0.896 0.911 0.925 0.925
Anv v6 2 0.931 0.945 0.955 0.959 0.959
3 0.944 0.957 0.965 0.965 0.965
1 0.945 0.898 0.918 0.922 0.921
ArcFace [2] 2 0.956 0.933 0.941 0.948 0.948
3 0.959 0.951 0.953 0.959 0.959
1 0.451 0.561 0.589 0.599 0.599
CosFace [77] 2 0.597 0.697 0.726 0.734 0.733
3 0.665 0.745 0.766 0.769 0.771

Table 6.4: Rank 1,3 and 5 accuracy evaluated on different face recognition network for
our output images. Performance is evaluated using Webface and CelebA dataset.

122
6.4 Experiments

FaceNet on WebFace FaceNet on CelebA

Anv v6 on WebFace Anv v6 on CelebA

ArcFace on WebFace ArcFace on CelebA

CosFace on WebFace CosFace on CelebA

Figure 6.8: Change of recognition accuracy with the number of input images.

123
6.5 Conclusion

We have validated this using the FaceNet algorithm [65], ArcFace algorithm [2] and
a private network called Anv v6 algorithm which is developed by a company called
Anyvision. We have validated this for various number of input images, i.e., for 15,
25, 35, and 45 images. Table 6.4 shows the results for Facenet [65], Anv v6 network,
ArcFace [2] and CosFace [77] network. It is to be noted that ArcFace was trained on
112 × 112 images and we used the original trained model for our experiments. To keep
things as fair as possible, we resized our images to 112 × 112, although originally the
images and our network were prepared for evaluation on 128 × 128. Even though this
is a disadvantage for us, our images compete well with the performance of Arcface.
The second point is that the accuracy must increase with the number of input im-
ages. With every input image, some incremental information is being added to the final
image. However, after a certain point when all information has been collected, addi-
tional images will only provide duplicate information. Thus, after a certain number of
input images, the accuracy should tend to remain constant irrespective of the number
of further additions of images. This should also be true for a variety of networks and
we have validated this using the FaceNet algorithm [65], ArcFace algorithm [2] and the
Anv v6 algorithm. Figure 6.8 shows the results for Facenet [65], Anv v6, CosFace [77]
and ArcFace [2] respectively. The figures show the graph for the change of accuracy
with respect to the number of input images. The number of input images has been var-
ied consecutively from 5 to 50. From the graphs, we can see that the accuracy gradually
increases with respect to the number of input images. However, the accuracy starts to
saturate when the number of input images is around 35. Adding further input images
does not help in the increase of recognition accuracy. This happens because any further
input images actually provide the same information which has already been provided
by the input images before. Therefore, this duplicate information is useless in this case
and does not help in increasing the accuracy further.

6.5 Conclusion
In this chapter, we have presented a generative adversarial network which is able to
extract relevant information from multiple images of the same identity and combine
them to synthesise a single image which is a frontal pose representation of the identity
in question. We have also shown that the image generated provides better recognition

124
6.5 Conclusion

performance compared to the average of all the input images. We have validated our
proposal by extensive experiments and have proved that information is actually gradu-
ally added with each image up to a certain point after which all information received
from further images is duplicated information which have already been processed be-
fore. This algorithm can have useful application in areas where we can obtain occluded
faces from various angles but a single face cannot provide enough information for recog-
nition with high confidence. The proposed method can also be used in other fields of
application which we will be discussing in chapter 7.

125
Chapter 7

Conclusion and Future Work

Generative adversarial networks are one of the most sophisticated approaches to genera-
tive modelling using deep learning. They are quite difficult to train, but at the same time
can be quite useful for image synthesis. In chapter 3 we have described a deep genera-
tive adversarial network with skip connections that sets a new state of the art on public
benchmark datasets when evaluated with respect to perceptual quality. This network is
the first framework, which successfully recovers images from artifacts and at the same
time super-resolves, thus having a single-shot operation performing two different tasks.
Using different combinations of loss functions and by using the discriminator both in
feature and pixel space, we confirm that IEGAN reconstructions for corrupted images
are superior by a considerable margin and more photorealistic than the reconstructions
obtained by the current state-of-the-art methods. Although this method was quite effi-
cient for artifact removal, it was not as good for noise removal purposes or simultaneous
noise with artifact removal. In chapter 4 we have described a deep generative adversar-
ial network which is the first framework to successfully recover images from artifacts or
noise, or both combined and at the same time super-resolves, thus having a single-shot
operation performing different tasks simultaneously.
Moving this idea forward, we used this process to clean facial images aiming for a
better recognition performance. Unfortunately, it resulted in worse recognition perfor-
mance. This helped us better understand the nature of GANs and human perception in
general. What results in humans perceived as clean might often be the loss of informa-
tion from an information theory perspective. To deal with this issue, we present another
deep generative adversarial network in chapter 5 that sets a new state-of-the-art for face

126
recognition enhancement for low-quality images by restoring their facial features. We
have highlighted some limitations of the existing face recognition algorithms and we
have tried to overcome those limitations by augmenting the input images to increase
the separation between different classes. In particular, we have explored the possibil-
ity of improving face recognition performance on those images for which the current
state-of-the-art fails to deliver good results. Our proposed loss function works best on
real-world images where they are inherently low quality such as SCFace. We use metric
learning to train the network to generate facial images preserving their identity, which
is more important for improving face recognition performance compared to good per-
ceptual quality, and our proposed algorithm successfully demonstrates that. Looking
at this problem from a different angle, it can be said that noise or artifacts are a kind
of microlevel occlusion where some information are getting hidden or distorted. This
motivated us to deal with occlusion in general. We often come across situations where
a facial image is occluded due to the presence of various objects including, but not lim-
ited to sunglasses, hats, scarves, masks, etc. Since these kinds of occlusions remove the
entire information from the occluded areas, and they cover a wide area, we cannot use
neighbouring information to reconstruct those areas. For example, if we encounter a
face with sunglasses, it might be possible to reconstruct the eyes, but it will be impossi-
ble to reconstruct the same eyes belonging to that person, unless it happens by chance.
Here comes the idea of using multiple images to accumulate information and create a
single image for recognition.
In chapter 6 we have presented a generative adversarial network which is able to
extract relevant information from multiple images of the same identity and combine
them to synthesise a single image which is a frontal pose representation of the identity
in question. We have also shown that the image generated provides better recognition
performance compared to the average of all input images.
We have shown in our experiments that our method works successfully for different
face recognition networks. We have also shown that novel information can boost the
recognition accuracy and duplicate information is not helpful, which aligns with our
proposed theory. This algorithm can have useful applications in areas where we can
obtain occluded faces from various angles, but a single face cannot provide enough
information for recognition with high confidence. The proposed method can also be
used in other fields of application which we will be discussing in chapter 7.

127
7.1 Future Work and Other Areas of Application

7.1 Future Work and Other Areas of Application


This research opens the door to many further research opportunities. It would be in-
teresting to see how the recognition performance varies with different facial feature
extraction algorithms. For temporal superresolution, pose grouping can be an important
thing to look into. Since the geometrical structure of different facial parts change with
the change of pose angle, pose grouping can be applied for a range of pose angles and
each pose group can be treated separately to obtain better images. Another point of
interest is to consider the option of pose angle along the vertical axis. In our research
till date, we have mainly considered the face pose along the horizontal axis. However,
extending it to the vertical axis can be an option. In addition, this opens the opportunity
to explore facial reconstruction in three-dimensional space.
Face is not the only area of application for quality enhancement using information
from multiple images. Agrawal et al. [114] have used temporal superresolution tech-
niques to mitigate motion blur and thus could use conventional cameras as high speed
cameras. This concept can be extended to various other areas of application as well and
some of them have been discussed in the following subsections.

7.1.1 Astronomical Photography


Temporal superresolution has been explored in the area of astro-photography in the
past. Hirsch et al. [115] presented a method for multiframe deconvolution which yields
deblurred images of astronomical objects. A single snapshot of astronomical objects is
generally not in a very acceptable state due to various atmospheric interferences which
includes light pollution and proper exposure. Thus, multiple images can be used to get
a better representation of the object in space. Similar problems can be solved using our
proposed method. We can use multiple images at different times and incrementally add
information to get a better picture of the objects in space.

7.1.2 Automatic License Plate Recognition


Automatic license plate recognition (ALPR) is a technology that uses optical character
recognition on images to read vehicle registration plates. ALPR is often a difficult
task in the real world since cars moving at high speed can introduce motion blur and

128
7.1 Future Work and Other Areas of Application

numbers often can be occluded by other traffic. Temporal superresolution can help in
this scenario by gathering temporal data and incrementally aggregate information to
synthesise the final complete license place.

7.1.3 Unwanted Object Removal


Object removal is another field where the method of temporal superresolution can be
used. This application case is quite similar to what we have done in the sense that
the unwanted object can be regarded as the occlusion and the task will be to clear the
occlusion. Lee et al. [116] have used a method for deep video inpainting which removes
unwanted objects from videos using deep inpainting algorithms. Similar problems can
be solved using incremental addition of information from multiple images.
These are just some of the areas of application that I have mentioned. There are
numerous other areas where this idea can be applied and further research can be done.

129
Appendix A

Author’s Publications

1. Soumya Shubhra Ghosh, Yang Hua, Sankha Subhra Mukherjee, and Neil Robert-
son, “IEGAN: Multi-purpose Perceptual Quality Image Enhancement Using Gen-
erative Adversarial Network,” in WACV, Hawaii, USA, Jan. 2019.

2. Soumya Shubhra Ghosh, Yang Hua, Sankha Subhra Mukherjee, and Neil Robert-
son, “Improving Detection and Recognition of Degraded Faces by Discriminative
Feature Restoration Using GAN,” accepted in ICIP, Abu Dhabi, UAE, Oct. 2020.

3. Soumya Shubhra Ghosh, Yang Hua, Sankha Subhra Mukherjee, and Neil Robert-
son, “All in One: A GAN based Versatile Image Enhancement Framework using
Edge Information,” IEEE Transactions in Image Processing, 2020, submitted

130
References

[1] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li, “S3fd: Single shot
scale-invariant face detector,” in ICCV, 2017.

[2] J. Deng, J. Guo, X. Niannan, and S. Zafeiriou, “Arcface: Additive angular margin
loss for deep face recognition,” in CVPR, 2019.

[3] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta,


A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single image super-
resolution using a generative adversarial network,” in CVPR, 2017.

[4] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep
convolutional networks,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, pp. 295–307, 2016.

[5] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” in CVPR, 2018.

[6] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” in
ICCV, 1998.

[7] C. Dong, Y. Deng, C. C. Loy, and X. Tang, “Compression artifacts reduction by


a deep convolutional network,” in ICCV, 2015.

[8] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep cnn denoiser prior for
image restoration,” in CVPR, 2017, pp. 3929–3938.

[9] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-
scale image recognition,” in ICLR, 2015.

[10] H. R. Sheikh, Z. Wang, L. Cormack, and A. C. Bovik, “Live image quality as-
sessment database release 2,” 2005.

131
REFERENCES

[11] R. Reisenhofer, S. Bosse, G. Kutyniok, and T. Wiegand, “A haar wavelet-based


perceptual similarity index for image quality assessment,” Signal Processing:
Image Communication, vol. 61, pp. 33 – 43, 2018. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0923596517302187

[12] M. Grgic, K. Delac, and S. Grgic, “Scface - surveillance cameras face database,”
Multimedia Tools Appl., vol. 51, pp. 863–879, 02 2011.

[13] R. Huang, S. Zhang, T. Li, and R. He, “Beyond face rotation: Global and local
perception gan for photorealistic and identity preserving frontal view synthesis,”
in ICCV, 2017, pp. 2439–2448.

[14] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representa-


tions by error propagation,” California Univ San Diego La Jolla Inst for Cognitive
Science, Tech. Rep., 1985.

[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep


convolutional neural networks,” in NIPS, 2012.

[16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied


to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–
2324, 1998.

[17] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van-


houcke, and A. Rabinovich, “Going deeper with convolutions,” in CVPR, 2015.

[18] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network train-
ing by reducing internal covariate shift,” in ICML, 2015.

[19] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the


inception architecture for computer vision,” in CVPR, 2016.

[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recogni-
tion,” in CVPR, 2016.

[21] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, inception-


resnet and the impact of residual connections on learning,” arXiv preprint
arXiv:1602.07261, 2016.

132
REFERENCES

[22] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected
convolutional networks,” in CVPR, 2017.

[23] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,


A. Courville, and Y. Bengio, “Generative adversarial nets,” in NIPS, 2014.

[24] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv


preprint arXiv:1411.1784, 2014.

[25] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference


on computer vision, 2015.

[26] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for
accurate object detection and semantic segmentation,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2014.

[27] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of


the IEEE international conference on computer vision, 2017.

[28] D. Yoo, S. Park, J.-Y. Lee, A. S. Paek, and I. So Kweon, “Attentionnet: Aggre-
gating weak directions for accurate object detection,” in ICCV, 2015.

[29] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg,
“SSD: Single shot multibox detector,” in ECCV, 2016.

[30] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Uni-
fied, real-time object detection,” in CVPR, 2016.

[31] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in CVPR, 2017.

[32] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment us-
ing multitask cascaded convolutional networks,” IEEE Signal Processing Letters,
vol. 23, no. 10, pp. 1499–1503, Oct 2016.

[33] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with


deep convolutional generative adversarial networks,” in ICLR, 2016.

[34] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for
improved quality, stability, and variation,” ICLR, 2018.

133
REFERENCES

[35] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing


and improving the image quality of stylegan,” in CVPR, 2020.

[36] H. Reeve and J. Lim, “Reduction of blocking effect in image coding,” in IEEE
International Conference on Acoustics, Speech, and Signal Processing, 1983.

[37] P. List, A. Joch, J. Lainema, G. Bjontegaard, and M. Karczewicz, “Adaptive de-


blocking filter,” IEEE Transactions on Circuits and Systems for Video Technol-
ogy, pp. 614–619, 2003.

[38] C. Wang, J. Zhou, and S. Liu, “Adaptive non-local means filter for image deblock-
ing,” Signal Processing: Image Communication, vol. 28, pp. 522–530, 2013.

[39] A.-C. Liew and H. Yan, “Blocking artifacts suppression in block-coded images
using overcomplete wavelet representation,” IEEE Transactions on Circuits and
Systems for Video Technology, pp. 450–461, 2004.

[40] A. Foi, V. Katkovnik, and K. Egiazarian, “Pointwise shape-adaptive dct for high-
quality denoising and deblocking of grayscale and color images,” IEEE Transac-
tions on Image Processing, pp. 1395–1411, 2007.

[41] P. Svoboda, M. Hradiš, D. Bařina, and P. Zemčík, “Compression artifacts re-


moval using convolutional neural networks,” Journal of WSCG, vol. 24, pp. 63–
72, 2016.

[42] J. Bruna, P. Sprechmann, and Y. LeCun, “Super-resolution with deep convolu-


tional sufficient statistics,” arXiv preprint arXiv:1511.05666, 2015.

[43] A. Dosovitskiy and T. Brox, “Generating images with perceptual similarity met-
rics based on deep networks,” in NIPS, 2016.

[44] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer
and super-resolution,” in ECCV. Springer, 2016.

[45] S. S. Ghosh, Y. Hua, S. S. Mukherjee, and N. Robertson, “Iegan: Multi-purpose


perceptual quality image enhancement using generative adversarial network,” in
WACV, 2019.

134
REFERENCES

[46] G. Lu, W. Ouyang, D. Xu, X. Zhang, Z. Gao, and M.-T. Sun, “Deep kalman
filtering network for video compression artifact reduction,” in ECCV, 2018.

[47] J. W. Soh, J. Park, Y. Kim, B. Ahn, H.-S. Lee, Y.-S. Moon, and N. I. Cho, “Re-
duction of video compression artifacts based on deep temporal networks,” IEEE
Access, vol. 6, pp. 63 094–63 106, 2018.

[48] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse


3-d transform-domain collaborative filtering,” IEEE Transactions on Image Pro-
cessing, vol. 16, no. 8, pp. 2080–2095, 2007.

[49] A. Chambolle, “An algorithm for total variation minimization and applications,”
Journal of Mathematical imaging and vision, vol. 20, no. 1-2, pp. 89–97, 2004.

[50] D. Zoran and Y. Weiss, “From learning models of natural image patches to whole
image restoration,” in ICCV, 2011.

[51] M. Elad and M. Aharon, “Image denoising via sparse and redundant representa-
tions over learned dictionaries,” IEEE Transactions on Image Processing, vol. 15,
no. 12, pp. 3736–3745, 2006.

[52] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for image denois-
ing,” in CVPR, 2005.

[53] A. Foi, V. Katkovnik, and K. Egiazarian, “Pointwise shape-adaptive dct denoising


with structure preservation in luminance-chrominance space,” in International
Workshop on Video Processing and Quality Metrics for Consumer Electronics,
2006.

[54] X. Li and M. T. Orchard, “New edge-directed interpolation,” IEEE Transactions


on Image Processing, pp. 1521–1527, 2001.

[55] J. Allebach and P. W. Wong, “Edge-directed interpolation,” in ICIP, 1996.

[56] Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang, “Deep networks for image
super-resolution with sparse prior,” in CVPR, 2015.

135
REFERENCES

[57] C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution convolu-


tional neural network,” in ECCV, 2016.

[58] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and


Z. Wang, “Real-time single image and video super-resolution using an efficient
sub-pixel convolutional neural network,” in CVPR, 2016.

[59] Y. Chen, Y. Tai, X. Liu, C. Shen, and J. Yang, “Fsrnet: End-to-end learning face
super-resolution with facial priors,” in CVPR, 2018.

[60] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a “completely blind”


image quality analyzer,” IEEE Signal processing letters, vol. 20, no. 3, pp. 209–
212, 2012.

[61] W. Xue, L. Zhang, X. Mou, and A. C. Bovik, “Gradient magnitude similarity


deviation: A highly efficient perceptual image quality index,” IEEE Transactions
on Image Processing, 2014.

[62] D. Kundu, L. K. Choi, A. C. Bovik, and B. L. Evans, “Perceptual quality eval-


uation of synthetic pictures distorted by compression and transmission,” Signal
Processing: Image Communication, vol. 61, pp. 54–72, 2018.

[63] J. Hu, J. Lu, and Y.-P. Tan, “Deep transfer metric learning,” in CVPR, 2015.

[64] Y. Yang, S. Liao, Z. Lei, and S. Li, “Large scale similarity learning using similar
pairs for person verification,” in AAAI, 2016.

[65] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for


face recognition and clustering,” in CVPR, 2015.

[66] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach
for deep face recognition,” in ECCV, 2016.

[67] G. Hu, Y. Hua, Y. Yuan, Z. Zhang, Z. Lu, S. S. Mukherjee, T. M. Hospedales,


N. M. Robertson, and Y. Yang, “Attribute-enhanced face recognition with neural
tensor fusion networks,” in ICCV, 2017.

136
REFERENCES

[68] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hyper-
sphere embedding for face recognition,” in CVPR, 2017.

[69] O. Barkan, J. Weill, L. Wolf, and H. Aronowitz, “Fast high dimensional vector
multiplication face recognition,” in ICCV, 2013.

[70] X. Cao, D. Wipf, F. Wen, G. Duan, and J. Sun, “A practical transfer learning
algorithm for face verification,” in ICCV, 2013.

[71] D. Chen, X. Cao, F. Wen, and J. Sun, “Blessing of dimensionality: High-


dimensional feature and its efficient compression for face verification,” in CVPR,
2013.

[72] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deep-face: Closing the gap to
human-level performance in face verification,” in CVPR, 2014.

[73] X. Wang and A. Gupta, “Unsupervised learning of visual representations using


videos,” in ICCV, 2015.

[74] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer,


“Discriminative learning of deep convolutional feature point descriptors,” in
ICCV, 2015.

[75] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese, “Deep metric learning via
lifted structured feature embedding,” in CVPR, 2016.

[76] C. Huang, C. C. Loy, and X. Tang, “Local similarity-aware deep feature embed-
ding,” in Advances in neural information processing systems, 2016.

[77] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu, “Cosface:
Large margin cosine loss for deep face recognition,” in CVPR, 2018.

[78] H. Jia and A. M. Martinez, “Face recognition with occlusions in the training
and testing sets,” in IEEE International Conference on Automatic Face Gesture
Recognition, 2008.

[79] H. K. Ekenel and R. Stiefelhagen, “Why is facial occlusion a challenging prob-


lem?” in Advances in Biometrics. Springer Berlin Heidelberg, 2009, pp. 299–
308.

137
REFERENCES

[80] A. Andres, S.Padovani, M.Tepper, and J.Jacobo-Berlles, “Face recognition on


partially occluded images using compressed sensing,” Pattern Recognition Let-
ters, vol. 36, pp. 235–242, 2014.

[81] Y. Su, Y. Yang, Z. Guo, and W. Yang, “Face recognition with occlusion,” in IAPR
Asian Conference on Pattern Recognition, 2015.

[82] Y. Wen, W. Liu, M. Yang, Y. Fu, Y. Xiang, and R. Hu, “Structured occlusion
coding for robust face recognition,” Neurocomputing, vol. 178, pp. 11–24, 2016.

[83] W. Weitao and C. Jiansheng, “Occlusion robust face recognition based on mask
learning,” in ICIP, 2017.

[84] L. Song, D. Gong, Z. Li, C. Liu, and W. Liu, “Occlusion robust face recognition
based on mask learning with pairwise differential siamese network,” in ICCV,
2019.

[85] Y. Xiao, D. Cao, and L. Gao, “Face detection based on occlusion area detection
and recovery,” Multimedia Tools and Applications, vol. 79, pp. 16 531–16 546,
2020.

[86] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assess-


ment: from error visibility to structural similarity,” IEEE Transactions on Image
Processing, pp. 600–612, 2004.

[87] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with
conditional adversarial networks,” in CVPR, 2017.

[88] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural


network acoustic models,” in ICML, 2013.

[89] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for


biomedical image segmentation,” in International Conference on Medical image
computing and computer-assisted intervention. Springer, 2015.

[90] S. Xie and Z. Tu, “Holistically-nested edge detection,” in ICCV, 2015.

138
REFERENCES

[91] J. Canny, “A computational approach to edge detection,” IEEE Transactions on


Pattern Analysis and Machine Intelligence, pp. 679–698, 1986.

[92] D. Berthelot, T. Schumm, and L. Metz, “Began: Boundary equilibrium generative


adversarial networks,” arXiv preprint arXiv:1703.10717, 2017.

[93] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using sparse-
representations,” in International conference on curves and surfaces. Springer,
2010.

[94] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented


natural images and its application to evaluating segmentation algorithms and
measuring ecological statistics,” in ICCV, 2001.

[95] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,


A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recogni-
tion challenge,” International Journal of Computer Vision, vol. 115, pp. 211–252,
2015.

[96] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” 2014.

[97] M. S. M. Sajjadi, B. Schölkopf, and M. Hirsch, “EnhanceNet: Single Image


Super-Resolution through Automated Texture Synthesis,” in ICCV, 2017.

[98] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for


image quality assessment,” in Conference Record of the Thirty-Seventh Asilomar
Conference on Signals, Systems and Computers, 2004., 2003.

[99] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assess-


ment: from error visibility to structural similarity,” IEEE Transactions on Image
Processing, pp. 600–612, 2004.

[100] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial net-


works,” in ICML, 2017.

[101] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation from scratch,”
arXiv preprint arXiv:1411.7923, 2014.

139
REFERENCES

[102] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software Tools, 2000.

[103] D. E. King, “Dlib-ml: A machine learning toolkit,” Journal of Machine Learning


Research, vol. 10, pp. 1755–1758, 2009.

[104] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2:


Inverted residuals and linear bottlenecks,” in CVPR, 2018.

[105] C. Sagonas, Y. Panagakis, S. Zafeiriou, and M. Pantic, “Robust statistical face


frontalization,” in ICCV, 2015, pp. 3871–3879.

[106] X. Zhu, Z. Lei, J. Yan, D. Yi, and S. Z. Li, “High-fidelity pose and expression
normalization for face recognition in the wild,” in CVPR, 2015, pp. 787–796.

[107] T. Hassner, S. Harel, E. Paz, and R. Enbar, “Effective face frontalization in un-
constrained images,” in CVPR, 2015, pp. 4295–4304.

[108] A. Bulat and G. Tzimiropoulos, “How far are we from solving the 2d & 3d face
alignment problem? (and a dataset of 230,000 3d facial landmarks),” in ICCV,
2017.

[109] E. Schonfeld, B. Schiele, and A. Khoreva, “A u-net based discriminator for gen-
erative adversarial networks,” in CVPR, 2020.

[110] H. A. Aly and E. Dubois, “Image up-sampling using total-variation regularization


with a new observation model,” IEEE Transactions on Image Processing, vol. 14,
no. 10, pp. 1647–1659, 2005.

[111] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb, “Learn-


ing from simulated and unsupervised images through adversarial training,” in
CVPR, 2017, pp. 2107–2116.

[112] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization
strategy to train strong classifiers with localizable features,” in ICCV, 2019.

[113] Z. Liu, P. Luo, X. Wang, X. Tang, and C.-H. Dataset, “Large-scale celebfaces
attributes (celeba) dataset,” 2015.

140
REFERENCES

[114] A. Agrawal, M. Gupta, A. Veeraraghavan, and S. G. Narasimhan, “Optimal coded


sampling for temporal super-resolution,” in CVPR, 2010.

[115] Hirsch, M., Harmeling, S., Sra, S., and Schölkopf, B., “Online multi-frame blind
deconvolution with super-resolution and saturation correction,” A&A, vol. 531,
p. A9, 2011.

[116] S. Lee, S. W. Oh, D. Won, and S. J. Kim, “Copy-and-paste networks for deep
video inpainting,” in ICCV, 2019.

141

You might also like