Professional Documents
Culture Documents
Individual Recognition From Low Quality and Occluded Images and Videos Using GAN
Individual Recognition From Low Quality and Occluded Images and Videos Using GAN
Doctor of Philosophy
Face recognition in the wild is one of the most interesting problems in the
present world. Many algorithms with high performance have already been
proposed and applied in real-world applications. However, the problem of
recognising degraded faces from low-quality images and videos mostly re-
mains unsolved. A possible intuitive solution can be the recovery of finer
texture details and sharpness of degraded images with low resolution. In
recent times, we have seen a breakthrough in the perceptual quality of im-
age enhancement with the recovery of the fine texture details and sharpness
of degraded images. Several state-of-the-art algorithms, like ARCNN, IRC-
NNN, SRGAN, etc., provide excellent results in the field of image enhance-
ment. However, regarding face recognition, these algorithms fail miserably.
This brings down to two possibilities; either the perceptual quality enhance-
ment of the algorithm till date is not good enough or perceptual quality en-
hancement is not the key to a better face recognition performance. In this
thesis, we have explored both possibilities. Existing image enhancement
approaches often focus on minimising the pixel-wise reconstruction error,
which results in a high peak signal-to-noise ratio. However, they often fail
to match the quality expected in a photorealistic image. Furthermore, an
end-to-end solution for simultaneous superresolution of finer texture de-
tails and sharpness for degraded images with low resolution has not been
explored yet. Therefore, we proposed a versatile framework capable of
inferring photorealistic natural images with both artifact removal and su-
perresolution simultaneously. This includes a new loss function consisting
of a combination of reconstruction loss, feature loss, and an edge loss coun-
terpart. The feature loss helps to push the output image to the natural image
manifold and the edge loss preserves the sharpness of the output image. The
reconstruction loss provides low-level semantic information to the genera-
tor regarding the quality of the generated images. Our approach has been
experimentally proven to recover photorealistic textures from heavily com-
pressed low-resolution images on public benchmarks. However, even with
this algorithm, the recognition performance could not be improved. This
motivated us in designing an algorithm which is specific to the enhance-
ment of faces for better recognition performance. Thus, we present a ver-
satile GAN capable of recovering facial features from degraded videos and
images. Specifically, we use an effective method involving metric learn-
ing and different loss function components operating on different parts of
the generator. The proposed method helps to push the facial features of
the output image to the cluster containing faces of the same identity, and
at the same time, increases the angular margin between different identities.
In other words, it enhances the degraded faces by restoring their lost facial
features rather than its perceptual quality, which ultimately leads to better
performance of any existing face recognition algorithm. Our approach has
been experimentally proven to enhance face detection and recognition, e.g.,
the face detection rate is improved by 3.08% for S3FD [1] and the area un-
der the ROC curve for recognition is improved by 2.55% for ArcFace [2],
evaluated on the SCFace dataset. The limitation using a single image is that
there is an upper bound to which we can enhance the images. Any informa-
tion that is not present cannot be generated out of thin air. This motivated
us to develop the idea of synthesising the frontal face pose from multiple
occluded images of the same identity, taken at different times and from dif-
ferent angles. This helps to further push the upper bound. Our network
crowdsources useful information from all images and rejects information
that is not useful for recognition. We build our generator on top of TPGAN
and use the concept of a U-net discriminator which evaluates the output
both at a global and local level, thus boosting the image to be consistent at
a global and local level.
Acknowledgements
First of all, I would thank my supervisor Prof. Neil Robertson for his guid-
ance and support during my PhD research. I am pleased to have such a
unique mentor. I would also express my sincere gratitude to my second su-
pervisor, Dr. Yang Hua for his constant support, encouragement and guid-
ance. He helped me a lot to make my publications stronger. I would also
thank my friend Dr. Sankha Subhra Mukherjee who had supported me in
all possible ways during my PhD and definitely made my life easier. I am
also very grateful to Anyvision for offering me the opportunity to conduct
my research with an exposure to its cutting-edge technology, state-of-the-
art hardware, good food, drinks, parties and an amazing group. Thanks to
John, Jerry and Dimitrios, who evaluated my work every year and helped
me progress through my research by providing useful feedback. A spe-
cial thanks goes to my colleague cum friends in Anyvision, Alessandro,
David, Piotr, Tomaš, Romain, Elyor, Guosheng, Steven, Tanya and the en-
tire group of colleagues from Queen’s University Belfast, particularly my
dearest friends Rachael Abbott, Andrew Moyes, Stuart Millar, Dr. Xinshao
Wang, Dr. Waqar Saadat, Dr. Umair Naeem and Dr. Sumit Raurale, who
have all contributed significantly to make this journey much more fun. Last
but certainly not least, I am grateful to my beloved family, my friends and
my beloved wife, Suparna Chowdhury for receiving endless support and
trust. Without their support, life during these few years would not be as
easy.
Table of Contents
Table of Contents iv
List of Tables ix
List of Figures xi
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation, Approach, Importance and Impact . . . . . . . . . . . . . . 2
1.2.1 Basic Design Principles and Pros of our Proposed Method . . . 5
1.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Thesis Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Background 11
2.1 A short history of neural networks . . . . . . . . . . . . . . . . . . . . 11
2.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Convolutional Neural Networks (CNN) . . . . . . . . . . . . . . . . . 14
2.4 Different CNN Architectures . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 LeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.2 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.3 VGGNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.4 GoogLeNet/Inception v1 . . . . . . . . . . . . . . . . . . . . . 17
2.4.5 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.6 Generative Adversarial Networks (GAN) . . . . . . . . . . . . 19
2.4.7 Conditional GANs . . . . . . . . . . . . . . . . . . . . . . . . 20
iv
TABLE OF CONTENTS
v
TABLE OF CONTENTS
vi
TABLE OF CONTENTS
vii
TABLE OF CONTENTS
References 131
viii
List of Tables
5.1 Cosine similarity scores w.r.t. gallery for input images and enhanced
images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2 Extension of Table 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 93
ix
LIST OF TABLES
5.3 Fréchet Inception Distance and Inception Scores for different algorithms
for the SCFace dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4 True Accept Rate (TAR) at different False Accept Rate (FAR) with dif-
ferent settings for the network . . . . . . . . . . . . . . . . . . . . . . 94
5.5 Detection rates for different algorithms for different image enhancement
algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.6 Comparison of the recognition performance of our method with the
state-of-the-art artifact and noise removal algorithms . . . . . . . . . . 94
x
List of Figures
xi
LIST OF FIGURES
xii
Chapter 1
Introduction
1.1 Introduction
Face recognition in real world scenarios is not an easy task to solve as various factors
like occlusion, extreme head pose, illumination, etc. comes into play. In this thesis, we
have solved the problem to a significant extent by gathering relevant facial information
from multiple different images of the same identity. This research takes us a step further
towards autonomous security and surveillance where the recognition performance can
be improved significantly. Additionally, this method works in conjunction with any face
detection or recognition algorithm to boost their performance.
Living in an era with an ever-increasing population in ever-increasing urban areas,
security and preservation of law and order have become one of the most important so-
cial aspects. Over the last decade, the number of surveillance cameras grew to un-
precedented levels, both in public and private spaces. Thus, face recognition became an
important biometric trait used in identifying human individuals. Face recognition is one
of the most critical activities forensic examiners conduct when there is a video or pic-
ture available from the crime scene during their investigation. Although automatic face
recognition and verification systems are used, the examiners still do manual examina-
tion of facial images or videos for a match with a large suspect identification photograph
database. In airports, shopping malls, and other important indoor environments, the task
1
1.2 Motivation, Approach, Importance and Impact
2
1.2 Motivation, Approach, Importance and Impact
Figure 1.2: Example of output images when a pipeline of image enhancement algo-
rithms are used. Details in Chapter 3. Best viewed in color.
easy and is prone to human errors, especially when you have to do this kind of task for
long hours. Any mistake in this situation can be expensive. Thus, it is important to
determine a solution to close this gap in the field of security, and this research plays a
significant role in doing so.
In the last few years, many robust face detection and recognition algorithms have
been proposed. They have very competitive results on benchmark datasets. However, it
has been observed that the performance of these algorithms falls with degrading image
quality. The biggest challenge lies in a real world scenario, where, unlike a benchmark
dataset, the input images are uncontrollable, the quality of images vary, and faces often
have occlusions or nonfrontal poses. Unfortunately, those algorithms perform poorly in
these scenarios. This motivates us to find a solution to boost the performance of these
algorithms in the real world scenario. A naive approach to solving this problem is to
use image enhancement algorithms and enhance the quality of these images in the hope
of better recognition performance. Hence, our first step to approach this problem is by
exploring photorealistic image enhancement.
Photo-Realistic image enhancement can be broadly classified into three domains:
(1) Super-resolution (SR) [3, 4] focuses on the task of estimating a high-resolution im-
age from its low-resolution counterpart; (2) Noise removal (NR) [5, 6] aims at cleaning
different noises present in the image, e.g., Gaussian noise, Poisson noise, etc.; (3) Ar-
tifact removal (AR) [7, 8] concentrates on estimating an artifact-free sharp image from
its corrupted counterpart, e.g., JPEG artifacts.
A possible intuitive solution can be the recovery of finer texture details and sharp-
ness of degraded images with low resolution. In recent times, we have seen break-
throughs in the perceptual quality of image enhancement with the recovery of the fine
3
1.2 Motivation, Approach, Importance and Impact
4
1.2 Motivation, Approach, Importance and Impact
and we have shown this in 1.2 (details can be found in Chapter 3 and 4). And this
is in addition to a higher total computation time which is a direct summation of the
computation time of each algorithm in the pipeline.
One of the biggest questions is how effective are these methods regarding face recog-
nition. Although macro-level features can be useful in object detection, in the case of
face recognition, the detailed features of faces are essential and the features involved in
the high perceptual quality of the images may not directly map to the microlevel features
required for successful face recognition or verification. Our research shows that percep-
tual image enhancement actually removes a lot of information from the face which are
important for face recognition. Although the images look better to the human eye after
enhancement, the recognition performance falls.
5
1.3 Goals
Figure 1.4: Example of synthesis of frontal face pose from multiple occluded images
using GANs where partial information from multiple faces is aggregated to create the
final output. Details in Chapter 6. Best viewed in color.
or extreme head poses to a significant extent by taking advantage of the multiple input
images from the video feed. Thirdly, the concept of using aggregated information from
multiple images to generate a single image can be used in other fields of application,
which we have described in detail in Chapter7.
1.3 Goals
Our primary goal of this research is twofold: First and most important is to enhance
facial recognition performance for faces whose performance is limited by the current
state-of-the-art due to various noise, occlusions, and pose limitations present in the fa-
cial images. It is important to understand in which scenario an algorithm works and
where it fails. As discussed in the last section, the state-of-the-art algorithms perform
well in benchmark datasets, but fail to keep up in real world scenarios. Instead of de-
signing a new face recognition algorithm, our goal is to enhance the image so that it can
work with any existing algorithm, or even with future algorithms which are yet to be
invented, in real world scenarios. It is very important for these algorithms to perform
well in practical scenarios so that we can have a better and safer world.
Second, we want to have an understanding on low level information of how face
recognition works and what is important for recognition algorithms to perform suc-
cessfully. Understanding this information is important so that it can facilitate future
researchers to understand what features and properties of faces are important for a face
recognition algorithm to perform well in environments which are beyond our control.
While keeping our primary goals in mind, we also aim to design our algorithms that
6
1.4 Thesis Roadmap
will be robust to various kinds of noise, illumination or races and fast enough so that
these methods will find practical use in a daily life scenario.
7
1.4 Thesis Roadmap
• We propose a new and improved perceptual loss function which is the sum of the
reconstruction loss of the discriminator, the feature loss from the VGG network
[9] and the edge loss from the edge detector. This novel loss function preserves
the sharpness of the enhanced image, which is often lost during enhancement.
However, we found out that a drawback of this method is that it fails to clean the image
in the presence of random noise. This led to our next investigation.
In Chapter 4, we present a further versatile framework capable of superresolving and
removing various kinds of noises and artifacts from images simultaneously in a single
end-to-end framework, inferring photorealistic natural images. This offers a substantial
advantage since it is computationally much less expensive and faster compared to a cas-
cade of several networks solving each problem separately. Our algorithm contains an
effective method involving edge reconstruction using a generative adversarial approach.
Our approach has been experimentally proven to recover photorealistic textures from
low-resolution, heavily compressed, and noisy images on several public benchmark
datasets. For example, with 4x superresolution and removal of speckle noise on the
LIVE1 dataset [10], the HaarPSI score [11] for the output images from our framework
is 12.8% better than the next best performing pipeline which is a cascade of SRGAN [3]
and IRCNN [8]. To sum up, our main contribution is threefold:
8
1.4 Thesis Roadmap
• Extensive ablation studies and experiments show the effectiveness and superiority
of our proposed framework on several datasets.
However, when we applied this algorithm to enhance faces for better face recognition,
we found out that the performance actually became worse. This draws an important
conclusion that human visual perception can be tricked and perceptual quality is not the
answer to solving low level computer vision problems. This motivated us for our next
research.
In Chapter 5, we presented a deep generative adversarial network for facial feature
enhancement that sets a new state-of-the-art for the detection and recognition of low-
quality images by restoring their facial features. We have highlighted some limitations
of the existing algorithms and overcame those by reconstructing the facial features and
increasing the separation between different classes. Specifically, we have explored the
possibility of improving face recognition performance on those images for which the
current state-of-the-art methods fail to deliver good results. Our proposed loss function
works very well on real-world images where they inherently are of low quality. We
use metric learning to train the network to generate facial images preserving their iden-
tity, which is more important for improving face detection and recognition performance
rather than perceptual quality, and our proposed algorithm successfully demonstrates
that. Our approach has been experimentally proven to enhance face recognition perfor-
mance on several public benchmarks, e.g., the area under the ROC curve for face recog-
nition is improved by 4.2% on the SCFace [12] dataset when used in conjunction with
ArcFace [2] compared to the recognition performance of ArcFace alone. The detection
performance has been increased by 3.08% when used in conjunction with S3FD [1] on
the SCFace dataset compared to detection using S3FD alone. In a nutshell, our novel
framework offers
9
1.4 Thesis Roadmap
Although this research provided an answer to the question of facial feature enhance-
ment, the conclusion was that the recognition performance was bounded by an upper
limit when a single image is used. There motivated us for further research and led to
our next work.
Chapter 6 deals with the synthesis of a frontal face pose from multiple occluded im-
ages using GANs where partial information from multiple faces is aggregated to create
the final output. In this chapter, we present a deep generative adversarial network which
takes multiple faces as input, each having different poses and occlusion, and combines
the useful information from each of the images and synthesises an output which closely
resembles a suspect identification photograph, i.e. a facial image having a frontal pose
with no occlusions or facial expressions. We use the architecture of TPGAN [13] as
our base and build on that to create our final generator network. We also use a U-net
discriminator which evaluates the output of the generator from a local and global per-
spective. To sum up, the main contributions in this chapter are as follows.
• To the best of our knowledge, we are the first to propose a GAN based solution
to remove facial occlusion by reconstructing the occluded area. Our method syn-
thesises the frontal face pose by extraction of useful information from multiple
occluded images.
• We have shown the working principles, behaviour, and limitations of our proposal
by making use of the science behind information theory.
• We have verified and shown through images and experiments that an increased
number of input images improve the quality of the output.
Finally, in Chapter 7, we draw the conclusion and the future possibilities and exten-
sion of this research.
10
Chapter 2
Background
This chapter introduces some of the previous work that is most relevant to my research.
I have provided some preliminary background of artificial neural networks and how they
lead to deep neural networks, including some well-known deep network architectures.
I review the contributions in the field of generative adversarial networks, particularly in
the field of noise removal, artifact removal, and superresolution and how they affect face
recognition. I have also provided a critique of existing methods by covering as much
relevant material as possible without too much historical context to keep the review
concise. Finally, I have made a comprehensive discussion on the gap between the state-
of-the-art and the proposed area of research.
11
12
Figure 2.1: Timeline of the history of neural networks. Source: https://towardsdatascience.com/rosenblatts-
perceptron-the-very-first-neural-network-37a3ec09038a
2.1 A short history of neural networks
2.2 Artificial Neural Networks
This was an important discovery and then in 1989, Yann LeCun came up with the
idea of convolutional neural networks. Convolutional Neural Network is a class of deep
neural networks, most commonly applied to analyse visual imagery. This approach
became the foundation of modern computer vision.
In 2014, Ian Goodfellow came up with the idea of generative adversarial networks,
which is one of the most, if not the most, sophisticated generative algorithms in the field
of computer vision. Figure 2.1 shows a timeline of the history of neural networks.
• It has a binary output y ∈ {0, 1}, where y = 1 indicates that the neuron fires and
y = 0 that it is at rest.
• It has a threshold value Θ. The neuron fires only if the sum of its inputs is larger
than this critical value.
σ (x) =
1 if ∑ xk > Θ and i = 0, (2.1)
k=1
0 otherwise (2.2)
In the 1960s, an American psychologist called Frank Rosenblatt came up with the idea
of perceptron, which was inspired by the Hebbian theory of synaptic plasticity, which
is the adaptation of brain neurons during the learning process. In the 1980s, the idea of
artificial neural networks was further developed, thanks to the work of Hopfield and the
development of the error backpropagation algorithm for training multilayer perceptrons.
13
2.3 Convolutional Neural Networks (CNN)
The artificial neural networks allow learning mappings from a given input space to
the desired output space. It consists of a training phase which can be either super or
unsupervised, or a combination of both, and a prediction phase. Figure 2.2 shows the
structure of a simple artificial neuron.
14
2.4 Different CNN Architectures
• Strides: Stride is the number of pixels shifted over the input matrix.
• Padding: Sometimes the filter does not fit perfectly fit the input image. Therefore,
we pad the picture with zeros (zero-padding) so that it fits.
• Pooling: Pooling is used to reduce the spatial dimension of the feature maps. The
most commonly used pooling methods are max-pooling and average pooling.
2.4.1 LeNet
This architecture was proposed by LeCun et al. [16] in 1998. However, due to lim-
ited computation capabilities and memory availability, it was successfully implemented
in 2010 using the back-propagation algorithm. It was applied to the handwritten digit
15
2.4 Different CNN Architectures
recognition task which defined the state-of-the-art. The overall number of weights is
431K. Its architecture is shown in Figure 2.7 (middle) and includes a couple of convo-
lutional layers, two subsampling layers, two fully connected layers, and the final output
layer.
2.4.2 AlexNet
Designed by the group of Alex Krizhevesky, this CNN won the ImageNet Large Scale
Visual Recognition Challenge (ILSVRC) in 2012, defining a new state-of-the-art in
visual object recognition. Dropout and Local Response Normalisation (LRN) was in-
troduced in this network. A dropout layer was applied at the fully connected (FC)
layer. LRN is applied either intrachannel or across feature maps. The input size is
224 × 224 × 3. The first convolutional layer produces 96 output feature maps of size
55 × 55 by using filters with size 11 × 11 with a stride of 4. The filter size for the
max-pooling layer is 3 × 3 with a stride of 2. The subsequent convolutional layers use
filters with size 3 × 3 or 5 × 5. This network contains 61 million learnable parameters.
AlexNet architecture is shown in Figure 2.4.
2.4.3 VGGNet
The VGGNet, shown in Figure 2.7, was proposed in 2014 by the Visual Geometry
Group (VGG) at Oxford University. It is a very heavy network, but it has contributed to
show that the network depth is an essential factor to perform well in visual tasks. Three
16
2.4 Different CNN Architectures
models have been proposed, VGG-11, VGG-16, and VGG-19, having, respectively, 11,
16, and 19 layers overall. While the number of convolutional layers varies for the three
models (8, 13, and 16, respectively), they all end with three fully connected layers and
use the ReLU activation function. The filter size is 3 × 3 with stride 2. The overall
number of weights for VGG-19 is of 138M.
2.4.4 GoogLeNet/Inception v1
GoogLeNet [17] was introduced by Google, and it won the ILSVRC competition in
2014. It represented a breakthrough in the effort for reducing the computation com-
plexity of CNNs. Indeed, the overall number of parameters of this architecture is 7M,
much lower than for AlexNet and VGG-19. Moreover, it counted the highest number
of layers overall, compared to all other CNNs, by stacking the inceptions layers. The
network architecture of GoogLeNet is shown in Figure 2.10. Although the details in the
image are not very clear due to page size limitations, the image gives an overall idea of
the architecture of the network. This network exploits the concept of allowing variable
17
2.4 Different CNN Architectures
receptive fields by kernels with different sizes, to perform dimension reduction before
more intense computation layers.
2.4.5 ResNet
ResNet [20] was introduced by Kaiming et al. and won the ILSVRC 2015 competition.
It is robust to the vanishing gradient problem, thanks to the idea of residual learning. A
residual block is where the activation of a layer is fast-forwarded to a deeper layer in the
neural network via a skip connection. The network architecture of ResNet is shown in
Figure 2.11. Different expression of ResNet has been proposed with different lengths.
ResNet50 is one of them which gained popularity. It contains 49 convolution layers and
1 fully connected layer, with 25.5 million learnable parameters. Variants of residual
models are also present and some of them combine residual blocks with inception units.
18
2.4 Different CNN Architectures
Given a training set, this technique learns to generate new data with the same statistics as
the training set. The main concept of GAN is that two neural networks contest with each
other in a zero-sum game. The two-player min-max game of GAN has been described
as follows:
• The discriminative model is analogous to the police, trying to detect the counter-
feit currency.
• Competition in this game drives both teams to improve their methods until the
counterfeits are indistinguishable from the genuine articles.
For example, a GAN trained on photographs can generate new photographs that look
at least superficially authentic to human observers, having many realistic characteris-
tics. Figure 2.8 show some of those examples where I trained a GAN on faces and it
generated those outputs. Although originally proposed as a form of a generative model
for unsupervised learning, GANs have also proven useful for semisupervised learning,
fully supervised learning, and reinforcement learning. GANs have gone a long way
since their inception and now can generate images that look like real images. Let the
generator be denoted by G and the discriminator be denoted by D. D is trained to max-
imise the probability of assigning the correct label and G is simultaneously trained to
maximise the probability of D to make a mistake. The objective function of a GAN is
min max V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))] (2.3)
G D
19
2.4 Different CNN Architectures
Figure 2.7: Hand written digits generated by a simple GAN. The network was trained
on the MNIST dataset.
where D is the discriminator, G is the generator, x is the observed image, y is the output,
z is a normal distribution. This means the best model can minimise the generator error
and maximise the discriminator error.
Figure 2.9 shows some of those examples. The top image is the condition image.
The middle image is the ground truth and the bottom image is the generated image with
the proposed algorithm.
The above described networks are some of the most representative and powerful
models in deep learning. The main advantages of neural networks are
20
2.4 Different CNN Architectures
Figure 2.8: Faces generated by different algorithms by tweaking the parameters and loss
functions.
Figure 2.9: Cityscapes generated by a conditional GAN. The first row of segmented
images are the input to the generator and the last row is the output. The middle row
shows the ground truth
21
2.4 Different CNN Architectures
• deep CNNs can jointly optimise several related tasks together, for example, Fast
RCNN [25] jointly performs classification and bounding box regression.
CNN has been widely used in many research areas due to its benefits. These fields
include, but not limited to superresolution, general image enhancement, classification,
object detection, face recognition, person reidentification, etc. The primary focus of our
work is face detection and recognition. In the following sections, we will go through
and analyse some related works which led to this research.
22
23
Figure 2.10: Block diagram of GoogLeNet architecture. Source: https://developer.ridgerun.com/wiki/index.
php?title=GstInference/Supported_architectures/InceptionV2
2.4 Different CNN Architectures
24
Figure 2.11: Block diagram of VGG-19 and ResNet architecture. Source: https://mc.ai/what-are-deep-residual-
networks-or-why-resnets-are-important/
2.4 Different CNN Architectures
2.5 Related Work
25
2.5 Related Work
well on clean images. However, the gap in the current state-of-the-art method lies in the
fact that the detection degrades with degraded images.
26
2.5 Related Work
have been proposed. Before the advent of deep learning, filter-based approaches were
quite popular. In the spatial domain, different kinds of filters [36–38] have been pro-
posed to adaptively deal with blocking artifacts in specific regions. In the frequency
domain, wavelet transform has been utilised to derive thresholds at different wavelet
scales for deblocking and denoising [39, 40]. However, the problems with these meth-
ods are that they could not reproduce sharp edges and tend to have overly smooth tex-
ture regions. In the recent past, different algorithms for image enhancement using deep
learning have been proposed. Dong et al. [7] showed that directly applying the SRCNN
architecture for JPEG AR resulted in undesired noisy patterns in the reconstructed im-
age, and thus proposed a new improved model. Svoboda et al. [41] proposed a novel
method of image restoration using convolutional networks that had a significant quality
advancement compared to the state-of-the-art methods. The drawback, however, is that
it works on monochromatic images. Moreover, pixel-wise loss functions like MSE or
L1 loss were quite common in those methods, making the images softer. Pixel-wise
loss functions are unable to recover the lost high-frequency details in an image and en-
courage finding pixel-wise averages of possible solutions which are generally smooth
having poor perceptual quality [3, 42–44]. Johnson et al. [44] and Bruna et al. [42]
proposed extracting the features from a pretrained VGG network instead of using pixel-
wise error. They proposed a perceptual loss function based on the Euclidean distance
between feature maps extracted from the VGG19 [9] network. This loss proved to be
quite efficient in capturing the details of an image as it uses the intermediate activations
in the backbone of the VGG network, which represents the deep features. Zhang et al.
proposed an algorithm [8] to solve different problems related to image artifacts. This
method aims to train a set of fast and effective deep learning-based denoisers and inte-
grate them into model-based optimisation methods to solve inverse problems. A recent
paper by Ghosh et al. [45] uses edge loss on top of perceptual loss to simultaneously
remove artifacts and superresolve images. This seemed to be an interesting approach as
including the edge loss in the loss function acted as a booster in preserving the sharp-
ness of the images, thus enhancing the overall quality of the output. Lu et al. [46]
models the video artifact reduction task as a Kalman filtering procedure and restores
the decoded frames through a deep Kalman filtering network. Soh et al. [47] exploits
a simple CNN structure and introduces a new training strategy that captures the tempo-
ral correlation of consecutive frames in a video. They aggregate similar patches from
27
2.5 Related Work
neighbouring frames and use them to reduce artifacts from the videos. GANs proved to
be quite effective in generating clean images, however, GAN based artifact removal still
remains relatively unexplored, which includes artifact removal for colour images. This
is a significant gap in the state-of-the-art which needs immediate attention.
28
2.5 Related Work
vation [54, 55] also fail to produce photorealistic images. Recently, convolutional neu-
ral network (CNN) based SR algorithms have shown excellent performance. Wang et
al. [56] showed that the sparse coding model for SR can be represented as a neural net-
work and improved results can be achieved. Dong et al. [4] used bicubic interpolation
to upscale an input image and trained a three-layer deep fully convolutional network
end-to-end achieving state-of-the-art SR performance. Their method SRCNN learns an
end-to-end mapping between low-resolution and high-resolution images directly. They
achieved enough speed for practical online usage. Later, Dong et al. [57] and Shi et
al. [58] demonstrated that upscaling filters can be learnt for SR to obtain an increased
performance. To upscale the low resolution feature maps into the high resolution out-
put, they added an effective subpixel convolution layer that learnt an array of upscaling
filters. This effectively replaced the handcrafted bicubic filters in the SR pipeline with
more complex upscaling filters that can be explicitly trained for each feature map. Si-
multaneously, the overall computational complexity of the superresolution operation
was also reduced. The studies of Johnson et al. [44] and Bruna et al. [42] relied on loss
functions that focus on perceptual similarity to recover HR images that are more pho-
torealistic. A recent work by Ledig et al. [3] presented the first framework capable of
generating photorealistic images for 4× upscaling factors. In this method, a perceptual
loss feature with adversarial loss and content loss is proposed. Using a discriminator
network that is trained to distinguish between superresolved images and original photo-
realistic images, the adversarial loss moves the solution to the natural image manifold.
Furthermore, rather than pixel space similarity, they used a content loss inspired by
perceptual similarity. This was a very sophisticated method for photorealistic image
superresolution and set a new standard for SR performance. Face super-resolution is
a domain-specific super-resolution problem. Significant advancement was also done
in this field. Chen et al. proposed a novel face superresolution algorithm [59] that
uses facial landmark heat maps and parsing maps to superresolve very low-resolution
face images. Although significant research has been done in the field of superresolu-
tion, it misses to answer the most important question for face superresolution, i.e., the
recognition performance. Facial image enhancement without improving the recognition
performance is practically useless. A high PSNR or SSiM does not guarantee higher
recognition performance. This is a critical shortcoming in the current state-of-the-art
that needs to be addressed.
29
2.5 Related Work
30
2.5 Related Work
31
2.5 Related Work
matrices of dissimilar pairs from the similar pairs. They use log-likelihood ratio during
training to aid a faster and scalable learning process.
The second one is formed by methods that learn a metric jointly with the embedding
in a deep architecture. Usually, the methods belonging to the second category are better
optimised, which is typically achieved by Siamese-based architectures. They exploit
the contrastive loss (or triplet loss) that allows a distance relation between pairs (or
triplets) of feature points to be learned at training time [65, 66]. A second approach to
the joint metric learning is to integrate the metric learning scheme is to integrate the
metric learning scheme into the feature extraction network, and an end-to-end training
is performed by the optimisation process. This technique tends to overfit to the training
data and thus needs weight constraints for better generalisation.
However, it is worth mentioning that the deep metric learning techniques suffer from
slow convergence caused by a subset of trivial samples that has negligible contribution
to the training. To counteract this problem, most of these techniques apply sample
mining strategies by selecting nontrivial samples, commonly known as hard or semihard
pairs. This accelerates the learning process and reduces overfitting.
32
2.5 Related Work
input is maximised. Wen et al. developed an algorithm [66] that used the centre loss
to enhance the discrimination of deep features. The center loss is a loss function that
efficiently pulls the deep features of the same class to their class center.
The deep metric learning techniques suffer from drawbacks both in the end-to-end
integration-based approaches and the Siamese network-based approaches. That is, the
slow convergence which is caused by some trivial samples that do not have much con-
tribution during the training phase. The possible solution to most of these techniques is
to apply sample mining strategies by selecting nontrivial samples which can accelerate
convergence and reduce overfitting. These nontrivial samples are called semi-hard nega-
tives/positives or hard negatives/positives. Different sample strategies target either hard
negatives [73–76], or semi-hard negatives [65], or hard positives [74, 76] or semi-hard
positives.
Some recent methods in face recognition using deep metric learning proved to be
very effective. A popular approach of research is to incorporate margins in well-established
loss functions to maximise face class separability. Liu et al. presented an algorithm
(SphereFace [68]) which assumes that the deep features can be used as a representation
of the centre of the given class/label of faces in a polar coordinate system. They intro-
duced a loss function called angular softmax loss that helps networks to learn angular
discriminative features. Wang et al. [77] proposed a novel algorithm, called Large Mar-
gin Cosine Loss (LMCL), which takes the normalised features as input to learn highly
discriminative features by maximising the interclass cosine margin. Recently, Deng et
al. [2] proposed an algorithm called Additive Angular Margin Loss (ArcFace in short)
to obtain highly discriminative features for face recognition. The proposed algorithm
has a very clear interpretation of geometry. The main concept revolves around the use of
the exact correspondence to the geodesic distance on a hypersphere. This has a substan-
tial advantage in face recognition compared to vanilla softmax. Although the softmax
loss provides decent separable feature embedding, it produces noticeable ambiguity in
decision boundaries. However, the ArcFace loss can enforce a better separation between
the nearest classes.
33
2.5 Related Work
34
2.6 Critical Thinking and Analysis: Summary
35
2.6 Critical Thinking and Analysis: Summary
state-of-the-art, we will have a detailed discussion in the following chapters about the
research performed towards achieving our goal and addressing the gaps present in the
current development.
36
Chapter 3
3.1 Introduction
Photo-Realistic image enhancement is challenging but highly demanded in real-world
applications. Image enhancement can be broadly classified into two domains: super-
resolution (SR) and artifact removal (AR). The task of estimating a high-resolution im-
age from its low-resolution (LR) counterpart is the SR and estimating an artifact-free
sharp image from its corrupted counterpart is the AR. The AR problem is particularly
prominent for highly compressed images and videos, for which texture detail in the re-
constructed images is typically absent. The same problem persists for SR as well. One
major problem with the current state-of-the-art is that there does not exist any end-to-
end network which can solve the problem of AR and SR simultaneously, thus requiring
two different algorithms to be applied on the image if both AR and SR are desirable.
This is an important problem which needs to be addressed. In this era of a data-
driven world, images play an important role. A Forbes report from mid 2018 mentions
that more than 300 million photos get uploaded per day, of which most are uploaded
from mobile phones. The number of videos uploaded every day are also compara-
ble. Current mobile phones are capable of creating images which are more than 15
37
3.1 Introduction
megapixel on average, and can also produce HD to 4k video. Thus, saving all images
and videos in full resolution can be costly. Therefore, most images and videos are down-
scaled and compressed by several factors before saving them, thus losing a lot of detail.
This also introduces compression artifacts like JPEG and MPEG artifacts. This is the
most common problem in images on the Internet (for instance, in Twitter, Instagram,
etc.). This problem also persists for object recognition and classification in surveillance
videos where some people/objects are typically far away from the camera and appear
small in the image. A simultaneous superresolution and artifact removal is highly useful
in these scenarios and we have explored this possibility in this paper. To cope with the
problem of generating high perceptual quality images, different approaches have been
proposed [42–44]. These approaches deal either with SR or with AR, but not both.
Supervised image enhancement algorithm [4, 7, 41] generally tries to minimise the
mean squared error (MSE) between the target high-resolution (HR) image and the
ground truth, thus maximising the PSNR. However, the ability of MSE to capture per-
ceptually relevant differences, such as high texture detail, is insufficient as they are de-
fined based on pixel-wise image differences. This leads to an image having an inferior
perceptual quality. Recently, deep learning has shown impressive results. In particular,
the Super Resolution Convolutional Neural Network (SRCNN) proposed by Dong et
al. [4] shows the potential of an end-to-end deep convolutional network in SR. Ledig et
al. [3] presented a framework called SRGAN which is capable of generating photoreal-
istic images for 4× up-scaling factors, but there are several problems of this framework
when used for SR in conjunction with an AR framework. Dong et al. [7] discovered that
SRCNN directly applied for compression artifact reduction leads to undesirable noisy
patterns, thus proposing a new improved model called Artifact Reduction Convolutional
Neural Networks (ARCNN), which showed better performance. Svoboda et al. [41] pro-
posed the L4 and L8 architecture which has better results compared to ARCNN but still
failed to completely remove all artifacts for highly compressed JPEG images. A ma-
jor drawback for all successful methods till date is that all the proposed methods work
on the luma channel (channel Y in YCbCr colour space which is monochrome), but
none of them reports the performance on colour images, although AR in colour images
is more relevant. As per our knowledge, till date, a versatile robust algorithm which
solves all kinds of image enhancement problems is yet to be proposed. Thus, we see
that there is a significant gap in the state of the art for general image enhancement prob-
38
3.2 Related Work
lems, particularly, an end-to-end solution for the superresolution and artifact removal
problems.
In this chapter, we propose a novel Image Enhancing Generative Adversarial Net-
work (IEGAN) using U-net like generator with skip connections and an autoencoder-
like discriminator. This is a multi-purpose image enhancement network which is ca-
pable of removing artifacts and superresolving with high sharpness and detail in an
end-to-end manner, simultaneously, within a single network. Our main contributions
are summarised as follows:
• We propose a new and improved perceptual loss function which is the sum of the
reconstruction loss of the discriminator, the feature loss from the VGG network
[9] and the edge loss from the edge detector. This novel loss function preserves
the sharpness of the enhanced image, which is often lost during enhancement.
• We also create a benchmark dataset named World100 for testing the performance
of our algorithm on high-resolution images.
39
3.2 Related Work
past, JPEG compression AR algorithm involving deep learning has been proposed. De-
signing a deep model for AR requires a deep understanding of the different artifacts.
Dong et al. [7] showed that directly applying the SRCNN architecture for JPEG AR re-
sulted in undesired noisy patterns in the reconstructed image, and thus proposed a new
improved model. Svoboda et al. [41] proposed a novel method of image restoration us-
ing convolutional networks that had a significant quality advancement compared to the
state-of-the-art methods. They trained a network with eight layers in a single step and in
a relatively short time by combining residual learning, skip architecture, and symmetric
weight initialisation.
40
3.3 Our Approach
a pretrained VGG network instead of using pixel-wise error. They proposed a percep-
tual loss function based on the Euclidean distance between feature maps extracted from
the VGG19 [9] network. Ledig et al. [3] proposed a GAN-based network optimized for
perceptual losses which are more invariant to changes in pixel space, obtaining better
visual results.
41
3.3 Our Approach
To estimate the enhanced image for a given low-quality image, we train the genera-
tor network as a feed-forward CNN GθG parameterised by θG . Here θG = W1 : L; b1 : L
denotes the weights and biases of an L-layer deep network and is obtained by optimising
a loss function Floss . The training is done using two sets of n images {IiGT : i = 1, 2, ..., n}
and {I LR GT = G (I LR ) (where I GT and I LR are correspond-
j : j = 1, 2, ..., n} such that Ii θG j i j
ing pairs) and by solving
1 n
θ̂G = arg min Floss (IiGT , GθG (I LR
j )) (3.1)
θG N i,∑
j=1
i= j
Following the work of Goodfellow et al. [23] and Isola et al. [87], we also add a
discriminator network DθD to assess the quality of images generated by the generator
network GθG .
The generative network is trained to generate the target images such that the differ-
ence between the generated images and the ground truth is minimised. While training
the generator, the discriminator is trained in an alternating manner such that the proba-
bility of error of the discriminator (between the ground truth and the generated images)
is minimised. With this adversarial min-max game, the generator can learn to create so-
lutions that are highly similar to real images. This also encourages perceptually superior
solutions residing in the manifold of natural images.
42
3.3 Our Approach
Figure 3.1: The overall architecture of our proposed network. The convolution layers of
the generator have a kernel size 3 × 3 and stride is 1. Number of filters for each layer is
indicated in the illustration, e.g., n32 refers to 32 filters. For the discriminator network,
the stride is 1 except for the layers which indicates that the stride is 2, e.g., n64s2 refers
to 64 filters and stride=2
we found that it is more useful to add skip connections following the general shape of a
U-Net [89]. As the name implies, in deep architectures, skip connections skip some lay-
ers in the neural network and feed the result of one layer as the input to the next level.
Thus, we give an alternative path for the gradient by using a skip connection during
backpropagation. This additional approach is generally useful for model convergence,
which has been proven experimentally.Specifically, we add skip connections between
each layer n and layer L − n, where L is the total number of layers. Each skip connec-
tion simply concatenates all channels at layer n with those at layer L − n. This type of
skip connection is motivated by the fact that it has an uninterrupted gradient flow from
the first to the last layer, thereby addressing the vanishing gradient problem. Concate-
native skip connections provide an alternative method for ensuring feature reusability
of the same dimension from previous layers. The proposed deep generator network GθG
is illustrated in Figure 3.1. The generator has an additional block containing two sub-
pixel convolution layers immediately before the last layer for cases where p > 0, i.e.,
where the size of the output is greater than the input. These layers are called pixel-
shuffle layers, as proposed by Shi et al. [58]. Pixel shuffle transformation reorganises
43
3.3 Our Approach
the low-resolution image channels to produce a larger image with fewer channels. We
can reorganise the set of pixels into a single larger image by introducing a pixel shuffle
layer after a series of small processing steps. The main advantage of using this trans-
formation is that it increases the computational efficiency of the neural network model.
Each pixel shuffle layer in our network increases the resolution of the image by 2×. In
Figure 3.1, we show two such layers which super-resolves the image by 4×. If p = 0,
we do not need any such block since the size of the output image is equal to the input
image.
The discriminator in our framework is very crucial for the performance and consists
of an encoder and a decoder, which is a typical autoencoder architecture. The output
of the autoencoder is the reconstructed image of its input which is either the ground
truth or the generator output. The discriminator output can then be used to differentiate
between the loss distribution of the reconstructed real image and the reconstructed gen-
erator image. This helps the discriminator to pass back a lot of semantic information to
the generator regarding the quality of the generated images, which is not possible with
a binary discriminator. Detailed analysis of the advantage of this type of discriminator
has been shown in Section 3.3.2. Our proposed discriminator contains eighteen convo-
lutional layers with an increasing number of 3 × 3 filter kernels. The specific number of
filters are indicated in Figure 3.1. Strided convolutions with stride=2 are used to reduce
the feature map size, and pixel-shuffle layers [58] are used to increase them. The overall
architecture of the proposed framework is shown in Figure 3.1 in detail.
44
3.3 Our Approach
Feature Loss
We choose the feature loss based on the ReLU activation layers of the pretrained 19
layer VGG network described in Simonyan and Zisserman [9]. This loss is described as
VGG loss by Ledig et al. [3] and is mathematically expressed as
W H
i, j i, j
V GG 1
Closs i. j (I GT , GθG (I LR )) = ∑ ∑ (φi, j (I GT )x,y − φi, j (GθG (I LR))x,y)2 (3.2)
Wi, j Hi, j x=1 y=1
where φi, j is the feature map obtained by the jth convolution (after activation) before
the ith max-pooling layer within the pretrained VGG19 network, and W and H represent
the width and height of the input image, respectively.
Edge Loss
Preservation of edge information is essential for the generation of sharp and clear im-
ages. Thus, we add an edge loss to the feature loss counterpart. There are several edge
detectors available in the literature, and we have chosen to design our edge loss func-
tion using the state of the art edge detection algorithm proposed by Xie and Tu called
Holistically-nested Edge Detection (HED) [90] and the classical Canny edge detection
algorithm [91] due to its effectiveness and simplicity. Experimental results prove that
the Canny algorithm provides similar results for the preservation of sharpness but with
greater speed and fewer resource requirements compared to HED. The detailed com-
parison results have been further discussed in Section 3.4.3. For the Canny algorithm,
a Gaussian filter of size 3 × 3 with σ = 0.3 was chosen as the kernel. This loss is
mathematically expressed as
1 W H
edge GT LR GT LR
Eloss (I , GθG (I )) = ∑ ∑ Θ(I )x,y − Θ(G θG
(I ))x,y
(3.3)
W H x=1 y=1
Reconstruction Loss
Unlike most other algorithms, our discriminator provides a reconstructed image of the
discriminator input. Modifying the idea of Berthelot et al. [92], we design the discrim-
inator to differentiate between the loss distribution of the reconstructed real image and
45
3.3 Our Approach
the reconstructed fake image. Thus, we have the reconstruction loss function as
f ake
LD = |Lreal
D − kt × LD | (3.4)
where Lreal
D is the loss distribution between the input ground truth image and the recon-
structed output of the ground truth image, mathematically expanded as
edge
Lreal
D = r × Eloss (DθD (I
GT
), DθD (GθG (I GT ))) + (1 − r)
(3.5)
V GGi. j
× Closs (DθD (I GT ), DθD (GθG (I GT )))
f ake
LD is the loss distribution between the generator output image and the reconstructed
output of the same, expanded as
f ake edge
LD = r × Eloss (DθD (I LR ), DθD (GθG (I LR ))) + (1 − r)
(3.6)
V GG
× Closs i. j (DθD (I LR ), DθD (GθG (I LR )))
and kt is a balancing parameter at the t th iteration which controls the amount of emphasis
f ake
put on LD .
f ake
kt+1 = kt + λ (γLreal D − LD ) ∀ step t (3.7)
λ is the learning rate of k which is set as 10−3 in our experiments. Details about this
can be found in [92].
Learning Parameters
The reconstruction loss consists of several learning parameters. In this section, we will
explain more in detail what is the function of each of these learning parameters and how
they affect the learning process. In the reconstruction loss, we encounter parameters k,
λ and γ. It has been mentioned earlier that k is a balancing parameter which controls
f ake
the amount of emphasis put on LD , and the value of k is not fixed as at every step
it needs to balance the real and the fake reconstruction losses. γ maintains the equilib-
rium between real image reconstruction loss and the fake image reconstruction loss, i.e.
f ake
D = LD . A high value of γ will put more emphasis on learning to reconstruct
γLreal
the real images better because it will push Lreal
D to have lower values. Similarly, a low
value of γ will help learning to reconstruct the fake images. In our experiments, we have
f ake
D − LD
used the value of 0.7. Ideally, for a trained discriminator, γLreal = 0. While the
46
3.4 Experiments
network is being trained, this is not the case, and there exists a significant difference.
But with learning, this difference becomes smaller. this difference decides the value of
k. We start with the value of k = 0, and then with every step, this difference is added to
the previous value of k. To control the learning of k, the difference is multiplied by λ ,
which is set to 10−3 .
We formulate the final perceptual loss Floss as the weighted sum of the feature loss Closs
and the edge loss Eloss component added to the reconstruction loss such that
Substituting the values from Equation 3.2, 3.3 and 3.4, we have
1 W H
GT LR
Floss =r× ∑ ∑ Θ(I )x,y − Θ(Gθ G
(I ))x,y
+ (1 − r)
W H x=1 y=1
(3.9)
Wi, j Hi, j
1 f ake
× ∑ ∑ D − kt × LD |
(φi, j (I GT )x,y − φi, j (GθG (I LR ))x,y )2 + |Lreal
Wi, j Hi, j x=1 y=1
3.4 Experiments
47
3.4 Experiments
AR experiments for all the datasets are performed by degrading JPEG images to a qual-
ity factor of 10% (i.e., 90% degradation), the SR experiments are performed with an
upscaling factor of 4, and for AR+SR, the dataset is degraded to a quality factor of 10%
and the resolution is reduced by a factor of 4, which corresponds to a 16 times reduction
in image pixels.
48
3.4 Experiments
Figure 3.2: Comparison of edge detection for Canny and HED. Left to Right - Image,
edge output of Canny, edge output of HED. Best viewed in pdf.
AR
Loss PSNR SSIM GMSD ↓ HaarPSI
VGG+Canny 27.31 0.8124 0.0685 0.7533
VGG+HED 27.27 0.8180 0.0705 0.7481
SR-4x
Loss PSNR SSIM GMSD ↓ HaarPSI
VGG+Canny 25.03 0.7346 0.0850 0.7297
VGG+HED 25.03 0.7457 0.0861 0.7279
49
3.4 Experiments
G + Dv1
Loss PSNR SSIM GMSD ↓ HaarPSI
VGG 27.12 0.801 0.074 0.737
L1 27.45 0.803 0.079 0.725
Canny+VGG 27.31 0.803 0.073 0.739
Canny+L1 27.74 0.811 0.075 0.738
G + Dv2
Loss PSNR SSIM GMSD ↓ HaarPSI
VGG 27.26 0.806 0.0740 0.7358
L1 27.68 0.810 0.0753 0.7374
Canny+VGG 27.41 0.810 0.0739 0.7384
Canny+L1 27.73 0.810 0.0750 0.7380
Table 3.2: Performance of AR with different discriminators and loss functions, evalu-
ated on the Y channel (Luminance) for LIVE1 dataset. The numbers in bold signifies
the best performance. For GMSD, a lower value is better.
IEGAN Ground
(RGB) Truth
Figure 3.3: Results of JPEG AR for different algorithms. The Ground Truth was de-
graded to 10% of its original quality. Note that for IEGAN, the image is sharper. The
IEGAN (BW) (black and white) image is provided for a fair comparison with the rest
of the images. Best viewed in pdf.
Loss Function: We use a weighted combination of the VGG feature maps with
the Canny edge detector as a loss function. We also study the performance using VGG
and L1 separately combined with Canny to validate our claim that combining Canny
50
3.4 Experiments
Table 3.4: Performance of state-of-the-art algorithms for SR for Set14 dataset for RGB
images. For GMSD lower, value is better.
LIVE1
Algorithm PSNR SSIM GMSD ↓ HaarPSI
ARCNN+SRGAN 21.61 0.5284 0.1980 0.4112
SRGAN+ARCNN 22.70 0.6417 0.1457 0.5302
IEGAN 22.57 0.6319 0.1404 0.5504
World 100
Algorithm PSNR SSIM GMSD ↓ HaarPSI
ARCNN+SRGAN 25.51 0.6809 0.1668 0.4792
SRGAN+ARCNN 27.16 0.7861 0.1059 0.6320
IEGAN 25.62 0.7651 0.1009 0.6429
Table 3.5: Performance of IEGAN for simultaneous AR+SR compared to other state
of the art algorithms for the benchmark LIVE1 dataset and the World100 dataset. For
GMSD, lower value is better.
51
3.4 Experiments
Figure 3.4: Results of SR for different algorithms. The perceptual quality of SRGAN
and IEGAN outputs are visually comparable. Left to Right - SRCNN, SRGAN, IEGAN,
Ground Truth. Best viewed in pdf.
enhances the perceptual quality of the images. Table 3.2 shows the quantitative per-
formance of the algorithm with various discriminator and loss functions. We also ex-
periment with Holistically Nested Edge Detection (HED) [90] by replacing the Canny
counterpart. HED is the state-of–of-the–the-art edge detection algorithm which has bet-
ter edge detection capability compared to Canny. However, from Figure 3.2, we can
observe that both HED and Canny successfully produce the required edge information,
which is perceptually indistinguishable to the human eye. In other words, the different
edge methods are not critical to the overall performance, which is also proved in Table
3.1. Thus, we choose the simple yet fast Canny method in our final framework.
The results from Table 3.1 and Table 3.2 confirm that the GAN with discriminator
Dv1 using a weighted combination of VGG with the Canny loss function gives the best
GMSD and HaarPSI score. The majority of the AR algorithms proposed till date work
on black and white images. Our proposed algorithm works with colour images as well.
The highest PSNR and SSIM values are obtained from the framework having the dis-
criminator Dv1 with the Canny+L1 loss function. PSNR is a common measure used to
evaluate AR and SR algorithms. However, the ability of PSNR to capture perceptually
relevant differences is very limited as they are defined based on pixel-wise image dif-
ferences [3, 98, 99]. We will be using Dv1 with loss VGG + Canny for the rest of the
52
3.4 Experiments
Figure 3.5: Results for simultaneous SR+AR of RGB images using various algorithms.
Row 1, 3 & 4 are from World100 and row 2 from LIVE dataset. Note that the textures
and details in the output images from IEGAN are much superior than the others. Best
viewed in pdf.
53
3.5 Conclusion
3.5 Conclusion
We have described a deep generative adversarial network with skip connections that sets
a new state of the art on public benchmark datasets when evaluated with respect to per-
ceptual quality. We have introduced an edge detector function in the loss which results
54
3.5 Conclusion
in sharper outputs at the edges. This network is the first framework which success-
fully recovers images from artifacts and at the same time super-resolves, thus having
a single-shot operation performing two different tasks. We have also shown that the
network is capable of solving the problem of artifact removal or superresolution indi-
vidually, and the results are comparable to the state-of-the-art, if not better. We have
highlighted some limitations of the existing loss functions used for training any image
enhancement network and introduced IEGAN, which augments the feature loss func-
tion with an edge loss during the training of the GAN. Our ablation study shows that a
critic-based discriminator, which are typical to Wasserstein GANs can provide better re-
sults compared to binary discriminators. Using different combinations of loss functions
and by using the discriminator both in feature and pixel space, we confirm that IEGAN
reconstructions for corrupted images are superior by a considerable margin and more
photorealistic than the reconstructions obtained by the current state-of-the-art methods.
55
Chapter 4
4.1 Introduction
Photo-Realistic image enhancement on low-quality images is challenging but highly
demanded in real-world applications. It can be broadly classified into three domains:
(1) Super-resolution (SR) [3, 4] focuses on the task of estimating a high-resolution im-
age from its low-resolution counterpart; (2) Noise removal (NR) [5, 6] aims at cleaning
different noises present in the image, e.g., Gaussian noise, Poisson noise, etc.; (3) Ar-
tifact removal (AR) [7, 8] concentrates on estimating an artifact-free sharp image from
its corrupted counterpart, e.g., JPEG artifacts.
To cope with the problem of generating high perceptual quality images from low-
quality ones, different methods have been proposed. In the spatial domain, different
kinds of filters [36–38] have been proposed to adaptively deal with blocking artifacts in
specific regions. In the frequency domain, wavelet transform has been utilized to derive
thresholds at different wavelet scales for deblocking and denoising [39, 40]. Recently,
deep learning-based approaches [3–5, 7, 8, 42–44] have shown impressive results. For
image superresolution, the seminal work, Super Resolution Convolutional Neural Net-
work (SRCNN) [4], shows the potential of an end-to-end deep convolutional network in
56
4.1 Introduction
Figure 4.1: Illustration of inputs (i.e., low-resolution images with noises and artifacts)
and outputs (i.e., images after super-resolving, removing different noises and artifacts
simultaneously) of our proposed framework. Images in the first row are original low-
resolution noisy images. Images in the second row are a 4x zoomed version of original
images for proper visualisation of the noise. Images in the third row are the correspond-
ing super-resolved and clean output images obtained from our proposed algorithm. Best
viewed in color.
SR. Ledig et al. [3] presented a framework called SRGAN, which is capable of gener-
ating photorealistic images for 4× up-scaling factors. In the field of noise and artifact
removal from images, several attempts have been made. Zhang et al. [8] proposed a
CNN denoiser integrated with model-based optimization method to solve different in-
verse problems that show significant improvement compared to the existing methods.
Dong et al. [7] discovered that SRCNN directly applied for compression artifact re-
duction leads to undesirable noisy patterns, thus proposing an improved model called
Artifact Reduction Convolutional Neural Network (ARCNN), which shows better per-
formance. Svoboda et al. [41] proposed the L4 and L8 architectures which have better
results compared to ARCNN. Deep Image Prior [5] by Ulyanov et al. shows promising
results for various image enhancement tasks.
The above-mentioned image enhancement approaches provide state-of–of-the–the-
art results while only solving a single problem, e.g., either SR, NR or AR, but not all
together simultaneously. However, in real-world scenarios, it is widespread that low-
quality images can arise from different sources, e.g., out-of-date camera sensors, poor-
57
4.1 Introduction
lighting environments, etc., with several mixed issues. Therefore, conducting multiple
types of image enhancement to generate photorealistic images in a single framework is
highly demanded in practical applications. A naive approach to solve this practical prob-
lem is to build a pipeline consisting of a cascade of superresolution and image cleaning
algorithms, where each of the components incrementally performs its own task. How-
ever, the run-time of this pipeline should be extremely slow since it combines several
different backbones. Furthermore, some components in the pipeline can also bring in
unexpected effects on the following ones, e.g., undesired artifacts from the superreso-
lution module [7], which cause the poor perceptual quality of the output images (see
Section 4.4 for detailed illustrations).
To address these issues, in this chapter, we propose a novel and effective end-to-end
image enhancement framework using a generative adversarial network (GAN). Specif-
ically, an encoder-decoder generator model is utilized to superresolve and clean the
images from noise and artifacts simultaneously, while a discriminator is designed to
help the generator further learn to generate outputs in the natural image manifold. Fur-
thermore, an edge detection module is also introduced to keep the sharpness and texture
details in the enhanced outputs, which is often lost during the enhancement procedure
in other methods. Figure 4.1 shows examples of inputs (i.e., low-resolution images with
noises and artifacts) and outputs (i.e., images after superresolving, removing different
noises and artifacts simultaneously) of our proposed framework.
To sum up, in this chapter, our main contribution is threefold:
• Extensive ablation studies and experiments show the effectiveness and superiority
of our proposed framework on several datasets.
58
4.2 Related Work
59
4.2 Related Work
work independently on every channel or mainly focus on modeling the gray image prior.
Since the color channels are highly correlated, working on the channel simultaneously
produces better performance than independently handling each color channel [53].
60
4.3 Our Approach
between feature maps extracted from the VGG19 [9] network. Ledig et al. [3] pro-
posed a GAN-based network optimized for perceptual losses which are more invariant
to changes in pixel space, obtaining better visual results.
61
4.3 Our Approach
notes the weights and biases of a L-layer deep network and is obtained by optimising a
loss function Floss . The training is done using two sets of n images {IiGT : i = 1, 2, ..., n}
and {I LR GT = G (I LR ) (where I GT and I LR are correspond-
j : j = 1, 2, ..., n} such that Ii θG j i j
ing pairs) and by solving
1 n
θ̂G = arg min Floss (IiGT , GθG (I LR
j )) (4.1)
θG N i,∑
j=1
i= j
Following the work of Goodfellow et al. [23] and Isola et al. [87], we also add a
discriminator network DθD to assess the quality of images generated by the generator
network GθG .
The generative network is trained to generate the target images such that the differ-
ences between the generated images and the ground truth are minimised. While training
the generator, the discriminator is trained in an alternating manner such that the proba-
bility of error of the discriminator (between the ground truth and the generated images)
is minimised. With this adversarial min-max game, the generator can learn to create so-
lutions that are highly similar to real images. This also encourages perceptually superior
solutions residing in the manifold of natural images.
We follow the architectural guidelines of GAN proposed by Radford et al. [33].
For the generator we use a convolutional layer with small 3 × 3 kernels and stride=1
followed by a batch normalisation layer [18] and Leaky ReLU [88] as the activation
function. The number of filters per convolution layer is indicated in Figure 3.1.
For image enhancement problems, even though the input and output differ in appear-
ance, both are actually renderings of the same underlying structure. Therefore, the input
is more or less aligned with the output. We design the generator architecture keeping
these in mind. For many image translation problems, there is a lot of low-level informa-
tion shared between the input and output, and it will be helpful to pass this information
directly across the network. Ledig et al. had used residual blocks and a skip connection
in their SRGAN [3] framework to help the generator carry this information. However,
we found that it is more useful to add skip connections following the general shape of
a U-Net [89]. Specifically, we add skip connections between each layer n and layer
L − n, where L is the total number of layers. Each skip connection simply concatenates
all channels at layer n with those at layer L − n. The proposed deep generator network
62
4.3 Our Approach
GθG is illustrated in Figure 3.1. The generator has an additional block containing two
subpixel convolution layers immediately before the last layer for cases where p > 0,
i.e., where the size of the output is greater than the input. These layers are called pixel-
shuffle layers, as proposed by Shi et al. [58]. Each pixel shuffle layer increases the
resolution of the image by 2×. In Figure 3.1, we show two such layers which super-
resolves the image by 4×. If p = 0, we do not need any such block since the size of the
output image is equal to the input image.
The discriminator in our framework is very crucial for the performance and is de-
signed in the form of an autoencoder. Thus, the output of the autoencoder is the recon-
structed image of its input which is the ground truth or the generator output. This helps
the discriminator to pass back a lot of semantic information to the generator regarding
the quality of the generated images, which is not possible with a binary discriminator.
Our proposed discriminator contains eighteen convolutional layers with an increasing
number of 3 × 3 filter kernels. The specific number of filters are indicated in Figure
3.1. Convolutions with stride=2 are used for feature map reduction, and pixel-shuffle
layers [58] are used to increase them. The architecture of the proposed framework is
shown in Figure 3.1 in details.
Feature Loss
We choose the feature loss based on the ReLU activation layers of the pretrained 19
layer VGG network described in Simonyan and Zisserman [9]. This loss is described as
63
4.3 Our Approach
where φi, j is the feature map obtained by the jth convolution (after activation) before
the ith max-pooling layer within the pretrained VGG19 network, and W and H represent
the width and height of the input image, respectively.
It is an alternative to pixel-wise loss, in that it attempts to be closer to perceptual
similarity. This feature loss assists in determining which features are present in the
ground truth image and evaluating how well the model’s predicted features match these.
This enables the model trained with this loss function to generate much finer detail in
the generated features and thus, the overall output.
Edge Loss
Preservation of edge information is essential for the generation of sharp and clear im-
ages. Thus, we add an edge loss to the feature loss counterpart. There are several edge
detectors available in the literature, and we have chosen to design our edge loss func-
tion using the state of the art edge detection algorithm proposed by Xie and Tu, called
Holistically-nested Edge Detection (HED) [90] and the classical Canny edge detection
algorithm [91] due to its effectiveness and simplicity. Experimental results prove that
the Canny algorithm provides similar results for the preservation of sharpness but with
greater speed and fewer resource requirements compared to HED. For the Canny al-
gorithm, a Gaussian filter of size 3 × 3 with σ = 0.3 was chosen as the kernel. For a
very high resolution image, a bigger kernel will produce better results. We used a lower
threshold of 0.050 and an upper threshold of 0.40. This loss is mathematically expressed
as
1 W H
edge GT LR GT LR
Eloss (I , GθG (I )) = ∑ ∑ Θ(I )x,y − Θ(Gθ G
(I ))x,y
, (4.3)
W H x=1 y=1
It is common for SR algorithms to produce images with soft edges, or if the pictures
contain noise, the noise is enhanced as well, i.e., any imperfection in the image is su-
64
4.3 Our Approach
SRGAN+IRCNN Ours
Figure 4.2: Canny edges for the input, ground truth, output of SRGAN+IRCNN and our
proposed algorithm. We have highlighted the areas of significant difference with red
boxes. Best viewed in color.
perresolved. This is quite evident in the images shown in Figure 4.4 where edges of
the images either very soft (e.g., SRCNN + DIP, Bicubic + IRCNN, etc.) or unwanted
distortions like random streaks (e.g. SRGAN + IRCNN, E-net + ARCNN). This obser-
vation motivated us to use the edge loss counterpart to solve this problem. Since the
edge loss map used by us is binary (i.e., consists of 0 and 1 only), the neurons get acti-
vated when there is a difference of edges. Thus, it can minimise the unwanted artifacts
and maximise the sharpness of the output image. Figure 4.2 shows that. The edge loss
counterpart helps to reduce the difference between the edge map of the input and the
ground truth, and as a result, the edge map of our output is quite similar to that of the
ground truth. Other algorithms, such as the cascade of SRGAN+IRCNN cannot use
the edge information, thus resulting in inferior output images. This is clearly visible in
the areas highlighted within the red boxes where our output image does not deviate a
lot from the ground truth. Compared to the other methods, our method offers the most
superior results.
The Canny edge detector works best at region boundaries in the image, i.e., those
areas which correspond to Canny edges. As Canny edges are sparse, this loss counter-
part does not contribute much to the fine internal texture of the image. However, when
65
4.3 Our Approach
the feature loss is combined with the edge loss, it is able to reproduce the image with
high sharpness in the region boundaries as well as the fine internal texture.
Reconstruction Loss
Unlike most other algorithms, our discriminator provides a reconstructed image of the
discriminator input. This is typical to Wasserstein GAN [100] where the discriminator
acts as a critic on how well it matches to the ground truth image. Modifying the idea
of Berthelot et al. [92], we design the discriminator to differentiate between the loss
distribution of the reconstructed real image and the reconstructed fake image. Thus, we
have the reconstruction loss function as
f ake
LD = |Lreal
D − kt × LD | (4.4)
where Lreal
D is the loss distribution between the input ground truth image and the recon-
structed output of the ground truth image, mathematically expanded as
edge
Lreal
D = r × Eloss (DθD (I
GT
), DθD (GθG (I GT ))) + (1 − r)
(4.5)
V GGi. j
× Closs (DθD (I GT ), DθD (GθG (I GT )))
f ake
LD is the loss distribution between the generator output image and the reconstructed
output of the same, expanded as
f ake edge
LD = r × Eloss (DθD (I LR ), DθD (GθG (I LR ))) + (1 − r)
(4.6)
V GGi. j
× Closs (DθD (I LR ), DθD (GθG (I LR )))
and kt is a balancing parameter at the t th iteration which controls the amount of emphasis
f ake
put on LD . The value of k is not fixed at every step as it needs to balance the real and
fake reconstruction losses. We start with the value of k = 0, and then with every step,
this difference is added to the previous value of k. It gets updated as per the following
equation.
f ake
kt+1 = kt + λ (γLreal D − LD ) ∀ step t (4.7)
where γ maintains the equilibrium between real image reconstruction loss and the fake
f ake
D = LD . A high value of γ will put more emphasis
image reconstruction loss, i.e. γLreal
on learning to reconstruct the real images better because it will push Lreal
D to have lower
66
4.4 Experiments
values. Similarly, a low value of γ will help learning to reconstruct the fake images. In
our experiments, we have used the value of 0.7. λ is the learning rate of k which is set
as 10−3 in our experiments. Further details about this can be found in [92].
We formulate the final perceptual loss Floss as the weighted sum of the feature loss Closs
and the edge loss Eloss component added to the reconstruction loss such that
Substituting the values from Equation 4.2, 4.3 and 4.4, we have
1 W H
GT LR
Floss =r× ∑ ∑ Θ(I )x,y − Θ(Gθ G
(I ))x,y
+ (1 − r)
W H x=1 y=1
(4.9)
W H
i, j i, j
1 f ake
× ∑ ∑ D − kt × LD |
(φi, j (I GT )x,y − φi, j (GθG (I LR ))x,y )2 + |Lreal
Wi, j Hi, j x=1 y=1
4.4 Experiments
To validate the performance of our proposed algorithm, we have performed extensive
experiments on public benchmark datasets for various combinations of degradation.
This includes noise removal with superresolution and noise with artifact removal and
superresolution simultaneously. The main tasks of this empirical study are to verify
how well our proposed networks perform for each of the noise cases combined with
superresolution. We will also verify how our method performs when images contain
three forms of degradation together, i.e, noise, compression artifacts, and downsam-
pling. Although our method can be tested on any set of images, we have selected some
particular benchmark image sets which have been used extensively in the research com-
munity to test image enhancement. We discuss further details about the datasets and the
implementation in the following section.
67
4.4 Experiments
• For JPEG degradation, images were degraded with a JPEG quality factor of 10%
(i.e., 90% degradation).
• For Gaussian noise, a Gaussian noise distribution with mean=0 and variance=0.01
was added to the image.
• For Poisson noise, a Poisson noise distribution with mean=0 and variance=0.01
was added to the image.
• For Speckle noise, a uniform noise distribution with mean=0 and variance=0.01
was added to the image using out = image + n × image, where n is the uniform
noise with specified mean and variance..
• For salt and pepper noise, random pixels were selected and the value was replaced
with either 0 or 1. The proportion of image pixels replaced with noise is 0.5.
For Gaussian, Poisson, and speckle, all outputs were clipped from the range 0-1. Af-
ter adding the noise, the resolutions of the images were downscaled by a factor of 4,
which corresponds to 16 times the reduction in image pixels. For comparison, we used
bicubic, E-Net [97], SRCNN [4] and SRGAN [3] for super-resolution and bilateral [6],
ARCNN [7], IRCNN [8] and Deep Image Prior [5] for noise and artifact removal. Since
there are no end-to-end networks available for solving these problems simultaneously,
we have used a series of networks for solving each problem separately. For each com-
bination of the superresolution algorithm and artifact/noise removal, we used a cascade
of the superresolution algorithm first and then the artifact/noise removal algorithm. We
have also experimented with the reverse combination, i.e., the artifact/noise removal al-
gorithm first and then the superresolution algorithm, but the results were much worse.
68
4.4 Experiments
Input
Input magnified Ground Truth
Figure 4.3: Visual results for 4xSR + Salt and Pepper NR + JPEG AR of RGB images
combining various algorithms. The image is taken from Set14 dataset and cropped for
better visualisation. For each image, from left to right, the scores are PSNR, SSIM,
GMSD and HaarPSI. Best viewed in color.
Thus, we have skipped those due to space limitations. This has also been shown and
verified by Ghosh et al. in [45]. In case of simultaneous noise removal, artifact removal,
and superresolution, the experiments have been compared with a cascade of 3 different
networks consisting of the best available method for superresolution (SRGAN), noise
removal (IRCNN) and artifact removal (ARCNN). We have shown the results for both
combinations, i.e., SRGAN-ARCNN-IRCNN and SRGAN-IRCNN-ARCNN whose re-
sults have been analysed in the following sections.
We trained all networks on an NVIDIA DGX-Station using a random sample of
60,000 images from the ImageNet dataset [95]. For each minibatch, we cropped random
128 × 128 HR sub-images of distinct training images. Our generator model can be
applied to images of arbitrary size as it is a fully convolutional network. We scaled
the range of the image pixel values to [-1, 1]. During feeding the outputs to the VGG
network for calculating the loss function, we scale it back to [0, 1] since VGG network
inherently handles image pixels in the range [0, 1]. For optimisation, we use Adam [96]
with β1 = 0.9. The value of r in Equation 4.8 is selected as 0.4. The network was trained
with a learning rate of 10−4 and with 105 update iterations. Our implementation is based
on TensorFlow.
69
Table 4.1: Performance of our proposed algorithm for simultaneous Gaussian noise removal and 4x superresolution com-
pared to other state of the art algorithms for the benchmark LIVE1 dataset, Set14 Dataset and the BSD100 dataset. For
GMSD, lower value is better.
70
Bicubic + ARCNN 22.851 0.580 0.210 0.346 22.824 0.574 0.194 0.391 23.621 0.572 0.202 0.398
SRGAN + ARCNN 22.349 0.571 0.200 0.348 22.389 0.563 0.188 0.391 22.001 0.550 0.196 0.381
SRCNN + ARCNN 22.942 0.596 0.200 0.363 23.135 0.593 0.183 0.409 23.258 0.578 0.187 0.405
ENet + ARCNN 22.233 0.564 0.201 0.359 22.369 0.560 0.189 0.401 21.884 0.524 0.194 0.398
Bicubic + IRCNN 22.834 0.581 0.198 0.359 22.851 0.578 0.187 0.398 23.723 0.589 0.189 0.412
SRGAN + IRCNN 22.069 0.546 0.189 0.360 22.073 0.537 0.186 0.397 21.066 0.495 0.189 0.385
SRCNN + IRCNN 22.869 0.586 0.187 0.371 23.130 0.586 0.174 0.413 21.066 0.495 0.189 0.385
ENet + IRCNN 21.622 0.525 0.193 0.369 21.776 0.522 0.188 0.406 20.924 0.475 0.195 0.399
Bicubic + DIP 22.466 0.554 0.212 0.331 22.217 0.545 0.203 0.352 22.823 0.503 0.214 0.359
SRGAN + DIP 20.379 0.448 0.249 0.241 20.234 0.436 0.228 0.279 20.960 0.425 0.232 0.290
SRCNN + DIP 21.755 0.513 0.225 0.293 21.515 0.503 0.214 0.322 22.135 0.472 0.210 0.338
ENet + DIP 20.807 0.462 0.243 0.255 20.515 0.447 0.225 0.290 21.721 0.443 0.224 0.314
Ours 23.206 0.662 0.177 0.424 22.386 0.614 0.164 0.419 22.945 0.595 0.180 0.440
4.4 Experiments
Table 4.2: Performance of our proposed algorithm for simultaneous Poisson noise removal and 4x superresolution compared
to other state of the art algorithms for the benchmark LIVE1 dataset, Set14 Dataset and the BSD100 dataset. For GMSD,
lower value is better.
71
Bicubic + ARCNN 22.984 0.585 0.211 0.347 23.007 0.581 0.194 0.395 23.836 0.582 0.203 0.401
SRGAN + ARCNN 22.463 0.578 0.201 0.348 22.502 0.571 0.187 0.390 21.843 0.549 0.192 0.383
SRCNN + ARCNN 23.101 0.604 0.201 0.364 23.325 0.602 0.182 0.414 23.490 0.590 0.189 0.409
ENet + ARCNN 22.381 0.571 0.202 0.361 22.542 0.569 0.188 0.407 22.223 0.542 0.194 0.402
Bicubic + IRCNN 22.991 0.590 0.199 0.361 23.058 0.594 0.186 0.403 23.994 0.609 0.190 0.419
SRGAN + IRCNN 22.194 0.558 0.190 0.360 22.201 0.553 0.185 0.399 21.340 0.524 0.187 0.389
SRCNN + IRCNN 22.194 0.558 0.190 0.360 23.362 0.605 0.173 0.420 23.444 0.588 0.176 0.417
ENet + IRCNN 21.796 0.539 0.194 0.372 21.970 0.540 0.186 0.412 21.373 0.507 0.192 0.407
Bicubic + DIP 22.672 0.568 0.210 0.339 22.475 0.562 0.198 0.363 23.048 0.520 0.212 0.365
SRGAN + DIP 20.466 0.454 0.249 0.240 20.205 0.441 0.232 0.274 20.996 0.432 0.232 0.291
SRCNN + DIP 21.732 0.513 0.225 0.290 21.664 0.515 0.212 0.324 22.211 0.476 0.211 0.339
ENet + DIP 20.899 0.468 0.242 0.257 20.649 0.459 0.225 0.289 21.817 0.451 0.223 0.316
Ours 22.944 0.638 0.178 0.417 22.672 0.631 0.163 0.423 23.229 0.617 0.179 0.448
4.4 Experiments
Table 4.3: Performance of our proposed algorithm for simultaneous salt and pepper noise removal and 4x superresolution
compared to other state-of-the-art algorithms for the benchmark LIVE1 dataset, Set14 Dataset and the BSD100 dataset. For
GMSD, lower value is better.
72
Bicubic + ARCNN 22.681 0.574 0.211 0.343 22.566 0.566 0.194 0.387 23.373 0.560 0.202 0.394
SRGAN + ARCNN 22.239 0.561 0.200 0.343 22.176 0.552 0.188 0.386 21.540 0.514 0.192 0.377
SRCNN + ARCNN 22.760 0.587 0.201 0.358 22.838 0.580 0.183 0.405 23.042 0.563 0.188 0.401
ENet + ARCNN 22.059 0.553 0.202 0.356 22.106 0.548 0.190 0.396 21.586 0.502 0.197 0.391
Bicubic + IRCNN 22.642 0.569 0.199 0.352 22.561 0.564 0.187 0.393 23.418 0.566 0.190 0.405
SRGAN + IRCNN 21.953 0.530 0.190 0.352 21.849 0.520 0.187 0.391 20.913 0.468 0.193 0.377
SRCNN + IRCNN 22.655 0.569 0.188 0.363 22.800 0.568 0.175 0.406 22.905 0.546 0.176 0.401
ENet + IRCNN 21.434 0.507 0.193 0.362 21.490 0.503 0.189 0.398 20.557 0.443 0.201 0.388
Bicubic + DIP 22.277 0.545 0.215 0.325 22.027 0.536 0.203 0.354 22.534 0.488 0.218 0.350
SRGAN + DIP 20.349 0.442 0.250 0.237 20.214 0.435 0.230 0.279 21.013 0.420 0.236 0.285
SRCNN + DIP 21.529 0.499 0.232 0.283 21.312 0.496 0.212 0.320 21.981 0.464 0.213 0.332
ENet + DIP 20.766 0.459 0.244 0.254 20.493 0.449 0.226 0.288 21.588 0.435 0.226 0.308
Ours 22.800 0.619 0.179 0.410 22.205 0.601 0.166 0.415 22.775 0.578 0.181 0.435
4.4 Experiments
Table 4.4: Performance of our proposed algorithm for simultaneous speckle noise removal and 4x superresolution compared
to other state-of–of-the–the-art algorithms for the benchmark LIVE1 dataset, Set14 Dataset and the BSD100 dataset. For
GMSD, lower value is better.
73
Bicubic + ARCNN 22.927 0.584 0.211 0.347 23.420 0.569 0.190 0.393 23.763 0.579 0.203 0.400
SRGAN + ARCNN 22.445 0.577 0.201 0.348 22.510 0.572 0.187 0.391 21.728 0.547 0.191 0.382
SRCNN + ARCNN 23.046 0.602 0.202 0.364 23.274 0.601 0.182 0.414 23.443 0.588 0.189 0.408
ENet + ARCNN 22.347 0.571 0.202 0.360 22.534 0.569 0.188 0.408 22.178 0.540 0.194 0.401
Bicubic + IRCNN 22.939 0.589 0.200 0.360 23.014 0.592 0.186 0.403 23.926 0.607 0.191 0.419
SRGAN + IRCNN 22.188 0.557 0.190 0.360 22.237 0.553 0.185 0.399 21.238 0.521 0.187 0.388
SRCNN + IRCNN 23.008 0.597 0.188 0.375 23.321 0.603 0.173 0.421 23.404 0.586 0.177 0.416
ENet + IRCNN 21.777 0.539 0.193 0.372 21.994 0.541 0.186 0.414 21.334 0.504 0.192 0.407
Bicubic + DIP 22.726 0.574 0.206 0.347 22.418 0.558 0.199 0.361 23.143 0.529 0.209 0.375
SRGAN + DIP 20.510 0.453 0.247 0.243 20.248 0.439 0.231 0.277 20.876 0.431 0.232 0.291
SRCNN + DIP 21.700 0.511 0.228 0.291 21.679 0.514 0.212 0.325 22.328 0.485 0.209 0.344
ENet + DIP 20.948 0.468 0.243 0.257 20.596 0.454 0.224 0.289 21.764 0.447 0.221 0.316
Ours 23.159 0.661 0.177 0.423 22.632 0.630 0.163 0.423 23.177 0.615 0.179 0.448
4.4 Experiments
Table 4.5: Performance of our proposed algorithm for simultaneous JPEG Artifact removal and 4x superresolution compared
to other state of the art algorithms for the benchmark LIVE1 dataset, Set14 Dataset and the BSD100 dataset. For GMSD,
the lower value is better.
74
Bicubic + ARCNN 22.769 0.579 0.210 0.330 22.818 0.575 0.193 0.387 23.400 0.570 0.202 0.376
SRGAN + ARCNN 22.287 0.570 0.200 0.336 22.389 0.567 0.187 0.386 21.594 0.533 0.191 0.378
SRCNN + ARCNN 22.850 0.595 0.201 0.344 23.115 0.594 0.182 0.403 23.153 0.579 0.187 0.396
ENet + ARCNN 22.170 0.564 0.201 0.342 22.383 0.564 0.187 0.398 21.799 0.527 0.194 0.388
Bicubic + IRCNN 22.736 0.580 0.199 0.341 22.835 0.585 0.186 0.393 23.464 0.587 0.191 0.390
SRGAN + IRCNN 22.003 0.548 0.189 0.347 22.087 0.547 0.184 0.394 21.010 0.501 0.189 0.382
SRCNN + IRCNN 22.769 0.587 0.188 0.353 23.107 0.595 0.173 0.408 23.153 0.579 0.187 0.396
ENet + IRCNN 21.584 0.529 0.192 0.352 21.802 0.534 0.185 0.403 20.870 0.483 0.194 0.391
Bicubic + DIP 22.619 0.575 0.202 0.338 22.198 0.548 0.202 0.351 22.580 0.499 0.216 0.347
SRGAN + DIP 20.460 0.451 0.248 0.240 20.185 0.440 0.230 0.273 20.991 0.427 0.233 0.289
SRCNN + DIP 21.437 0.502 0.231 0.277 21.516 0.511 0.214 0.320 22.030 0.472 0.210 0.334
ENet + DIP 20.734 0.464 0.243 0.251 20.482 0.452 0.222 0.289 21.645 0.444 0.225 0.308
Ours 22.659 0.635 0.181 0.387 22.329 0.612 0.167 0.407 22.747 0.592 0.182 0.423
4.4 Experiments
4.4 Experiments
75
4.4 Experiments
Input
Input magnified Bicubic+Bilateral Bicubic+ARCNN
18.78/0.57/0.27/0.39 18.56/0.59/0.28/0.41
Figure 4.4: Results for 4xSR + Gaussian NR of RGB images combining various algorithms.
The image is taken from BSD100 dataset. For each image, from left to right, the scores are
PSNR, SSIM, GMSD and HaarPSI. The scores for perceptual quality and structural consistency
of the image using our algorithm is superior to the rest. Best viewed in color.
76
4.4 Experiments
is given by the image obtained from Bicubic+IRCNN, which appears to be quite blurry,
whereas our image is free from all kinds of noises or artifacts and is reasonably sharp.
Note that the output is obtained from a noisy image and is superresolved 4 times each
side (i.e., 16x increase in overall pixels) simultaneously using a single network. Some
of the cascade networks, for example, SRGAN+IRCNN, E-Net+IRCNN, etc., produce
sharp results, but unwanted streaks of different colors throughout the image degrade the
perceptual quality of the image.
Furthermore, we have shown the results for images degraded with salt and pepper
noise and JPEG compression artifacts along with 4x size reduction in Table 4.6. We
choose the best performing algorithms (i.e., SRGAN for superresolution, IRCNN for
noise removal, and ARCNN for artifact removal) for solving each problem and add
them into the cascaded pipeline. Even though, our scores of GMSD and HaarPSI still
outperform the counterparts. The perceptual quality of the images is shown in Figure
77
4.5 Conclusion
4.3. Our output image is free of undesirable marks and is the closest to the ground truth
in terms of human perceptual quality.
4.5 Conclusion
We have described a deep generative adversarial network with skip connections that
sets a new state of the art on public benchmark datasets when evaluated with respect to
perceptual quality. This is the first framework that is able to successfully recover images
from artifacts and/or various noises and at the same time super-resolves, thus having
a single-shot operation performing two to three different tasks. We have highlighted
some limitations of the existing loss functions used for training any image enhancement
network and introduced our network, which augments the feature loss function with an
edge loss during the training of the GAN. Using different combinations of loss functions
and by using the discriminator both in feature and pixel space, we confirm that our
proposed network’s reconstructions for corrupted images are superior by a considerable
margin and more photorealistic than the reconstructions obtained by the current state-
of-the-art methods.
78
Chapter 5
5.1 Introduction
Face detection and recognition from low-quality images is challenging but highly de-
manded in real-world applications. These problems are particularly prominent for highly
compressed images and videos, where details are typically absent. This is most common
in surveillance videos where the camera sensors are not of very high quality. Good de-
tection and recognition performance is highly desirable in these scenarios and we have
explored this possibility in this chapter.
Solving the problem of detecting/recognising faces with high accuracy, different
approaches have been proposed [1,2,32,65,68,72]. Most recognition algorithms attempt
to minimise the distance between similar identities and maximise the distance between
dissimilar identities, thus maximising the overall face recognition performance. The
limitation of these methods is that their performance drops significantly on degraded
images.
Pixel-wise loss functions like MSE or L1 loss are unable to recover the high-frequency
79
5.1 Introduction
details in an image. These loss functions encourage finding pixel-wise averages of pos-
sible solutions. This alone neither leads to reconstruction of discriminative facial fea-
tures nor leads to better perceptual quality [3, 42–44]. Johnson et al. [44] and Bruna et
al. [42] proposed a perceptual loss function based on the Euclidean distance between
feature maps extracted from the VGG19 [9] network. Zhang et al. proposed an algo-
rithm [8] to solve different problems related to image artifacts. A recent paper by Ghosh
et al. [45] uses edge loss on top of perceptual loss to simultaneously remove artifacts
and superresolve images. A major drawback of these perceptual image enhancement al-
gorithms is that they are not suitable for applications regarding face detection or recog-
nition. Improving perceptual quality does not necessarily lead to better face detection
or recognition. Rather, it can get worse, and we have shown that in Section 5.3.
Till date, several algorithms related to face recognition have been proposed, most
of which do not specifically target facial images with degraded quality such as noise or
compression artifacts. There are a few pre-deep learning era algorithms which deal with
face recognition for low quality images, but their performance is not adequate compared
to the practical requirements. Some proposed algorithms deal with face hallucinations
or superresolution, for example FSRNet [59]. Thus, an end-to-end method solving the
problem of low quality face recognition is highly desirable. In this chapter, we propose
a novel discriminative facial feature restoring GAN, which enhances the discriminative
features of degraded face images, leading to better detection and recognition perfor-
mance. Our major contribution lies in the fact that we propose a generative method,
where our network enhances the degraded face. This is of utmost importance as this
image can be fed in with any existing algorithm, or the algorithms that are yet to be
proposed. Thus, this method has the ability to work in conjunction with multiple algo-
rithms and is not limited to a single algorithm. As per our knowledge, the enhancement
of face detection and recognition by restoring the lost features of the input images using
GANs has not been explored yet. In a nutshell, our novel framework offers
• enhancement of facial features for very low quality images,
80
5.2 Facial Feature Restoration Using GAN
5.2.1 Method
In our work, we aim to estimate a clean facial image f (I G ) from an image I G which
is corrupted with various artifacts which include but not limited to MPEG compression
artifacts, interlacing artifacts, artifacts due to the inherent quality of camera sensors, etc.
Furthermore, I GT is the ground truth image which is the clean and artifact-free version
of I G and I R,S is the mugshot of the same identity in I G . Here we define a mugshot as a
clean, artifact-free, and properly illuminated frontal face pose of the person in question
with neutral facial expression and without any kind of facial occlusion. For an image
with c channels, we describe I G and I GT by a real-valued tensor such that I ∈ Rw×h×c .
To estimate the enhanced image for a given corrupted image, we train a generator
network as a feed-forward CNN GθG parameterised by θG . Here θG = W1:N ; b1:N denotes
the weights and biases of an N-layer deep network and is obtained by optimising a loss
function L. The training is done using two sets of n images {IiGT : i = 1, 2, ..., n} and
{IiG : i = 1, 2, ..., n} such that IiGT = GθG (IiG ) (where IiGT and IiG are corresponding pairs)
and by solving
1 n
θ̂G = arg min
θG
∑ L(IiGT , GθG (IiG), IiR,S ).
N i=1
(5.1)
We also have a function Φ which is a facial feature extraction model, such that
Φ(I) = vID where vID is the facial feature vector and vID ∈ Rd where d is its dimension.
Following the work of Goodfellow et al. [23], we also add a discriminator network DθD
to evaluate the images which are generated by the generator network GθG . The gener-
ative network is trained to generate the target image GθG (I G ) such that it is similar to
the ground truth. Ideally, we want the cosine distance between Φ(GθG (I G )) and Φ(I GT )
to be 0. This is done with the help of a novel framework which helps in enhancing
the quality and preserving the identity of the image. While training the generator, the
discriminator is trained in an alternating manner such that the probability of error of the
81
5.2 Facial Feature Restoration Using GAN
Figure 5.1: The generator architecture of the proposed network. Best viewed in color.
discriminator (between the ground truth and the generated images) is minimised. With
this adversarial min-max game, the generator can learn to generate images with restored
facial features. This also encourages the generation of perceptually superior solutions
residing in the manifold of clean facial images.
82
5.2 Facial Feature Restoration Using GAN
The prior net carries prior information from the coarse network to the encoded im-
age. A high quality prior information can boost the performance of the network. Thus,
we minimise the facial feature loss between the ground truth and the coarse image.
The prior network helps to preserve the facial landmarks which are obtained from the
coarse image. The hourglass block acts as an attention network and creates an informa-
tion bottleneck where the most important facial features are allowed to pass and other
unimportant or redundant features are filtered out. Figure 5.2 shows a couple of visual
examples of the prior network output. It is clearly visible in the images that the most
important facial landmarks get highlighted. Facial landmarks are key points that are
used to localise and represent salient regions of the face, such as eyes, eyebrows, nose,
mouth, etc. The output of the prior network is a single channel array which is then
added to all channels of the encoded image.
The encoder encodes the images obtained from the coarse network to extract some
latent features from the degraded faces. Figure 5.2 shows a couple of visual exam-
ples of the encoder network output. The encoder network consists of a convolution-
batchnorm-relu followed by thirteen residual blocks. Each residual block is a series
of two convolution-batchnorm-relu layers. The final layer of the encoder block is a
convolution layer with 3 channel output.
The final block of the network consists of a decoder. The output of the prior network
and the encoder are added, which is then fed to the decoder to obtain the final output.
This addition of the output from the two different blocks contains important facial fea-
tures and landmarks which are essential for a human face to get recognised with high
confidence. The decoder consists of a convolution-batchnorm-relu followed by another
convolution layer. The output from this layer is then passed to three consecutive resid-
ual layers, followed by a final convolution layer. The output of this decoder is the facial
image with enhanced features, which will help us with better recognition performance.
Our network does not use any kind of down-sampling or up-sampling layers (apart
from the hourglass block). Moreover, we use a single hourglass block in the prior net
compared to 2 blocks in FSRNet. The most important differentiating factor is that our
prior net is based on an unsupervised method to create a facial landmark distribution,
whereas FSRNet uses a supervised method. The number of filters per convolution layer
in the coarse encoder and decoder network is 64 with 3 × 3 kernels and stride 1. For
the prior network, the number of filters are 128 with 3 × 3 kernels and stride 1 except
83
5.2 Facial Feature Restoration Using GAN
Figure 5.2: Visual examples of the outputs from different network components - (a) Fa-
cial landmark distribution from the prior network. (b) Latent features from the encoder
network. Best viewed in color.
the first convolution layer where the kernel is 7 × 7. The hourglass block consists of a
U shaped network (with skip connections) with 4 blocks of convolution - 2x2 max-pool
layers followed by 4 blocks of convolution - 2x2 nearest neighbour upsampling layers.
A convolution is also applied to the output of each skip connection. The number of
filters are 128 for each convolution layer, kernel size 1x1 and stride is 1.
ReLU activation is used throughout the network. To encourage the generation of
realistic images, a discriminator is added to the network. The design of our discriminator
is in line with the architectural guidelines of the discriminator proposed in [33]. The
network is then trained in an adversarial manner so that the features of the generated
images are driven towards the natural facial feature space.
84
5.2 Facial Feature Restoration Using GAN
Although inspired from [65], this method is very different from triplet loss. The con-
strained metric learning relies on two fixed anchors that guides the feature generation in
the correct direction. The feature vector Φ(I) (see Section ??) embeds an image I in a
d dimensional Euclidean space. Let us define Φ(IiR,S ) as the reference point where IiR,S
is the unique and clean mugshot for identity i where i ∈ {1, 2, ...,C} (considering there
are C classes). We aim to ensure that the features of all generated images of any given
class i are as close as possible to the feature of Φ(IiR,S ). These pairs are called similar
pairs. Let us denote a generated similar counterpart as GθG (IiG ) where IiG is the input
image containing artifacts. The generated similar counterpart simultaneously must also
be as further as possible to the mugshot features of any class, but i, and these are called
as dissimilar pairs. Let us denote a dissimilar reference as Φ(I R,D j ) where I j
R,D
is the
unique and clean mugshot for identity j and j ∈ {1, 2, ...,C} but i 6= j. The distance
considered here is the cosine distance and the distance between similar pairs Dsim can
be formulated as
R,S
∑dk=1 Φ(Ii )k Φ{GθG (IiG )}k
D(IiR,S , IiG , GθG ) = 1 − q q , (5.2)
d 2 R,S d 2 G
∑k=1 Φ (Ii )k ∑k=1 Φ {GθG (Ii )}k
85
5.2 Facial Feature Restoration Using GAN
where ni is the number of images in the ith class. Similarly, the distance between dis-
similar pairs Ddsim can be formulated as
R,D
∑dk=1 Φ(I j
)k Φ{GθG (IiG )}k
D(I R,D G
j , Ii , GθG ) =1− q q . (5.3)
d 2 R,D d 2 G
∑k=1 Φ (I j )k ∑k=1 Φ {GθG (Ii )}k
where the feature Φ{GθG (IiG )} is constrained in space by similar and dissimilar pairs
and the loss is given by
Since one dissimilar counterpart is randomly selected for each similar counterpart,
the total number of triplets is ∑Ci=1 ∑N i
1 and is equal to the total number of training
images, where Ni is the number of images in class i.
As the preservation of identity is important for the network, the objective function needs
to evaluate the identity vector of the generated images with respect to the ground truth
image. If the ground truth image is denoted by IiGT , then the identity feature loss is
given by the equation
86
5.2 Facial Feature Restoration Using GAN
Figure 5.3: The proposed metric learning is different from triplet loss. It has two fixed
anchors minimising the cosine distance between the similar pairs and maximising the
cosine distance between dissimilar pairs. Images are taken from SCFace Dataset for the
purpose of explanation. The model was not trained using SCFace Dataset. Best viewed
in color.
Since the images are corrupted with artifacts, it is important for the generated images
to learn to reconstruct the corrupted images and drive the reconstruction towards the
natural image manifold. This can be done introducing a pixel-wise identity loss in the
cost function which is the Euclidean norm of the generated and the ground truth image.
GT G
R,S G
L px =
Ii − GθG (Ii )
+
Ii − GθG (Ii )
(5.8)
2 2
Perceptual Loss
To ensure a high-quality output image, we choose a feature loss based on the ReLU
activation layers of the pretrained 19 layer VGG network described in Simonyan and
Zisserman [9]. This loss was referred to as VGG loss by Ledig et al. [3].
Wm,n Hm,n
1
LV GGm,n (IiGT , GθG (IiG )) =
Wm,n Hm,n ∑ ∑ (φi, j (ImGT )x,y − φm,n(GθG (ImG))x,y)2 (5.9)
x=1 y=1
where φm,n is the feature map obtained by the nth convolution (after activation) before
the mth max-pooling layer within the pretrained VGG19 network, and W and H repre-
sent the width and height of the input image, respectively.
87
5.3 Experiments
Adversarial Loss
For training the discriminator, we use a binary cross-entropy loss, which predicts if the
image is a real one or a generated one. A sigmoid function is applied to the output of
the final layer of the discriminator and the resulting output is used to calculate the loss,
which is represented by
Where t1 and s1 are the ground truth and the generated image score for class 1.
Final Loss
We formulate the final loss as the weighted sum of all described cost function compo-
nents. To enhance the prior information, we also minimise the errors after the coarse
network. Let the coarse network be denoted by GCθG . Pixel-wise feature loss is cal-
culated for the output from the coarse network and the angular metric learning of the
features of the coarse image is also carried out at this stage. This ensures a better coarse
image which in turn helps the final image to be enhanced further. The cost function for
the coarse image can be formulated as
where GCθG is the coarse network, and IiG,C is the coarse image obtained from the coarse
network.
At the output of the network, all losses are combined, and the final loss function is
88
5.3 Experiments
(e) LMET at Coarse Network (f) Identity Feature Loss at Coarse Network
Figure 5.4: Change of the value of different loss function components with iterations
during training.
89
5.3 Experiments
5.3 Experiments
90
5.3 Experiments
Gallery Positive pair Enh. pos. pair Negative pair Enh. neg. pair
Table 5.1: Cosine similarity score w.r.t. gallery for input image and enhanced images
for positive pairs. The score for a random negative pair is also shown for reference.
These enhanced images are obtained from the network trained using our proposed loss
function. Shown images are taken from SCFace dataset. Best viewed in color.
91
5.3 Experiments
Figure 5.5: A qualitative comparison of the image level recovery compared to artifact
and noise removal algorithms. Since our algorithm is not targeted towards perceptual
recovery of image, our output is not very pleasing to the eyes. Best viewed in color.
Figure 5.6: ROC for ArcFace on SCFace dataset. XYZ→ Φ denotes that the image was
enhanced using algorithm XYZ and then recognition performance was calculated using
algorithm Φ (ArcFace). There is an improvement of 2.55% for ArcFace [2] for the AUC
after feature restoration. Best viewed in color.
92
5.3 Experiments
Table 5.2: This table is a supplement to Table 5.1. The rows in the first column are the
corresponding IDs. The second column is the score enhancement for the positive pair
(Col. 2 and Col. 3 in Table 5.1). The third column is the positive-negative pair margin
for non-enhanced images (Col. 2 and Col. 4 in Table 5.1), and the fourth column is the
positive-negative pair margin for enhanced images (Col. 3 and Col. 5 in Table 5.1).
Table 5.3: Fréchet Inception Distance and Inception Scores for different algorithms for
the SCFace dataset. As expected, our algorithm has very high FID and low IS since it is
not targeted to enhance perceptual quality.
Quantitative Results
Figure 5.9 shows an important result of our method. The green line shows the ROC for
the enhanced images and the blue line shows the ROC for the original image. The AUC
of the green curve is 4.2% higher than the blue curve, which implies a better separation
between different identities and thus better face recognition performance. Our method
boosts its performance of ArcFace [2] significantly and its ROC is shown in Figure 5.6.
Table 5.1 shows the cosine similarity scores for input images and enhanced images with
respect to the gallery images for both positive pairs and negative pairs. From Table
5.2 we can observe that the scores have increased for the positive pairs which imply
better matching. At the same time, we can see that the margin for the negative pairs
has also increased after enhancement, which implies better separation. Table 5.6 shows
93
5.3 Experiments
Components TAR%@FAR
Dis LMET 10% 1% 0.1% 0.01% 0.001%
× X 82.62 61.97 46.36 32.90 23.39
X × 84.62 63.95 48.74 31.92 19.47
X X 88.02 67.90 51.01 39.19 33.35
Table 5.4: True Accept Rate (TAR) at different False Accept Rate (FAR) with different
settings for the network and loss functions on SCFace dataset. The last row show the
results of the complete framework.
Table 5.5: Detection rates for different algorithms on SCFace dataset for different image
enhancement algorithms.
Algorithm TAR%@FAR
5% 1% 0.1% 0.01% 0.001%
ARCNN [7] 77.39 65.25 50.62 38.39 29.72
IRCNN [8] 75.78 61.29 47.41 33.64 21.15
IEGAN [45] 76.82 63.09 48.82 34.71 20.21
Ours 82.06 67.90 51.01 39.19 33.35
Table 5.6: Comparison of the recognition performance of our method with the state-of-
the-art artifact and noise removal algorithms for SCFace Dataset.
94
5.3 Experiments
Figure 5.7: ROC for CosFace on SCFace dataset. XYZ→ Φ denotes that the image was
enhanced using algorithm XYZ and then recognition performance was calculated using
algorithm Φ (CosFace). There is an improvement of 1.95% for CosFace [77] for the
AUC after feature restoration. Best viewed in color.
the performance comparisons with different state of the art AR algorithms. Thus, the
identity features of the faces at a low level are important for superior performance on
recognition problems.
Face Detection: Table 5.5 shows the detection performance for various algorithms,
which improves by a substantial margin when the images are enhanced using our method,
especially for SSD+MobileNetV2 [29, 104] where there is an improvement of 27.09%.
Note that images enhanced using perceptual image enhancement methods [7, 8, 45] per-
forms worse on detection algorithms as they do not operate on facial features but on
perceptual quality, which is useless for this purpose. This highlights the importance of
facial feature enhancement.
Face Recognition: Table 5.6 shows the recognition performance after image enhance-
ment using different algorithms. The numbers in bold denote the best result. We can
see in Figure 5.8 that the area under the ROC increases using our method, which im-
plies a better separation between different identities. This leads to a significant boost
of face recognition performance for state-of-the-art face recognition algorithms. Thus,
the identity features of the faces at a low level are essential for superior performance on
95
5.4 Conclusion
Figure 5.8: ROC for FaceNet on SCFace dataset. XYZ→ Φ denotes that the image was
enhanced using algorithm XYZ and then recognition performance was calculated using
algorithm Φ (FaceNet). There is an improvement of 0.9% for FaceNet [65] for the AUC
after feature restoration. Best viewed in color.
Qualitative Results
Although, it is to be noted that this method does not guarantee a good visual or percep-
tual quality of the image to human eyes. Most artifact and noise removal algorithms try
to achieve good perceptual quality and that is evident from Table 5.3. Figure 5.5 shows
the qualitative comparison between algorithms trying to achieve better perceptual qual-
ity and our algorithm which achieves better recognition performance.
5.4 Conclusion
We presented a deep generative adversarial network that sets a new state-of-the-art for
face recognition enhancement for low-quality images by restoring their facial features.
We have highlighted some limitations of the existing face recognition algorithms, which
mainly involves the accuracy drop of degrading images. We have tried to overcome
96
5.4 Conclusion
Figure 5.9: A closer look at the ROC curve for different enhancement algorithms on
SCFace dataset. Best viewed in color.
those limitations by augmenting the input images to increase the separation among dif-
ferent classes. We have used a GAN to recreate the facial images in a way that it
enhances the discriminative features of the facial image. In particular, we have explored
the possibility of improving face recognition performance on those images for which
the current state-of-the-art fails to deliver good results. We have formulated a loss func-
tion involving metric learning to learn the important discriminative features of the face.
Our proposed loss function works best on real-world images where they are inherently
low quality such as SCFace. The use of metric learning during the training helps to
generate facial images preserving their identity. This is essential for improving face
recognition performance compared to good perceptual quality, and our proposed algo-
rithm successfully demonstrates that. It is also worth noting that this method can be
used in conjunction with any existing face recognition algorithm.
97
Chapter 6
6.1 Introduction
Face detection and recognition from low-quality and occluded images are challenging
problems. A viable solution for these kinds of problems is highly demanded in real-
world applications, especially in the field of security. Occluded faces reduce the perfor-
mance of face recognition algorithms because many facial details can be absent in the
images, which are crucial for the recognition of faces with high confidence. Good de-
tection and recognition performance is highly desirable in these scenarios and we have
explored this possibility in this chapter.
For solving the problem of detecting/recognising faces with high accuracy, differ-
ent approaches have been proposed [1], [2], [65], [72], [68], [32]. Most recognition
algorithms attempt to minimise the distance between similar identities and maximise
the distance between dissimilar identities, thus maximising the overall face recognition
performance. These algorithms perform very well on clean facial images, but the per-
formance decreases on degraded images. We have shown in the last chapter that we can
98
6.1 Introduction
Figure 6.1: Our proposed algorithm takes multiple occluded images and generates a
frontal face as output while preserving the identity. Best viewed in color.
99
6.1 Introduction
However, our interest lies in restoring the information lost due to occlusion.
Occlusion results in complete modification of information. Neighbouring pixels can
be used for interpolation to restore the information at a local level as long as the oc-
clusion is very small, but for larger occlusions, the restoration is very difficult. As we
have seen, most of the previous researches carried out were based on using the visible
information available. The main problem with that is a low confidence score for face
matching and high false positives. There is also an important factor to be taken into con-
sideration: face recognition for occluded faces requires a completely different algorithm
which are trained to work with occluded faces. It is highly likely that these algorithms
will be unable to perform well for the recognition of complete unoccluded faces. Our
aim is to fix the problem at the root, i.e., the occlusion. This can be considered more
like a preprocessing step where we are making sure that the face recognition algorithm
receives only high-quality unoccluded faces for evaluation.
Since most of the lost information cannot be recovered from a single image, using
multiple images is a possible option to restore the lost information. Here we are solving
the problem from a new direction. Instead of using limited available information, we
are making use of more information that can be gathered from other images. The basis
of this research approach has been motivated from information theory. We have shown
in Section 6.3.1 why the best solution cannot be obtained using a single image and why
we need multiple images, if available, for a better solution to this problem.
In this chapter, we propose a method which uses multiple images of the same iden-
tity with different poses and occlusions to synthesise a frontal face pose image of the
concerned identity. The main idea of this research is to collect useful information from
each image and incrementally add information to create the frontal face pose. The infor-
mation lost in a certain part of an image due to occlusion can be retrieved from another
image if that part remains unoccluded. We achieve this by using a generative adversar-
ial network, which takes multiple inputs and gives a single image at the output of the
generator. Our network gathers useful information from each of those input images and
incrementally adds the information to get a final output which resembles the frontal face
of the person.
To sum up, the main contributions are as follows.
• To the best of our knowledge, we are the first to propose a GAN based solution to
100
6.2 Related Work
restore the identity by reconstructing the occluded area. Our method synthesises
the frontal face pose by extraction of useful information from multiple occluded
images.
• We have shown the working principles, behaviour, and limitations of our proposal
by making use of the science behind information theory.
• We have verified and shown through images and experiments that the increased
number of input images improves the quality of the output.
101
6.2 Related Work
al. [81] have shown that doing this can degrade the performance since a lot of noise was
introduced in the recovery procedure. They used the information from the non-occluded
part of the face for recognition. Wen et al. [82] proposed a structured occlusion coding
to address the problem of recognition of occluded faces. For their solution, they have
considered that occlusion is predictable, which in the real world scenario might not be
true. This method also revolves around using the non-occluded information to recognise
the face correctly. Wan et al. offered a solution called MaskNet [83] which is a masking
based solution that masks the occluded area. It is trained to generate different feature
map masks for different occluded face images and automatically assign lower weights
to those that are activated by the occluded facial parts and higher weights to the hidden
units activated by the non-occluded facial parts. Song et al.’s [84] solution is similar,
which is based on a mask learning strategy. It finds and discards corrupted feature ele-
ments for recognition. The latest research done in this area is by Xiao et al. [85] which
is based on occlusion area detection and recovery. They used a robust principal compo-
nent analysis algorithm and a cluster-based saliency detection algorithm to achieve the
processing of face images with occlusion. These methods mostly consider synthetic oc-
clusion or other forms of occlusion like sunglasses or scarves. However, none of these
methods have considered a major form of occlusion, which is the nonfrontal face pose.
Since nonfrontal face pose is a major form of face occlusion, our proposed method
heavily relies on frontal face synthesis. Several methods have been proposed for frontal
face view synthesis till date. This problem is quite challenging due to its ill-posed na-
ture. There are several traditional methods available which address this problem by
statistical modelling [105] or by a 2D or 3D local texture warping [106], [107]. Huang
et al. [13] proposed a method for photorealistic frontal view synthesis from a single face.
They proposed a generative adversarial network called TPGAN [13] for photorealistic
frontal view synthesis by perceiving global structures and local details simultaneously.
The generator of the TPGAN consists of two branches, one of which deals with the
local consistency and the other deals with the global consistency of the facial structure.
Facial landmark detection and alignment plays an important role in frontal face synthe-
sis. Several methods have been proposed in the literature, for example, the MTCNN [32]
detector proposed by Zhang et al., where they propose a deep cascaded multitask frame-
work, which exploits the inherent correlation between detection and alignment to boost
up their performance. S3FD [1] is another state-of-the-art face detection network pro-
102
6.3 Method
posed by Zhang et al., which is a single shot scale-invariant face detector. Bulat and Tz-
imiropoulos proposed an algorithm called Face Alignment Network (FAN) [108] where
a simple approach is taken for face recognition which gradually integrates features from
different layers of a facial landmark localisation network into different layers of the
recognition network.
To improve the quality of our solution, we are using a generative adversarial net-
work. The discriminator is an essential part of our network and various design guide-
lines are found in the literature. Radford et al. [33] have proposed some effective dis-
criminator design guidelines for improved training of a GAN. Schönfeld et al. [109] pro-
posed a U-net based discriminator which evaluates the generator output from a global
perspective as well as a local perspective. This has been proved to be very effective in
driving the generator to create more accurate solutions.
6.3 Method
In this section, we will discuss in detail how we have designed the algorithms and the
loss functions. Firstly, I will be first explaining the basis and motivation of this research
direction from an information theory perspective. This will help us understand why
multiple images will help boost the recognition performance. Then, we will move on to
the network design of the generator and the discriminator and then to the loss functions.
6.3.1 Hypothesis
We have an image I G which is either occluded or corrupted with artifacts and noise. If
XIG is the set of all information {IiG = I1G , ..., InG } that I G could provide, and P(IiG ) is
the probability of IiG ∈ XIG , the entropy of our information source will be denoted by H
where
H = − ∑ P(IiG ) log(P(IiG )).
i
Ideally, we are aiming to obtain I R,S for best recognition performance where I R,S
is a hypothetical image which can provide best recognition performance for the same
identity as I G , also described as the mugshot as seen in the previous chapter. Let {IiR,S =
I1R,S , ..., InR,S } be the set of all information that I R,S provides, and by definition, this is
103
6.3 Method
the information we need for best recognition of the identity in question. According to
information theory, the mutual information of I G relative to I R,S is
P(IiR,S |IiG )
G
I(I ; I R,S
) = EI G ,I R,S [SI(IiG , IiR,S )] = ∑ G R,S
P(Ii , Ii ) log . (6.1)
i P(IiR,S )
Here I denotes the mutual information and SI denotes the specific or pointwise mu-
tual information (PMI). PMI is a measure of association which refers to single events,
and mutual information is the expectation over all PMI. Although, theoretically I(I G ; I R,S ) ≤
I(I R,S ; I R,S ), but for practical cases, we can safely assume that I(I G ; I R,S ) < I(I R,S ; I R,S ).
Let us consider that I(I G ; I R,S ) + α = I(I R,S ; I R,S ), where α is the set of useful informa-
tion that contains in I(I R,S ; I R,S ) but is missing from I(I G ; I R,S ). Our target is to design a
function that can recover all or most of the information contained in α. For ease of the
reader, we denote I(I R,S ; I R,S ) as I(I R,S ) and I(I G ; I R,S ) as I(I G ) from now on.
Since it holds true for all real cases, that information cannot be created from thin
air, so for any given function f , I( f (I G )) < I(I R,S ) if I(I G ) < I(I R,S ). Let there be a set
of C different images belonging to the same identity. Considering G to be a generator
function, I(G(I G j )) < I(I R,S ) for each j, where j is the index of the images in set C.
The set of mutual information X that can be obtained from all the images in set C
will be the union of the mutual information in I(G(I G j )) for all j. Mathematically, this
can be expressed as
C
[
X= I(G(I G j )).
j=1
• Case 1: The first possibility is that the information contained in X is less than the
information contained in the hypothetical image I R,S . In that case, we have
C
[
I(G(I G j )) = I(I R,S ) − α (6.2)
j=1
104
6.3 Method
the information contained in the hypothetical image I R,S . In that case, we have
C
[
I(G(I G j )) = I(I R,S ) (6.3)
j=1
Either of the above cases are possible in a real world scenario, which is completely
dependent on the available input information, i.e., {I G1 , ..., I GC }. It can be assumed
that if we have a sufficient number of input images from different times and angles, we
can gather sufficient information for each individual region of the face. This means the
probability of satisfying equation 6.3 is high. If it is difficult to obtain the information
of each individual region of the face from all available input images, as it might happen
in some cases, we assume that equation 6.2 is satisfied. This assumption leads to the
following equation
C
[
Gj
I(G(I )) < I(G(I G j )) ≤ I(I R,S ). (6.4)
j=1
Since our generator takes multiple images as inputs, we can modify equation 6.4 ac-
cordingly to get our final equation
We have carried out our research based on the equation 6.5 and our objective is to find
the parameters (θ ) of G such that θ = arg maxθ I(G({I G1 , ..., I GC })).
We aim to estimate a clean and occlusion free facial image f (I G ) from an image I G ,
in which much information is lost due to various artifacts and/or occlusions. The first
step is to crop and align the faces in the frame. This step is important to maintain the
consistency of the geometric coordinates of different parts of the face like eyes, nose,
mouth, etc. The next step is to get the available landmarks or coordinates of the eyes,
nose, and mouth. These two steps are the preprocessing part of the pipeline and are
accomplished using the Face Alignment Network [108] by Bulat and Tzimiropoulos.
Once the landmarks are obtained, we crop out four sections, the face viz., left eye, right
eye, nose, and mouth. The cropping is done with sufficient padding to make sure that
the whole eye, nose, or mouth is in the cropped part. Next, these parts are passed on
to a rotation network where they process the local textures for the four main landmarks
105
6.3 Method
so that their orientation matches the one from the frontal face pose. Each part is now
assembled at their proper location to get the frontal face pose from all input images.
In parallel to the rotation network, the whole input image is passed to another network
which processes the global structure of the face. These images are now passed to the
fusion network to get a single output image. Figure 6.2 shows the block diagram of the
complete pipeline.
6.3.2 Approach
In this section, we will describe each component of the whole image reconstruction
pipeline shown in Figure 6.2 and explain how each of those works. The first part of the
pipeline consists of an align and crop section. The main function of this block, as the
name suggests, is to align the face images horizontally and vertically. This ensures that
the eyes, nose, and the mouth are approximately in the same position in every image.
The irrelevant parts of the image are also cropped out to make sure that all images are
of the same scale. This is done with the help of a pretrained network. Thus, this step is
more of a preprocessing step.
Once the images are aligned and cropped, they are then sent to the main network for
106
6.3 Method
Figure 6.3: Figure explaining the process of alignment, cropping and segmentation.
further processing. The main network consists of a generator and a discriminator. The
generator consists of two branches, viz. global network and a local network. The local
network processes the image in specific areas. It first cuts out the two eyes, nose and
the mouth according to the landmark coordinates which are obtained during alignment
and cropping. Next, they are sent to the part rotation network for front pose generation.
The part rotation network consists of a UNet based architecture. The downsampling
layers consist of blocks of convolution, batchnorm, relu, and a residual block. The
upsampling layers consist of blocks of convolution, batchnorm, relu, residual block,
transposed convolution, batchnorm, and a final activation. At the output of the network,
there are 2 branches of the network. One gives the feature vector of the output image,
and the other branch is passed on to a convolution layer to give the final output im-
age. Figure 6.4 shows the architecture of the part rotation network. The parts are then
recombined to get the image of the full face.
The global network tries to rotate the face from a global perspective. Figure 6.5
shows the complete generator block. The global network is a UNet based structure
which works on different scales. We can see in the figure that there are several resize
blocks at the input which downscales the image so that the global network can work
on different scales. The outputs from the global and local branches are then passed on
to the fusion network. This part of the network combines all information together to
obtain the final image. This image is then sent to the discriminator for evaluation and
107
6.3 Method
feedback at a local and global level. The details of the main network are discussed in
the following sections in detail. s To estimate the enhanced image for a given corrupted
image, we train a generator network as a feed-forward CNN GθG parameterised by θG .
Here θG = W1:N ; b1:N denotes the weights and biases of an N-layer deep network and
is obtained by optimising a loss function L. The training is done using two sets of n
images {IkGT : k = 1, 2, ..., n} and {IkG : k = 1, 2, ..., n} such that IkGT = GθG (IkG ) (where
IkGT and IkG are corresponding pairs) and by solving
1 n
θ̂G = arg min ∑ L(I GT , GθG (I G ), I R,S ). (6.6)
θG N i=1
We have removed the subscript k for simplicity and we will continue the same. This
will always refer to as corresponding pairs unless otherwise stated.
6.3.3 Generator
Our generator architecture is inspired by the generator proposed by Huang et al. [13]. It
consists of two branches, one of which processes parts of the image, i.e., the two eyes,
nose and mouth, and another branch which processes the input image as a whole. The
outputs of these two branches are then concatenated together. What we do next is to use
all images in the batch and take point-wise maximum in the axis along the batch. This is
where the information from multiple images are collected and the useful information is
fused together. They are then passed through some convolutional layers to get the final
output image. Details of the branches are described in the following sections.
The global network consists of an encoder and a decoder, which are connected through
a bottleneck which is a fully connected layer. The final layer of the encoder contains
a fully connected layer with 512 outputs. The next layer is a max-out function fmaxout
108
6.3 Method
Figure 6.4: Network architecture of the facial landmarks part rotation network.
109
6.3 Method
110
6.3 Method
Here l is the fully connected layer immediately before the max-out layer. fmaxout (l) is
then fed directly into the decoder. Several layers from the encoder are fed to the decoder
through skip connections having residual blocks in between. Figure 6.5 shows the de-
tailed architecture of the global network. The output of the global generator network is
then fed to the fusion network.
The local generator network acts as a low-level transformation network whose job is de-
tailed frontal pose feature representation synthesis of the eyes, nose, and mouth. Figure
6.4 shows the detailed architecture of the local generator network. This network also
takes the form of an encoder-decoder structure, but it is a fully convolutional network.
Skip connections are present in the network, which facilitates direct information prop-
agation from the lower layers to the upper layers. There are 2 outputs for this branch.
One is the feature representation of the face part in question, and the other is the output
image itself. The penultimate layer gives the features, which is followed by a convo-
lution layer with 3 output channels giving the required image representation. The next
step of this layer is the assemble of the different parts in their respective positions. This
is done from the landmark information which we have already acquired earlier. Once
this is done, the result is then fed to the fusion network for further processing.
Fusion Network
The main function of the fusion network is to combine multiple output images into a
single frontal pose image. The output of the generator global and local network is fed
to the fusion network. This part of the generator consists of a residual block followed
by a convolution with batch normalisation and ReLU activation. This is followed by
111
6.3 Method
a batch-wise max-out layer. This layer takes the maximum value of the corresponding
cell from all image features and returns the feature representation for a single frontal
pose image. This is then passed through another residual block followed by a couple of
convolution layers to get the final output.
6.3.4 Discriminator
The discriminator is an essential part of a generative adversarial network which is in
charge of evaluating how well the generator is performing. Inspired by the architec-
tural guidelines set by Schönfeld et al. [109], we design our discriminator in the form
of a u-net with skip connections. The discriminator has an encoder which performs a
high-level evaluation of the image and a decoder which performs a low-level evalua-
tion. Outputs from the various layers of the encoder are directly fed to the decoder for
information propagation.
Encoder Part
The architecture of the encoder part of the discriminator is in line with the guideline set
by Huang et al. [13]. The encoder consists of 3 convolution blocks with leaky ReLU
activation and strides=2. The first convolution layer consists of 64 filters, and each
succeeding layer contains twice the number of layers of the preceding block. The filters
are of size 3 × 3. These layers are followed by a couple of residual blocks, followed by
another convolution block. The final layers consist of a convolution layer with 1 filter
of size 1 × 1 and stride=1. This gives an output of size 2 × 2. In parallel to this final
layer, there is another fully connected layer which is connected to the decoder.
Decoder Part
The architecture of the decoder part of the discriminator is the same as the decoder
shown in Figure 6.5.
112
6.3 Method
Pixel-wise Loss
Since the information in the input images is corrupted, it is important to learn to re-
construct the occluded regions and drive the reconstruction towards the clean image
manifold. While learning this task, it is important to preserve the geometrical consis-
tency of the image both from a local and global perspective. This is done introducing a
pixel-wise loss L px in the cost function where
L px (GθG ) =
IGT − GθG (IG )
. (6.7)
2
Symmetry Loss
Human faces are inherently quite symmetrical. This loss counterpart makes sure that
each side of the face is a mirror image of each other. This loss represents the pixel-wise
difference between the two longitudinal halves of the image. For an image with height
H and width W , this loss can be expressed as
W /2 H
1 G G
Lsym (GθG ) = G −
∑ ∑ θG i x,y θG i W −x+1,y
2 ,
(I ) G (I ) (6.8)
W /2 × H x=1
y=1
where x, y are the pixel indexes of the output image along the row and column, respec-
tively.
As the preservation of identity is important for the network, the objective function needs
to evaluate the identity vector of the generated images with respect to the ground truth
image. Following Equation 5.2, the identity feature loss LID (GθG ) is given by the dis-
tance between the vector representations of the generated face and the ground truth. If
we consider Φ to be a pretrained network which gives the vector representation of the
facial images, then the Euclidean distance will be given by
GT G
LID (GθG ) =
Φ(Ii ) − Φ(GθG (Ii ))
(6.9)
2
113
6.3 Method
Total variation regularisation can be used to remove some noise and unwanted spikes
from the image, which will lead to a more consistent image representation. Aly et
al. [110] has shown in their work that a variation regulariser of the functional form
Z
Js ( f ) = L(k∇x f (x)k) dx
W
(refer to [110] for details) can be optimised to remove unwanted details while preserving
important details such as edges without introducing ringing or other artifacts. Using a
similar idea for digital images, the total variation can be given by
x
G G
LTV (GθG ) = ∑
Φ{GθG (I )i, j − Φ{GθG (I )i+1, j
i=1
y
(6.11)
+ ∑
Φ{GθG (I G )i, j − Φ{GθG (I G )i, j+1
,
j=1
where x, y are the pixel index along the row and column, respectively.
Adversarial Loss
The adversarial loss consists of the local loss and the global loss counterparts of the
discriminator. The details about the loss of the discriminator are discussed in detail in
Section 6.3.6. The global loss is given by the encoder part of the network and can be
expressed as
1 2 2
Lglob (DθE ) = − ∑ ∑ {C · log(DθE (I) p,q) + (1 − C) · log(1 − DθE (I)) p,q}, (6.12)
4 p=1 q=1
114
6.3 Method
where DθE is the encoder part of the discriminator network, C = 1 for I = IiGT and
C = 0 for I = GθG (IiG ). The local loss is given by the decoder part of the network and is
expressed as
1 x y
Lloc (DθD ) = − ∑ ∑ {C · log(DθD (I)i, j ) + (1 − C) · log(1 − DθD (I)i, j )}, (6.13)
x · y i=1 j=1
where DθD id the discriminator network, i, j are the pixel indexes for row and column
respectively, and x, y are the rows and columns, respectively. The adversarial loss is
calculated by considering how far the output image is compared to the real one. Thus,
closer the output image is to the real one, the smaller is the adversarial loss. Using this
concept, we can replace C by 1, which gives us
1 2 2
Lglob (DθE ) = − ∑ ∑ {log(DθE (I) p,q ) (6.14)
4 p=1 q=1
1 x y
Lloc (DθD ) = − ∑ ∑ {log(DθD (I)i, j ), (6.15)
x · y i=1 j=1
and the final adversarial loss is the sum of the two losses, which gives us
!
1 2 2 1 x y
Ladv =− ∑ ∑ {log(DθE (I) p,q) + x · y ∑ ∑ {log(DθD (I)i, j ) . (6.16)
4 p=1 q=1 i=1 j=1
The global decision is obtained from the output generated by the encoder part of the
discriminator. Instead of using a single output in the global decision loss, we have
broken the decision regions into 4 regions as described by Shrivastava et al. [111]. The
output of each region is evaluated separately to ensure a better global level evaluation by
115
6.3 Method
the discriminator encoder. The global loss is given by the encoder part of the network
and can be expressed as
1 2 2
Lglob (DθE ) = − ∑ ∑ {C · log(DθE (I) p,q) + (1 − C) · log(1 − DθE (I)) p,q}, (6.17)
4 p=1 q=1
where DθE is the encoder part of the discriminator network, C = 1 for I = I GT and C = 0
for I = GθG (I G ).
The local decision is obtained from the output generated by the decoder part of the
discriminator. The local loss is given by the decoder part of the network and is expressed
as
1 x y
Lloc (DθD ) = − ∑ ∑ {C · log(DθD (I)i, j ) + (1 − C) · log(1 − DθD (I)i, j )}, (6.18)
x · y i=1 j=1
where DθD id the discriminator network, i, j are the pixel indexes for row and column
respectively, and x, y are the rows and columns respectively. The local decision loss is
evaluated on a per-pixel basis and is more effective when the images are augmented
using the CutMix [112] technique which is described in the following subsection.
Consistency Regularisation
Unlike what Yun et al. proposed in [112], and Schönfeld et al. performed in [109],
we use the real and the fake image for the augmentation strategy. For each real image,
a rectangular or square patch is replaced with a fake image patch corresponding to the
same region and from the same class. This is explained in Figure 6.6. For the 128 × 128
real image in the top row, we have a rectangular patch of 85 × 60, which is replaced by
a fake image. The black part represents a fake patch and the white part represents the
real part of the image. The encoder output ideally produces
" # " #
0 0 1 1
for fake images, and for real images
0 0 1 1
116
6.3 Method
Figure 6.6: Supplementary image for CutMix augmentation [112] with real and fake
images as explained in section 6.3.6
The red line shown in figure 6.6 divides the whole image into 4 equal parts. Therefore,
in this case, we are looking for an encoder output value of
" # " #
1 1 − ((85 − 64) × 60)/(64 × 64) 1 0.69
=
1 1 − (64 × 60)/(64 × 64) 1 0.06
117
6.4 Experiments
Similarly, for the image in the second row, we have a fake image of 128 × 128, which
contains a rectangular patch of 65 × 80 and ideally, this will give an encoder output of
" # " #
1 ((80 − 64) × 60)/(64 × 64) 1 0.23
=
(1 × 64)/(64 × 64) (1 × (80 − 64))/(64 × 64) 0.016 0.004
For the decoder output, pixels with fake part will ideally have a value of 0 and with
the real ones, it will be 1. Accordingly, we can substitute the value of C in equation 6.17
and 6.18.
6.4 Experiments
Our research is based on solving the problem of occlusion without any (or minimum)
constraints. For example, we do not make a prior guess that the occlusion will be limited
to sunglasses or scarves or hats. In the real world, occlusion can appear in several forms,
ranging from a part of the face being hidden behind an object or a part of the body, to
various poses where only one side of the face is visible. We also consider the fact
that we do not have any prior information about the faces being evaluated. The input
images can be individual frames from a video or separate images from different cameras
and viewing angles. The main motivation of this research is to explore the possibility
of reconstructing the lost information in occluded images and still make sure that the
reconstructed face will have a high recognition accuracy when evaluated with different
face recognition networks. Our solution has been designed keeping these in mind and
this will be reflected in our experimental setup.
To validate the effectiveness of our algorithms, we have conducted extensive exper-
iments on relevant datasets. We have also conducted some ablation studies and showed
how each part of the network is effective in attaining the final results.
118
6.4 Experiments
Output
Avg. Input
Accuracy Accuracy
Algorithm Accuracy
Rank Discriminator Discriminator
of all images
without U-net with U-net
1 0.384 0.391 0.491
FaceNet [65] 3 0.532 0.569 0.657
5 0.602 0.661 0.729
1 0.764 0.734 0.762
Anv v6 3 0.839 0.856 0.873
5 0.865 0.894 0.908
1 0.883 0.769 0.765
ArcFace [2] 3 0.908 0.873 0.869
5 0.916 0.901 0.902
Table 6.1: Recognition performance for different face recognition networks for our pro-
posed algorithm. This table shows the results for 15 input images evaluated on the
Webface dataset.
kept untouched to add variation to the training dataset. We trained the network on an
NVIDIA DGX-Station. Each training image is 128 × 128 pixels and was trained with
a batch size of 24. For optimisation, we use Stochastic Gradient Descent and a learn-
ing rate of 10−4 . We have carried out our test on a subset of WebFace dataset [101]
and CelebA [113] dataset, a public benchmark dataset having face images. Both the
Webface and CelebA subset consists of 7500 images of 150 unique identities. Our im-
plementation is based on Tensorflow. For performance evaluation, we have compared
our method with FaceNet [65], ArcFace [2], CosFace [77] and a private network called
Anv v6 which was developed by Anyvision.
119
6.4 Experiments
Output
Avg. Input
Accuracy Accuracy
Algorithm Accuracy
Rank Discriminator Discriminator
of all images
without U-net with U-net
1 0.384 0.424 0.514
FaceNet [65] 3 0.532 0.617 0.705
5 0.602 0.709 0.774
1 0.764 0.790 0.807
Anv v6 3 0.839 0.884 0.902
5 0.865 0.927 0.932
1 0.883 0.811 0.817
ArcFace [2] 3 0.908 0.894 0.899
5 0.916 0.921 0.929
Table 6.2: Recognition performance for different face recognition networks for our pro-
posed algorithm. This table shows the results for 25 input images evaluated on the
Webface dataset.
Output
Avg. Input
Accuracy Accuracy
Algorithm Accuracy
Rank Discriminator Discriminator
of all images
without U-net with U-net
1 0.384 0.433 0.541
FaceNet [65] 3 0.532 0.624 0.713
5 0.602 0.710 0.789
1 0.764 0.787 0.821
Anv v6 3 0.839 0.901 0.905
5 0.865 0.932 0.936
1 0.883 0.837 0.830
ArcFace [2] 3 0.908 0.920 0.905
5 0.916 0.937 0.935
Table 6.3: Recognition performance for different face recognition networks for our pro-
posed algorithm. This table shows the results for 35 input images evaluated on the
WebFace dataset.
120
6.4 Experiments
Figure 6.7: Visual example of how the output image improves with the number of input
images.
between faces. For example, FaceNet uses the Euclidean distance for calculating the
similarity between two faces, while Arcface and Anv v6 use the angular distance. This
ensures that our method is robust to various face recognition algorithms. As expected,
we see an increase of accuracy with the increasing number of images until it saturates at
the point where all information has been gathered. Figure 6.7 shows how the output im-
age improves with increasing number of input images. For the second test, we analysed
the effect of the discriminator with skip connections on the performance of the network.
In Table 6.1, 6.2 and we can see that our network improves the recognition rate com-
pared to the average accuracy of the input images. When the network is trained with the
discriminator with skip connections (U-net discriminator), the performance increases in
most of the cases. Still, we have managed to outperform ArcFace for rank-3 and rank-5
accuracy when the input is 35 images and has a close match for most other cases.
121
6.4 Experiments
Average
Recognition Input Accuracy of output image
FR Algorithm
Rank Accuracy for n input images
of all images
50 images n=15 img n=25 img n=35 img n=45 img
Webface Dataset
CelebA Dataset
Table 6.4: Rank 1,3 and 5 accuracy evaluated on different face recognition network for
our output images. Performance is evaluated using Webface and CelebA dataset.
122
6.4 Experiments
Figure 6.8: Change of recognition accuracy with the number of input images.
123
6.5 Conclusion
We have validated this using the FaceNet algorithm [65], ArcFace algorithm [2] and
a private network called Anv v6 algorithm which is developed by a company called
Anyvision. We have validated this for various number of input images, i.e., for 15,
25, 35, and 45 images. Table 6.4 shows the results for Facenet [65], Anv v6 network,
ArcFace [2] and CosFace [77] network. It is to be noted that ArcFace was trained on
112 × 112 images and we used the original trained model for our experiments. To keep
things as fair as possible, we resized our images to 112 × 112, although originally the
images and our network were prepared for evaluation on 128 × 128. Even though this
is a disadvantage for us, our images compete well with the performance of Arcface.
The second point is that the accuracy must increase with the number of input im-
ages. With every input image, some incremental information is being added to the final
image. However, after a certain point when all information has been collected, addi-
tional images will only provide duplicate information. Thus, after a certain number of
input images, the accuracy should tend to remain constant irrespective of the number
of further additions of images. This should also be true for a variety of networks and
we have validated this using the FaceNet algorithm [65], ArcFace algorithm [2] and the
Anv v6 algorithm. Figure 6.8 shows the results for Facenet [65], Anv v6, CosFace [77]
and ArcFace [2] respectively. The figures show the graph for the change of accuracy
with respect to the number of input images. The number of input images has been var-
ied consecutively from 5 to 50. From the graphs, we can see that the accuracy gradually
increases with respect to the number of input images. However, the accuracy starts to
saturate when the number of input images is around 35. Adding further input images
does not help in the increase of recognition accuracy. This happens because any further
input images actually provide the same information which has already been provided
by the input images before. Therefore, this duplicate information is useless in this case
and does not help in increasing the accuracy further.
6.5 Conclusion
In this chapter, we have presented a generative adversarial network which is able to
extract relevant information from multiple images of the same identity and combine
them to synthesise a single image which is a frontal pose representation of the identity
in question. We have also shown that the image generated provides better recognition
124
6.5 Conclusion
performance compared to the average of all the input images. We have validated our
proposal by extensive experiments and have proved that information is actually gradu-
ally added with each image up to a certain point after which all information received
from further images is duplicated information which have already been processed be-
fore. This algorithm can have useful application in areas where we can obtain occluded
faces from various angles but a single face cannot provide enough information for recog-
nition with high confidence. The proposed method can also be used in other fields of
application which we will be discussing in chapter 7.
125
Chapter 7
Generative adversarial networks are one of the most sophisticated approaches to genera-
tive modelling using deep learning. They are quite difficult to train, but at the same time
can be quite useful for image synthesis. In chapter 3 we have described a deep genera-
tive adversarial network with skip connections that sets a new state of the art on public
benchmark datasets when evaluated with respect to perceptual quality. This network is
the first framework, which successfully recovers images from artifacts and at the same
time super-resolves, thus having a single-shot operation performing two different tasks.
Using different combinations of loss functions and by using the discriminator both in
feature and pixel space, we confirm that IEGAN reconstructions for corrupted images
are superior by a considerable margin and more photorealistic than the reconstructions
obtained by the current state-of-the-art methods. Although this method was quite effi-
cient for artifact removal, it was not as good for noise removal purposes or simultaneous
noise with artifact removal. In chapter 4 we have described a deep generative adversar-
ial network which is the first framework to successfully recover images from artifacts or
noise, or both combined and at the same time super-resolves, thus having a single-shot
operation performing different tasks simultaneously.
Moving this idea forward, we used this process to clean facial images aiming for a
better recognition performance. Unfortunately, it resulted in worse recognition perfor-
mance. This helped us better understand the nature of GANs and human perception in
general. What results in humans perceived as clean might often be the loss of informa-
tion from an information theory perspective. To deal with this issue, we present another
deep generative adversarial network in chapter 5 that sets a new state-of-the-art for face
126
recognition enhancement for low-quality images by restoring their facial features. We
have highlighted some limitations of the existing face recognition algorithms and we
have tried to overcome those limitations by augmenting the input images to increase
the separation between different classes. In particular, we have explored the possibil-
ity of improving face recognition performance on those images for which the current
state-of-the-art fails to deliver good results. Our proposed loss function works best on
real-world images where they are inherently low quality such as SCFace. We use metric
learning to train the network to generate facial images preserving their identity, which
is more important for improving face recognition performance compared to good per-
ceptual quality, and our proposed algorithm successfully demonstrates that. Looking
at this problem from a different angle, it can be said that noise or artifacts are a kind
of microlevel occlusion where some information are getting hidden or distorted. This
motivated us to deal with occlusion in general. We often come across situations where
a facial image is occluded due to the presence of various objects including, but not lim-
ited to sunglasses, hats, scarves, masks, etc. Since these kinds of occlusions remove the
entire information from the occluded areas, and they cover a wide area, we cannot use
neighbouring information to reconstruct those areas. For example, if we encounter a
face with sunglasses, it might be possible to reconstruct the eyes, but it will be impossi-
ble to reconstruct the same eyes belonging to that person, unless it happens by chance.
Here comes the idea of using multiple images to accumulate information and create a
single image for recognition.
In chapter 6 we have presented a generative adversarial network which is able to
extract relevant information from multiple images of the same identity and combine
them to synthesise a single image which is a frontal pose representation of the identity
in question. We have also shown that the image generated provides better recognition
performance compared to the average of all input images.
We have shown in our experiments that our method works successfully for different
face recognition networks. We have also shown that novel information can boost the
recognition accuracy and duplicate information is not helpful, which aligns with our
proposed theory. This algorithm can have useful applications in areas where we can
obtain occluded faces from various angles, but a single face cannot provide enough
information for recognition with high confidence. The proposed method can also be
used in other fields of application which we will be discussing in chapter 7.
127
7.1 Future Work and Other Areas of Application
128
7.1 Future Work and Other Areas of Application
numbers often can be occluded by other traffic. Temporal superresolution can help in
this scenario by gathering temporal data and incrementally aggregate information to
synthesise the final complete license place.
129
Appendix A
Author’s Publications
1. Soumya Shubhra Ghosh, Yang Hua, Sankha Subhra Mukherjee, and Neil Robert-
son, “IEGAN: Multi-purpose Perceptual Quality Image Enhancement Using Gen-
erative Adversarial Network,” in WACV, Hawaii, USA, Jan. 2019.
2. Soumya Shubhra Ghosh, Yang Hua, Sankha Subhra Mukherjee, and Neil Robert-
son, “Improving Detection and Recognition of Degraded Faces by Discriminative
Feature Restoration Using GAN,” accepted in ICIP, Abu Dhabi, UAE, Oct. 2020.
3. Soumya Shubhra Ghosh, Yang Hua, Sankha Subhra Mukherjee, and Neil Robert-
son, “All in One: A GAN based Versatile Image Enhancement Framework using
Edge Information,” IEEE Transactions in Image Processing, 2020, submitted
130
References
[1] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li, “S3fd: Single shot
scale-invariant face detector,” in ICCV, 2017.
[2] J. Deng, J. Guo, X. Niannan, and S. Zafeiriou, “Arcface: Additive angular margin
loss for deep face recognition,” in CVPR, 2019.
[4] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep
convolutional networks,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, pp. 295–307, 2016.
[5] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” in CVPR, 2018.
[6] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” in
ICCV, 1998.
[8] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep cnn denoiser prior for
image restoration,” in CVPR, 2017, pp. 3929–3938.
[9] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-
scale image recognition,” in ICLR, 2015.
[10] H. R. Sheikh, Z. Wang, L. Cormack, and A. C. Bovik, “Live image quality as-
sessment database release 2,” 2005.
131
REFERENCES
[12] M. Grgic, K. Delac, and S. Grgic, “Scface - surveillance cameras face database,”
Multimedia Tools Appl., vol. 51, pp. 863–879, 02 2011.
[13] R. Huang, S. Zhang, T. Li, and R. He, “Beyond face rotation: Global and local
perception gan for photorealistic and identity preserving frontal view synthesis,”
in ICCV, 2017, pp. 2439–2448.
[18] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network train-
ing by reducing internal covariate shift,” in ICML, 2015.
[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recogni-
tion,” in CVPR, 2016.
132
REFERENCES
[22] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected
convolutional networks,” in CVPR, 2017.
[26] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for
accurate object detection and semantic segmentation,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2014.
[28] D. Yoo, S. Park, J.-Y. Lee, A. S. Paek, and I. So Kweon, “Attentionnet: Aggre-
gating weak directions for accurate object detection,” in ICCV, 2015.
[29] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg,
“SSD: Single shot multibox detector,” in ECCV, 2016.
[30] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Uni-
fied, real-time object detection,” in CVPR, 2016.
[31] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in CVPR, 2017.
[32] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment us-
ing multitask cascaded convolutional networks,” IEEE Signal Processing Letters,
vol. 23, no. 10, pp. 1499–1503, Oct 2016.
[34] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for
improved quality, stability, and variation,” ICLR, 2018.
133
REFERENCES
[36] H. Reeve and J. Lim, “Reduction of blocking effect in image coding,” in IEEE
International Conference on Acoustics, Speech, and Signal Processing, 1983.
[38] C. Wang, J. Zhou, and S. Liu, “Adaptive non-local means filter for image deblock-
ing,” Signal Processing: Image Communication, vol. 28, pp. 522–530, 2013.
[39] A.-C. Liew and H. Yan, “Blocking artifacts suppression in block-coded images
using overcomplete wavelet representation,” IEEE Transactions on Circuits and
Systems for Video Technology, pp. 450–461, 2004.
[40] A. Foi, V. Katkovnik, and K. Egiazarian, “Pointwise shape-adaptive dct for high-
quality denoising and deblocking of grayscale and color images,” IEEE Transac-
tions on Image Processing, pp. 1395–1411, 2007.
[43] A. Dosovitskiy and T. Brox, “Generating images with perceptual similarity met-
rics based on deep networks,” in NIPS, 2016.
[44] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer
and super-resolution,” in ECCV. Springer, 2016.
134
REFERENCES
[46] G. Lu, W. Ouyang, D. Xu, X. Zhang, Z. Gao, and M.-T. Sun, “Deep kalman
filtering network for video compression artifact reduction,” in ECCV, 2018.
[47] J. W. Soh, J. Park, Y. Kim, B. Ahn, H.-S. Lee, Y.-S. Moon, and N. I. Cho, “Re-
duction of video compression artifacts based on deep temporal networks,” IEEE
Access, vol. 6, pp. 63 094–63 106, 2018.
[49] A. Chambolle, “An algorithm for total variation minimization and applications,”
Journal of Mathematical imaging and vision, vol. 20, no. 1-2, pp. 89–97, 2004.
[50] D. Zoran and Y. Weiss, “From learning models of natural image patches to whole
image restoration,” in ICCV, 2011.
[51] M. Elad and M. Aharon, “Image denoising via sparse and redundant representa-
tions over learned dictionaries,” IEEE Transactions on Image Processing, vol. 15,
no. 12, pp. 3736–3745, 2006.
[52] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for image denois-
ing,” in CVPR, 2005.
[56] Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang, “Deep networks for image
super-resolution with sparse prior,” in CVPR, 2015.
135
REFERENCES
[59] Y. Chen, Y. Tai, X. Liu, C. Shen, and J. Yang, “Fsrnet: End-to-end learning face
super-resolution with facial priors,” in CVPR, 2018.
[63] J. Hu, J. Lu, and Y.-P. Tan, “Deep transfer metric learning,” in CVPR, 2015.
[64] Y. Yang, S. Liao, Z. Lei, and S. Li, “Large scale similarity learning using similar
pairs for person verification,” in AAAI, 2016.
[66] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach
for deep face recognition,” in ECCV, 2016.
136
REFERENCES
[68] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hyper-
sphere embedding for face recognition,” in CVPR, 2017.
[69] O. Barkan, J. Weill, L. Wolf, and H. Aronowitz, “Fast high dimensional vector
multiplication face recognition,” in ICCV, 2013.
[70] X. Cao, D. Wipf, F. Wen, G. Duan, and J. Sun, “A practical transfer learning
algorithm for face verification,” in ICCV, 2013.
[72] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deep-face: Closing the gap to
human-level performance in face verification,” in CVPR, 2014.
[75] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese, “Deep metric learning via
lifted structured feature embedding,” in CVPR, 2016.
[76] C. Huang, C. C. Loy, and X. Tang, “Local similarity-aware deep feature embed-
ding,” in Advances in neural information processing systems, 2016.
[77] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu, “Cosface:
Large margin cosine loss for deep face recognition,” in CVPR, 2018.
[78] H. Jia and A. M. Martinez, “Face recognition with occlusions in the training
and testing sets,” in IEEE International Conference on Automatic Face Gesture
Recognition, 2008.
137
REFERENCES
[81] Y. Su, Y. Yang, Z. Guo, and W. Yang, “Face recognition with occlusion,” in IAPR
Asian Conference on Pattern Recognition, 2015.
[82] Y. Wen, W. Liu, M. Yang, Y. Fu, Y. Xiang, and R. Hu, “Structured occlusion
coding for robust face recognition,” Neurocomputing, vol. 178, pp. 11–24, 2016.
[83] W. Weitao and C. Jiansheng, “Occlusion robust face recognition based on mask
learning,” in ICIP, 2017.
[84] L. Song, D. Gong, Z. Li, C. Liu, and W. Liu, “Occlusion robust face recognition
based on mask learning with pairwise differential siamese network,” in ICCV,
2019.
[85] Y. Xiao, D. Cao, and L. Gao, “Face detection based on occlusion area detection
and recovery,” Multimedia Tools and Applications, vol. 79, pp. 16 531–16 546,
2020.
[87] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with
conditional adversarial networks,” in CVPR, 2017.
138
REFERENCES
[93] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using sparse-
representations,” in International conference on curves and surfaces. Springer,
2010.
[96] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” 2014.
[101] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation from scratch,”
arXiv preprint arXiv:1411.7923, 2014.
139
REFERENCES
[102] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software Tools, 2000.
[106] X. Zhu, Z. Lei, J. Yan, D. Yi, and S. Z. Li, “High-fidelity pose and expression
normalization for face recognition in the wild,” in CVPR, 2015, pp. 787–796.
[107] T. Hassner, S. Harel, E. Paz, and R. Enbar, “Effective face frontalization in un-
constrained images,” in CVPR, 2015, pp. 4295–4304.
[108] A. Bulat and G. Tzimiropoulos, “How far are we from solving the 2d & 3d face
alignment problem? (and a dataset of 230,000 3d facial landmarks),” in ICCV,
2017.
[109] E. Schonfeld, B. Schiele, and A. Khoreva, “A u-net based discriminator for gen-
erative adversarial networks,” in CVPR, 2020.
[112] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization
strategy to train strong classifiers with localizable features,” in ICCV, 2019.
[113] Z. Liu, P. Luo, X. Wang, X. Tang, and C.-H. Dataset, “Large-scale celebfaces
attributes (celeba) dataset,” 2015.
140
REFERENCES
[115] Hirsch, M., Harmeling, S., Sra, S., and Schölkopf, B., “Online multi-frame blind
deconvolution with super-resolution and saturation correction,” A&A, vol. 531,
p. A9, 2011.
[116] S. Lee, S. W. Oh, D. Won, and S. J. Kim, “Copy-and-paste networks for deep
video inpainting,” in ICCV, 2019.
141