Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

The 3rd International Conference on Intelligent Autonomous Systems

GAN-based End-to-End Unsupervised Image Registration for RGB-Infrared Image

Kanti Kumari Ganapathy Krishnamurthi


Department of Engineering Design Department of Engineering Design
Indian Institute of Technology Madras Indian Institute of Technology Madras
Chennai, India Chennai, India
e-mail: kanti@smail.iitm.ac.in e-mail: gankrish@iitm.ac.in

Abstract —Image registration is a pre-processing step used in advancements, many deep learning-based approaches have
various computer vision applications. This paper presents an been explored for registration. Elhanany et al. [1] presented a
unsupervised image registration for a given pair of RGB and fully connected neural network to predict translation, rotation,
infrared images, with the RGB image used as the reference
for infrared. This method exploits the GAN architecture with and scaling affine parameters with Discrete Cosine Transform
a spatial transformer module, to synthesize the transformed as its input. Fully connected neural networks fail to capture
image using an unsupervised loss criterion. This loss used the spatial details which can be done by using Convolutional
for error backpropagation taken as a linear combination of Neural Networks (CNNs). Various CNN-based methods have
adversarial loss; Mean Squared Error (MSE) loss between the been developed over time to learn affine parameters with
input RGB image and image synthesized by the generator;
KL divergence loss between the IR image and the synthesized Fourier transform [2], Zernike Moment [3], etc. of images as
image; and another MSE loss estimated using features maps network inputs. Simonovsky et al. [4] proposed a CNN-based
extracted from pretrained VGG-16. The adversarial loss forces method, which generates a dissimilarity map for given moving-
the generator to output an IR like image, with the input IR fixed image pair for patchwize registration. Similarly, the
image and the generated IR image labelled as real and fake estimation of a displacement vector field has been performed
respectively. The other three losses are backpropagated through
the generator network to learn the transformation as well as to by Sokooti et al. [5] for non-rigid image registration.
preserve the structure and resolution of the generated image. This CNNs captures spatial relationship while it fails to learn
unsupervised learning process is stopped after a specified number image-to-image mapping across different domains. This chal-
of iterations based on a validation set. A supervised method lenge can be overcome with the implementation of GAN-based
has also been developed for comparison with the presented method [6]. Tanner et al. [7] and Mahapatra et al. [8] proposed
method. The SSIM and PSNR values estimated between predicted
registered image and its ground truth has been used as evaluation GAN inspired network for deformable registration of multi-
criteria. The unsupervised method has scored 0.8351±0.06 and modal images. The method proposed by Mahapatra et al. uses
35.2723±0.68 for SSIM and PSNR respectively, while supervised images of two different modalities as input to the generator to
scored as 0.7620±0.08 and 15.8978±2.21. output a deformation field. The application of this deformation
field on moving image generates a transformed image which
Keywords —unsupervised image registration, convolutional is further fed to discriminator to differentiate between original
neural network, spatial transformer network, GANs
and fake moving image.
In this paper, we propose a GAN-based image registration
I. I NTRODUCTION method. The major highlight of this approach is that it does
Registration of images plays a fundamental role in many not need training before the inference. Also, it is an unsuper-
computer vision applications, for example, medical image pro- vised approach towards making GANs learn transformation
cessing or remote sensing. It is required to combine imaging for the given image pair without providing the ground truth
data captured with different sensors, to have all the information transformations and the ground truth registered image.
on a single platform. One of the images is referred as a fixed Rest of the paper is organized in the following parts: Section
while other as moving image. II gives an overview of data used along with the preprocessing
Image registration takes two images as input and tries to performed, Section III explains the proposed method, and
achieve the maximum overlap between their common features. Section IV showcase the outcomes of unsupervised image reg-
Its objective is to estimate an optimal mapping function to istration compared with the supervised technique, and finally
transform the moving image. Affine transformations have we conclude the paper with Section V.
been a well-known transformation function, which consists of
translation, scale, rotation, and shear as its parameters. The II. DATA
proposed method estimates affine parameters as its intermedi- A mixture of four different publicly available RGB and IR
ate result. image datasets have been used in this paper, which are as
Conventionally, image registration has been performed us- follows,
ing the image intensity details. With the recent technology

978-1-7281-6078-8/20/$31.00 ©2020 IEEE 62

Authorized licensed use limited to: UNIVERSITE DE TUNIS EL MANAR. Downloaded on October 30,2020 at 17:25:54 UTC from IEEE Xplore. Restrictions apply.
1) Video Analytics Dataset by INO [9], Major two variants of GAN-based registration has been de-
2) Freiburg forest dataset by AIS [10], veloped: supervised and unsupervised. As the name suggests,
3) OSU color-thermal database [11], and supervised is designed with ground truth registered images
4) RGB-INR Scene dataset [12]. while the latter does not have access to it.
After combining datasets, the full data consists of 22076 A. Supervised GAN-based method
pre-registered pairs of visible-infrared modalities. All the
The supervised image registration has been developed to
datasets mentioned above include images having different
learn the mapping from moving to the registered image. It
sizes; hence it has been resized to 256x256 with an appropriate
consists of two interconnected modules (Fig. 2),
zero-padding to ensure the loss-less transformation of images
after its geometric transformation. 1) Encoder-decoder like architecture for Generator, and
Train and validation sets with ground truth registered images 2) A deep convolutional neural network as Discriminator.
have been used for training of the supervised method. For The Generator includes pretrained ResNet-50, which is con-
evaluation of outputs generated with both unsupervised and nected with two strided convolutional and four convolutional
supervised methods, the held-out test has been used [13]. transpose layers, along with batch normalization and few
ReLU activation layers, as given in Fig. 2. The upsampling
III. M ETHODOLOGY layer has been added to get the image of the same size as the
The developed GAN-based image registration methods aim input infrared. The training of network has been performed
to produce a registered image for the given image pairs. The using calculation of loss, which is a linear combination of
pairs were initially aligned to each other; hence fixed images adversarial loss, KL (Kullback-Leibler) divergence loss, and
have been generated by adding transformation to induce mis- SSIM (structural similarity index) loss. This combined loss
alignment. A random affine transform parameters have been has been referred as alignment loss, which is calculated with
used in this process, with different values for translation, the following equation,
scale, and rotation. The values for these parameters have been
Lalignment supervised = Ladversarial + LKL + LSSIM (1)
sampled from the range specified in Table I.
where,
TABLE I Ladversarial = minθG maxθD EM 0 ∼ptrain (M 0 ) [log DθD (M 0 ) ]+
T HE RANGE OF VALUES KEPT FOR TRANSFORM PARAMETERS , WHICH HAS
BEEN USED TO GENERATE THE FIXED IMAGE . E(M,F )∼pG (M,F ) [log(1 − DθD (GθG (M, F )))]
(2)
0
Affine Parameters Range of Values In (2); F , M , and M denotes the fixed, moving and
Translation [0,10]
Scale [0.8,1.5] transformed image respectively. Transformed image is the
Rotation angle (degrees) [0,90] output of the generator network.
The KL divergence has been estimated between the output
of the generator and the moving image to ensure that the
The objective of this paper is to train a GAN-based model
generator gives an output with features and the intensity dis-
with a generator that can take an unregistered pair of RGB-
tribution similar to the moving image. It has been formulated
infrared and output the registered infrared. Let GθG and DθD
to estimate the dissimilarity between the pixel probability
represents the networks developed for Generator and Discrim-
distribution between two images, which is computed as,
inator, respectively, parameterized with their corresponding θ.
255
Fig. 1 represents the general view of the developed methods X PM 0 (X = i)
for end-to-end image registration. LKL (PM 0 , PM ) = PM 0 (X = i) log (3)
i=0
PM (X = i)
In the above formula, PM 0 (X = i) and PM (X = i)
represents the probability of pixel’s discrete values present in
0
M and M respectively.
While SSIM loss is computed between the warped and
fixed images to have common features appropriately aligned
as the output of the generator network. SSIM is a similarity
index whose values range ranging between 0 (zero structural
similarity) to 1 (maximum similarity). The loss has been
calculated using its inversion by subtracting the values from
1, as given below:
0
LSSIM = 1 − SSIM (M , F ) (4)
The discriminator network has been trained along with the
Fig. 1. Flowchart of a typical end-to-end image registration system. ground truth registered image and output of generator with
real and fake labels, respectively.

63

Authorized licensed use limited to: UNIVERSITE DE TUNIS EL MANAR. Downloaded on October 30,2020 at 17:25:54 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. The network architecture of Generator and Discriminator developed for supervised image registration method with each layer’s kernel size(f), the
number of filters(n), and stride(s). FC represents a fully-connected layer.

Fig. 3. The network architecture of the Generator and Discriminator developed for the unsupervised image registration model.

0
Adam optimizer with a learning rate of 0.0002 and a batch transformed image M . After completion of one forward
size of 10 has been used for training of supervised image pass, the calculation of loss gets performed, followed by the
registration model. The values of these hyperparameters has replacement of moving image with the warped image for next
been selected based on hit and trial. iteration. The architecture for discriminator follows a CNN
0
architechture which takes M and M as its input, labelled as
B. Unsupervised GAN-based method real and fake respectively.
The unsupervised model has a similar architecture for gen- Loss of the network is a combination of adversarial, KL-
erator and discriminator as supervised method. The generator divergence, and MSE loss.
consists of two parts, a block of convolutional layers and a
Lalignment unsupervised = Ladversarial + LKL + LM SE (5)
geometric transformer layer. The convolutional block outputs
the first two rows of the affine transformation matrix, which The KL divergence loss LKL captures intensity details while
further gets paired with the moving image and forwarded to MSE (LM SE ) and adversarial loss (Ladversarial ) focuses on
the transformer layer (Fig. 3). The transformer layer applies the fidelity information. KL divergence loss has been estimated
0
the given transformation on moving image and generates between M and M as given in (3). While the MSE component

64

Authorized licensed use limited to: UNIVERSITE DE TUNIS EL MANAR. Downloaded on October 30,2020 at 17:25:54 UTC from IEEE Xplore. Restrictions apply.
(a) (b) (c)

(d) (e) (f) (g)


Fig. 4. Comparison of outputs generated by different registration algorithms along with the inputs and ground truth. The images are: (a)fixed or reference RGB
image, (b)moving infrared image, (c)ground truth registered image, (d)output of supervised model trained without VGG-MSE loss, (e) output of supervised
model trained along with VGG-MSE loss, (f)image generated by unsupervised registration model without VGG-MSE loss, and (g) output generated by
unsupervised model with VGG-MSE loss. The images has been cropped to remove the extra zero-padding for better presentation.

0
computed between F and M in a pixel-to-pixel manner as IV. R ESULTS AND DISCUSSION
given below,
The comparison between developed methods has been per-
C P
H P
W 0
P
(M [j, k, i] − F [j, k, i]) 2 formed with the held-out test set, which consists of a total
i=1 j=1 k=1 11038 video frames for both modalities.
LM SE =
CxHxW
where, C, H, and W represents the size of the image in TABLE II
AVERAGE SSIM ESTIMATED ON THE TEST SET,BETWEEN GROUND TRUTH
channels, height, and width. The values of C, H, and W are REGISTERED IMAGE WITH THE NETWORK OUTPUT.
3, 256, and 256, respectively.
The optimization of loss has been carried out using stochas- Model SSIM PSNR Processing Time(s)
Supervised 0.7615 15.8889 0.0301
tic gradient descent, with a learning rate of 0.005. Few dropout Supervised + VGG-MSE 0.7620 15.8978 0.0311
layers have also been added to both generator and discrimina- Unsupervised 0.8287 35.1335 1.7554
tor with 0.2 as the dropout rate. Both the modules consist of Unsupervised +VGG-MSE 0.8351 35.2723 1.8355
strided convolutional layers, which makes the networks learn
the downsampling of the image across different layers.
In addition to all the loss components explained above, The similarity index SSIM and PSNR have been used to
experiments have been carried out to estimate the change in compare the outcomes of each method. It has been observed
output with the addition of VGG-MSE loss in both super- that constraints on adversarial loss, such as KL-divergence
vised and unsupervised models. This loss has been computed and MSE ensure that the generated image resembles realistic.
between the feature maps extracted from pretrained VGG-16 Also, the methods with VGG-MSE loss generates an image
by removing the top classifier layer, which gives a total of with better contrast as well as improved SSIM and PSNR
512 feature maps of shape 7x7 for the given input of size values after the registration. The unsupervised method with
256x256x3. Its estimation has been carried out as, VGG-MSE loss results in better SSIM and PSNR values as
compared to other methods (Table II). While unsupervised
512 method without this loss has comparable performance. Table
1 X (FM 0 − FM )2
LV GG−M SE = (6) II also consists of the average processing time taken for a pair
512 i=1 7x7
of images for registration (after completion of the training
where, FM 0 and FM represents feature maps corresponding phase in case of supervised methods). From Table II, it is
to warped and moving images extracted from the modified clear that supervised methods register an image faster than
VGG-16 network. The comparisons have been made between the unsupervised, but the time taken by this method during
both the networks, with and without VGG-MSE loss for each the training phase on NVIDIA Tesla K40c with 12GB GPU
method. memory, can not be neglected. While the unsupervised model

65

Authorized licensed use limited to: UNIVERSITE DE TUNIS EL MANAR. Downloaded on October 30,2020 at 17:25:54 UTC from IEEE Xplore. Restrictions apply.
does not go through an overhead of model training, also it and MSE and VGG-MSE loss. An attempt has been made to
does not need the registered image for ground truth. compare the proposed method with a GAN-based supervised
image registration, which uses the ground truth for its training.
Network loss for a given pair of images over iterations
Publicly available datasets of RGB-infrared images have been
0.5 used, which were already registered in the raw data. A
random transformation has been applied to RGB images to add
0.4
misalignment, which has been considered as the fixed image
while the original infrared image used as a moving image.
The proposed method does not need to be trained and stored
Network loss

0.3
before using it for registration. Our results show that the
unsupervised methods outperform the developed supervised
0.2 models with higher values of SSIM and PSNR ratio. With
taking this point as a reference, it can be stated that the
supervised method learns the synthesis from RGB to infrared,
0.1
but the required transformations have been performed better
with the proposed GAN-base image registration method.
0 1 2 3 4 5 6 7

Iteration R EFERENCES
[1] I. Elhanany, M. Sheinfeld, A. Beck, Y. Kadmon, N. Tal, and D. Tirosh,
Fig. 5. Loss of unsupervised model (with VGG-MSE loss) over consecutive “Robust image registration based on feedforward neural networks,” in
iterations. Smc 2000 conference proceedings. 2000 ieee international conference on
systems, man and cybernetics. ’cybernetics evolving to systems, humans,
Warped images generated by using all four different models organizations, and their complex interactions’ (cat. no.0, vol. 2, Oct
2000, pp. 1507–1511 vol.2.
have been compared in Fig. 4. The figure consists of fixed, [2] A. B. Abche, F. Yaacoub, A. Maalouf, and E. Karam, “Image registration
moving, and ground truth images as Fig. 4(a), 4(b), and 4(c) based on neural network and fourier transform,” in 2006 International
respectively. While Fig. 4(d), 4(e), 4(f), and 4(g) are outputs Conference of the IEEE Engineering in Medicine and Biology Society,
Aug 2006, pp. 4803–4806.
corresponding to the supervised method without VGG-MSE [3] Jianzhen Wu and Jianying Xie, “Zernike moment-based image reg-
loss, supervised method with VGG-MSE loss, unsupervised istration scheme utilizing feedforward neural networks,” in Fifth
method without VGG-MSE loss, and unsupervised method World Congress on Intelligent Control and Automation (IEEE Cat.
No.04EX788), vol. 5, June 2004, pp. 4046–4048 Vol.5.
with VGG-MSE loss, respectively. The outputs generated [4] M. Simonovsky, B. Gutiérrez-Becker, D. Mateus, N. Navab,
by supervised methods appear more like the manipulated and N. Komodakis, “A deep metric for multimodal
version of the RGB image rather than the transformed version registration,” CoRR, vol. abs/1609.05396, 2016. [Online]. Available:
http://arxiv.org/abs/1609.05396
of infrared, as distinctive properties of features in it show [5] H. Sokooti, B. de Vos, F. Berendsen, B. P. F. Lelieveldt, I. Išgum, and
similarity with the RGB image. For example, transformed M. Staring, “Nonrigid image registration using multi-scale 3d convo-
image has parallel strips present on the road and building like lutional neural networks,” in Medical Image Computing and Computer
Assisted Intervention MICCAI 2017, M. Descoteaux, L. Maier-Hein,
features appearing similar to RGB rather than with IR (Fig. A. Franz, P. Jannin, D. L. Collins, and S. Duchesne, Eds. Cham:
4(d) and 4(e)). With this observation, it can be stated that Springer International Publishing, 2017, pp. 232–239.
the network learns the mapping from the fixed image to the [6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
transformed image but is not able to preserve the distinctive Advances in Neural Information Processing Systems 27, Z. Ghahramani,
features present in infrared. While for both the versions of M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds.
unsupervised models, all the features present in the original Curran Associates, Inc., 2014, pp. 2672–2680. [Online]. Available:
http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf
image are present after the transformation, as given in Fig. [7] C. Tanner, F. Özdemir, R. Profanter, V. Vishnevsky, E. Konukoglu, and
4(f) and 4(g). O. Göksel, “Generative adversarial networks for MR-CT deformable
Visually it is not very easy to differentiate between the image registration,” CoRR, vol. abs/1807.07349, 2018. [Online].
Available: http://arxiv.org/abs/1807.07349
outputs of both the variant of unsupervised model outcomes, [8] D. Mahapatra, S. Sedai, and R. Garnavi, “Elastic registration of
with and without VGG-MSE loss. Also, this loss results to an medical images with gans,” CoRR, vol. abs/1805.02369, 2018. [Online].
increase in the processing time taken for registration. Available: http://arxiv.org/abs/1805.02369
Fig. 5 represents the loss of model over consecutive itera- [9] “Video analytics dataset,” www.ino.ca/en/video-analytics-dataset/.
[10] A. Valada, G. Oliveira, T.Brox, and W. Burgard, “Deep multispectral se-
tions during the registration. In this example, the network was mantic scene understanding of forested environments using multimodal
able to learn the transformation in only 7 iterations. The graph fusion,” in International Symposium on Experimental Robotics (ISER
has been created for the best performing unsupervised model. 2016), 2016, pp. 465–477.
[11] J. W. Davis and V. Sharma, “Background-subtraction using contour-
V. C ONCLUSION based fusion of thermal and visible imagery,” Computer Vision and
Image Understanding, vol. 106, pp. 162–182, 2007.
An unsupervised image registration method inspired by [12] M. Brown and S. Susstrunk, “Multi-spectral sift for scene category
the architecture of GANs has been presented in this paper. recognition,” in CVPR 2011, 2011, pp. 177–184.
[13] K. Kumari and G. Krishnamurthi, “An end-to-end unsupervised rgb-ir
The network learns the required geometric transformation by image registration,” Journal of the Indian Society of Remote Sensing
reducing an adversarial loss combined with KL-divergence, (JISRS), 2019, in press.

66

Authorized licensed use limited to: UNIVERSITE DE TUNIS EL MANAR. Downloaded on October 30,2020 at 17:25:54 UTC from IEEE Xplore. Restrictions apply.

You might also like