Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

sensors

Article
Visible-Image-Assisted Nonuniformity Correction of Infrared
Images Using the GAN with SEBlock
Xingang Mou, Tailong Zhu and Xiao Zhou *

School of Mechanical and Electronic Engineering, Wuhan University of Technology, Wuhan 430070, China
* Correspondence: zhouxiao@whut.edu.cn; Tel.: +86-134-0711-2626

Abstract: Aiming at reducing image detail loss and edge blur in the existing nonuniformity correction
(NUC) methods, a new visible-image-assisted NUC algorithm based on a dual-discriminator genera-
tive adversarial network (GAN) with SEBlock (VIA-NUC) is proposed. The algorithm uses the visible
image as a reference for better uniformity. The generative model downsamples the infrared and
visible images separately for multiscale feature extraction. Then, image reconstruction is achieved
by decoding the infrared feature maps with the assistance of the visible features at the same scale.
During decoding, SEBlock, a channel attention mechanism, and skip connection are used to ensure
that more distinctive channel and spatial features are extracted from the visible features. Two dis-
criminators based on vision transformer (Vit) and discrete wavelet transform (DWT) were designed,
which perform global and local judgments on the generated image from the texture features and
frequency domain features of the model, respectively. The results are then fed back to the generator
for adversarial learning. This approach can effectively remove nonuniform noise while preserving
the texture. The performance of the proposed method was validated using public datasets. The
average structural similarity (SSIM) and average peak signal-to-noise ratio (PSNR) of the corrected
images exceeded 0.97 and 37.11 dB, respectively. The experimental results show that the proposed
method improves the metric evaluation by more than 3%.

Keywords: generative adversarial network; infrared image; nonuniformity correction; visible image;
vision transformer

Citation: Mou, X.; Zhu, T.; Zhou, X.


Visible-Image-Assisted 1. Introduction
Nonuniformity Correction of
Infrared Images Using the GAN with
Infrared thermography (IRT), as a nondestructive detection technology, can effectively
SEBlock. Sensors 2023, 23, 3282.
complete such tasks as navigation, remote sensing, reconnaissance, and so on [1]. Ideally,
https://doi.org/10.3390/s23063282
when an infrared detector receives uniform infrared radiation, its imaging is also uniform.
However, the response of different detection units varies according to material, process,
Academic Editor: Marco Diani
and circuit design. The direct expression on the image is fringe noise or low-frequency
Received: 20 January 2023 noise, which can be collectively referred to as nonuniform noise [2].
Revised: 15 March 2023 Currently, deep learning-based algorithms for infrared image correction have shown
Accepted: 17 March 2023 promising results, but they suffer from some problems, such as texture degradation and
Published: 20 March 2023 detail loss. In this paper, we propose the visible-image-assisted nonuniformity correction
of infrared images using GAN [3] with SEBlock [4]. The generator outputs the ideal image
by correcting the infrared images assisted by the visible images, and in order to introduce
visible image features more effectively, SEBlock is used to train the weights of the feature
Copyright: © 2023 by the authors.
maps. Then, the generator works with discriminators to realize adversarial learning. Vit [5]
Licensee MDPI, Basel, Switzerland.
and DWT [6] were introduced into the discriminators, respectively, to realize the global
This article is an open access article
and local discrimination of the generated image from the perspective of spatial texture and
distributed under the terms and
noise residue. Infrared images corrected by the proposed method have clear textures, sharp
conditions of the Creative Commons
Attribution (CC BY) license (https://
edges, strong robustness, and high correction efficiency.
creativecommons.org/licenses/by/
The main ideas and contributions of this paper are summarized as follows:
4.0/).

Sensors 2023, 23, 3282. https://doi.org/10.3390/s23063282 https://www.mdpi.com/journal/sensors


Sensors 2023, 23, 3282 2 of 22

• In order to solve the problems of edge degradation and texture detail loss, visible
images are introduced as an assistant. Infrared images with different noise intensities
and visible images with the same field of view and resolution are used as input image
pairs. Visible images are used to assist the network in better learning information
about the texture of the infrared image;
• Two discriminators were devised, including Vit-Discriminator (VITD) and DWT-
Discriminator (DWTD). The VITD is used for the judgment of spatial texture and
introduces Vit to avoid the problem of inconsistent convergence between the generator
and discriminator in GAN. The DWTD transforms the input images by using DWT
first and then determines whether residual noise exists in the generated image by
combining the directionality of the noise;
• For solving the problem of weighting the infrared and visible information, we used
SEBlock to modify the channel weights of the visible feature maps in this paper. It
helps the model to make better use of the visible information and avoid the impact of
too much visible information on the infrared image correction.
The rest of the paper is organized as follows. In Section 2, related works in this field
are introduced. In Section 3, the principles and theoretical support of this method are
presented. In Section 4, the experimental procedure is detailed, the infrared images are
corrected by comparing various algorithms, and the results are analyzed. Finally, the
conclusion is drawn in Section 5.

2. Related Work
Common correction algorithms can be divided into three categories, namely, calibration-
based, scene-based, and deep learning-based.
Calibration-based correction algorithms obtain the nonuniform response of the detec-
tion unit through a homogeneous radiation source before the system runs. The principle
of this approach is simple and easy to apply. Common algorithms include one-point
correction [7], two-point correction [8], and multipoint correction [9]. Currently, nonlin-
ear response models and real-time correction are mainly studied. For example, in 2020,
Huang Yu et al. proposed adaptive multipoint calibration by improving the selection of
the standard point [9]. Guan et al. applied two-point calibration to an online detection
process in 2021 [10]. However, this approach requires that the calibration environment be
consistent with the working environment. The system needs to be recalibrated if the work
environment changes or if the working time is too long.
Scene-based calibration methods solve the recalibration problem by continuously
updating the calibration parameters based on the imaging results. Common algorithms
include constant statistical [11], time-domain Gaussian filtering [12], Kalman filtering [13],
etc. However, most of these correction methods have to obtain parameters based on
the results of the previous frame or several frames. Therefore, when the scene changes
suddenly, there will be a “Ghost” left by the last frame in the corrected images. Moreover,
such methods are based on multiframe processing, which requires a large amount of
computation and a highly stable working environment.
Deep learning-based correction algorithms use autoencoders for single-frame image
correction. For example, Wang et al. used a convolutional neural network to realize infrared
image correction [14], avoiding the problem of a lack of prior knowledge in traditional
algorithms. Dan et al. proposed a multiscale residual network [15] based on an attention
mechanism to obtain better texture information. However, these algorithms are highly
demanding on datasets. When the quality of the datasets is poor, the correction results
are not ideal. Moreover, when the structure of the feature extraction is relatively simple,
the reconstructed image is blurred, and “Checkerboard Artifacts” occur. The correction
network proposed by Lu et al. is based on the Unet network [16], which effectively solves
the “Checkerboard Artifacts” problem, but still relies highly on datasets. Cui et al. used the
generated adversarial network for the nonuniform correction of infrared images [17] and
introduced multilevel residual connections to help the network make better use of context
Sensors 2023, 23, 3282 3 of 22

information. This approach has certain robustness, but its discriminator is simple, and the
model will easily collapse during training if the convergence of the generator and discrimi-
nator is inconsistent. So, it is necessary to keep adjusting the hyperparameters to achieve
better results. In the application of an infrared nonuniform correction GAN network,
Chen et al. optimized its loss function [18], and Liu et al. introduced multiple discriminator
architectures [19] to continuously improve the effect of adversarial learning. There are also
other improvements to GAN itself, such as CGAN [19], WGAN [20], SGAN [21], etc.
As Transformer [22] has been studied in recent years, its contribution to the field
of image processing has been increasing day by day. Some scholars combined it with
Unet network and proposed the TUnet network [23] for image generation. Despite the
good feature extraction capability of Transformer, its network model is relatively large and
requires a large number of datasets, which makes it difficult to be applied in practical work.
The ViTGAN model [24] proposed by Lee et al. combined Vit with GAN and effectively
improved the training instability problem of GAN, and the dependence on datasets could be
slightly reduced by using adversarial training. However, this discriminator only performs
binary classification on the generated images, which does not utilize Vit’s feature extraction
capability effectively.
Overall, the single-frame image correction results from deep learning-based algorithms
are better than the infrared inhomogeneous correction algorithms. By rationalizing the
ideas of GAN, visible information, and Vit, the method proposed in this paper can obtain
better correction networks and better correction results.

3. The Proposed VIA-NUC Method


3.1. Algorithm Flow and Principle
At present, many infrared NUC algorithms have problems, such as noise residuals
and texture fading. The cause lies not only in correction algorithms but also in the infrared
images themselves, which often lack sufficient texture information, hierarchical informa-
tion, or edge information. In contrast, visible images, with better structures, more detailed
information, and higher resolutions, can effectively reduce the problems of infrared im-
ages [25]. Based on the above analysis, in the proposed algorithm, in order to have better
texture and structure information, a visible image is introduced as guide information.
The implementation of the algorithm is based on GAN, which was proposed by
Goodfellow et al. in 2014. GAN is an unsupervised deep learning model that is commonly
used for data generation tasks. Its basic idea is to design a generator and a discriminator.
Then, adversarial training between the generator and the discriminator is used to help the
generator generate samples that follow the distribution of the real data. The discriminator is
used to distinguish the authenticity of the input data and provide feedback to the generator.
Its training process can be divided into two steps. The first step is to fix the generator
and train the discriminator. The discriminator performs forward propagation on the real
data and the generated data separately. Then, based on the outputs of the two times, the
discrimination loss is calculated, and the parameters are updated by backpropagation. The
second step is to fix the discriminator and train the generator. The generator generates
data that conform to the true data distribution from the initial noise input and passes the
generated data to the discriminator. The generator calculates the adversarial loss based on
the output of the discriminator and updates the parameters by backpropagation. The two
models are alternately trained, improving their abilities synchronously and finally reaching
the Nash equilibrium.
The proposed algorithm is different from the traditional GAN and consists of a gen-
erator and two discriminators. The flow of the proposed algorithm is shown in Figure 1.
First, the generator takes the infrared image to be corrected and its corresponding visible
image as the initial input and extracts the features from the infrared and visible images
separately. Then, through the feature fusion and reconstruction module, the corrected
infrared image is generated. Second, in order to evaluate the performance of the generator,
two discriminators are designed to assess the generated infrared image from the aspects
Sensors 2023, 23, x FOR PEER REVIEW 4 of 32

Sensors 2023, 23, 3282 separately. Then, through the feature fusion and reconstruction module, the corrected 4 ofin-
22
frared image is generated. Second, in order to evaluate the performance of the generator,
two discriminators are designed to assess the generated infrared image from the aspects
of
of texture
texture and
and noise.
noise. One
One discriminator,
discriminator, VITD,
VITD, discriminates
discriminates the
the texture
texture of
of the
the generated
generated
image
image through ideal infrared image and visible image and introduces the Vit to
through ideal infrared image and visible image and introduces the Vit to improve
improve
the
the generalization
generalization ability.
ability. The
The other, DWTD, uses
other, DWTD, uses the
the DWT
DWT to to transform
transform the
the ideal
ideal infrared
infrared
and
and generated
generated images
images and
and then
then carries
carries on
on to
to global
global noise
noise elimination
elimination judgment.
judgment. Finally,
Finally,
through the adversarial loss function, the evaluation results of the two discriminators are
fed back to the generator. The generator is instructed to perform adaptive training so that
the generated infrared image achieves the best balance in texture and noise. After After multiple
multiple
iterations, the
the Nash
Nashequilibrium
equilibriumbetween
betweenthethe generator and the two discriminators
generator and the two discriminators is finallyis fi-
nally realized,
realized, resulting
resulting in a high-quality
in a high-quality infrared
infrared imageimage correction
correction model.model.

Algorithm flowchart.
Figure 1. Algorithm

The algorithm
The algorithm inin this
thispaper
paperfocuses
focuseson onthe
thedesign
designand implementation
and implementationof the generator
of the gener-
and discriminators. In the generator, infrared image reconstruction, assisted by
ator and discriminators. In the generator, infrared image reconstruction, assisted by visi- visible
features,
ble can can
features, effectively reconstruct
effectively texture
reconstruct information
texture and and
information achieve better
achieve locallocal
better features
fea-
for the corrected image.
tures for the corrected image.
3.2. Visible Assisted Generator Structure Design
3.2. Visible Assisted Generator Structure Design
The structure of the generator is shown in Figure 2. There are 11 encoding units
The structure of the generator is shown in Figure 2. There are 11 encoding units and
and six decoding units in the network. The encoding module decodes the infrared and
six decoding units in the network. The encoding module decodes the infrared and visible
visible channels separately. It achieves multiscale feature extraction by reducing the feature
channels separately. It achieves multiscale feature extraction by reducing the feature map
map size and increasing the number of feature map channels. The decoding operation is
size and increasing the number of feature map channels. The decoding operation is per-
performed based on the infrared feature channel, and the SEBlock is used to introduce the
formed based on the infrared feature channel, and the SEBlock is used to introduce the
visible and infrared contextual features. Finally, the model outputs a corrected infrared
visible and infrared contextual features. Finally, the model outputs a corrected infrared
image of 256 × 256 × 1.
image of encoding
The 256 × 256 module
× 1. is shown in Figure 3. The encoding module and the convolution
The
kernel encoding
(with a size module
of 3 × 3),iswhich
shownisinused
Figure
for 3. The encoding
encoding and themodule and theactivation
“LeakyRule” convolu-
tion
function, are used for nonlinear activation. Then, the max pooling with a step activa-
kernel (with a size of 3 × 3), which is used for encoding and the “LeakyRule” size of
tion
2 andfunction, are
a size of 2 used
× 2 isfor nonlinear
used activation. Then,
for downsampling the max
to obtain thepooling
encoded with a step size
features. of
Batch
2normalization
and a size of is2 performed
× 2 is used to
forensure
downsampling to obtain the encoded features. Batch
network convergence. Finally, the C × C × M featurenor-
malization is performed
map is encoded as a C/2 to ensure
× C/2 ×Nnetwork
feature convergence.
map. Finally, the C × C × M feature
map Figure
is encoded as a C/2
4 shows the ×process
C/2 × Noffeature
featuremap.
fusion. After the feature map is encoded, the
infrared and visible feature maps using the same scale are spliced, and the convolution
layer is used to obtain the fusion feature. Then, the fusion feature is input into the SEBlock
to obtain more details by changing the channel weights.
Sensors 2023, 23, x FOR PEER REVIEW 7 of 32

Sensors 2023, 23, x FOR PEER REVIEW 6 of 32


Sensors2023,
Sensors 2023, 23,
23, 3282
x FOR PEER REVIEW 5 of 532of 22

Figure 2. Structure of the generator.

Figure Structureof
Figure2.2.Structure
Structure ofthe
of thegenerator.
the generator.
generator.

Figure 3. Structure of the encoding module.

Figure 4 shows the process of feature fusion. After the feature map is encoded, the
infrared and visible feature maps using the same scale are spliced, and the convolution
layer is used to obtain the fusion feature. Then, the fusion feature is input into the SEBlock
Figure 3.
Figure 3. Structure of the encoding module.
to
Figure 3. Structure
obtain of
ofthe
more details
Structure encoding
theby module.
changing
encoding the channel weights.
module.
Figure 4 shows the process of feature fusion. After the feature map is encoded, the
Figure 4 shows the process of feature fusion. After the feature map is encoded, the
infrared and visible feature maps using the same scale are spliced, and the convolution
infrared and visible feature maps using the same scale are spliced, and the convolution
layer is used to obtain the fusion feature. Then, the fusion feature is input into the SEBlock
layer is used to obtain the fusion feature. Then, the fusion feature is input into the SEBlock
to obtain more details by changing the channel weights.
to obtain more details by changing the channel weights.

Figure
Figure4.
4.Feature
Featurefusion
fusionmodule.
module.

The
The SEBlock
SEBlock isis shown
shownin in Figure
Figure 5. 5. First,
First,thethefeature
featuremap
mapof ofH H××WW× × C is compressed
C is compressed
to
to 11 ××11××CCbybyglobal
globalaverage
averagepooling.
pooling.ItItisisthenthenconverted
convertedinto
into1 1××1 ×1 C
× channel
C channel weights,
weights,
which is realized through a fully connected layer and an activation
which is realized through a fully connected layer and an activation function. Finally, function. Finally, thethe
Figure 4. Feature fusion module.
channel
channel
Figure weights
weights
4. Feature are
aremultiplied
fusion module. by
multiplied bythetheoriginal
originalfeature
featuremap,
map,and anda anew
newfeature
featuremap
map is is
obtained
obtained by
by changing
changing the
theweight
weight of
ofthe
the original
original channel.
channel.
The SEBlock is shown in Figure 5. First, the feature map of H × W × C is compressed
to 1 ×TheThe decoding
1 × SEBlock
C by globalis module
shown
average ispooling.
in shown 5.
Figure in Figure
It First,
is then 6.feature
the After the
converted map fused
into 1feature
1of×H ×C map
Wchannel
×C of the same
is compressed
weights,
scale
to is input into
1 × 1is× realized
which C by global the decoding
average
through layer,
pooling.
a fully connected it is concatenated
It is then
layerconverted with the encoded
into 1 × function.
and an activation feature
1 × C channel map
the of
weights,
Finally,
the
which previous
channelis weights layer
realizedare and
through then fused by the
a fullybyconnected
multiplied convolutional
the original layer and an
feature module.
activation
map, Then,
and a new through
function.
feature deconvo-
Finally,
map isthe
lution and
channel
obtained bybatch
weights normalization
are
changing the weight operations,
multiplied bythe
of theoriginal
original × C × map,
thechannel.
Cfeature M feature
and amapnewis feature
decoded intoisa
map
2C × 2C by
obtained ×N feature map.
changing the weight of the original channel.
Sensors 2023,23,
Sensors2023, 23,3282
x FOR PEER REVIEW 86 of 22
32

Figure 5. Structure of SEBlock module.

The decoding module is shown in Figure 6. After the fused feature map of the same
scale is input into the decoding layer, it is concatenated with the encoded feature map of
the previous layer and then fused by the convolutional module. Then, through deconvo-
lution and batch normalization operations, the C × C × M feature map is decoded into a
2C × 2C5.×Structure
Figure N featureof map.
SEBlock module.
Figure 5. Structure of SEBlock module.

The decoding module is shown in Figure 6. After the fused feature map of the same
scale is input into the decoding layer, it is concatenated with the encoded feature map of
the previous layer and then fused by the convolutional module. Then, through deconvo-
lution and batch normalization operations, the C × C × M feature map is decoded into a
2C × 2C × N feature map.

Figure
Figure 6.
6. Structure
Structureof
ofdecoding.
decoding.

3.3. Vit-Discriminator
3.3. Vit-DiscriminatorStructure Structure Design
Design
InIn the
the traditional
traditional GAN, GAN, the the discriminator
discriminator is is used
used toto distinguish
distinguish the the generated
generated data data
from the real data but focuses only on global features. In 2017,
from the real data but focuses only on global features. In 2017, the Pix2Pix [19] algorithm the Pix2Pix [19] algorithm
improved the
improved the traditional
traditional GAN GANand andproposed
proposedthe thePatchGAN
PatchGAN model.
model. It introduces
It introduces local fea-
local
tures that take the real and predicted images as input data.
features that take the real and predicted images as input data. After the convolution op- After the convolution operation,
aFigure
30 × 30
eration,
6. Structure
× 1× feature
a 30
of decoding.
map ismap
30 × 1 feature finally generated.
is finally The global
generated. and local
The global anddiscriminative
local discriminative effects
of the generated image can be achieved by treating each
effects of the generated image can be achieved by treating each patch as local information patch as local information and
3.3.
the mean Vit-Discriminator
of the Structure Design
and the mean of entire feature
the entire mapmap
feature as global information.
as global information. However,
However, to prevent
to prevent overfitting
over-
caused In by
the the simplistic
traditional GAN,structure
the of the PatchGan
discriminator is discriminator,
used
fitting caused by the simplistic structure of the PatchGan discriminator, the complexityto distinguish the complexity
the generated of data
the
of
discriminator
from the real should
data befocuses
but increased. only on global features. In 2017, the Pix2Pix [19] algorithm
the discriminator should be increased.
improved In 2020, thetraditional
Google team proposed an effective model for applying the Transformer
In 2020,the the Google team GAN and proposed
proposed the PatchGAN
an effective model for applyingmodel. Itthe introduces
Transformer local
to image that
features recognition,
take the called
real andVision Transformer,
predicted images as input
as showndata. in Figure
After 7. The
the core ideaop-
convolution of
to image recognition, called Vision Transformer, as shown in Figure 7. The core idea of
Vit is
eration, to divide
a 30 × thethe input
30 ×input image
1 feature map into multiple patches and flatten each patch into a vector,
Vit is to divide image intoismultiple
finally generated.
patches and Theflatten
globaleach and patch
local discriminative
into a vector,
which
effects isofadded
the with position
generated image encoding
can be and fedby
achieved into the Transformer
treating each patch Encoder
as local as as an input
information
which is added with position encoding and fed into the Transformer Encoder an input
sequence.
and the mean The Transformer Encoder
of the entireEncoder
feature map then extracts global features from the input sequence
sequence. The Transformer thenas globalglobal
extracts information.
featuresHowever,
from the inputto preventsequenceover-
through the self-attention mechanism. Finally, a one-dimensional feature iscomplexity
obtained as
through the self-attention mechanism. Finally, a one-dimensional feature is obtained asof
fitting caused by the simplistic structure of the PatchGan discriminator, the
the original
theoriginal
discriminator image feature.be During the process, Vit completely abandons the structure of a
the image should
feature. Duringincreased.
the process, Vit completely abandons the structure of
convolutional neural network and fully utilizes Transformer
modelto forcapture the global and local
a convolutional neural network and fullyan
In 2020, the Google team proposed effective
utilizes Transformer applying
to capturethe theTransformer
global and
relationships
to image in the
recognition, image,
called thus extracting
Vision richer features. In addition, Vit combines ideathe
local relationships in the image, thus Transformer,
extracting richer as shown
features. in InFigure 7. The
addition, Vitcore
combines of
semantic
Vit is to information
divide the and
input spatial
image information
into multiple of the
patches image
and by
flattenusingeach image
patch segmentation
into a vector,
the
and semantic
token information
embedding, and spatial
which can information
obtain more of the image
expressive by using
features. imageVit
However, segmenta-
which
tion and istoken
added with position
embedding, encoding
which can and
obtain fed
moreintoexpressive
the Transformerfeatures. Encoder
However, as requires
an input
huge
sequence. computational
The resources
Transformer and
Encoder data
then for training,
extracts globalso it cannot
features perform
from the well
input inVit
some
sequence
re-
quires
image huge
processingcomputational
tasks. resources and data for training, so it cannot perform well in
through
some image theprocessing
self-attentiontasks.mechanism. Finally, a one-dimensional feature is obtained as
This paper
the original imageproposes
feature. VITD
During basedtheon the idea
process, Vitofcompletely
PatchGAN. The mainthe
abandons idea of VITD
structure of
isa to use Vit to discriminate the texture of the generated image.
convolutional neural network and fully utilizes Transformer to capture the global and Each patch with encoded
information in the Vit module represents the local information of the input image, and the
local relationships in the image, thus extracting richer features. In addition, Vit combines
mean of all patches represents the global information of the image. The Vit mainly uses
the semantic information and spatial information of the image by using image segmenta-
the self-attention mechanism to extract features so it can extract richer and higher-level
tion and token embedding, which can obtain more expressive features. However, Vit re-
features. This is very important for image texture discrimination. Moreover, VITD is only
quires huge computational resources and data for training, so it cannot perform well in
used for image texture discrimination and does not need to understand the content or
some image processing tasks.
semantics of the image, which means it does not have high requirements for its feature
by

Sensors 2023, 23, 3282 7 of 22

Sensors 2023, 23, x FOR PEER REVIEW


ability. 10 of 32
Therefore, fewer Transformer Encoders are needed to meet the design requirements,
thus reducing the computational complexity and memory consumption.

Figure 7. Structure of the Vit module.

This paper proposes VITD based on the idea of PatchGAN. The main idea of VITD is
to use Vit to discriminate the texture of the generated image. Each patch with encoded
information in the Vit module represents the local information of the input image, and the
mean of all patches represents the global information of the image. The Vit mainly uses
the self-attention mechanism to extract features so it can extract richer and higher-level
features. This is very important for image texture discrimination. Moreover, VITD is only
used for image texture discrimination and does not need to understand the content or
semantics of the image, which means it does not have high requirements for its feature
Figure
ability.7.7.Therefore,
Figure Structure
Structureof the
theVit
offewer module.
VitTransformer
module. Encoders are needed to meet the design require-
ments, thus reducing the computational complexity and memory consumption.
The
This detailed
The paper
detailed structural
proposes
structuralVITD design
based
design ofofon
thethe
the VITD model
ideamodel
VITD of is shown
PatchGAN.
is shown The inmain
Figure
in Figure idea
8. 8.ofFirst,
First, VITD the
the gen-is
generated
to use Vit
erated image, ideal
to discriminate
image, infrared
ideal infrared theimage,image,
texture and and
ofvisible visible
the generated image
image are are
image. concatenated
Each patch
concatenated to obtain
with encoded
to obtain the
the input
input features.
information
features. Then, Then,
in the
VIT VIT transfers
Vittransfers
module the input
represents
the input feature
the local
feature mapping
information
mapping of 30of×of30
the ×
30patches
30 patches
input image, to serial
and
to three three
the
serial
mean Transformer
of all patches Encoders,
represents then
the generates
global them
informationto a 900
of the × 1 full
image.
Transformer Encoders, then generates them to a 900 × 1 full connection layer and turns it connection
The Vit layer
mainly and
uses
turns
the a it30into
intoself-attention × 30 × output.
× 30a ×301mechanism
feature 1 feature Atoutput.
to extract thefeatures
same Attime,
the same
so it time,
caninput
the extractthe input
richer
features features
and
obtain 30 × obtain
higher-level
30 × 1
30 × 30
features. ×This1 features
is very via the
important residual
for image network.
texture Finally, the
discrimination.
features via the residual network. Finally, the residual connection and Vit features residual connection
Moreover, VITD and Vit
is only
are
features
used
fusedforwithare
imagefused with
texture
a fully a fully connected
discrimination
connected network so and network
does
that so
the not that
30 ×need the 30
30 × 1tooutput× 30
understand× 1
featuresoutput
the features
content
are obtained,or
are obtained,
semantics each patchwhich
of which indicates a local texture repair situation. for its feature
each patchofofthe image,
which indicates ameans it does
local texture not have
repair high
situation. requirements
ability. Therefore, fewer Transformer Encoders are needed to meet the design require-
ments, thus reducing the computational complexity and memory consumption.
The detailed structural design of the VITD model is shown in Figure 8. First, the gen-
erated image, ideal infrared image, and visible image are concatenated to obtain the input
features. Then, VIT transfers the input feature mapping of 30 × 30 patches to three serial
Transformer Encoders, then generates them to a 900 × 1 full connection layer and turns it
into a 30 × 30 × 1 feature output. At the same time, the input features obtain 30 × 30 × 1
features via the residual network. Finally, the residual connection and Vit features are
fused with a fully connected network so that the 30 × 30 × 1 output features are obtained,
each patch of which indicates a local texture repair situation.

Figure 8. Structure of the Vit-Discriminator.


Figure 8. Structure of the Vit-Discriminator.

The advantage
The advantage of ofVITD
VITDisisthat, byby
that, introducing
introducing Vit,Vit,
it can effectively
it can extract
effectively high-level
extract high-
semantic
level information
semantic from from
information the input feature
the input tensor
feature and and
tensor enhance the discrimination
enhance and
the discrimination
generalization ability of the discriminator. It can also effectively avoid the model
and generalization ability of the discriminator. It can also effectively avoid the model collapse
collapse caused by the insufficient generalization ability of the discriminator. Moreover, by
introducing a residual network, it can ensure the convergence speed and stability of the
discriminator, thus avoiding the problems of gradient vanishing or exploding.

3.4. DWT-Discriminator Structure Design


FigureThe
8. Structure of thenoise
nonuniform Vit-Discriminator.
in the infrared image is dominated by linear fringe noise, which
is distinctly oriented. After correction, the residual noise also has a certain directivity.
The advantage
Therefore, of VITD
we use DWTistothat, byresidual
judge introducing
noise,Vit,
new it can effectively
noise, and otherextract high-level
problems.
semantic information from the input feature tensor and enhance the discrimination and
generalization ability of the discriminator. It can also effectively avoid the model
ducing a residual network, it can ensure the convergence speed and stability of the dis-
criminator,
causedthusby theavoiding the problems
insufficient of gradient
generalization abilityvanishing or exploding.
of the discriminator. Moreover, by
introducing a residual network, it can ensure the convergence speed and stability of the
3.4. DWT-Discriminator
discriminator, thus avoidingStructure Design of gradient vanishing or exploding.
the problems
The nonuniform noise in the infrared image is dominated by linear fringe noise,
Sensors 2023, 23, 3282 3.4. DWT-Discriminator
which is distinctly oriented. Structure
AfterDesign
correction, the residual noise also has a certain directiv- 8 of 22
The nonuniform
ity. Therefore, we use DWT noise toin judge
the infrared
residualimagenoise,isnewdominated
noise, andby other
linearproblems.
fringe noise,
which is distinctly oriented. After correction, the residual noise
DWT can effectively extract the frequency domain features of infrared images also has a certain directiv-
with
ity. Therefore,
nonuniform
DWT can we
noise use DWT to
by decomposing
effectively judge
extract thethe residual noise,
image into
frequency new
domain noise,
fourfeaturesand
sub-bands, other problems.
namely,images
of infrared approxima- with
DWT can
nonuniform
tion, vertical, noiseeffectively extract
by decomposing
horizontal, and the
thefrequency
diagonal image
bands.into domain
Among features
four them,
sub-bands, ofnamely,
infraredapproximation,
the approximation images bandwithrep-
nonuniform
vertical,
resents the noise by
horizontal, anddecomposing
low-frequency diagonal
component theofimage
bands. Among into
the imagethem,four
and sub-bands,
the thenamely,
approximation
reflects overallbandapproxima-
represents
brightness of
tion,
the
the vertical,
low-frequency
image. The horizontal,
component
other andsub-bands
three diagonal bands.
of the image andAmong
represent the them,
reflects the approximation
the overall
high-frequency brightness ofband
components rep-
the image.
of the
resents
The
image inthe
other the low-frequency
three sub-bands
vertical, component
represent
horizontal, andthe of high-frequency
the image
diagonal and reflects
directions, the overall
components
respectively, thebrightness
ofand image the
reflect in of
the
di-
the image.
vertical, The
horizontal, other
and three sub-bands
diagonal represent
directions, the
respectively,high-frequency
and
rectional details in the image. When the infrared image with nonuniform noise passes reflect components
the directional of the
details
inimage
through inDWT,
the image. the When
vertical,
four the horizontal,
infraredare
subgraphs and
image diagonal asdirections,
with nonuniform
obtained, shown innoiserespectively,
Figurepasses
9. The and reflect
through
noise isDWT,the di-
mainly fourin
rectional
subgraphs details
the verticalare in the
obtained,while
subgraph, image.
as shown When the
in Figure
the other infrared image
9. The noise
subgraphs with
is mainly
are mainly nonuniform
in the
used noise passes
vertical subgraph,
to describe the edge
through
while the DWT,
information other fourimage.
subgraphs
subgraphs
of the are obtained,
are mainly used toasdescribe
shown in theFigure 9. The noise is
edge information of mainly
the image.in
the vertical subgraph, while the other subgraphs are mainly used to describe the edge
information of the image.

(b) (c)
(b) (c)

(a) (d) (e)


(a) (d) (e)
Figure 9. The DWT of the strip noise corrupted image. (a) Noise corrupted image; (b) approximation
Figure 9. The DWT of the strip noise corrupted image. (a) Noise corrupted image; (b) approximation
coefficients;
Figure 9. The(c)DWT
horizontal coefficients;
of the strip (d) vertical
noise corrupted coefficients;
image. (e) diagonal
(a) Noise corrupted coefficients.
image; (b) approximation
coefficients; (c) horizontal coefficients; (d) vertical coefficients; (e) diagonal coefficients.
coefficients; (c) horizontal coefficients; (d) vertical coefficients; (e) diagonal coefficients.
Based on
Based on the
the above
above analysis,
analysis, the
the DWTD
DWTD is
is designed
designed in
in this
this paper,
paper, and
and its
its structure
structure is
is
shownBased
in on the10.
Figure above analysis, the DWTD is designed in this paper, and its structure is
shown in Figure 10.
shown in Figure 10.

Figure 10. Structure of the DWT-Discriminator.

The idea of DWTD is to use DWT to decompose the image into different sub-bands,
and then perform discrimination on each sub-band. Finally, the global noise residual
discrimination of the image can be obtained. Specifically, DWTD can be divided into the
following four parts:
DWT layer: This layer puts a 256 × 256 × 1 generated image and ideal infrared
image into the DWT layer, respectively, and obtains two 128 × 128 × 4 feature maps. The
Sensors 2023, 23, 3282 9 of 22

DWT layer can effectively extract the frequency domain features of the image and, at the
same time, maintain the spatial information of the image and provide the basis for the
subsequent discrimination.
Group Convolutional Layer: This layer performs cross-concatenation of the output
features from the DWT layer and applies group feature extraction to the corresponding
sub-bands, resulting in four 128 × 128 × 1 feature maps. The Group Convolutional layer
can effectively help DWTD to discriminate different subgraphs separately and enhance the
feature extraction ability of the network.
Convolutional Layer: The role of this layer is to perform global feature extraction
on the output of the group convolutional layer and help the DWTD to perform global
discrimination. Moreover, this layer uses a residual structure, which effectively avoids the
problem of gradient vanishing and enhances the feature extraction ability.
Discrimination Layer: This layer is used to implement the final discrimination of the
generated data. The input features are reduced by the pooling layer and fully connected
layer, and these are nonlinearly mapped by the Sigmoid activation function to obtain
the discrimination result. In order to avoid overfitting, the Discrimination layer uses the
dropout technique in the fully connected layer, which enhances the generalization ability
of the network.
The DWTD discriminates the generated image based on the ideal infrared image and
obtains a numerical output in the range of [0, 1]. The output value represents the noise
residual of the generated data from the frequency domain perspective. The closer the
output is to 1, the closer the generated data is to the ideal infrared image.

4. Experimental Results and Analysis


4.1. Implementation Details
4.1.1. Dataset
The dataset in this paper is composed of a public dataset and selftaken images. The
public datasets MSRS [26], M3FD [27], and RoadScene [28] provide rich infrared and visible
images under different scenes and different illumination. The infrared camera for the
selftaken images is a long-wave (8–14 µm) infrared detector developed by our laboratory,
which is based on RTD611 uncooled infrared focal plane array produced by the InfiRay
Company. The output frame rate of our infrared imaging system is 50 Hz, the resolution
of the image is 640 × 512, and the bit width is 16 bits. We then use platform histogram
equalization to convert the output infrared image into an 8-bit grayscale image. The visible
light camera is an A7A20MG9 camera produced by the IRAYPLE Company. After fixing
the infrared camera and the visible light camera to the positioning plate, the correction
was performed. The visible image is registered with the infrared image based on the
image feature points and through bicubic interpolation. We selected a total of 5000 pairs of
infrared and visible images. Through image enhancement technology, 10,000 image pairs
were made as the model training set, which was divided into the training set, test set, and
validation set based on the 6:2:2 ratio.
Since the gray image and visible image have different resolutions, the dataset is
processed as follows, as shown in Figure 11. The visible image was converted into the
grayscale image shown in Figure 11a and cut into a 1:1 square matrix with the size adjusted
to 256 × 256. The infrared image of Figure 11b is adjusted according to the same size. Then,
the fringe noise was added to the infrared image to obtain Figure 11c, which is used as
the noisy image to be corrected. By stitching the noisy, visible, and ideal images into a
three-channel image, as shown in Figure 11d, the dataset in Figure 11e was completed. The
noisy and visible channels were taken as the generator inputs during model training, and
the ideal image was fed as the input into the discriminator.
the fringe noise was added to the infrared image to obtain Figure 11c, which is used as the
noisy image to be corrected. By stitching the noisy, visible, and ideal images into a three-
channel image, as shown in Figure 11d, the dataset in Figure 11e was completed. The noisy
Sensors 2023, 23, 3282 10 of 22
and visible channels were taken as the generator inputs during model training, and the
ideal image was fed as the input into the discriminator.

(a) (b) (c) (d) (e)


Figure11.
Figure 11.Dataset.
Dataset. (a)
(a) Visible
Visible gray
gray image;
image; (b)
(b) infrared
infrared image;
image; (c)
(c) infrared
infrarednoise
noiseimage;
image;(d)
(d)splicing
splicing
example diagram; (e) image data set.
example diagram; (e) image data set.

4.1.2.Loss
4.1.2. Loss Function
Function
Theoriginal
The original GAN
GAN consists
consists of
of aa generator
generatorandandaadiscriminator, denotedasas𝐺Gand
discriminator,denoted and𝐷,D,
respectively. The generator is responsible for generating data that are as realistic
respectively. The generator is responsible for generating data that are as realistic as possible,as possi-
ble, the
and anddiscriminator
the discriminator is responsible
is responsible for judging
for judging the authenticity
the authenticity of theofdata.
the data. The ob-
The objective
jective function
function of the
of the GAN is GAN is designed
designed to take advantage
to take advantage of the adversarial
of the adversarial relationship
relationship between
between
the the generator
generator and discriminator,
and discriminator, which helps which helps
in the in the of
training training of the network.
the network. The
The objective
objectiveoffunction
function the GAN of can
the GAN can be expressed
be expressed as as
𝐺 = arg 𝑚𝑎𝑥 𝑚𝑖𝑛 𝐸𝑥 [𝑙𝑜𝑔𝐷(𝑥)] + 𝐸𝑧 [log⁡(1 − 𝐷(𝐺(𝑧)))], (1)
G = argmax𝐷min𝐺Ex [logD ( x )] + Ez [log(1 − D ( G (z)))], (1)
D G
where 𝑥 is the real data, 𝑧 is the random noise, and 𝐸 is the expectation value.
x isgoal
whereThe the real data,
of the z is theisrandom
generator noise, the
to maximize E is the
anderror rateexpectation value.
of the discriminator. That is,
The goal
the data of the generator
distribution of 𝐺(𝑧) is is to maximize
consistent the error
with rate ofit the
𝑥, making discriminator.
impossible for theThat is, the
discrimi-
data distribution
nator of Gbetween
to distinguish (z) is consistent
the idealwithdatax,and
making it impossible
generated data. At forthisthe discriminator
point, 𝐷(𝐺(𝑧)) isto
distinguish
maximized.between the the
Therefore, ideal data andfunction
objective generated data.
of the At this point,
generator is D ( G (z)) is maximized.
Therefore, the objective function of the generator is
𝐺𝐺 = arg 𝑚𝑖𝑛 𝐸𝑧 [log⁡(1 − 𝐷(𝐺(𝑧)))], (2)
𝐺
GG = argmin
The goal of the discriminator is to maximize − Daccuracy
Ez [log(1 the ( G (z)))]of , the discriminator’s judg- (2)
G
ment. That is, 𝐷(𝑥) is maximized, but 𝐷(𝐺(𝑧)) is minimized. Therefore, the objective
The goal
function of theofdiscriminator
the discriminator
is is to maximize the accuracy of the discriminator’s
judgment. That is, D ( x ) is maximized, but D ( G (z)) is minimized. Therefore, the objective
𝐺𝐷 = ⁡ argis𝑚𝑎𝑥 𝐸𝑥 [𝑙𝑜𝑔𝐷(𝑥)] + 𝐸𝑧 [log⁡(1 − 𝐷(𝐺(𝑧)))],
function of the discriminator (3)
𝐷

The proposed model differs from the original GAN in an additional discriminator
GD = argmaxEx [logD ( x )] + Ez [log(1 − D ( G (z)))], (3)
and an input condition. By denotingD the generator as 𝐺 and the discriminators as 𝐷1 and
𝐷2 , the input infrared image to be corrected as 𝑥, the input random noise image as 𝑧, and
The proposed model differs from the original GAN in an additional discriminator
the ideal infrared image as 𝑦, the objective function was obtained according to the idea of
and an input condition. By denoting the generator as G and the discriminators as D1 and
the GAN:
D2 , the input infrared image to be corrected as x, the input random noise image as z, and
𝐺𝑜𝑢𝑟𝑠 image
the ideal infrared = arg 𝑚𝑎𝑥
as y,𝑚𝑖𝑛the 𝐸𝑥,𝑦 [𝑙𝑜𝑔𝐷1function
objective 𝐸𝑥,𝑧 [𝑙𝑜𝑔⁡(1
(𝑥, 𝑦)] + was 1 (𝑥, 𝐺(𝑥, 𝑧)))]
− 𝐷according
obtained to the idea of
𝐷1 ,⁡𝐷2 𝐺 (4)
the GAN: + 𝐸𝑥,𝑦 [𝑙𝑜𝑔𝐷2 (𝑥, 𝑦)] + 𝐸𝑥,𝑧 [𝑙𝑜𝑔⁡(1 − 𝐷2 (𝑥, 𝐺(𝑥, 𝑧)))] ,
BasedGon = arg max
Equation
ours we E
(4),min x,y [ logD1the
designed y)] +
( x,loss Ex,z [logof
function −D
(1the 1 ( x, G including
model, ( x, z)))] the gen-
D1 ,D2 G (4)
erator loss and the discriminator loss.
+ Ex,y [logD2 ( x, y)] + Ex,z [log(1 − D2 ( x, G ( x, z)))],
The generator losses consist of adversarial losses, L1 losses, and SSIM losses. The ad-
versarial
Basedlosson encourages
Equation (4), thewedistribution
designedof thetheloss
generated
functionimages
of theasmodel,
close as possible the
including to
the distribution
generator loss andof the ideal image. The
discriminator L1 loss and the SSIM loss ensure that the generated
loss.
images
Thepreserve
generator thelosses
texture information
consist as much losses,
of adversarial as possible.
L1 losses, and SSIM losses. The
adversarial loss encourages the distribution of the generated images as close as possible to
the distribution of the ideal image. The L1 loss and the SSIM loss ensure that the generated
images preserve the texture information as much as possible.
The adversarial loss function is

L a ( G, D1 , D2 ) = Ex,z [log(1 − D1 ( x, G ( x, z)))] + Ex,z [log(1 − D2 ( x, G ( x, z)))], (5)


Sensors 2023, 23, 3282 11 of 22

When the adversarial loss is small, it means that D1 ( x, G ( x, z)) and D2 ( x, G ( x, z)) are
large. It indicates that the data generated by G can well deceive D1 and D2 .
The L1 loss function is

L L1 ( G ) = Ex,y,z [ky − G ( x, z)k1 ], (6)

The SSIM loss function is

LSSIM ( G ) = (1 − SSIM( G ( x, z), y)), (7)

SSIM evaluates the image quality from three aspects of brightness, contrast, and
structure, and the smaller the SSIM, the more similar the structure. Its calculation formula
is as follows:
SSIM( X, Y ) = f (l ( X, Y ), c( X, Y ), s( X, Y )), (8)
where, l ( X, Y ) represents the brightness of the picture, c( X, Y ) represents the contrast of
the picture, and s( X, Y ) represents the structure of the picture. The formulas are

2µ x µy + C1
l ( X, Y ) = , (9)
µ2x + µ2y + C1

2δx δy + C2
c( X, Y ) = , (10)
δx2 + δγ2 + C2
δxy + C3
s( X, Y ) = , (11)
δx δy + C3
where µ x and µy are the average values of all pixel values of image X and image Y, δx and
δy are the standard deviations of all pixel values of the two images, δxy is the covariance of
the corresponding pixels of image X and image Y, and C1 , C2 and C3 are constant, which
can avoid a zero denominator.
When the SSIM and L1 losses are small, it means that the generated images have a
high similarity to the ideal images, and the generated images are close to the ideal images.
In summary, the total loss function of the generator is

LG ( G, D1 , D2 ) = L a ( G, D1 , D2 ) + λ1 L L1 ( G ) + λ2 LSSI M ( G ), (12)

where λ1 and λ2 are the coefficients of the loss function, and 100 is selected for them in
this paper.
We expect to minimize the generator loss, which means the generated image of G can
not only deceive D1 and D2 but also has better texture information.
The loss functions of the two discriminators are

L D1 ( G, D1 ) = Ex,y [logD1 ( x, y)] + Ex,z [log(1 − D1 ( x, G ( x, z)))], (13)

L D2 ( G, D2 )= E x,y [logD2 ( x, y)] + Ex,z [log(1 − D2 ( x, G ( x, z)))], (14)


Our goal was that, by maximizing the loss of the two discriminators, D1 and D2 could
have a stronger discrimination ability.

4.1.3. Model Training


The hardware platform for the deep learning network training designed in this paper
is Intel(R) Xeon(R) Platinum 8350C and RTX 3090. The deep learning framework used in
the experiment is tensorflow_gpu-2.3. The detailed parameters are shown in Table 1.
Sensors 2023, 23, 3282 12 of 22

Table 1. Model training environment and parameters.

Indicator Parameters
CPU Intel(R) Xeon(R) Platinum 8350C
GPU RTX 3090
RAM Size 43G
VRAM Size 24G
CUDA 10.0
Learning Framework Tensorflow-gpu-2.3.0
Batch Size 64
Optimizer Adam
learning Rate 0.0001

4.1.4. Compared Algorithm


The algorithm proposed in this paper was compared with the following algorithms,
namely, the GAN for improving Vit-Discriminator (Vit-GAN), Pix2Pix algorithm [19], a
histogram equalization correction algorithm (MHE) [29], a residual deep network-based
NUC method (DLS) [30], NUC based on a bilateral filter (BFTH) [31], the depth residual
network NUC algorithm (DMRN) [32], the residual attention network NUC algorithm
(RAN) [15], and convolutional blind denoising (CBDNet) [33].

4.1.5. Evaluation Metrics


The evaluation of the NUC performance of the infrared detectors can be divided into
subjective evaluation and objective evaluation.
The subjective evaluation was mainly to observe the corrected image with the naked
eye and evaluate the corrected image from the aspects of image clarity, texture details, edge,
and noise removal.
The objective evaluation mainly involves two aspects: SSIM and PSNR, where the
PSNR formula is " #
(2n − 1 ) 2
PSNR = 10 × log10 (15)
MSE

where, n is the number of bits, MSE is the mean square deviation between two images, and
MSE formula is:
1 m −1 n −1
MSE =
mn ∑ i = 0 ∑ j =0
k X (i, j) − Y (i, j) k2 (16)

where X and Y are the compared images; m and n are the width and height of the image.
The higher the PSNR value, the better the NUC quality.

4.2. Experimental Results and Analysis


4.2.1. Network Analysis
In order to better reflect the function of the visible image and SEBlock, we visually
analyzed the trained model and the Pix2Pix model. In the network, the activation function
converts the output of the convolutional layer into features in the [−1, 1] interval. For
better visual presentation, this paper takes advantage of the mean variance standardization
to convert it into a pixel value of [0, 255]. The input image for the generator is shown in
Figure 12.
Figure 13 shows the results of the visible feature layer for the model proposed in this
paper. The model extracted features from visible images, including background features,
texture features, and structural noise. After separating the sky background, the model
extracted the features of streetlights, pedestrians, floors, and road information. The result
shows that with the visible image as assistance, the generator effectively learns the visible-
related features for subsequent feature fusion.
SensorsSensors
2023, 23, x FOR
2023, PEER REVIEW
23, 3282 18 of 32
13 of 22

(a) (b)
Figure 12. Network visual test image. (a) Input infrared image to be corrected; (b) input
image.

Figure 13 shows the results of the visible feature layer for the model proposed
paper. The model extracted features from visible images, including background fe
texture features, and structural noise. After separating the sky background, the mo
(a) (b)
tracted the features of streetlights, pedestrians, floors, and road information. The
Figure 12.12.Network
shows
Figure that
Networkvisual
with test
test image.
the visible
visual (a) Input
image
image. (a) Input infrared
as assistance,
infrared image to corrected;
be corrected;
thetogenerator
image be (b) input
effectively
(b) input visible
learns
visible the v
image.
related features for subsequent feature fusion.
image.

Figure 13 shows the results of the visible feature layer for the model proposed in this
paper. The model extracted features from visible images, including background features,
texture features, and structural noise. After separating the sky background, the model ex-
tracted the features of streetlights, pedestrians, floors, and road information. The result
shows that with the visible image as assistance, the generator effectively learns the visible-
related features for subsequent feature fusion.

Figure
Figure 13.13. Network
Network intermediate
intermediate visible visible feature maps.
feature maps.

Figure 14 shows
Figure the feature
14 shows maps after
the feature the model
maps afterfuses the visible
the model and the
fuses infrared features
visible and infrar
and feature maps of the same layer of the Pix2Pix. In the Pix2Pix, the feature layers are
tures and feature maps of the same layer of the Pix2Pix. In the Pix2Pix,
simply extracted from the infrared image, which does not effectively separate the texture
the feature
are simply
features from extracted from the
the noisy features. infrared
After image,
introducing thewhich
visibledoes
image, not
theeffectively
proportionseparate
of t
ture
noisy
Figure features
Network from
information
13. in thethe noisy
feature
intermediate features.
extraction
visible After introducing
is reduced,
feature maps. the and
and the structural visible
detailimage,
featuresthe prop
inofthenoisy
imageinformation
are effectivelyinobtained.
the feature extraction is reduced, and the structural and
Figure 14
features inshows the feature
the image maps after
are effectively the model fuses the visible and infrared fea-
obtained.
tures and feature maps of the same layer of the Pix2Pix. In the Pix2Pix, the feature layers
are simply extracted from the infrared image, which does not effectively separate the tex-
ture features from the noisy features. After introducing the visible image, the proportion
of noisy information in the feature extraction is reduced, and the structural and detail
features in the image are effectively obtained.
Sensors
Sensors 2023,2023,
2023,
Sensors 23, 3282
23, x23,
FOR PEER
x FOR REVIEW
PEER REVIEW 20 of21
32of
14 32
of 22

(a) (b)
(a) 14. Network intermediate feature maps. (a) Infrared feature
Figure (b) maps of Pix2Pix; (b) the middle
Figure 14. Network intermediate feature maps. (a) Infrared feature maps of Pix2Pix; (b) the middle
feature maps of our model.
Figure 14.
feature mapsNetwork intermediate feature maps. (a) Infrared feature maps of Pix2Pix; (b) the middle
of our model.
feature maps of our model.
Next, this paper conducts a detailed analysis of the effect of SEBlock. In order to show
Next, this paper conducts a detailed analysis of the effect of SEBlock. In order to show
the effect more clearly, in the process of visualization, the mean variance normalization is
thecanceled
effect
Next,more
while clearly,
this paper itsinfeature
only conductsthe process
alayer isoftransformed
detailed visualization,
analysis ofintothe
the mean
effect
the of variance
range of [0, 255]normalization
SEBlock. Inthrough
order to show
the is
canceled
effectwhile
theformula more only its
clearly, feature
in
for visual analysis. the layer
process is
of transformed
visualization,into
the the range
mean of [0,
variance 255] through
normalization the
is
formula
canceledfor visual
while analysis.
only its feature layer is transformed into the range of [0, 255] through the
′ (𝑖,
formula for visual analysis. 𝑋(𝑖, 𝑗) = (𝑋 𝑗) + 1.0) × 127.5, (17)

0
′ (𝑖,
where 𝑋 𝑗) is the feature layer = (X
X (i, j)pixel j) +𝑋(𝑖,
size(i,and 1.0𝑗) ×is 127.5
the display pixel size. (17)
𝑋(𝑖, 𝑗) = (𝑋 ′ (𝑖, 𝑗) + 1.0) × 127.5, (17)
The results are shown in Figure 15. After the feature layer of Figure 15a passes
where
where X𝑋0 ′((𝑖,
through i,the ) SEBlock,
j𝑗) isisthe the output
thefeature
feature layerof
layer Figure
pixel
pixel size 15b
sizeand is X
and obtained.
𝑋(𝑖, As
(i, j)𝑗)is is
the can be seen,
display
the display pixelafter
pixel the corre-
size.
size.
sponding
The
The resultsfeature
results are layers
are shown
shownhave passed
in Figure
in Figurethrough
15. After the
theSEBlock
15. After feature module,
layer of
the feature moreofstructural
Figure
layer 15a passes
Figure details
15a through
passes
theappear,
SEBlock, as is shown
the outputin the
of second
Figure row,
15b the
is third
obtained.column,
As the
can third
be
through the SEBlock, the output of Figure 15b is obtained. As can be seen, after the corre- row,
seen, the
after second
the column,
corresponding
the fourth
feature
sponding layers row,
feature haveand the have
passed
layers second column.
through
passed the From the
SEBlock
through thecomparison
module, of the
more
SEBlock module, output
structural ofdetails
the feature
more structural appear,
details
as layers
is shownin Figures 15b,c, adding SEBlock effectively helps the network to improve the over- the
appear, as is in the second
shown row, the
in the second row,third
thecolumn, the third
third column, row, row,
the third the second column,
the second column,
all generalization
fourth row,row, andand the ability
second andcolumn.
the abilityFromto extract
the network features.
comparison
the fourth the second column. From the comparisonofofthe theoutput
outputof of the
the feature
feature
layers
layers in
in Figure
Figures15b,c, 15b,c,adding
addingSEBlock
SEBlockeffectively
effectivelyhelpshelpsthe thenetwork
network toto
improve
improve the overall
the over-
generalization ability and the ability to extract
all generalization ability and the ability to extract network features. network features.

(a) (b) (c)


Figure 15. SEBlock feature layer visualization. (a) The characteristic layer of the input channel atten-
tional mechanism module; (b) the output feature layer of SEBlock; (c) the output feature layer of the
model without SEBlock.
(a) (b) (c)
Figure 15. SEBlock feature layer visualization. (a) The characteristic layer of the input channel atten-
Figure 15. SEBlock feature layer visualization. (a) The characteristic layer of the input channel
tional mechanism module; (b) the output feature layer of SEBlock; (c) the output feature layer of the
attentional mechanism module; (b) the output feature layer of SEBlock; (c) the output feature layer of
model without SEBlock.
the model without SEBlock.
Sensors 2023, 23, x FOR PEER REVIEW 23 of 32
Sensors 2023,
Sensors 2023, 23,
23, 3282
x FOR PEER REVIEW 2215of
of 32
22

Finally, to verify the effectiveness of the VITD, the dataset is reduced to 2000, and the
Finally, to verify
verifythe
theeffectiveness
effectivenessofofthetheVITD,
VITD, thethe
dataset
dataset is reduced
is reduced to 2000, andand
to 2000, the
training time is increased to ensure that the network overfits. Then the GAN model and
training time is increased to ensure that the network overfits. Then
the training time is increased to ensure that the network overfits. Then the GAN model the GAN model and
the Vit-GAN model are trained under the same hyperparameters, and the effect of the
the
and Vit-GAN
the Vit-GANmodelmodelare are
trained under
trained underthethe
same
same hyperparameters,
hyperparameters,and andthetheeffect
effect of
of the
discriminator loss and the generator loss curve analysis module is observed. As is shown
discriminator loss and the generator loss curve analysis module is observed. As is shown
in Figure 16, the discriminator loss of the GAN model is highly fluctuating at 500 epochs,
in Figure 16, the discriminator loss of the GAN model is highly fluctuating fluctuating at at 500
500 epochs,
epochs,
followed by similar phenomena at 700 and 900 epochs, while the generator loss also ex-
followed by bysimilar
similarphenomena
phenomenaatat700 700
andand
900900 epochs,
epochs, while
while the the generator
generator loss loss
also also ex-
exhibits
hibits peculiar fluctuations. However, when Vit-Discriminator is adopted, the fluctuation
hibits peculiar
peculiar fluctuations.
fluctuations. However,However,
when when Vit-Discriminator
Vit-Discriminator is adopted,
is adopted, the fluctuation
the fluctuation of the
of the overall loss curve is relatively stable, the loss function is optimized, and there is no
overall
of loss curve
the overall is relatively
loss curve stable,stable,
is relatively the lossthefunction is optimized,
loss function and there
is optimized, is no over-
and there is no
over-fitting throughout the network. As can be seen from the comparison, the introduction
fitting throughout
over-fitting the network.
throughout As can
the network. Asbecan
seenbefrom
seen the
fromcomparison,
the comparison, the introduction of the
the introduction
of the Vit-Discriminator optimizes the generalization ability of network learning and ef-
Vit-Discriminator optimizes
of the Vit-Discriminator the generalization
optimizes ability of
the generalization network
ability learninglearning
of network and effectively
and ef-
fectively avoids
avoids the
the over-fitting
over-fitting
of the algorithm
of the algorithm due to the
due to the relatively
relatively
simple discriminator.
simple discriminator.
fectively avoids the over-fitting of the algorithm due to the relatively simple discriminator.

Figure 16.
16. Loss
Loss function
function curves
curves of
of GAN
GAN algorithm
algorithm and
and Vit-GAN
Vit-GAN algorithm.
Figure
Figure 16. Loss function curves of GAN algorithm and Vit-GAN algorithm.
algorithm.
4.2.2.
4.2.2. Subjective Evaluation
4.2.2. Subjective Evaluation
In
In this
thispaper,
paper,the
theproposed
proposedalgorithm
algorithmis is
qualitatively evaluated
qualitatively evaluatedwith thethe
with following six
following
In this paper, the proposed algorithm is qualitatively evaluated with the following
algorithms, namely
six algorithms, the the
namely BFTH, MHE,
BFTH, DMRN,
MHE, DMRN, Pix2Pix, RAN,
Pix2Pix, andand
RAN, CBDNet
CBDNetalgorithms.
algorithms.
six algorithms, namely the BFTH, MHE, DMRN, Pix2Pix, RAN, and CBDNet algorithms.
Figure
Figure 1717shows
showsthethedataset
dataset(Figure 17a–c)
(Figures and
17a–c) thethe
and selection of the
selection local
of the magnification
local magnifica-
Figure 17 shows the dataset (Figures 17a–c) and the selection of the local magnifica-
window (Figure 17d).
tion window (Figure 17d).
tion window (Figure 17d).

(a) (b) (c) (d)


(a) (b) (c) (d)
Figure 17. Test data display. (a) Input infrared image to be corrected; (b) the corresponding visible
Figure 17. Test data display. (a) Input infrared image to be corrected; (b) the corresponding visible
image; (c) ideal output infrared image; (d) windows selection.
image; (c) ideal output infrared image; (d) windows selection.

Figure 18 shows the ideal results for the enlarged view of window 1 (Figure 18a), the
Figure 18 shows the ideal results for the enlarged view of window 1 (Figure 18a), the
correction results of our proposed algorithm (Figure 18h), and the correction results of the
correction results of our proposed algorithm (Figure 18h), and the correction results of the
other six
other six algorithms
algorithms (Figure
(Figures 18b–g).
other six algorithms (Figures18b–g).
18b–g).
image changes considerably, and some information is lost. The mosaic phenomenon ap-
pears in the images of the CBDNet algorithm, where the text information is hard to rec-
ognize. Although the RAN algorithm produces good correction results, the image bright-
ness is reduced. As is shown, except for the Pix2Pix algorithm, the deep learning-based
correction results show a significant improvement. Among all the results in Figure 18, our
Sensors 2023, 23, 3282 16 of 22
algorithm achieves the best results with higher image quality, better texture preservation,
and cleaner noise removal compared to the other algorithms.

(a) (b) (c) (d)

(e) (f) (g) (h)


Figure
Figure 18.
18. Correction
Correctionresults
resultsofofWindow
Window1.1.(a)(a)
Ideal result;
Ideal (b)(b)
result; BFTH algorithm
BFTH correction
algorithm result;
correction (c)
result;
MHE algorithm correction result; (d) DMRN algorithm correction result; (e) Pix2Pix algorithm cor-
(c) MHE algorithm correction result; (d) DMRN algorithm correction result; (e) Pix2Pix algorithm
rection result; (f) CBDNet algorithm correction result; (g) RAN algorithm correction result; (h) our
correction result; (f) CBDNet algorithm correction result; (g) RAN algorithm correction result; (h) our
algorithm correction result.
algorithm correction result.
Figure 19 shows an enlarged view of the results for Window 2. The algorithm pro-
As shown in Figure 18, none of the compared algorithms produced satisfactory results.
posed by us achieves higher image quality and better texture preservation. In contrast,
The result of the BFTH algorithm shows various noise residuals after correction, and the
among
result isthe other algorithms,
over-smoothed the images
and leads corrected
to a loss byinformation.
of texture the BFTH, MHE,
The MHEand DMRN
algorithmalgo-
re-
rithms all show significant noise; both the BFTH and CBDNet algorithms lead to the
sults in a significant reduction in image contrast, and the noise removal is not clean enough. prob-
lem of
The texture
DMRN distortion;
algorithm the Pix2Pixreduces
significantly algorithm comes with
the overall noiseproblems
but has aofsmall
contrast
noisereduction
residual.
and structural distortion, and after RAN algorithm correction, additional
In the Pix2Pix algorithm results, noise residuals are present, the brightness of the noise appears
image
around the vehicle.
changes considerably, and some information is lost. The mosaic phenomenon appears
in the images of the CBDNet algorithm, where the text information is hard to recognize.
Although the RAN algorithm produces good correction results, the image brightness is
reduced. As is shown, except for the Pix2Pix algorithm, the deep learning-based correction
results show a significant improvement. Among all the results in Figure 18, our algorithm
achieves the best results with higher image quality, better texture preservation, and cleaner
noise removal compared to the other algorithms.
Figure 19 shows an enlarged view of the results for Window 2. The algorithm proposed
by us achieves higher image quality and better texture preservation. In contrast, among
the other algorithms, the images corrected by the BFTH, MHE, and DMRN algorithms
all show significant noise; both the BFTH and CBDNet algorithms lead to the problem of
texture distortion; the Pix2Pix algorithm comes with problems of contrast reduction and
structural distortion, and after RAN algorithm correction, additional noise appears around
the vehicle.
Subsequently, we performed DWT on the corrected images of each algorithm.
Figure 20 shows the results for their vertical subgraphs. As is shown, the BFTH algo-
rithm suffers from a large amount of fringe noise. After the correction of the MHE and
Pix2Pix algorithms, the noise residuals are clearly regular. The DMRN algorithm and the
RAN algorithm still have some slight bar noise in the vertical sub-band. After CBDNet
algorithm correction, the texture information of some images is changed.
Sensors2023,
Sensors 2023,23,
23,3282
x FOR PEER REVIEW 25 17
ofof3222

(a) (b) (c) (d)

(a) (b) (c) (d)

(e) (f) (g) (h)


Figure 19. Correction results of Window 2. (a) Ideal result; (b) BFTH algorithm correction result; (c)
MHE algorithm correction result; (d) DMRN algorithm correction result; (e) Pix2Pix algorithm cor-
rection result; (f) CBDNet algorithm correction result; (g) RAN algorithm correction result; (h) our
algorithm correction result.

(e)
Subsequently,
(f)
we performed DWT on the (g)
corrected images of each algorithm.
(h)
Figure
20 shows the results for their vertical subgraphs. As is shown, the BFTH algorithm suffers
Figurea19.
from
Figure 19. Correction
large amount
Correction results ofofWindow
of fringe
results noise.2.After
Window 2.(a)(a)
Ideal
theresult;
Ideal (b)(b)
correction
result; BFTH
of algorithm
BFTH MHEcorrection
thealgorithmand result;
Pix2Pix
correction (c)
algo-
result;
MHE algorithm correction result; (d) DMRN algorithm correction result; (e) Pix2Pix algorithm cor-
rithms,
(c) MHE the noise residuals
algorithm correction are clearly
result; regular.
(d) DMRN The DMRN
algorithm algorithm
correction and
result; (e) the RAN
Pix2Pix algo-
algorithm
rection result; (f) CBDNet algorithm correction result; (g) RAN algorithm correction result; (h) our
rithm stillresult;
correction have(f) some slight
CBDNet bar noise
algorithm in theresult;
correction vertical
(g) sub-band.
RAN After
algorithm CBDNet
correction algorithm
result; (h) our
algorithm correction result.
correction, the texture
algorithm correction result.information of some images is changed.
Subsequently, we performed DWT on the corrected images of each algorithm. Figure
20 shows the results for their vertical subgraphs. As is shown, the BFTH algorithm suffers
from a large amount of fringe noise. After the correction of the MHE and Pix2Pix algo-
rithms, the noise residuals are clearly regular. The DMRN algorithm and the RAN algo-
rithm still have some slight bar noise in the vertical sub-band. After CBDNet algorithm
correction, the texture information of some images is changed.

(a) (b) (c) (d)

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 20. Vertical subgraph of correction results. (a) Ideal result; (b) BFTH algorithm correction
result; (c) MHE algorithm correction result; (d) DMRN algorithm correction result; (e) Pix2Pix al-
gorithm correction result; (f) CBDNet algorithm correction result; (g) RAN algorithm correction result;
(h) our algorithm correction result.
(e) (f) (g) (h)
Sensors 2023, 23, x FOR PEER REVIEW 27 of 32

Figure 20. Correction results of Window 1. (a) Ideal result; (b) BFTH algorithm correction result; (c)
MHE algorithm correction result; (d) DMRN algorithm correction result; (e) Pix2Pix algorithm cor-
Sensors 2023, 23, 3282 18 ofour
rection result; (f) CBDNet algorithm correction result; (g) RAN algorithm correction result; (h) 22
algorithm correction result.

According to
According to the
the above
abovequalitative
qualitativeevaluation,
evaluation,the image
the image quality
qualityof of
thethe
proposed
proposed al-
gorithm was significantly improved after the correction. In the correction
algorithm was significantly improved after the correction. In the correction results, the results, the back-
ground is clean,
background the the
is clean, texture is preserved,
texture is preserved, thethe
edges
edges are
aresharp,
sharp,and andnonoresidual
residual noise or
noise or
added noise exist.
Figure21,
In Figure 21,weweperform
performa more
a more qualitative
qualitative evaluation
evaluation of the
of the proposed
proposed algorithm.
algorithm. The
The figure
figure showsshows the correction
the correction results results of the algorithm
of the algorithm when occlusion,
when occlusion, parallax,
parallax, etc., appearetc.,
in
appear in the visible image, where the upper part is the input infrared
the visible image, where the upper part is the input infrared image with noise, the middle image with noise,
the middle
part part is the corresponding
is the corresponding visible image, visible
and image,
the lower andpart
the is
lower part is the result
the correction correction
and
resultindex.
SSIM and SSIM
When index.
thereWhen there is between
is a parallax a parallax between
the the visible
visible image and image and the
the infrared infrared
image, the
image, theimage
corrected corrected imagethe
still takes stillinfrared
takes the infrared
image image
as the as theWhen
standard. standard. When the
the visible imagevisible
has
large
imageocclusions, the overall the
has large occlusions, level is stilllevel
overall excellent,
is stillalthough
excellent,the infrared
although thecorrection results
infrared correc-
are
tionreduced. Therefore,
results are reduced. in the proposed
Therefore, in algorithm,
the proposed the visible image
algorithm, theinformation
visible imageis more of
infor-
an auxiliary information rather than a determining factor.
mation is more of an auxiliary information rather than a determining factor.

(a) (b) (c) (d) (e) (f)


Figure 21. Correction results under different conditions. (a) Infrared and visible images are misreg-
Figure 21. Correction results under different conditions. (a) Infrared and visible images are mis-
istered; (b) infrared and visible images are misregistered; (c) infrared and visible images are mis-
registered; (b) infrared and visible images are misregistered; (c) infrared and visible images are
registered; (d) visible image has small area occlusion; (e) visible image has occlusion; (f) visible im-
misregistered;
age has a large(d) visible
area image has small area occlusion; (e) visible image has occlusion; (f) visible
of occlusion.
image has a large area of occlusion.
4.2.3. Objective
4.2.3. Objective Evaluation
Evaluation
A quantitative
A quantitative analysis
analysis ofof the
the correction
correction algorithms
algorithms waswas performed
performed on on 100
100 infrared
infrared
images with
images with artificially
artificially added
added nonuniform
nonuniform noise.noise. Among
Among them,
them, the
the visible
visible image
image without
without
large area occlusion is selected, and it is registered with the infrared
large area occlusion is selected, and it is registered with the infrared image. image.
The specific
The specific SSIM
SSIM forfor the
the various
various algorithms
algorithms areare shown
shown inin Figure
Figure 22.
22. The
The closer
closer the
the
metric is to 1, the closer the corrected image is to the true image. As shown in Figure 22,
metric is to 1, the closer the corrected image is to the true image. As shown in Figure 22,
the BFTH
the BFTH algorithm
algorithm has has the
the worst
worst SSIM.
SSIM. TheThe Pix2Pix
Pix2Pix and
and MHE
MHE algorithms
algorithms havehave similar
similar
correction results,
correction results, but
but the
the Pix2Pix
Pix2Pix correction
correction isis less
less robust.
robust. The
The stability
stability of
of the
the DLS
DLS and
and
DMRN algorithms
DMRN algorithms is similar, and
is similar, and the
the SSIM
SSIM isis significantly
significantly higher
higher than
than the
the BFTH,
BFTH, MHE,
MHE,
and Pix2Pix algorithms, with the DMRN algorithm having the highest. Both the RAN and
CBDNet algorithms are close in terms of stability and SSIM, with their average values above
0.93. Among them, the CBDNet algorithm performs slightly worse, and their stability is
Sensors 2023, 23, x FOR PEER REVIEW 29 of 32
Sensors 2023, 23, x FOR PEER REVIEW 28 of 32

and Pix2Pix algorithms, with the DMRN algorithm having the highest. Both the RAN
and Pix2Pix algorithms, with the DMRN algorithm having the highest. Both the RAN and
Sensors 2023, 23, 3282
and CBDNet algorithms are close in terms of stability and SSIM, with their average values19 of 22
CBDNet algorithms are close in terms of stability and SSIM, with their average values
above 0.93. Among them, the CBDNet algorithm performs slightly worse, and their sta-
above 0.93. Among them, the CBDNet algorithm performs slightly worse, and their sta-
bility is not as good as the DLS and DMRN algorithms. Our algorithm has the highest
bility is not as good as the DLS and DMRN algorithms. Our algorithm has the highest
SSIM, almost all of which are above 0.95 and fluctuate gently, indicating that the algorithm
SSIM,as almost
good asall ofDLS
which are abovealgorithms.
0.95 and fluctuate gently, indicating that the algorithm
isnot
robust. the and DMRN Our algorithm has the highest SSIM, almost
is
allrobust.
of which are above 0.95 and fluctuate gently, indicating that the algorithm is robust.

Figure 22. SSIM of test dataset.


Figure 22.
Figure SSIM of
22. SSIM of test
test dataset.
dataset.

The
Theresults
resultsof of the
the PSNR
PSNR comparison
comparisonare areshown
shownin inFigure
Figure23.
23. The
Thehigher
higherthe
thePSNR,
PSNR,
The results of the PSNR comparison are shown in Figure 23. The higher the PSNR,
the
the better the correction results. The figure shows that the Pix2Pix algorithm has theworst
better the correction results. The figure shows that the Pix2Pix algorithm has the worst
the better the correction results. The figure shows that the Pix2Pix algorithm has the worst
measurements,
measurements, with with fluctuations
fluctuations ofof around
around 31 31 dB.
dB.The TheBFTH
BFTHalgorithm
algorithmhas hasaahigher
higherPSNR
PSNR
measurements, with fluctuations of around 31 dB. The BFTH algorithm has a higher PSNR
than
thanPix2Pix
Pix2Pixdue duetotothe
thesmooth
smoothtransition.
transition. The
Themetrics
metrics of the MHE,
of the MHE, DMRN,DMRN,andandDLSDLS
al-
than Pix2Pix due to the smooth transition. The metrics of the MHE, DMRN, and DLS al-
gorithms
algorithmsareare
similar,
similar,with
withMHE
MHEbeing
beingthetheworst.
worst.When
Whencompared
comparedtotothe theRAN
RANalgorithm,
algorithm,
gorithms are similar, with MHE being the worst. When compared to the RAN algorithm,
the
thePSNR
PSNRof ofthe
theCBDNet
CBDNetalgorithm
algorithmfluctuates
fluctuatesaalot,
lot,which
whichisissimilar
similartotothe
theresult
resultofofSSIM.
SSIM.
the PSNR of the CBDNet algorithm fluctuates a lot, which is similar to the result of SSIM.
Although
Although the PSNRPSNR of ofour
ouralgorithm
algorithmfluctuates,
fluctuates, it floats
it floats around
around 37 dB,
37 dB, whichwhich is signifi-
is significantly
Although the PSNR of our algorithm fluctuates, it floats around 37 dB, which is signifi-
cantly
higherhigher
than thethan the algorithms.
other other algorithms.
cantly higher than the other algorithms.

Figure23.
Figure PSNRof
23.PSNR oftest
testdataset.
dataset.
Figure 23. PSNR of test dataset.
In order to verify the impact of the visible images on our algorithm, 100 datasets with
unregistered infrared and visible images and 100 datasets with large occlusion areas in the
visible images were selected from the validation set. The average SSIM and PSNR obtained
were compared with those of the other algorithms, and the results are shown in Table 2.
Sensors 2023, 23, 3282 20 of 22

Table 2. SSIM and PSNR of each algorithm.

Algorithm SSIM PSNR


Ours 0.9661 37.1148
Ours (mis-registration) 0.9643 37.0124
Ours (occlusion) 0.9452 36.5823
RAN 0.9388 35.3755
CBDNet 0.9312 35.7384
Pix2Pix 0.8496 30.9646
MHE 0.8583 33.9248
DLS 0.8870 34.3888
BFTH 0.8282 32.9318
DMRN 0.9032 34.8618

As shown in Table 2, the SSIM and PSNR of the corrected results do not decrease
significantly when the visible and infrared images are not registered. When the visible
image has a large occlusion area, the correction result of the proposed algorithm is reduced
due to the lack of auxiliary information of the partially visible image. However, the values
of the metrics are still in a high range.
As can be seen from the quantitative evaluation, the SSIM and PSNR of our algorithm
are higher than those of the other algorithms, and the overall volatility is stable, with good
correction results and strong robustness.

5. Conclusions
In this paper, we propose an infrared NUC algorithm based on visible image assistance
(VIA-NUC). We designed a generator and two different discriminators based on the GAN.
The generator takes the infrared image to be corrected, together with the assisted coaxial
visible image as the input. The feature map of the visible image is introduced into the
decoding process of the feature map of the infrared image to achieve better uniformity,
and the excessive influence of the visible images can be avoided by using SEBlock to
adjust the channel weights in the skip connections. We designed two discriminators: VITD
and DWTD. VITD was designed based on the idea of PatchGAN, which evaluates the
texture information of the generated image and avoids the overfitting that is typical in the
traditional GAN. DWTD performs the DWT on the input images to determine whether the
generated image has residual noise or additive noise. The experimental results show that
the proposed VIA-NUC achieves better performance when compared to the algorithms
for infrared NUC, and the structural similarity and PSNR of the corrected results were
obviously improved. The correction results of this algorithm are relatively stable, the
texture details of the corrected images are well preserved, and the edges are sharp without
introducing fresh noise. The algorithm is based on end-to-end training, which also avoids
the “Ghost” phenomenon caused by a priori information in the scene-based correction
algorithm and has better robustness.

Author Contributions: Conceptualization, X.M.; methodology, X.M., X.Z; software, T.Z.; validation,
X.M.; formal analysis, X.M.; investigation, T.Z.; resources, X.M.; data curation, X.Z.; writing—original
draft preparation, T.Z.; writing—review and editing, X.M., X.Z.; visualization, T.Z.; supervision,
X.M.; project administration, X.M. All authors have read and agreed to the published version of
the manuscript.
Funding: This research was supported in part by the National Natural Science Foundation of China
(61701357) and the China Scholarship Council.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The synthetic data underlying this article will be shared upon reason-
able request to the corresponding author.
Sensors 2023, 23, 3282 21 of 22

Conflicts of Interest: The authors declare no conflict of interest.

References
1. Guan, J.; Lai, R.; Li, H.; Yang, Y.; Gu, L. DnRCNN: Deep Recurrent Convolutional Neural Network for HSI Destriping. IEEE Trans.
Neural Netw. Learn. Syst. 2022, 33, 1–14. [CrossRef] [PubMed]
2. Mou, X.G.; Lu, J.J.; Zhou, X. Adaptive Correction Algorithm of Infrared Image Based on Encoding and Decoding Residual
Network. Infrared Technol. 2020, 42, 833–839.
3. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial
nets. In Proceedings of the Advances in Neural Information Processing Systems (NIPS 2014), Montreal, QC, Canada, 8–13
December 2014; pp. 2672–2680.
4. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141.
5. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.;
Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929.
6. Jensen, A.; la Cour-Harbo, A. Ripples in Mathematics: The Discrete Wavelet Transform; Springer Science & Business Media:
Berlin/Heidelberg, Germany, 2001.
7. Schulz, M.J.; Caldwell, L.V. Nonuniformity correction and correctability of infrared focal plane arrays. Infrared Imaging Syst. 1996,
36, 763–777. [CrossRef]
8. Yoshizawa, T.; Zou, R.; Shi, C.; Wei, P.; Zheng, J.; Mao, E. A new two-point correction algorithm for non-uniformity correction
combined with the information of scene. In Proceedings of the SPIE-The International Society for Optical Engineering, Beijing,
China, 15–17 November 2010; Volume 7850, p. 78500J.
9. Huang, Y.; Zhang, B.H.; Wu, J.; Chen, Y.Y.; Ji, L.; Wu, X.D.; Yu, S.K. Adaptive Multipoint Calibration Non-uniformity Correction
Algorithm. Infrared Technol. 2020, 42, 637–643. [CrossRef]
10. Guan, T.H.; Zhang, T.H. A New Real-Time Two-Point Non-Uniformity Correction Method. Aero Weapon. 2021, 28, 112–117.
11. Harris, J.G.; Chiang, Y.M. Nonuniformity correction of infrared image sequences using the constant-statistics constraint. IEEE
Trans. Image Process. 1999, 8, 1148–1151. [CrossRef] [PubMed]
12. Nie, X.; Zhang, H.; Liu, H.; Liang, Y.; Chen, W. An Infrared Image Enhancement Algorithm for Gas Leak Detecting Based on
Gaussian Filtering and Adaptive Histogram Segmentation. In Proceedings of the 2021 IEEE International Conference on Real-time
Computing and Robotics (RCAR), Xining, China, 15–19 July 2021.
13. Jian, Y.B.; Ruan, S.C.; Zhou, H.X. Improved Nonuniformity Correction Algorithm Based on Kalman-filtering. Opto-Electron. Eng.
2008, 35, 131–135.
14. Xiu, L.; Lin, Z.; Song, L.; Wen, G.; Yong, L. Adaptive neural network non-uniformity correction based on edge detection. In
International Congress on Image & Signal Processing; IEEE: Piscataway, NJ, USA, 2013.
15. Ding, D.; Li, Y.; Zhao, P.; Li, K.; Jiang, S.; Liu, Y.X. Single Infrared Image Stripe Removal via Residual Attention Network. Sensors
2022, 22, 8734. [CrossRef] [PubMed]
16. Mou, X.G.; Cui, J.; Zhou, X. Adaptive correction algorithm for infrared image based on generative adversarial network. Laser
Infrared 2022, 52, 427–434.
17. Chen, F.J.; Zhu, F.; Wu, Q.X.; Hao, Y.M.; Wang, E.D. Infrared image data augmentation based on generative adversarial network.
J. Comput. Appl. 2020, 40, 2084–2088.
18. Li, H.; Ma, Y.; Guo, L.; Li, H.; Chen, J.; Li, H. Image restoration for irregular holes based on dual discrimination generation
countermeasure network. Xibei Gongye Daxue Xuebao/J. Northwestern Polytech. Univ. 2021, 39, 423–429. [CrossRef]
19. Isola, P.; Zhu, J.-Y.; Zhou, T.; Efros, A. Image-to-image translation with conditional adversarial networks. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA,
2017; pp. 1125–1134.
20. Ma, H.; Li, J.; Zhan, W.; Tomizuka, M. Wasserstein generative learning with kinematic constraints for probabilistic interactive
driving behavior prediction. In Proceedings of the IEEE Intelligent Vehicles Symposium, Paris, France, 9–12 June 2019; pp.
2477–2483.
21. Yang, B.; Yan, G.; Wang, P.; Chan, C.Y.; Chen, Y. A novel graph-based trajectory predictor with pseudo-oracle. IEEE Trans. Neural
Netw. Learn. Syst. 2022, 33, 7064–7078. [CrossRef] [PubMed]
22. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In
Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing
Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008.
23. Sha, Y.Y.; Zhang, Y.H.; Ji, X.Q.; Hu, L. Transformer-Unet: Raw Image Processing with Unet. arXiv 2021, arXiv:2109.08417.
24. Lee, K.; Chang, H.; Jiang, L.; Zhang, H.; Tu, Z.; Liu, C. Vitgan: Training gans with vision transformers. arXiv 2021, arXiv:2107.04589.
25. Liu, Y.; Wu, Z.; Han, X.; Sun, Q.; Zhao, J.; Liu, J. Infrared and Visible Image Fusion Based on Visual Saliency Map and Image
Contrast Enhancement. Sensors 2022, 22, 6390. [CrossRef] [PubMed]
26. Tang, L.; Yuan, J.; Ma, J. Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image
fusion network. Inf. Fusion 2022, 82, 28–42. [CrossRef]
Sensors 2023, 23, 3282 22 of 22

27. Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-aware Dual Adversarial Learning and a Multi-scenario
Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, Montreal, BC, Canada, 11–17 October 2022; pp. 5802–5811.
28. Xu, H.; Ma, J.; Le, Z.; Jiang, J.; Guo, X. Fusiondn: A unified densely connected network for image fusion. In Proceedings of the
AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12484–12491.
29. Tendero, Y.; Gilles, J.; Landeau, S.; Morel, J.M. Efficient single image non-uniformity correction algorithm. In Proceedings of the
SPIE—The International Society for Optical Engineering, Orlando, FL, USA, 1–5 August 2010; p. 7834.
30. He, Z.W.; Cao, Y.P.; Dong, Y.F.; Yang, J.X.; Cao, Y.L.; Tisse, C.L. Single-image-based nonuniformity correction of uncooled
long-wave infrared detectors: A deep-learning approach. Appl. Opt. 2018, 57, D155–D164. [CrossRef] [PubMed]
31. Zuo, C.; Chen, Q.; Gu, G.; Qian, W. New temporal high-pass filter nonuniformity correction based on bilateral filter. Opt. Rev.
2011, 2, 18. [CrossRef]
32. Chang, Y.; Yan, L.; Liu, L.; Fang, H.; Zhong, S. Infrared aerothermal nonuniform correction via deep multiscale residual network.
IEEE Geosci. Remote Sens. Lett. 2019, 16, 1120–1124. [CrossRef]
33. Guo, S.; Yan, Z.; Zhang, K.; Zuo, W.; Zhang, L. Toward convolutional blind denoising of real photographs. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1712–1722.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like