CAFNET: Cross-Attention Fusion Network For Infrared and Low Illumination Visible-Light Image

Neural Processing Letters (2023) 55:6027–6041
https://doi.org/10.1007/s11063-022-11125-9
CAFNET: Cross-Attention Fusion Network for Infrared and

Low Illumination Visible-Light Image
Xiaoling Zhou1,2 · Zetao Jiang1 · Idowu Paul Okuwobi1
Accepted: 11 December 2022 / Published online: 30 December 2022

© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022
Abstract
Due to the sampling and pooling operations, deep learning-based infrared and visible-light
image fusion methods often result into detail loss problem, especially under low illumination.
Therefore, we propose a novel cross-attention fusion network (CAFNET) to fuse infrared
and low illumination visible-light image merely based on the first layer of the pre-trained
VGG16 network. Firstly, features with the same size of source images are extracted by the
first layer of pre-trained VGG16 respectively. Then, based on the extracted features, cross
attention is calculated to distinguish the differences between infrared and visible-light image,
and the spatial attention is computed to reflect the characteristics of infrared and visible-
light image. After that, weight maps are gained through modulating the cross attention and
spatial attention, based on which the source images are pre-fused. Finally, Gaussian blur-
based details injection is performed to further enhance the details of the pre-fused image.
Experiment results show that, compared with the traditional multi-scale and state-of-the-art
deep learning-based fusion methods, our approach can achieve better performance both in
subjective and objective evaluations.
Keywords Feature extraction · VGG16 network · Cross attention · Spatial attention
1 Introduction
Infrared image has abundant thermal information not affected by illuminance but lacks of
high-frequency components and detail information; visible-light image has rich detail infor-
mation but is greatly affected by illuminance [1]. The fusion of infrared and visible-light
image can synthesize their advantages, which is important for target detection and pattern
B Zetao Jiang
zetaojiang@guet.edu.cn
Xiaoling Zhou
xiaolingzhou_p@163.com
Idowu Paul Okuwobi
paulokuwobi@guet.edu.cn
1 Guilin University of Electronic Technology, Jinji Road, Guilin 541000, Guangxi, China
2 Guilin University of Aerospace Technology, Jinji Road, Guilin 541004, Guangxi, China
123
6028 X. Zhou et al.
recognition in many fields such as modern military, medical imaging. However, both high and
low-frequency components will be affected under low illumination. Researchers have been
exploring how to use the complementary characteristics of infrared and visible-light image
to obtain rich details under low illumination, and expand their applications in the military
and civilian fields. The key to visible-light and infrared image fusion is how to extract the
details and the salient target areas of the source images, and keep both of them in the fused
image [2].
Currently, image fusion methods can be briefly classified into two categories: tradi-
tional and deep learning-based. Thereinto, traditional methods mainly include: multi-scale
decomposition-based fusion (MDF), Region-based fusion (RF) and Global optimization-
based fusion (GOF). The key of MDF methods is to design decomposition method so as
to improve their detail extraction ability, e.g., non-subsampled contourlet transform (NSCT)
[3], multi-resolution singular value decomposition [4], anisotropic diffusion [5], fourth-order
partial differential equations [6], etc. However, the decomposition and reconstruction opera-
tion may result in detail loss and distortion. The RF methods conduct fusion based on saliency
regions derived from a proper region segmentation algorithm. Through segmenting source
images into object and background regions, certain fusion rules can be designed for different
regions [7], it is superior in detail preservation, but may be partial to infrared highlighted
objects or visible scene details due to the lack of image features during the calculation of
saliency map [8–10]. The GOF methods utilize flexibly designed loss function to adjust the
weights of infrared and visible-light image during the fusion progress [11, 12], alleviate the
information asymmetry problem of RSF methods to some extent. But finding the optimal
value is time-consuming, and it may suffer convergence problem, moreover, improper loss
function may result in distortion and over enhancement phenomenon.
Deep learning-based fusion approaches can be roughly divided into two categories: hybrid
fusion (HF), end-to-end fusion (ETEF). The HF methods usually combine deep-learning
neural network with traditional fusion approaches. Liu et al. [13]proposed a pyramid-based
fusion framework for infrared and visible-light image fusion, but the fusion weights of each
detail layers are adaptively adjusted by structural similarity derived from Siamese Convolu-
tion Neural Network (SCNN). Though quality of the fusion image is improved obviously,
however, it inherits the disadvantage of multiscale decomposition fusion method, inevitably,
detail loss and distortion will occur in fusion image. Li et al. [14] adopted pre-trained ResNet
to extract features of infrared and visible-light image, then operated Zero phase Compo-
nent Analysis and l1-norm on these features to construct a weight matrix, based on which
the weighted average fusion was realized. HF methods based on pre-trained deep-learning
neural network eliminate the time-consuming training process, but loss of source images’
scene details and low contrast is its vital problem, the main reasons are as follows: the down-
sampling operation of the ResNet feature extraction network will lose some details of source
images, and calculation of weight matrix without considering the complementary properties
of infrared and visible image will lose some saliency details of source images. The ETEF
methods conduct image fusion merely on deep-learning neural network, including Genera-
tive Adversarial Network-based fusion [15–18] and Encoder-decoder-based fusion [19, 20].
For fusion methods of this kind, loss function design is crucial, since there is no ground-
truth for infrared and visible-light image fusion, usually the use of structural similarity and
gradient-based index is an alternative. Consequently, it will face the problem of information
imbalance. To address the above problem, Xu et al. [21] proposed U2Fusion network. They
adopted features extracted by VGG19 network to adaptively adjust loss function, however,
loss of details still exists.
123
CAFNET: Cross-Attention Fusion Network... 6029
In summary, region-based fusion methods are suitable for infrared and visible-light image
fusion, but need more image features for saliency regions segmentation so as to improve detail
preservation ability. Deep learning-based methods gain a relative natural fusion image as their
framework can extract abundant features, but sampling and pooling operation will lead to
detail loss especially under low illumination. However, the first layer of VGG16 network [22]
can extract abundant common features with the same size as source image, we conclude that
it is suitable for image fusion as the detail loss due to sampling is avoided. Considering that
the pre-trained VGG16 based on ImageNet can extract common features of images, which are
enough for our image fusion task meanwhile the pre-training step can be omitted, we propose
a cross-attention fusion network (CAFNET): Based on features extracted by the first layer
of pre-trained VGG16 network, calculate cross attention and spatial attention to reflect the
complementary and inherent information of source images respectively, upon which design
a novel two-step fusion strategy. Experiment results show that fusion images of CAFNET
are natural with abundant details preserved, specifically, details of low illumination areas are
crisp benefit from the proposed cross attention.
The contributions of this paper mainly include:
(1) Cross attention: considering that complementary information will be distinguished if we
cross-compare source images at the same time, we innovatively propose to calculate
similarity between features of infrared and visible-light image, and term it as cross atten-
tion. Since features extracted by pre-trained VGG16 are from 64 different convolution
kernels, the derived cross attention can reflect the details of source images less affected
by illuminance.
(2) Two-step fusion strategy: We propose to fuse images with two steps, pre-fuse source
images with CAFNET network, then inject the details from source images to improve
quality of pre-fuse image. Considering fusion strategy of the proposed CAFNET belongs
to weighted fusion, inevitably detail smoothing will occur in fusion image. We find
that our saliency map derived from VGG16 network is similar to a Gaussian blurring,
therefore, it is feasible to extract the details of source images by subtracting the original
image from its gaussian blur and directly add these details to the pre-fused image. It
effectively improves the quality of fusion image without artifacts introduced.
The paper is organized as follows. In Sect. 2, we introduce the framework of our proposed
CAFNET network and all the component modules in detail. In Sect. 3, experiments and
corresponding analysis are presented. Section 4 concludes this paper.
2 The proposed CAFNET
To address the detail loss problem under low illumination, we propose cross attention fusion
network, mainly includes: feature extraction module, cross attention module, attention alloca-
tion module, spatial attention module and weighted fusion module. Firstly, we apply shallow
layer of pre-trained VGG16 to extract features of infrared and visible-light image, the out-
puts of which are the same size as the source images, avoiding detail loss from sampling
and pooling operation. Secondly, we calculate the complementary information of infrared
and visible-light image with our proposed cross attention module. It is worth mentioning
that low illumination scene details will be highlighted benefiting from cross attention. Then,
we design an attention allocation module to highlight advantage areas of the original images
according to their contribution to cross attention. Meanwhile, spatial attention is computed
to reflect the characteristics of infrared and visible-light image. Finally, to efficiently pre-
123
6030 X. Zhou et al.
Fig. 1 Architecture of the proposed CAFNET. Ft_vis and Ft_ir denotes features extracted from VGG16
conv1_1, S A_vis and S A_ir denotes spatial attention of visible-light and infrared image respectively, C A_vis
and C A_ir denotes allocated cross attention of visible-light and infrared image respectively, W _vis and W _ir
denotes weight map for visible-light and infrared image respectively, Pr e_ f used_image represents the pre-
fused image, F_image represents the final fused image
serve the details of source images, we modulate the cross attention with spatial attention to
construct saliency matrices, based on which weighted fuse the source images. Additionally,
to solve the detail smoothing problem of the weighted average fusion, we propose a detail
injection module on the periphery of CAFNET to further enhance the details and gain the
final fused image. The overall structure of our proposed CAFNET is shown in Fig. 1.
2.1 Proposed Cross Attention
Mining the complementary information of infrared and visible-light image is crucial for cal-
culating saliency regions (also called as saliency map). Local features e.g., contrast, gradient,
etc. are used to represent the saliency information. Commonly only two or three local fea-
tures are adopted simultaneously to calculate saliency map. However, fewer features will lead
the saliency map biased to a specific feature. Moreover, separately calculating saliency map
from the own features of infrared and visible-light image can’t reflect the complementary
information of them. Therefore, we propose to cross-calculate features of source images so as
to capture their complementary saliency information. The specific calculation is as follows:
using the first layer of pre-trained VGG16 to extract 64 channel image features from infrared
and visible-light image respectively, then each pixel of source images can be represented with
a 64-d vector Ii64×1
j ,Vi64×1
j , next, “cross” calculate the similarity of Ii64×1
j and Vi64×1
j pixel by
pixel, the output of which is termed as cross attention. The specific structure of cross atten-
tion is shown in Fig. 2. Considering that cosine similarity is insensitive to absolute distance
while the difference of gray value is an important index to measure the similarity of infrared
and visible image, we design a modified Manhattan distance to calculate the similarity as
123
Fig. 2 Structure of cross attention

module. Cross-calculate the
difference of infrared and
visible-light image in the feature
space extracted from pre-trained
VGG16
Fig. 3 Structure of cross attention module. (a) Infrared image; (b) Visible image; (c) Cross attention
follows:
|Ii,64×1
j − Vi,64×1
j |
Simi, j (I , V ) = (1)
|Ii,64×1 64×1
j | × |vi, j |
where Ii64×1
j and Vi64×1
j represents the 64-d feature vector extracted by VGG16 for the pixel
at position (i, j).
To intuitively discuss the effectiveness of cross attention, we present the output of cross
attention module for a typical infrared and visible-light image set, results are shown in Fig.
3. Among them, Fig. 3(c) is the cross attention (CA) in graphic form, brighter areas indicate
greater difference between source images. We can see those regions with humans and cars,
scene details under low illumination, are all highlighted in CA image, which indicate that
the regions with complementary information can be distinguished with our cross attention.
2.2 Attention allocation
As mentioned above, C A image can highlight the complementary information of infrared

and visible-light image. However, we should design an algorithm to judge who is the major
contributor to C A and put more weights to it, then advantage details of the original images
can be preserved to fusion image. Therefore, we proposed an attention allocation module to
allocate the cross attention for infrared (denoted as C A_ir ) and visible-light image (denoted
123
6032 X. Zhou et al.
Fig. 4 Structure of the attention

allocation module. Areas with
advantage details will be
highlighted in C A_vis and
C A_ir
Fig. 5 Output of the attention allocation module. (a) Infrared image; (b) Visible image; (c) C A_ir ; (d) C A_vis.
Areas with abundant details under low illumination can be highlighted in C A_vis graphic
as C A_vis) respectively. To put more attention on low illumination visible-light image and
utilize the complementary properties, we proposed to calculate the local similarity of visible-
light image and cross attention image as C A_vis while set the complement of C A_vis as
C A_ir , the specific structure of attention allocation module is shown in Fig.4. Since the
convolution kernel size of VGG16 is 3×3, here we set the local window size as 3×3 too.
Allocation of cross attention is given as:

3×3 3×3
C A− visi, j = |C Ai, j − I− visi, j | (2)
C A− ir = 1 − N or m (C A− vis) (3)
where C Ai3×3
j denotes the cross attention value contained in a 3×3 slide window centered on
the pixel at position (i, j), I− visi3×3
j denotes the gray value of visible-light image contained
in 3×3 slide window centered on the pixel at position (i, j). N or m represents normalization
operation.
Figure 5 presents the allocated C A_ir and C A_vis in graphic form for a typical set of
infrared and visible-light image. Likewise, for C A_ir and C A_vis images, brighter areas
indicate greater weights (or attention) will be paid on them. We can see from Fig. 5(c),
more weights have been paid to regions with humans and cars according to the C A_ir
image, which is consistent to human visual perception. From Fig. 5(d), it can be found that
though the details of regions with road and constructions are difficult to recognize due to
low-illumination, larger weights still have been paid to these areas according to the C A_vis
image, which will help with the preservation of visible-light image details.
123
Fig. 6 Output of the attention allocation module. (a) Infrared image; (b) S A_ir ; (c) Visible-light image; (d)
S A_vis. Intrinsic characteristics of the original images can be presented but the details under low-illumination
are hard to distinguish
2.3 Spatial attention module and modulation
Complementary information of infrared and visible-light image is specifically highlighted in

C A_vis and C A_ir , but directly fuse source images based on C A_vis and C A_ir will lead
to unnatural fusion image with shadows and unexpected contours. Spatial attention can reflect
intrinsic properties of image itself with smooth edges and regions. Therefore, we design to
modulate cross attention and spatial attention together to form saliency weight maps, based
on which weighted fuse source images.
2.3.1 Spatial Attention module
To integrate the extracted 64 channel features, we calculate spatial attention with two steps:
first, perform the ReLu operation for each feature channel, discarding negative features; then
calculate the l1-norm for each feature vector pixel by pixel, forming the spatial attention
S A_vis ,S A_ir . Spatial attention is given as:

64
S A_vis = |max 0, Ft− vis 64×H ×W | (4)
1

64
S A_ir = |max 0, Ft− ir 64×H ×W | (5)
1
where | f (x)| denotes absolute value operation. According to equations (4) and (5), the
calculation of spatial attention is similar to average operation, the details of spatial attention
will be smoothed, which may deteriorate the fusion result. To increase the contrast, we
utilized an exponential function to stretch S A_vis and S A_ir . Figure 6 is the output of
spatial attention module for a set of infrared and visible-light image. We can see that, the
spatial attention of infrared image (Fig. 6b) highlights infrared objects and the background
contours clearly, and spatial attention of visible-light image (Fig. 6d) highlights the bright
objects and some of the scene details, but the details under low illumination are hard to
recognize. Therefore, we should modulate cross attention and spatial attention together to
form the final saliency weight maps.
2.3.2 Modulation module
As mentioned above, spatial attention can reflect the intrinsic texture details and bright-
ness distribution of source images themselves, cross attention can reflect the complementary
123
6034 X. Zhou et al.
Fig. 7 Output of the modulation module. (a) Infrared image; (b) W _ir final weight map for infrared image;
(c) Visible-light image; (d) W _vis final weight map for visible-light image
information considering both of them. Hence, we proposed a modulation module to combine

the advantages of spatial attention and cross attention, so as to form the final saliency maps
W _vis and W _ir . It is given as:
W− vis = S A_vis × (1 + C A− vis) (6)
W− ir = S A− ir × (1 + C A− ir ) (7)
The modulation operation makes W _vis and W _ir reflect both the texture details and
complementary information well. Figure 7 shows the output of modulation module for a
typical set of infrared and visible-light image. Compared with C A_ir in Fig. 5 and S A_ir in
Fig. 6, we can see that both the thermal objects and construction details are paid larger weights
in W _ir , which indicate that the calculated weight map W _ir can integrate advantages of
cross attention and spatial attention. Compared with C A_vis in Fig. 5 and S A_vis in Fig. 6,
we can see that in map W _vis, words on billboard are clear as it inherits the advantages of
S A_vis, areas under low-illumination still with larger weights as it inherits the advantages
of C A_vis, which further indicate that modulation module can integrate the advantages of
cross attention and spatial attention.
2.4 Fusion rule and detail injection
As mentioned above, the proposed saliency weight maps W _vis and W _ir can focus on
areas with abundant details, we adopt weighted fusion rule to pre-fuse the original images as
follow:
Pr e_ f use = Wvis × Ivis + Wir × Iir (8)
However, the weighted average operation will inevitably cause blurring of fusion image
details. We design a detail injection module on the periphery of CAFNET network to enhance
the details of the pre-fused image. Considering the weighted average is similar to a gaussian
blur, we design a gaussian detail extraction module to get the details by subtracting the
original image from its gaussian blur. Since CAFNET has preserved abundant details of
source images into the pre-fused image, injecting the maximum details of source images into
the pre-fused image will strengthen the details without false contours introduced. Our final
fused image F_img is given as:
F_img = Pr e_ f use + max ((I mg_vis − G(I mg_vis)), (I mg_ir − G (I mg_ir ))) (9)
where function G(x) represents the gaussian blur operation. Figure 8 shows the final fused
image of our proposed method for a set of typical infrared and visible-light image. It is
obvious that advantage information of source images have been retained to fused image
123
Fig. 8 Fusion result of CAFNET network. (a) Infrared image;(b) Visible-light image; (c) Final fused image;
(d) Magnified details of fused image
Fig. 9 The 8 groups of infrared and low illumination visible-light images from TNO dataset. The left side is
the infrared image and right side is the visible-light image
naturally. The colorful rectangles are used to emphasize our method’s detail preservation
ability under low illumination. We can see letters on the lamp (marked in red), arrow sign
(marked in purple) and contours of chairs (marked in blue) are crisp, which validate that our
method is superior in detail preservation.
3 Experiments and analysis
To validate the performance of our algorithm, we collect 8 groups of infrared and visible
images under low illumination from the TNO image fusion dataset [23] for experiment,
including Camp_1826, K aptein_1123, meting016, 37rad, Str eet, Road, Movie_01,
K aptein_1654. It covers a variety of scenes such as pedestrians, streets, buildings, etc.
For ease of presentation, the 8 groups of input images are adjusted to the same size, as shown
in Fig. 9. We can see that the background details of visible-light image are more clear com-
pared with the corresponding infrared image, however for areas under low illumination, the
details are hard to distinguish in visible-light image but part of contours can be recognized
in infrared image. Contours of thermal objects are crisp in infrared images but difficult to
distinguish in visible-light image, however some of the details of thermal objects are captured
in visible-light image.
The proposed algorithm is compared to the latest and classic fusion algorithms from both
subjective and objective point of view, including traditional multiscale fusion method MSVD
[4], ADF [5], FPDE [6], hybrid optimization fusion IVF[11] and deep learning-based fusion
ResNetZCA [14], U2F [21]. All the codes of these algorithms are obtained from original
authors or implemented based on public available codes, and parameters are set as default.
Among them, ResNet-ZCA is a typical fusion method based on pretrained ResNet network,
U2F is a typical end-to-end generative network. Since our fusion framework merely adopts
first layer of pretrained VGG16 (VGG16_conv1_1) to extract image features, there is no need
to adjust input images to a certain size and no need for model training. The implementation
123
6036 X. Zhou et al.
Fig. 10 Experiment results on 8 groups of images from TNO dataset.(a) Fusion results of MSVD; (b) Fusion
results of ADF;(c) Fusion results of FPDE;(d) Fusion results of ResNet-ZCA;(e) Fusion results of U2F; (f)
Fusion results of IVF; (g) Fusion results of CAFNET
of the entire framework does not require additional parameters to be set, which is simple and
effective.
3.1 Subjective evaluation
Subjective evaluation can evaluate the fusion performance from personal visual perception,
such as color, lightness, fidelity and other more abstract judgments. Though it may vary with
observers, the subjective feeling is consistent in general. Therefore, subjective evaluation is
significant for fusion result evaluation.
Figure 10 shows the fusion results of compared methods and our proposed method, where
the parts framed in red represent low illumination details needed particular attention, and
are zoomed in on the bottom right corner. We can see that, all the fusion methods can
preserve bright thermal objects from infrared image, only brightness level of the highlighted
target is slightly different. However, the ability to retain the details of source images is quite
123
different, our method is obviously better. The first row shows fusion results on Camp_1826,
we can see that, the details of construction framed in red are crisp in our proposed CAFNET
method, while in IVF they are over-enhanced. The second row shows fusion results on
K aptein_1123, the branches in the sky can be clearly distinguished in our CAFNET method,
however in multiscale-based methods MSVD, ADF and FPDE, branches have been smoothed.
Besides, distorted branches are found in fusion image of IVF and U2F. The third row shows
fusion results on meting016, the soldier’s facial features in our proposed CAFNET are crisp
compared with other methods. The fourth row shows fusion results on 37rad, chair beside
the beach can be recognized and the contour of thermal object is clear in our proposed
CAFNET. The fifth row shows fusion results on Street, chairs on the block can be identified
in CAFNET, though contours of chairs are clearer in FPDE, the visual perception of it is
not good and artifacts are introduced. The sixth row shows fusion results on Road, it is
obvious that rails of building are all preserved in our method, which are lost in U2F. The
seventh row shows fusion results on Movie_01, window of the building can be recognized in
proposed CAFNET, while in other methods it is ambiguous except U2F. The last row shows
fusion results on K aptein_1654, the details of branches are relatively clear both in U2F and
proposed CAFNET, while over-enhancement are existed in IVF. In summary, our proposed
CAFNET network can better preserve thermal objects and background details from infrared
and visible-light image, specifically, the details from low illumination visible-light image
can be strengthened in the fusion image to some extent.
3.2 Objective evaluation
To ensure our evaluation is comprehensive, four objective quality metrics are used for compar-
ative evaluation, including information entropy (EN), feature mutual information (FMI) [24],
image quality Qab/f and structural similarity (SSIM). Among them, EN indicates the infor-
mation richness of fusion image. The greater the EN, the richer the fusion image information;
FMI considers both mutual information and neighborhood features to evaluate detail preser-
vation ability; Qab/f mainly measures the retain of edge information from source images;
SSIM evaluates the similarity of fused image and source images from lightness, contrast and
structure aspects.
Comparing results of the quality metrics are shown with dot-line graph in Fig. 11, where the
proposed CAFNET method is marked with bold green line. We can see that the CAFNET
curve is basically on the top of other contrasting curves, which indicates that our method is
superior to other comparing methods from the objective view.
The average values of four quality metrics over the fused images on 8 groups of infrared
and visible images are shown in Table 1. The best values are indicated in red; the second-best
values are denoted in blue. Compared with the other 6 fusion methods, the proposed CAFNET
network exhibits better performance. Our method achieves three best average values (FMI,
Qab/f and SSIM), which indicates that our method has the best detail preservation ability.
Our method achieves third-best entropy average value, indicating that the richness of our
fusion image is also competitive.
The above subjective and objective discussions demonstrate that though the structure of
our CAFNET network is relatively simple, the quality of our fused images is the best. Due
to the design of cross attention, complementary information can be distinguished and help
to enrich the details of saliency weight map, meanwhile low-illumination details can be paid
more attention, thereby the details of source images can be preserved and enhanced in fused
image. In addition, modulation of cross attention and spatial attention can improve the visual
123
6038 X. Zhou et al.
Table 1 The average values of four quality metrics over the fused images on 8 groups of infrared and visible
images collected from TNO
MSVD ADF FPDE ResNet U2F IVT CAFNET
EN 6.374 6.408 6.409 6.338 6.819 6.785 6.541

FMI 0.890 0.881 0.878 0.895 0.890 0.886 0.896
Qab/f 0.287 0.398 0.358 0.281 0.408 0.359 0.495
SSIM 0.683 0.663 0.638 0.701 0.6547 0.608 0.705
Fig. 11 Comparison of four quality metrics over the fused images on 8 groups of infrared and visible images.
(a) The comparing results of index Entropy; (b) The comparing results of index FMI; (c) The comparing
results of index Qabf; (d) The comparing results of index SSIM
perception of fusion image as it considers the complementary information of source images

and intrinsic characteristics of source images themselves simultaneously.
3.3 Ablation study
To further discuss the effectiveness of our proposed fusion scheme, we perform some ablation
experiments: (1) CAFNET that removes cross attention branch (called SA-inject), which
123
Table 2 The objective EN FMI Qab/f SSIM

evaluations of ablation
experiments on 8 groups of SA-inject 6.488 0.891 0.414 0.698
infrared and visible images
collected from TNO CA-no-inject 6.521 0.896 0.391 0.703
Proposed CAFNET 6.541 0.896 0.435 0.705
Fig. 12 Comparing ablation results of one group of infrared and visible-light image.The first row are the
original infrared (the left side) and visible-light (the right side) image, the second row are the corresponding
fusion results of ablation experiment. (a) The fusion result of SA-inject; (b) The fusion result of CA-no-inject;
(c) The fusion result of CAFNET
aims to validate whether the design of cross attention is beneficial for detail preservation; (2)
CAFNET that removes detail injection module (called CA-no-inject), which aims to judge
who is the major contributor to performance improvement, cross attention or detail injection?
The corresponding objective comparing results are shown in Table 2, values marked with red
are the best. We can see that the values in row “SA-inject” are the lowest, which means that
the absence of cross attention branch will cause all the indicators to drop. Values in row “CA-
no-inject” are higher than the values in row “SA-inject” except for the Qabf, indicating that
introducing cross attention will improve the detail preservation ability, and detail injection
is good for edge preservation. Values in row “CA-no-inject”are lower than the values in row
“Proposed CAFNET” illustrates that modulation of spatial attention and cross attention will
aggregate their advantages so as to achieve satisfying fusion results.
Fig. 12 shows one group of the ablation experiment results, where the details with obvious
differences are framed in red. The first row is the original infrared and visible-light image.
Facial features in visible-light image (the right side) are recognizable but the illumination is
low. The second row is the ablation results, from left to right is SA-inject, CA-no-inject and
proposed CAFNET. It is obvious that the facial details of SA-inject is blur, and more details
are presented in CA-no-inject with the help of cross attention. More edge details are preserved
in proposed CAFNET under the combined effect of cross attention and detail injection.
123
6040 X. Zhou et al.
The above objective and subjective evaluations of ablation experiments validate that our
proposed cross attention can preserve abundant details from the original images to fusion
image, despite of the low-illumination condition.
4 Conclusion
To solve the detail loss problem of infrared and low-illumination visible-light image fusion,
we propose a cross attention fusion network (CAFNET). Cross attention is designed to mine
the complementary information of source images from features extracted by pre-trained
VGG16 network, then modulate it with spatial attention to obtain saliency weight map,
based on which gain a fused image with satisfactory visual perception meanwhile preserve
the details from source images to the best extent. Compared with classic and state-of-the-art
methods, our proposed CAFNET performs better both in subjective and objective evaluation,
and the fusion image is consistent to human visual perception. But due to the weighted fusion
rule, the brightness information of thermal objects in infrared image will be weakened in our
method, further work should be focused on the preservation of brightness information.
Acknowledgements This work was supported by the National Natural Science Foundation of China under
Grant Nos. 61876049 and 62172118, and the Nature Science key Foundation of Guangxi (2021GXNSFDA196002)
and the Guangxi Key Laboratory of Image and Graphic Intelligent Processing under Grants ( GIIP2006,
GIIP2007, GIIP2008) and the Innovation Project of Guangxi Graduate Education under Grants (YCB2021070
, YCBZ2018052, 2021YCXS071).
References
1. Liu Y, Zhou D, Nie R, Ding Z, Guo Y, Ruan X, Xia W, Hou R (2022) Tse_fuse: two stage enhancement
method using attention mechanism and feature-linking model for infrared and visible image fusion. Dig
Signal Process. https://doi.org/10.1016/j.dsp.2022.103387
2. Zhao F, Zhao W, Yao L, Liu Y (2021) Self-supervised feature adaption for infrared and visible image
fusion. Inform Fus 76:189–203. https://doi.org/10.1016/j.inffus.2021.06.002
3. He K, Zhou D, Zhang X, Nie R, Wang Q, Jin X (2017) Infrared and visible image fusion based on target
extraction in the nonsubsampled contourlet transform domain. J Appl Remote Sens 11(1):015011. https://
doi.org/10.1117/1.JRS.11.015011
4. Naidu V (2011) Image fusion technique using multi-resolution singular value decomposition. Def Sci J
61(5):479. https://doi.org/10.14429/dsj.61.705
5. Bavirisetti DP, Dhuli R (2015) Fusion of infrared and visible sensor images based on anisotropic diffusion
and karhunen-loeve transform. IEEE Sens J 16(1):203–209. https://doi.org/10.1109/JSEN.2015.2478655
6. Bavirisetti DP, Xiao G, Liu G (2017) Multi-sensor image fusion based on fourth order partial differential
equations. In: 2017 20th international conference on information fusion (Fusion). pp. 1–9 . https://doi.
org/10.23919/ICIF.2017.8009719
7. Meher B, Agrawal S, Panda R, Abraham A (2019) A survey on region based image fusion methods.
Inform Fus 48:119–132. https://doi.org/10.1016/j.inffus.2018.07.010
8. Cui G, Feng H, Xu Z, Li Q, Chen Y (2015) Detail preserved fusion of visible and infrared images using
regional saliency extraction and multi-scale image decomposition. Opt Commun 341:199–209. https://
doi.org/10.1016/j.optcom.2014.12.032
9. Zhang Y, Zhang L, Bai X, Zhang L (2017) Infrared and visual image fusion through infrared feature
extraction and visual information preservation. Infrar Phys Technol 83:227–237. https://doi.org/10.1016/
j.infrared.2017.05.007
10. Meher B, Agrawal S, Panda R, Dora L, Abraham A (2022) Visible and infrared image fusion using an
efficient adaptive transition region extraction technique. Eng Sci Technol Int J 29:101037. https://doi.org/
10.1016/j.jestch.2021.06.017
11. Li G, Lin Y, Qu X (2021) An infrared and visible image fusion method based on multi-scale transformation
and norm optimization. Inform Fus 71:109–129. https://doi.org/10.1016/j.inffus.2021.02.008
123
12. Shen D, Zareapoor M, Yang J (2021) Multimodal image fusion based on point-wise mutual information.
Image Vis Comput 105:104047. https://doi.org/10.1016/j.imavis.2020.104047
13. Liu Y, Chen X, Cheng J, Peng H, Wang Z (2018) Infrared and visible image fusion with convolu-
tional neural networks. Int J Wavel Multiresol Inform Process 16(03):1850018. https://doi.org/10.1142/
S0219691318500182
14. Li H, Wu X-J, Durrani TS (2019) Infrared and visible image fusion with resnet and zero-phase component
analysis. Infra Phys Technol 102:103039. https://doi.org/10.1016/j.infrared.2019.103039
15. Ma J, Yu W, Liang P, Li C, Jiang J (2019) Fusiongan: a generative adversarial network for infrared and
visible image fusion. Inform Fus 48:11–26. https://doi.org/10.1016/j.inffus.2018.09.004
16. Fu Y, Wu X-J, Durrani T (2021) Image fusion based on generative adversarial network consistent with
perception. Inform Fus 72:110–125. https://doi.org/10.1016/j.inffus.2021.02.019
17. Song A, Duan H, Pei H, Ding L (2022) Triple-discriminator generative adversarial network for infrared
and visible image fusion. Neurocomputing 483:183–194. https://doi.org/10.1016/j.neucom.2022.02.025
18. Fu Y, Wu X-J, Durrani T (2021) Image fusion based on generative adversarial network consistent with
perception. Inform Fus 72:110–125. https://doi.org/10.1016/j.inffus.2021.02.019
19. Liu L, Chen M, Xu M, Li X (2021) Two-stream network for infrared and visible images fusion. Neuro-
computing 460:50–58. https://doi.org/10.1016/j.neucom.2021.05.034
20. Li H, Wu X-J, Kittler J (2021) Rfn-nest: an end-to-end residual fusion network for infrared and visible
images. Inform Fus 73:72–86. https://doi.org/10.1016/j.inffus.2021.02.023
21. Xu H, Ma J, Jiang J, Guo X, Ling H (2020) U2fusion: a unified unsupervised image fusion network. IEEE
Trans Patt Analy Mach Intell 44(1):502–518. https://doi.org/10.1109/TPAMI.2020.3012548
22. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition.
Comput Sci https://doi.org/10.48550/arXiv.1409.1556
23. Toet A (2014) TNO image fusion dataset. https://figshare.com/articles/TN_Image_Fusion_Dataset/
1008029
24. Haghighat MBA, Aghagolzadeh A, Seyedarabi H (2011) A non-reference image fusion metric based
on mutual information of image features. Comput Elect Eng 37(5):744–756. https://doi.org/10.1016/j.
compeleceng.2011.07.012
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and applicable
law.
123

CAFNET: Cross-Attention Fusion Network For Infrared and Low Illumination Visible-Light Image

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CAFNET: Cross-Attention Fusion Network For Infrared and Low Illumination Visible-Light Image

Uploaded by

Copyright:

Available Formats

Neural Processing Letters (2023) 55:6027–6041

CAFNET: Cross-Attention Fusion Network for Infrared and

Xiaoling Zhou1,2 · Zetao Jiang1 · Idowu Paul Okuwobi1

Accepted: 11 December 2022 / Published online: 30 December 2022

Keywords Feature extraction · VGG16 network · Cross attention · Spatial attention

2 The proposed CAFNET

2.1 Proposed Cross Attention

Fig. 2 Structure of cross attention

2.2 Attention allocation

As mentioned above, C A image can highlight the complementary information of infrared

Fig. 4 Structure of the attention

2.3 Spatial attention module and modulation

Complementary information of infrared and visible-light image is specifically highlighted in

2.3.1 Spatial Attention module

2.3.2 Modulation module

information considering both of them. Hence, we proposed a modulation module to combine

2.4 Fusion rule and detail injection

3 Experiments and analysis

3.1 Subjective evaluation

3.2 Objective evaluation

EN 6.374 6.408 6.409 6.338 6.819 6.785 6.541

perception of fusion image as it considers the complementary information of source images

3.3 Ablation study

Table 2 The objective EN FMI Qab/f SSIM

You might also like