Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

electronics

Communication
The Circular U-Net with Attention Gate for Image Splicing
Forgery Detection
Jin Peng 1,† , Yinghao Li 1,† , Chengming Liu 1, * and Xiaomeng Gao 2

1 School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou 450001, China
2 International College, Zhengzhou University, Zhengzhou 450000, China
* Correspondence: cmliu@zzu.edu.cn; Tel.: +86-371-63886568
† These authors contributed equally to this work.

Abstract: With the advent and rapid development of image tampering technology, it has become
harmful to many aspects of our society. Thus, image tampering detection has been increasingly
important. Although current forgery detection methods have achieved some success, the scale of the
tampered areas in each forgery image are different, and previous methods do not take this into account.
In this paper, we believe that the inability of the network to accommodate tampered regions of various
sizes is the main reason for the low precision. To address the mentioned problem, we propose a neural
network architecture called CAU-Net, which adds residual propagation and feedback, attention gate
and Atrous Spatial Pyramid Pooling with CBAM to the U-Net. The Atrous Spatial Pyramid Pooling
with CBAM can capture information from multiple scales and adapt to differently sized target areas.
In addition, CAU-Net can solve the vanishing gradient issue and suppress the weight of untampered
regions, and CAU-Net is an end-to-end network without redundant image processing; thus, it is fast
to detect suspicious images. In the end, we optimize the proposed network structure by ablation
study, and the experimental results and visualization results demonstrate that our network has a
better performance on CASIA and NIST16 compared with state of the art methods.

Keywords: image forgery detection; residual propagation; attention gate

Citation: Peng, J.; Li, Y.; Liu, C.; Gao, 1. Introduction


X. The Circular U-Net with Attention
With the advent and wide use of image forgery software in recent years, it has become
Gate for Image Splicing Forgery
increasingly easy for people to modify and even edit the content of images. Image forgery
Detection. Electronics 2023, 12, 1451.
has a negative impact on many aspects of our lives, such as academic fraud, fake images,
https://doi.org/10.3390/
etc. These phenomena make us pay attention to image forgery technology. There are
electronics12061451
many image forgery techniques [1–4], such as compositing, enhancement and retouching,
Academic Editor: Chiman Kwan but in general, image forgery techniques can be divided into three categories: copy and
paste forgery, splicing forgery and removal forgery. Different types of tampering methods
Received: 10 February 2023
Revised: 8 March 2023
are not detected in the same way. In this paper, we specifically focus on splicing forgery
Accepted: 16 March 2023
detection in image forgery. The process of image splicing forgery is to copy a part of an
Published: 19 March 2023 image into another image to compose a new tampering image. As Figure 1 shows, a deer
from an unknown image is copied, and then the deer is pasted into Figure 1a (host image)
to merge a new image, Figure 1b (forgery image). Figure 1c (ground-truth) indicates the
tampered regions.
Copyright: © 2023 by the authors. In order to address the splicing forgery issue, a great deal of related methods have
Licensee MDPI, Basel, Switzerland. been proposed. The methods can be divided into traditional-based approaches and deep
This article is an open access article learning-based approaches. The traditional-based methods mostly depend on a specific
distributed under the terms and
feature that can highlight the differences between the untampered areas and tampered
conditions of the Creative Commons
areas, such as Color Filter Array (CFA) artifacts [5] and noise inconsistency [6]. However,
Attribution (CC BY) license (https://
these hand-designed feature have limitations and lack representativeness. In addition,
creativecommons.org/licenses/by/
these methods are not robust to various attacks. Recently, the convolutional neural network
4.0/).

Electronics 2023, 12, 1451. https://doi.org/10.3390/electronics12061451 https://www.mdpi.com/journal/electronics


Electronics 2023, 12, 1451 2 of 12

(CNN) has had great achievements in the field of computer vision. Its ability to extract
images features adaptively has led increasing numbers of researchers to apply it to image
tampered detection, so large numbers of CNN-based image forgery detection methods
have been proposed in recent years [7–12]. Although the approaches have achieved great
performance, these methods pay more attention to learning the inconsistency between the
untampered areas and tampered areas while ignoring the different sizes of tampered areas
for different forgery images. Since targets of different sizes need to be in different receptive
fields to be relatively easy to separate from the background, in this paper, we believe that
the inability of the network to accommodate forgery regions of various sizes and capture
forgery features from multiple scales is the main reason for the low precision. Previous
work (e.g., DeepLab v2 [13]) proposed the Atrous Spatial Pyramid Pooling (ASPP) module,
which can get multi-scale information and is useful for segmenting objects of different
scales. Benefitting from this, we introduce the ASPP module into the forgery detection
network to improve the ability to segment tampered regions of different scales. However,
it is not enough simply to add an ASPP module.

(a) (b) (c)

Figure 1. The three images represent the host image, the forgery image and the ground-truth. (a) Host
image. (b) Forgery image. (c) Ground-truth.

In order to address the above issues, we propose a network called Circular U-Net
with Attention Gate (CAU-Net) in this paper. The network incorporates the features
of [14] based on [9]. Meanwhile, we introduce the ASPP incorporating Convolutional
Block Attention Module (CBAM) [15] called CBASPP module in our network to capture
information at multiple scales. We optimize the proposed network structure by ablation
study, and our network has a better performance on NIST16 [16] and CASIA [17] than the
previous methods. In summary, the key contributions of our work are as follows.
• We modified the ASPP module and applied the CBASPP module to enhance the
detection performance of CAU-Net. The CBASPP module samples the input feature
map in parallel with a convolution of voids at different sampling rates and then
concatenates the results to expand the number of channels. Thus, adding the CBASPP
module can capture the context of an image at multiple scales well.
• We introduce an ingenious method between the corresponding encode and decode
layers called an attention gate. It can control the tampered region weights by enlarging
them, and untampered region weights become smaller, which means that the neural
network is able to get better results.
• We use ResNet with Efficient Channel Attention (ECA) [18] instead of the regular ResNet
to improve the performance of tampered detection without additional parameters.

2. Related Work
In the following, we briefly describe how our ideas were implemented and explain
the novelties accordingly. Since it was discovered that forgery features can be extracted
from the noise stream, an increasing number of forgery detection approaches chose to
get more information from both the noise stream and the RGB stream. Zhou et al. [8]
proposed a two-stream network, including the RGB stream and noise stream generated by
the SRM filter [19]. Hu et al. [20] also use the two-stream structure. In [20], feature fusion is
Electronics 2023, 12, 1451 3 of 12

adopted at an early stage. The difference is that [8] performs feature concatenation at a later
stage. However, when splicing forgery images are taken from the same type of camera,
the noise information will be consistent, so the noise stream is useless. Several approaches
(e.g., MFCN [7], MVSS [11], ET [12]) are proposed to learn the forgery edge with deep
learning. Howeverm there is a gradient degradation issue as the network gets increasingly
deep. Thus, the effect of edge learning on the deep network is weak. RRU-Net [9] and
MCNL-Net [10] proposed the ringed residual U-Net to solve the gradient degradation
issue. Recently, forgery detection networks such as PSCCNet [21] and ObjectFormer [22]
have been proposed to localize the tampered regions.
Although these methods achieve promising results, most of them overlook the variable
scales of the tampered area, which leads to low precision. Our work is closer to RRU-Net,
as we apply the CBASPP module to capture multi-scale information and introduce some
attention modules to suppress the weight of untampered regions. These modules are
described in detail in the following.

3. Network Architecture Overview


We elaborate our network (CAU-Net) in this section. The structure of the method we
proposed is shown in Figure 2. We discuss these modules one by one in the following
sections. The purple ring structure represents the residual module, which is composed of
residual propagation and residual feedback. Details of these two algorithms are introduced
in Section 3.1 and Section 3.2, receptively. The encode layers and decode layers are summed
by the attention gate instead of simply summing to remove redundant information; the
details are described in Section 3.3. The ringed structure is downsampled by MaxPool in
the encode layers and upsampled by transposed convolution in the decode layers. The
CBASPP module is applied in the bottom layer to obtain more feature information at
multiple scales, and details of the structure are presented in Section 3.4. In the end, we
present the structure of the ECA module in Section 3.5.

Figure 2. The overview of the proposed CAU-Net structure for the task of the location of tampered
regions.

3.1. Residual Propagation


For image forgery detection, the differences of image essential properties between
host images and copied images are crucial. Using these differences, we can easily detect
and locate tampered areas. However, with the network getting increasingly deep, image
Electronics 2023, 12, 1451 4 of 12

essential properties fade away, and the minor differences between the tampered and un-
tampered regions thus disappear. In order to address the mentioned problem, we introduce
a classical method called residual propagation mechanism [23] to each component block.
The component block is shown in Figure 3. It consists of two convolution blocks as well as
residual propagation. The residual propagation mechanism is defined in Equation (1):

y f = F ( x, Wi ) + M ( x ), (1)

where x and y f represent the input and output of the component block, respectively, as
shown in Figure 3, and Wi denotes the parameters of the layer i. The function F ( x, Wi ) is
learnable; in this paper, this function represents the feature map after two convolutional
layers and ReLU. After that, we use the function M, which is a learnable linear mapping
layer, to alter the dimension of input x to ensure the same dimension as F ( x, Wi ) so that
they can be added together. Compared to the previous normal networks, the residual
network introduces a new mechanism called a jump connection. Benefit from this, more
information from the previous residual block can flow unimpeded to the next residual
block, which can improve the information flow and allow the difference in the essential
properties of the images to propagate with the network. Moreover, this can avoid the
problem of a vanishing gradient because of the excessive depth of the network.
The mechanism is just like the human brain’s recall mechanism. When we learn a
large amount of new knowledge, we may forget part of our previous knowledge. At this
point, we need a recall mechanism to help us remember what we have learned before, and
so does the network, using the residual propagation module to keep crucial information
from being forgotten. Meanwhile, it is of no doubt that if the differences of image essential
properties between host images and copied images do not disappear, the performance of
the network will be greatly improved.

Figure 3. Residual propagation.

3.2. Residual Feedback


According to the explanation above, it is obvious that as soon as we can enhance the
differences between the untampered and tampered regions, the detection accuracy of the
network will be improved. In [8], the proposed method chose to used the SRM filter to
strengthen the noise feature. The method has some effect, but it is useful only for RGB
image tamper detection. In addition, when tampered and untampered areas come from
cameras of the same brand, the SRM filter will not work well because of the consistent
noise. In addition, the mechanism of residual propagation alone is not enough for the
network to learn more differences in the essential properties of images. In order to further
strengthen the differences, this paper introduces the residual feedback [9] block, which is an
auto-learning mechanism, to further improve the performance of the network. In addition,
a simple and useful attention mechanism is used for the residual feedback block to pay
more attention to the input feature map. The mentioned mechanism uses the properties
of the attention method to avoid the key information loss and redundancy of not-crucial
information. The weights acquired from the input feature map are used to enlarge the
differences in the essential properties of the images between the tampered and untampered
Electronics 2023, 12, 1451 5 of 12

regions. In the component block, the residual feedback method is shown as Figure 4, and
its definition is presented in Equation (2).

yb = (s(W (y f )) + 1) × x, (2)

where x and yb represent the input of the component block and the enhanced input,
respectively, and y f is the output of the previous step of residual Equation (1). W represents
a linear mapping to change the dimension of y f , and s represents the sigmoid activation
function.
Unlike the residual propagation, the residual feedback is more interesting in the
difference between the untampered and tampered regions in the input feature map. It
can suppress the weights of the untampered areas as well as amplifying the weights of
the tampered areas. Based on this, our network can easily segment out of the tampered
areas. The combination of residual propagation and residual feedback can not only make
the network perform better but also shorten the training time of the network. In summary,
residual feedback has some meaningful effects:
• The enhancement of positive labeling features while suppressing negative labeling fea-
tures makes the differences in the image’s essential properties between the tampered
and untampered regions increasingly large.

Figure 4. Residual feedback.

3.3. Attention Gate


In the above subsections, we introduced two modules, which are residual propagation
and residual feedback, to improve the performance of our network. However, residual
propagation and residual feedback only work between the component blocks, and each
corresponding decode and encode layer is subjected to simple summation, with a great deal
of redundant information. If this redundant information is not eliminated, the difference
between the tampered and untampered areas cannot be well represented. To address the
mentioned issue, we introduce the attention gate block [14] between the encode layers and
decode layers at the same level, and the details are shown in Figure 5.

Figure 5. The architecture of the attention gate.


Electronics 2023, 12, 1451 6 of 12

Where inputs g and xl are manipulated by two convolutional kernels Wg and Wx to


obtain A and B, respectively. Then, A and B are added to get C through the ReLU operation
and convolutional kernel Ψ operation to obtain E, and finally the attention coefficients (α)
are obtained by sigmoid activation function and resampling. Let α and xl be multiplied so
that attention can be put on the tampered regions.
This mechanism is similar to the human eye. In our daily life, we pay more attention to
the areas that interest us. In the forgery detection task, adding the attention mechanism can
make our network more interested in tampered regions. In addition, this mechanism uses
soft attention instead of hard attention. Soft attention is differentiable, and differentiable
attention can be learned by neural networks that calculate gradients and learn the weights
of attention by forward propagation and backward feedback. This is used to learn more
important features. In addition, this mechanism is also useful for the network to identify
small tampered regions. Because the attention gate does not focus on the whole image,
but operates on small local tampered regions, in summary, the role of the attention gate
mechanism is that in the input feature map after the operation of attention coefficients,
the tampered region weights are amplified and the untampered regions weights become
smaller so that the accuracy of the network can be improved.

3.4. Atrous Spatial Pyramid Pooling with CBAM


Atrous Spatial Pyramid Pooling (ASPP) was first proposed in DeepLab V2 [13] to
acquire multi-scale information and to better segment objects at different scales. Moreover,
CBAM pays more attention to identifying target regions. Inspired by these, we combine
the ASPP module with the CBAM module into the CBASPP module. Its exact structure is
shown in Figure 6.

Figure 6. The architecture of Atrous Spatial Pyramid Pooling with CBAM.

Dilation convolution with dilation rates of 4, 8 and 12 and a convolution kernel size
of 3 × 3 is used. The local features on the previous layer are associated with a wider field
of view to prevent small target features from being lost during information transfer. As
shown in the above, from top to bottom, the first branch is a convolution with the filter size
of 1 × 1, which aims to maintain the original receptive field. The second to fourth branches
are dilation convolutions with different dilation rates, which aim at feature extraction to
obtain different receptive fields. The fifth branch is the global average pooling of the input
to obtain global features. Finally, the feature maps of the five branches are stacked in the
channel dimension, and the information at different scales is fused by convolution with
filter of 1 × 1 to get the new feature map F. The feature map F (H × W × C) is then
calculated by the Channel Attention Module (CAM) and Spatial Attention Module (SAM)
Electronics 2023, 12, 1451 7 of 12

to obtain the final feature map Ms . The detailed process is shown in Figure 6, and the
formula of CAM is shown in Equation (3):
c c
Mc ( F ) = σ (W1 (W0 ( Favg )) + W1 (W0 ( Fmax ))), (3)

where σ represents the sigmoid function, W0 ∈ RC/r×C and W1 ∈ RC×C/r , and r is the
reduction ratio. The weights of MLP are W0 and W1 , and they are shared for the inputs, and
the ReLU activation function is followed by W0 . Moreover, the formula of SAM is shown
in Equation (4):
Ms ( F ) = σ( f 7×7 ([ Favg
c c
; Fmax ])), (4)
where σ represents the sigmoid function, and f is a convolution operation with the kernel
s
size of 7 × 7. Favg ∈ R1× H ×W and Fmaxs ∈ R1× H ×W , and they are the 2D feature maps
obtained by two pooling operations for Fc .
Since the scale and location of the tampered areas in each forgery image are different,
this poses a problem for forgery detection. The above CBASPP module enlarges the
receptive field and enriches feature information through multiple level sampling rates of
dilated convolution parallel sampling, and the features of the image level can be efficient in
capturing global contextual information. The module considers these relationships between
contexts to avoid the problem of segmentation errors caused by getting caught in local
features and to improve the accuracy of image forgery detection. Moreover, this module
aggregates multiple scale contextual information and enhances the ability of the network
to identify tampered regions of different sizes.

3.5. Efficient Channel Attention


ECA [18] is an efficient channel attention module that adopts a local cross-channel
interaction strategy without dimensionality reduction, effectively avoiding the effect of
dimensionality reduction on the learning effect of channel attention. In this paper, we apply
the ECA module to the residual propagation that can enhance channel features to improve
forgery detection network performance. The details are shown in Figure 7.

Figure 7. The Residual propagation with Efficient Channel Attention module.

After the two convolution operations with the kernel size of 3 × 3, we then get
aggregated features of 1 × 1 × C by global average pooling (GAP). The ECA module
produces the channel weights by performing a fast one-dimensional convolution of size
k, where k is determined adaptively by a mapping of the channel dimension C and σ
represents the sigmoid activation function.

4. Experiments
In the above sections, we have shown the specific structure and design ideas of our
method, and we analyzed the possibility of these modules from the theoretical perspective.
In order to further demonstrate its performance, we have conducted various comprehensive
Electronics 2023, 12, 1451 8 of 12

experiments. The goal of our work is to detect whether the target image is tampered at the
pixel level and locate the tampered regions. Below, we present the details of our experiment
in several ways. We introduce the experiment specific setup, such as the experimental
dataset, evaluation metrics, etc. in Section 4.1 and then present results compared with other
previous methods in Section 4.2. Next, we perform an ablation study to demonstrate the
effectiveness of each module in the network in Section 4.3. Then, we perform a robustness
experiment on our network in Section 4.4. In the end, we show the visualization of the
detection results predicted by various methods in Section 4.5.

4.1. Experimental Settings


Experimental Dataset:There are now many tampered datasets on the internet. After
some thought, we chose CASIA [17] and Nist Nimble 2016 (NIST16) datasets [16] as the
experimental dataset. CASIA contains two tampering types—copy-and-paste forgery and
splicing forgery—and all images in this dataset are artificially produced, complex, real and
not easily judged by human eyes. Using it as part of our dataset makes our work more
practical and better demonstrates network performance. NIST16 is a dataset that includes
copy-and-paste forgery, splicing forgery and removal forgery. The tampering of the dataset
is post-processed to hide the visible trace. CASIA has 5610 images in sizes from 240 ×
600 to 800 × 600. NIST16 has 564 images, and these ground-truth masks are available for
evaluation. To guarantee the authenticity of the experiment, for the datasets, we randomly
selected 10% as the test set, 10% as the validation set and 80% as the training set. We resized
all images at a uniform size, which is 256 × 384. Meanwhile, we kept both TIFF and JPG
images in the dataset so that we could make our network more applicable.
Experimental Metrics: In the experiment, appropriate evaluation metrics can reflect
the performance of the model well. For image tampering detection, the crucial evaluation
is the degree of accuracy in locating the forgery areas at the pixel level. In this experiment,
we follow the evaluation metrics used in the previous related works [22], which are the F1
score and Area Under Curve (AUC) score. The F1 score is the combination of recall and
precision to measure the performance of the network. The AUC is is the area under the
ROC curve. ROC curves are generally used to evaluate the classification effect of a certain
classifier. In tampering detection, we classify each pixel of the images into tampered and
untampered. Thus, we also use AUC as an experimental metric.
Compared Methods: In order to test the performance of the proposed method and to
ensure the authenticity of the experiment, we chose some previous excellent forgery detec-
tion approaches as compared methods, which are RGB-N [8], RRU-Net [9], MCNL-Net [10],
PSCCNet [21] and ObjectFormer [22]. We describe them in detail below.
RGB-N adopts two streams, which are the RGB stream and noise stream, in parallel to
detect the forgery features and noise inconsistency within an image, respectively.
RRU-Net is an end to end image segmentation network for image forgery detection.
It uses residual propagation and residual feedback to enhance the capacity of feature
extraction and detect tampering images without any pre-process and post-process.
The structure of MCNL-Net is similar to RRU-Net. It adds the BAM module and
MaxBlurPool module on the basis of RRU-Net. In addition, it uses convolution kernels of
different sizes for better feature extraction.
PSCCNet performs image tampering localization in a step-by-step manner from coarse
to fine.
ObjectFormer models the visually inconsistent information at the object level for
tampering detection with the advantage of a visual transformer.
Implementation Details: All images before entering the network training are resized
256 × 384. Our network and other compared detection methods were run on a server
with an NVIDIA GeForce RTX 2080 Ti GPU. RRU-Net, MCNL-Net and CAU-Net were
implemented by PyTorch 1.8.2. The details about the training process of our network are as
follows. We utilized stochastic gradient descent (SGD) as optimization and used random
weights as initial parameters. Moreover, the batch size was 8, the momentum was 0.9 and
Electronics 2023, 12, 1451 9 of 12

the weight decay was 0.0005. In the first 80 epochs, we set the learning rate as 0.01. Before
120 epochs, we reduced the learning rate to 0.005. After 120 epochs, we set the learning
rate to 0.001.

4.2. Compared Detection Methods


To evaluate the practical effects of the method proposed in this paper, we chose SPAN,
RGB-N, RRU-Net, MCNL-Net, PSCCNet and ObjectFormer as the compared methods in
this experiment. In the above section, we presented the experimental metrics. In order to
ensure the fairness of the experiments, the parameters of the above comparison experiments
have been tuned to be optimal, and the best detection results of various methods are taken
for comparison.
We report the F1 score and AUC (%) in Table 1, from which we can observe that our
network performs best on the CASIA dataset with 58.3% F1 score and 88.4% AUC. On the
NIST16 dataset, out method outperforms ObjectFormer by 11.9% in terms of the F1 score,
but the AUC is slightly lower on the NIST16 dataset than PSCCNet and ObjectFormer.
Overall, the network we proposed has a better performance compared with other related
approaches.

Table 1. The results (%) of the proposed method and the other methods on the testing set.

CASIA NIST16
Methods
F1 Score AUC F1 Score AUC
SPAN [20] 38.2 83.8 58.2 96.1
RGB-N [8] 40.8 79.5 72.2 93.7
RRU-Net [9] 45.2 79.8 85.1 92.3
MCNL-Net [10] 52.4 81.9 90.6 96.7
PSCCNet [21] 55.4 87.5 81.9 99.6
ObjectFormer
57.9 88.2 82.4 99.6
[22]
Ours 58.3 88.4 94.3 97.3
Bold text represents the best results.

4.3. Ablation Study


The attention gate module is used to remove redundant information between the
corresponding encode layer and decode layer, and the CBASPP module makes it possible
for the network to extract image information from multiple scales. In addition, the ECA
module uses 1D-CNN to achieve cross-channel information interaction. In the above
section, we analyzed the effectiveness of these three modules in terms of theoretical aspects.
In order to evaluate the authenticity of the modules, the ablation study compared the
experimental effects of the model with the attention gate removed, with the ASPP module
and ECA module removed and with both removed, and we evaluated the forgery detection
performance on CASIA.
The experimental results are listed in Table 2. It is easy to observe that without the
attention gate, ASPP and ECA modules, the F1 fraction and AUC decreased by 13.1% and
8.6%, respectively, on the CASIA dataset. In contrast, for the network without the attention
gate module, the F1 score and AUC decreased by 5.1% and 4.1%. Without the CBASPP
module, the F1 score and AUC decreased by 6.8% and 7.1%. Without the ECA module,
the F1 score and AUC decreased by 3.8% and 7.5%. The degradation of the detection
results verifies that the use of these modules is effective in improving the performance of
our network.
Electronics 2023, 12, 1451 10 of 12

Table 2. Experimental data of ablation study results on CASIA dataset.

Attention Gate CBASPP ECA F1 Score AUC


× × × 45.2 79.8
× X X 53.2 84.3
X × X 51.5 81.3
X X × 54.5 80.9
X X X 58.3 88.4

4.4. Robustness Evaluation


In the above, we compare our work with other experiments as well as performing an
ablation study. To further evaluate the robustness of our network, we apply different image
attack methods on original images from the NIST16 dataset. The attack methods include
resize, JPEG compression with the quality factor η and Gaussian blur with the kernel size
κ. The parameters and forgery detection performance (F1 score and AUC) are shown in
Table 3. From Table 3, we can observe that our network is robust to the different image
attack methods.

Table 3. The forgery detection results (%) under various attacks on the NIST16 dataset. F1 score and
AUC are reported.

Attack F1 Score AUC


no attack 94.3 97.3
Resize (0.8×) 90.8 94.5
Resize (0.6×) 86.7 92.4
GaussianBlur (κ = 3) 86.9 92.8
GaussianBlur (κ = 7) 80.7 87.6
JPEGCompress (η = 100) 92.8 96.4
JPEGCompress (η = 80) 91.6 95.9

4.5. Visualization Results


We provide experimental results in Table 1. However, the results are not intuitive from
the data alone, so we visualize the tampering detection results of various methods. Since
the code of ObjectFormer [22] and PSCCNet [21] is not provided, their detection results are
not available. In order to ensure authenticity, we randomly chose four sets of data from the
test set on CASIA as the example of visualization results. We show the detection results of
various methods in Figure 8.
In Figure 8, each line represents a different meaning, where the first line represents
the forgery images, the second line is the real tampered regions, which is the ground truth
image, the third line shows visualization results of the prediction of RRU-Net method,
the fourth line represents the prediction results of MCNL-Net method, and finally the
fifth line represents the prediction results of the method we proposed in this paper. From
the following visualization of results, it is easy to see that the RRU-Net method has good
performance for normal tampered areas and can locate the tampered areas roughly, but
there are still some false detections and missed areas. Additionally, the method does not
perform well on small tampered regions. Compared with the RRU-Net method, the MCNL-
Net method has a better performance with fewer false detection and missed detection
areas. However, we can observe that the method also does not show excellent performance
for tiny tampered areas through its predicted visualization results. From a subjective
perspective, the predicted visualization results demonstrate that our model is the best and
the most stable among the three methods. It can not only locate the forgery areas more
accurately but also develop more sharp boundaries.
Electronics 2023, 12, 1451 11 of 12

Figure 8. Visualization of the forgery detection results predicted by various methods. From top to
bottom, we show the forgery image, the GT mask and predicted results of RRU-Net, MCNL-Net
and CAU-Net.

From the experimental metrics and predicted visualization results, we can easily con-
clude that the network we proposed has a better performance than other compared methods.

5. Conclusions
In this paper, we proposed the circular U-Net with attention gate (CAU-Net), which
is an end to end image forgery detection network. The Atrous Spatial Pyramid Pooling
with CBAM extracts image information from multiple scales to better detect tampered
areas of different sizes. The attention gate module can suppress the untampered regions,
so our network pays more attention to the tampered regions. Besides, Efficient Channel
Attention enhances channel features to improve forgery detection network performance. We
demonstrate the effectiveness of our network through theoretical analysis and comparative
experiments with experimental data and visualization results. Extensive experiments on
CASIA and NIST16 public datasets verify that our network has better performance than
state of the art methods.

Author Contributions: The conception of the study, C.L. and J.P.; literature search, Y.L. and X.G.;
figures, X.G.; data collection, J.P. and X.G.; data analysis, C.L. and J.P.; data interpretation, Y.L.;
writing, J.P.; review and editing C.L. and Y.L.; the experiment J.P. and X.G.; supervision, C.L. and Y.L.;
funding acquisition C.L. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the National Key Research and Development Program of
China (No. 2020YFB1712401), and Chinese Scholarship Council (No. 202007045007).
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
Electronics 2023, 12, 1451 12 of 12

References
1. Dhamo, H.; Farshad, A.; Laina, I. Semantic image manipulation using scene graphs. In Proceedings of the Computer Vision and
Pattern Recognition, Seattle, WA, USA, 13–19 June 2020.
2. Li, B.; Qi, X.; Lukasiewicz, T. Manigan: Text-guided image manipulation. In Proceedings of the Computer Vision and Pattern
Recognition, Seattle, WA, USA, 13–19 June 2020.
3. Park, T.; Zhu, J.-Y.; Wang, O. Swapping autoencoder for deep image manipulation. In Proceedings of the Neural Information
Processing Systems, Online, 6–12 December 2020.
4. Vinker, Y.; Horwitz, E.; Zabari, N. Deep single image manipulation. In Proceedings of the IEEE International Conference on
Computer Vision, Montreal, QC, Canada, 11–17 October 2021.
5. Ferrara, P.; Bianchi, T.; De Rosa, A. Image forgery localization via fine-grained analysis of CFA artifacts. IEEE Trans. Inf. Forensics
Secur. 2012, 7, 1566–1577. [CrossRef]
6. Pan, X.; Lyu, S. Region Duplication Detection Using Image Feature Matching. IEEE Trans. Inf. Forensics Secur. 2010, 5, 857–867.
[CrossRef]
7. Salloum, R.; Ren, Y.; Kuo, C.C.J. Image Splicing Localization Using a Multi-task Fully Convolutional Network (MFCN). J. Vis.
Commun. Image Represent. 2017, 51, 201–209. [CrossRef]
8. Zhou, P.; Han, X.T.; Morariu, V.I. Learning rich features for image manipulation detection. In Proceedings of the 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1053–1061.
9. Bi, X.; Wei, Y.; Xiao, B.; Li, W. The Ringed Residual U-Net for Image Splicing Forgery Detection. In Proceedings of the 2019
IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019.
10. Wei, Y.; Wang, Z.; Xiao, B. Controlling Neural Learning Network with Multiple Scales for Image Splicing Forgery Detection. ACM
Trans. Multimed. Comput. Commun. Appl. 2020, 16, 1–22. [CrossRef]
11. Chen, X.; Dong, C.; Ji, J.; Cao, J. Image manipulation detection by multi-view multi-scale supervision. In Proceedings of the 2021
IEEE/CVF Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021.
12. Sun, Y.; Ni, R. ET: Edge-enhanced Transformer for Image Splicing Detection. IEEE Signal Process. Lett. 2022, 29, 1232–1236.
[CrossRef]
13. Chen, L.C.; Papandreou, G.; Kokkinos, I. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous
Convolution, and Fully Connected CRFs. TPAMI 2018, 40, 834–848. [CrossRef] [PubMed]
14. Oktay, O.; Schlemper, J.; Folgoc, L.L. Attention U-Net: Learning where to Look for the Pancreas. In Proceedings of the International
Conference on Medical Imaging with Deep Learning, Amsterdam, The Netherlands, 4–6 July 2018.
15. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference
on Computer Vision, Munich, Germany, 8–14 September 2018.
16. Nist. Nimble 2016 Datasets. Available online: https://www.nist.gov/itl/iad/mig/nimble-challenge-2017-evaluation (accessed
on 5 February 2016).
17. Dong, J.; Wang, W.; Tan, T.N. CASIA image tampering detection evaluation database. In Proceedings of the 2013 IEEE China
Summit and International Conference on Signal and Information Processing, Beijing, China, 6–10 July 2013.
18. Wang, Q.; Wu, B.; Zhu, P. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020.
19. Fridrich, J.; Kodovsky, J. Rich models for steganalysis of digital images. IEEE Trans. Inf. Forensics Secur. 2012, 7, 868–882.
[CrossRef]
20. Hu, X.; Zhang, Z.; Jiang, Z. Span: Spatial pyramid attention network forimage manipulation localization. In Proceedings of the
European Conference on Computer Vision, ECCV, 2020, Glasgow, UK, 23–28 August 2020.
21. Liu, X.; Liu, Y.; Chen, J.; Liu, X. PSCC-Net: Progressive spatio-channel correlation network for image manipulation detection and
localization. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 7505–7517. [CrossRef]
22. Wang, J.; Wu, Z.; Chen, J. ObjectFormer for Image Manipulation Detection and Localization. In Proceedings of the 2022 IEEE
Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022.
23. He, K.; Zhang, X.; Ren, S. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

You might also like