Professional Documents
Culture Documents
Electronics 12 01451
Electronics 12 01451
Communication
The Circular U-Net with Attention Gate for Image Splicing
Forgery Detection
Jin Peng 1,† , Yinghao Li 1,† , Chengming Liu 1, * and Xiaomeng Gao 2
1 School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou 450001, China
2 International College, Zhengzhou University, Zhengzhou 450000, China
* Correspondence: cmliu@zzu.edu.cn; Tel.: +86-371-63886568
† These authors contributed equally to this work.
Abstract: With the advent and rapid development of image tampering technology, it has become
harmful to many aspects of our society. Thus, image tampering detection has been increasingly
important. Although current forgery detection methods have achieved some success, the scale of the
tampered areas in each forgery image are different, and previous methods do not take this into account.
In this paper, we believe that the inability of the network to accommodate tampered regions of various
sizes is the main reason for the low precision. To address the mentioned problem, we propose a neural
network architecture called CAU-Net, which adds residual propagation and feedback, attention gate
and Atrous Spatial Pyramid Pooling with CBAM to the U-Net. The Atrous Spatial Pyramid Pooling
with CBAM can capture information from multiple scales and adapt to differently sized target areas.
In addition, CAU-Net can solve the vanishing gradient issue and suppress the weight of untampered
regions, and CAU-Net is an end-to-end network without redundant image processing; thus, it is fast
to detect suspicious images. In the end, we optimize the proposed network structure by ablation
study, and the experimental results and visualization results demonstrate that our network has a
better performance on CASIA and NIST16 compared with state of the art methods.
(CNN) has had great achievements in the field of computer vision. Its ability to extract
images features adaptively has led increasing numbers of researchers to apply it to image
tampered detection, so large numbers of CNN-based image forgery detection methods
have been proposed in recent years [7–12]. Although the approaches have achieved great
performance, these methods pay more attention to learning the inconsistency between the
untampered areas and tampered areas while ignoring the different sizes of tampered areas
for different forgery images. Since targets of different sizes need to be in different receptive
fields to be relatively easy to separate from the background, in this paper, we believe that
the inability of the network to accommodate forgery regions of various sizes and capture
forgery features from multiple scales is the main reason for the low precision. Previous
work (e.g., DeepLab v2 [13]) proposed the Atrous Spatial Pyramid Pooling (ASPP) module,
which can get multi-scale information and is useful for segmenting objects of different
scales. Benefitting from this, we introduce the ASPP module into the forgery detection
network to improve the ability to segment tampered regions of different scales. However,
it is not enough simply to add an ASPP module.
Figure 1. The three images represent the host image, the forgery image and the ground-truth. (a) Host
image. (b) Forgery image. (c) Ground-truth.
In order to address the above issues, we propose a network called Circular U-Net
with Attention Gate (CAU-Net) in this paper. The network incorporates the features
of [14] based on [9]. Meanwhile, we introduce the ASPP incorporating Convolutional
Block Attention Module (CBAM) [15] called CBASPP module in our network to capture
information at multiple scales. We optimize the proposed network structure by ablation
study, and our network has a better performance on NIST16 [16] and CASIA [17] than the
previous methods. In summary, the key contributions of our work are as follows.
• We modified the ASPP module and applied the CBASPP module to enhance the
detection performance of CAU-Net. The CBASPP module samples the input feature
map in parallel with a convolution of voids at different sampling rates and then
concatenates the results to expand the number of channels. Thus, adding the CBASPP
module can capture the context of an image at multiple scales well.
• We introduce an ingenious method between the corresponding encode and decode
layers called an attention gate. It can control the tampered region weights by enlarging
them, and untampered region weights become smaller, which means that the neural
network is able to get better results.
• We use ResNet with Efficient Channel Attention (ECA) [18] instead of the regular ResNet
to improve the performance of tampered detection without additional parameters.
2. Related Work
In the following, we briefly describe how our ideas were implemented and explain
the novelties accordingly. Since it was discovered that forgery features can be extracted
from the noise stream, an increasing number of forgery detection approaches chose to
get more information from both the noise stream and the RGB stream. Zhou et al. [8]
proposed a two-stream network, including the RGB stream and noise stream generated by
the SRM filter [19]. Hu et al. [20] also use the two-stream structure. In [20], feature fusion is
Electronics 2023, 12, 1451 3 of 12
adopted at an early stage. The difference is that [8] performs feature concatenation at a later
stage. However, when splicing forgery images are taken from the same type of camera,
the noise information will be consistent, so the noise stream is useless. Several approaches
(e.g., MFCN [7], MVSS [11], ET [12]) are proposed to learn the forgery edge with deep
learning. Howeverm there is a gradient degradation issue as the network gets increasingly
deep. Thus, the effect of edge learning on the deep network is weak. RRU-Net [9] and
MCNL-Net [10] proposed the ringed residual U-Net to solve the gradient degradation
issue. Recently, forgery detection networks such as PSCCNet [21] and ObjectFormer [22]
have been proposed to localize the tampered regions.
Although these methods achieve promising results, most of them overlook the variable
scales of the tampered area, which leads to low precision. Our work is closer to RRU-Net,
as we apply the CBASPP module to capture multi-scale information and introduce some
attention modules to suppress the weight of untampered regions. These modules are
described in detail in the following.
Figure 2. The overview of the proposed CAU-Net structure for the task of the location of tampered
regions.
essential properties fade away, and the minor differences between the tampered and un-
tampered regions thus disappear. In order to address the mentioned problem, we introduce
a classical method called residual propagation mechanism [23] to each component block.
The component block is shown in Figure 3. It consists of two convolution blocks as well as
residual propagation. The residual propagation mechanism is defined in Equation (1):
y f = F ( x, Wi ) + M ( x ), (1)
where x and y f represent the input and output of the component block, respectively, as
shown in Figure 3, and Wi denotes the parameters of the layer i. The function F ( x, Wi ) is
learnable; in this paper, this function represents the feature map after two convolutional
layers and ReLU. After that, we use the function M, which is a learnable linear mapping
layer, to alter the dimension of input x to ensure the same dimension as F ( x, Wi ) so that
they can be added together. Compared to the previous normal networks, the residual
network introduces a new mechanism called a jump connection. Benefit from this, more
information from the previous residual block can flow unimpeded to the next residual
block, which can improve the information flow and allow the difference in the essential
properties of the images to propagate with the network. Moreover, this can avoid the
problem of a vanishing gradient because of the excessive depth of the network.
The mechanism is just like the human brain’s recall mechanism. When we learn a
large amount of new knowledge, we may forget part of our previous knowledge. At this
point, we need a recall mechanism to help us remember what we have learned before, and
so does the network, using the residual propagation module to keep crucial information
from being forgotten. Meanwhile, it is of no doubt that if the differences of image essential
properties between host images and copied images do not disappear, the performance of
the network will be greatly improved.
regions. In the component block, the residual feedback method is shown as Figure 4, and
its definition is presented in Equation (2).
yb = (s(W (y f )) + 1) × x, (2)
where x and yb represent the input of the component block and the enhanced input,
respectively, and y f is the output of the previous step of residual Equation (1). W represents
a linear mapping to change the dimension of y f , and s represents the sigmoid activation
function.
Unlike the residual propagation, the residual feedback is more interesting in the
difference between the untampered and tampered regions in the input feature map. It
can suppress the weights of the untampered areas as well as amplifying the weights of
the tampered areas. Based on this, our network can easily segment out of the tampered
areas. The combination of residual propagation and residual feedback can not only make
the network perform better but also shorten the training time of the network. In summary,
residual feedback has some meaningful effects:
• The enhancement of positive labeling features while suppressing negative labeling fea-
tures makes the differences in the image’s essential properties between the tampered
and untampered regions increasingly large.
Dilation convolution with dilation rates of 4, 8 and 12 and a convolution kernel size
of 3 × 3 is used. The local features on the previous layer are associated with a wider field
of view to prevent small target features from being lost during information transfer. As
shown in the above, from top to bottom, the first branch is a convolution with the filter size
of 1 × 1, which aims to maintain the original receptive field. The second to fourth branches
are dilation convolutions with different dilation rates, which aim at feature extraction to
obtain different receptive fields. The fifth branch is the global average pooling of the input
to obtain global features. Finally, the feature maps of the five branches are stacked in the
channel dimension, and the information at different scales is fused by convolution with
filter of 1 × 1 to get the new feature map F. The feature map F (H × W × C) is then
calculated by the Channel Attention Module (CAM) and Spatial Attention Module (SAM)
Electronics 2023, 12, 1451 7 of 12
to obtain the final feature map Ms . The detailed process is shown in Figure 6, and the
formula of CAM is shown in Equation (3):
c c
Mc ( F ) = σ (W1 (W0 ( Favg )) + W1 (W0 ( Fmax ))), (3)
where σ represents the sigmoid function, W0 ∈ RC/r×C and W1 ∈ RC×C/r , and r is the
reduction ratio. The weights of MLP are W0 and W1 , and they are shared for the inputs, and
the ReLU activation function is followed by W0 . Moreover, the formula of SAM is shown
in Equation (4):
Ms ( F ) = σ( f 7×7 ([ Favg
c c
; Fmax ])), (4)
where σ represents the sigmoid function, and f is a convolution operation with the kernel
s
size of 7 × 7. Favg ∈ R1× H ×W and Fmaxs ∈ R1× H ×W , and they are the 2D feature maps
obtained by two pooling operations for Fc .
Since the scale and location of the tampered areas in each forgery image are different,
this poses a problem for forgery detection. The above CBASPP module enlarges the
receptive field and enriches feature information through multiple level sampling rates of
dilated convolution parallel sampling, and the features of the image level can be efficient in
capturing global contextual information. The module considers these relationships between
contexts to avoid the problem of segmentation errors caused by getting caught in local
features and to improve the accuracy of image forgery detection. Moreover, this module
aggregates multiple scale contextual information and enhances the ability of the network
to identify tampered regions of different sizes.
After the two convolution operations with the kernel size of 3 × 3, we then get
aggregated features of 1 × 1 × C by global average pooling (GAP). The ECA module
produces the channel weights by performing a fast one-dimensional convolution of size
k, where k is determined adaptively by a mapping of the channel dimension C and σ
represents the sigmoid activation function.
4. Experiments
In the above sections, we have shown the specific structure and design ideas of our
method, and we analyzed the possibility of these modules from the theoretical perspective.
In order to further demonstrate its performance, we have conducted various comprehensive
Electronics 2023, 12, 1451 8 of 12
experiments. The goal of our work is to detect whether the target image is tampered at the
pixel level and locate the tampered regions. Below, we present the details of our experiment
in several ways. We introduce the experiment specific setup, such as the experimental
dataset, evaluation metrics, etc. in Section 4.1 and then present results compared with other
previous methods in Section 4.2. Next, we perform an ablation study to demonstrate the
effectiveness of each module in the network in Section 4.3. Then, we perform a robustness
experiment on our network in Section 4.4. In the end, we show the visualization of the
detection results predicted by various methods in Section 4.5.
the weight decay was 0.0005. In the first 80 epochs, we set the learning rate as 0.01. Before
120 epochs, we reduced the learning rate to 0.005. After 120 epochs, we set the learning
rate to 0.001.
Table 1. The results (%) of the proposed method and the other methods on the testing set.
CASIA NIST16
Methods
F1 Score AUC F1 Score AUC
SPAN [20] 38.2 83.8 58.2 96.1
RGB-N [8] 40.8 79.5 72.2 93.7
RRU-Net [9] 45.2 79.8 85.1 92.3
MCNL-Net [10] 52.4 81.9 90.6 96.7
PSCCNet [21] 55.4 87.5 81.9 99.6
ObjectFormer
57.9 88.2 82.4 99.6
[22]
Ours 58.3 88.4 94.3 97.3
Bold text represents the best results.
Table 3. The forgery detection results (%) under various attacks on the NIST16 dataset. F1 score and
AUC are reported.
Figure 8. Visualization of the forgery detection results predicted by various methods. From top to
bottom, we show the forgery image, the GT mask and predicted results of RRU-Net, MCNL-Net
and CAU-Net.
From the experimental metrics and predicted visualization results, we can easily con-
clude that the network we proposed has a better performance than other compared methods.
5. Conclusions
In this paper, we proposed the circular U-Net with attention gate (CAU-Net), which
is an end to end image forgery detection network. The Atrous Spatial Pyramid Pooling
with CBAM extracts image information from multiple scales to better detect tampered
areas of different sizes. The attention gate module can suppress the untampered regions,
so our network pays more attention to the tampered regions. Besides, Efficient Channel
Attention enhances channel features to improve forgery detection network performance. We
demonstrate the effectiveness of our network through theoretical analysis and comparative
experiments with experimental data and visualization results. Extensive experiments on
CASIA and NIST16 public datasets verify that our network has better performance than
state of the art methods.
Author Contributions: The conception of the study, C.L. and J.P.; literature search, Y.L. and X.G.;
figures, X.G.; data collection, J.P. and X.G.; data analysis, C.L. and J.P.; data interpretation, Y.L.;
writing, J.P.; review and editing C.L. and Y.L.; the experiment J.P. and X.G.; supervision, C.L. and Y.L.;
funding acquisition C.L. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the National Key Research and Development Program of
China (No. 2020YFB1712401), and Chinese Scholarship Council (No. 202007045007).
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
Electronics 2023, 12, 1451 12 of 12
References
1. Dhamo, H.; Farshad, A.; Laina, I. Semantic image manipulation using scene graphs. In Proceedings of the Computer Vision and
Pattern Recognition, Seattle, WA, USA, 13–19 June 2020.
2. Li, B.; Qi, X.; Lukasiewicz, T. Manigan: Text-guided image manipulation. In Proceedings of the Computer Vision and Pattern
Recognition, Seattle, WA, USA, 13–19 June 2020.
3. Park, T.; Zhu, J.-Y.; Wang, O. Swapping autoencoder for deep image manipulation. In Proceedings of the Neural Information
Processing Systems, Online, 6–12 December 2020.
4. Vinker, Y.; Horwitz, E.; Zabari, N. Deep single image manipulation. In Proceedings of the IEEE International Conference on
Computer Vision, Montreal, QC, Canada, 11–17 October 2021.
5. Ferrara, P.; Bianchi, T.; De Rosa, A. Image forgery localization via fine-grained analysis of CFA artifacts. IEEE Trans. Inf. Forensics
Secur. 2012, 7, 1566–1577. [CrossRef]
6. Pan, X.; Lyu, S. Region Duplication Detection Using Image Feature Matching. IEEE Trans. Inf. Forensics Secur. 2010, 5, 857–867.
[CrossRef]
7. Salloum, R.; Ren, Y.; Kuo, C.C.J. Image Splicing Localization Using a Multi-task Fully Convolutional Network (MFCN). J. Vis.
Commun. Image Represent. 2017, 51, 201–209. [CrossRef]
8. Zhou, P.; Han, X.T.; Morariu, V.I. Learning rich features for image manipulation detection. In Proceedings of the 2018 IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1053–1061.
9. Bi, X.; Wei, Y.; Xiao, B.; Li, W. The Ringed Residual U-Net for Image Splicing Forgery Detection. In Proceedings of the 2019
IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019.
10. Wei, Y.; Wang, Z.; Xiao, B. Controlling Neural Learning Network with Multiple Scales for Image Splicing Forgery Detection. ACM
Trans. Multimed. Comput. Commun. Appl. 2020, 16, 1–22. [CrossRef]
11. Chen, X.; Dong, C.; Ji, J.; Cao, J. Image manipulation detection by multi-view multi-scale supervision. In Proceedings of the 2021
IEEE/CVF Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021.
12. Sun, Y.; Ni, R. ET: Edge-enhanced Transformer for Image Splicing Detection. IEEE Signal Process. Lett. 2022, 29, 1232–1236.
[CrossRef]
13. Chen, L.C.; Papandreou, G.; Kokkinos, I. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous
Convolution, and Fully Connected CRFs. TPAMI 2018, 40, 834–848. [CrossRef] [PubMed]
14. Oktay, O.; Schlemper, J.; Folgoc, L.L. Attention U-Net: Learning where to Look for the Pancreas. In Proceedings of the International
Conference on Medical Imaging with Deep Learning, Amsterdam, The Netherlands, 4–6 July 2018.
15. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference
on Computer Vision, Munich, Germany, 8–14 September 2018.
16. Nist. Nimble 2016 Datasets. Available online: https://www.nist.gov/itl/iad/mig/nimble-challenge-2017-evaluation (accessed
on 5 February 2016).
17. Dong, J.; Wang, W.; Tan, T.N. CASIA image tampering detection evaluation database. In Proceedings of the 2013 IEEE China
Summit and International Conference on Signal and Information Processing, Beijing, China, 6–10 July 2013.
18. Wang, Q.; Wu, B.; Zhu, P. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020.
19. Fridrich, J.; Kodovsky, J. Rich models for steganalysis of digital images. IEEE Trans. Inf. Forensics Secur. 2012, 7, 868–882.
[CrossRef]
20. Hu, X.; Zhang, Z.; Jiang, Z. Span: Spatial pyramid attention network forimage manipulation localization. In Proceedings of the
European Conference on Computer Vision, ECCV, 2020, Glasgow, UK, 23–28 August 2020.
21. Liu, X.; Liu, Y.; Chen, J.; Liu, X. PSCC-Net: Progressive spatio-channel correlation network for image manipulation detection and
localization. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 7505–7517. [CrossRef]
22. Wang, J.; Wu, Z.; Chen, J. ObjectFormer for Image Manipulation Detection and Localization. In Proceedings of the 2022 IEEE
Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022.
23. He, K.; Zhang, X.; Ren, S. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.