Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

A IMAGE GENERATION SCHEME BASED ON FACE

SWAPPING AND DISTORTION GENERATION

A SEMINAR REPORT

submitted by

Abin Eldho
(KME20CS006)

to
the APJ Abdul Kalam Technological University
in partial fulfillment of the requirements for the award of the Degree
of
Bachelor of Technology
in
Computer Science & Engineering

Department of Computer Science & Engineering


KMEA Engineering College Edathala, Aluva
683 561
December 2023
DECLARATION

I undersigned hereby declare that the seminar report “A Image Genera-


tion Scheme Based on Face Swapping and Distortion Generation” submitted
for partial fulfilment of the requirement for the award of Degree of Bachelor of
Technology of the APJ Abdul Kalam Technological University,Kerala,is a bonafide
academic document prepared under the supervision of Assoc prof.Sheena Kurian
K, Assistant Professor, KMEA, Cochin.
I have not submitted the matter presented in this seminar report anywhere for the
award of any other Degree.

Signature of student : ......................................


Name of student : Abin Eldho

Place : ..........................
Date : .̇.........................

2
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
KMEA Engineering College Edathala, Aluva
683 561

CERTIFICATE

This is to certify that the report entitled A Image Generation Scheme


Based on Face Swapping and Distortion Generation submitted by
Abin Eldho to the APJ Abdul Kalam Technological University in partial fulfill-
ment of the requirements for the award of the Degree of Bachelor of Technology in
Computer Science & Engineering is a bonafide record of the seminar work carried
out by him/her under our guidance and supervision. This report in any form has not
been submitted to any other University or Institute for any purpose.

Seminar Guide Seminar Coordinator


Name : Assoc prof.Sheena Kurian K Name : Ms. Vidya Hari
Signature : ....................... Signature : .......................

Head of Department
Name : Prof. Rekha Lakshmanan
Signature : .......................
ACKNOWLEDGMENT

First and foremost, I would like to express my thanks to Almighty for the
diving grace bestowed on me to complete this seminar successfully on time.
I would like to thank our respected Principal Dr. Amar Nishad T. M, the lead-
ing light of our institution and Dr. Rekha Lakshmanan, Vice Principal and Head
of Department of Computer Science and Engineering for her suggestions and sup-
port throughout my seminar. I also take this opportunity to express my profound
gratitude and deep regards to my Seminar Coordinator Ms. Vidya Hari, for all
her effort, time and patience in helping me to complete the seminar successfully
with all her suggestions and ideas. And a big thanks to my Seminar Guide Assoc
prof.Sheena Kurian K, of the Department of Computer Science & Engineering for
leading to the successful completion of the seminar. I also express my gratitude to
all teachers for their cooperation. I gratefully thank the lab staff of the Department
of Computer Science and Engineering for their kind cooperation. Once again I con-
vey my gratitude to all those who had direct or indirect influence on my seminar.

Abin Eldho
B. Tech. (Computer Science & Engineering)
Department of Computer Science & Engineering
KMEA Engineering College Edathala, Aluva

i
ABSTRACT

Nowadays, digital images are widely used in various services. The emer-
gence of more and more image editing algorithms has made image forensic ap-
proaches to be severely challenged. Driven by the emergence of forged images,
more and more image forensic methods are proposed to evaluate the authenticity of
digital images. However, in some privacy-related image forensics areas, the scarcity
of data affects their development. In this paper, we investigate a document image
generation scheme based on face swapping and distortion generation to generate
document image databases at a low cost. First, we propose an IDNet for editing
face content and text content in digital document images. Second, we propose a
Distortion Simulation Network (DSNet) for simulating print-and-scan distortion,
and the generated data can be used to study a novel document attack type called
recapture attack. Third, we use the database generated by our proposed method to
assist in training document image recapture forensic networks. A qualitative com-
parison with existing methods illustrates that the image content generated by our
proposed method maintains more complete semantic information and higher image
quality. Quantitative results have confirmed that with the addition of the generated
database, the performance of the model improves by about 8 percentage measured
in AUC.

ii
CONTENTS

ACKNOWLEDGMENT i

ABSTRACT ii

LIST OF FIGURES iv

ABBREVIATIONS vi

Chapter 1. INTRODUCTION 1

Chapter 2. LITERATURE SURVEY 3

Chapter 3. METHODOLOGY 13
3.1 GENERATION OF ORIGINAL DOCUMENT IMAGES . . 14
3.1.1 THE FACE SWAPPING NETWORK . . . . . 14
3.1.2 THE FACE RESTORATION NETWORK . . 16
3.2 SIMULATION OF PRINT AND SCAN DISTORTION . . 17

Chapter 4. DATA AND EXPERIMENT 20


4.1 DATA PREPARATION . . . . . . . . . . . . . . . . . . . 20
4.2 EXPERIMENTS . . . . . . . . . . . . . . . . . . . . . . . 21
4.2.1 INTRA-DATASET EXPERIMENT . . . . . . 21
4.2.2 CROSS-DATASET EXPERIMENT WITH ONLY
ORIGINAL DATA . . . . . . . . . . . . . . . 22
4.2.3 CROSS-DATASET EXPERIMENT WITH BOTH
ORIGINAL DATA AND GENERATED DATA 22
4.2.4 EXTENDED EXPERIMENT ON DIFFER-
ENT TYPES OF IMAGES . . . . . . . . . . . 22

iii
Chapter 5. ADVANTAGES 23

Chapter 6. CHALLENGES 25

Chapter 7. RESULTS AND DISCUSSION 26

Chapter 8. CONCLUSION 28

REFERENCES 29

iv
LIST OF FIGURES

3.1 Overview Of Proposed System . . . . . . . . . . . . . . . . . . . . 14


3.2 Structure of DualGAN in our task . . . . . . . . . . . . . . . . . . 19

v
ABBREVIATIONS

Abbreviation Expansion
GANs Generative Adversarial Networks
CNN Convolutional Neural Network
MSE mean-square error
PSNR peak signal-to-noise ratio
SAPLC Spatial Aggregation of Pixel-level Local Classifiers
FCN fully convolutional network
*** ***

vi
Chapter 1

INTRODUCTION

The proposed document image generation scheme aims to address the need
for cost-effective and efficient methods to create document image databases for
training forensic networks. The method leverages face swapping and distortion
generation to generate realistic document images, which can be used for training
document image recapture forensic networks. The approach involves two key com-
ponents: IDNet for editing face and text content, and DSNet for simulating print-
and-scan distortion.
IDNet is a comprehensive network that encompasses face swapping, face
restoration, and text editing capabilities. This network is designed to generate orig-
inal document images by seamlessly integrating new faces and editing text content.
The generated images are then subjected to print-and-scan distortion simulation
using DSNet, which employs Generative Adversarial Networks (GANs) to realisti-
cally simulate the distortions that occur during the print-and-scan process.
The effectiveness of the proposed method is evaluated using Convolutional
Neural Network (CNN)-based recapture detection. The results demonstrate im-
proved performance, indicating that the generated database can effectively train
forensic networks for document recapture detection. This is a significant contribu-
tion as it addresses the limitations of existing data generation methods and provides
a cost-effective solution for creating realistic document image databases.
The proposed method also aims to overcome challenges related to obtaining
face information and privacy concerns. By utilizing face swapping and restora-
tion techniques, the method ensures that the generated document images maintain
privacy and security while still being realistic and suitable for training forensic net-
works.
It is important to note that the generation of original document images and
the simulation of print-and-scan distortion are treated as separate steps in the pro-
posed method. This separation allows for a more comprehensive and specialized
approach to each aspect of the document image generation process.
The method employs state-of-the-art techniques such as SimSwap for face
swapping, DFDNet for face restoration, and DualGAN for simulating print-and-
scan distortion. These advanced technologies contribute to the realism and effec-
tiveness of the generated document images, as demonstrated in the evaluation using
a large number of document images, including genuine and recaptured images, for
training CNN-based recapture detection experiments.
While the proposed method shows promising results, it is important to ac-
knowledge its limitations. For instance, the method may not be able to simulate all
print-and-scan distortions, and there is a need for improved generalization perfor-
mance. Future work will focus on enhancing image quality, improving generaliza-
tion performance, and exploring the application of the generated data to additional
forensic tasks.

2
Chapter 2

LITERATURE SURVEY

Recently, face swapping has become a hot topic in computer vision. The
methods can be mainly divided into two types; one is the source-oriented method
for processing source faces at the image level, and the other is the target-oriented
method for processing target faces at the feature level . Source-oriented Methods
are sensitive to the source face, and face swapping does not work well when the
source face expression is exaggerated. Since our task is to generate ID photos, it
is required that the generated faces have uniform lighting conditions and that only
identity attributes come from the source face. In contrast, other attributes, such as
expressions, come from the target face (template ID photo). Therefore, the source-
oriented methods cannot be applied to our data generation system.
For the distortion simulation methods based on neural networks, we need to
focus on the images before and after imaging once or twice. In other words, we can
consider a way for the image-to-image translation between them. This leads us to
the Generative Adversarial Networks (GAN)
For the problem of background and degradation, we can consider DualGAN.
Although it requires paired images for training, we can get it by importing a publicly
shared dataset, where we can find original document images and their correspond-
ing captured/recaptured images easily. More importantly, paired training images
data make sure that what we want the network to learn is the degradation of cap-
tured/ recaptured images compared with the original document images. Therefore,
it does not have the problem of background migration error, so we choose to employ
DualGAN for our task.
according to L. Zhao et al [1], With the ongoing popularization of online ser-
vices, the digital document images have been used in various applications. Mean-
while, there have emerged some deep learning-based text editing algorithms which
alter the textual information of an image in an end-to-end fashion. In this work,
they present a low-cost document forgery algorithm by the existing deep learning-
based technologies to edit practical document images. To achieve this goal, the
limitations of existing text editing algorithms towards complicated characters and
complex background are addressed by a set of network design strategies. First,
the unnecessary confusion in the supervision data is avoided by disentangling the
textual and background information in the source images. Second, to capture the
structure of some complicated components, the text skeleton is provided as aux-
iliary information and the continuity in texture is considered explicitly in the loss
function. Third, the forgery traces induced by the text editing operation are miti-
gated by some post-processing operations which consider the distortions from the
print-and-scan channel. Quantitative comparisons of the proposed method and the
exiting approach have shown the advantages of our design by reducing the about 2/3
reconstruction error measured in MSE, improving reconstruction quality measured
in PSNR and in SSIM by 4 dB and 0.21, respectively. Qualitative experiments have
confirmed that the reconstruction results of the proposed method are visually bet-
ter than the existing approach in both complicated characters and complex texture.
More importantly, we have demonstrated the performance of the proposed docu-
ment forgery algorithm under a practical scenario where an attacker is able to alter
the textual information in an identity document using only one sample in the target
domain. The forged-and-recaptured samples created by the proposed text editing
attack and recapturing operation have successfully fooled some existing document
authentication systems.

4
W. Sun et al. [2] emphasized the vulnerability of face verification systems to
various spoofing attacks, including those involving photos, videos, and 3D masks,
i.e., face anti-spoofing, face liveness detection, or face presentation attack detection,
is an important task for securing face verification systems in practice and presents
many challenges. In this paper, a state-of-the-art face spoofing detection method
based on a depth-based Fully Convolutional Network (FCN) is revisited. Different
supervision schemes, including global and local label supervisions, are comprehen-
sively investigated. A generic theoretical analysis and associated simulation are
provided to demonstrate that local label supervision is more suitable than global
label supervision for local tasks with insufficient training samples, such as the face
spoofing detection task. Based on the analysis, the Spatial Aggregation of Pixel-
level Local Classifiers (SAPLC), which is composed of an FCN part and an ag-
gregation part, is proposed. The FCN part predicts the pixel-level ternary labels,
which include the genuine foreground, the spoofed foreground, and the undeter-
mined background. Then, these labels are aggregated together to yield an accurate
image-level decision. Furthermore, to quantitatively evaluate the proposed SAPLC,
experiments are carried out on the CASIA-FASD, Replay-Attack, OULU-NPU, and
SiW datasets. The experiments show that the proposed SAPLC outperforms the
representative deep networks, including two globally supervised CNNs, one depth-
based FCN, two FCNs with binary labels, and two FCNs with ternary labels, and
achieves competitive performances close to some state-of-the-art method perfor-
mances under various common protocols. Overall, the results empirically verify
the advantage of the proposed pixel-level local label supervision scheme.
Y. Sun et al. [3] shift the focus in their paper from traditional image forensics
methods, which primarily concentrate on natural content images, to the detection
of forgeries in certificate images. Certificates, representing individuals’ rights and
interests, pose unique challenges due to variable tampered region scales and di-

5
verse manipulation types. To address these challenges, the paper introduces a novel
Multi-level Feature Attention Network (MFAN) that outperforms existing forensics
methods. The MFAN architecture follows an encoder–decoder network structure.
In the encoder, the authors employ Atrous Spatial Pyramid Pooling (ASPP) on the
final layer of a pre-trained residual network. This approach helps capture contex-
tual information at different scales, addressing the issue of variable tampered region
scales in certificate images. Additionally, low-level features are incorporated to
maintain sensitivity to small targets, preventing the loss of discriminative features
in the presence of pooling operations. One distinctive feature of fake certificate
images is the diversity of manipulation types. To adapt to these variations, the pro-
posed method introduces feature recalibration on channels. This recalibration pro-
cess enhances the network’s ability to identify and suppress irrelevant information,
particularly in tampered regions, facilitating the detection of diverse manipulation
traces. In the decoder module, attentive feature maps are convoluted and unsam-
pled to generate an effective prediction mask. Experimental results demonstrate
the superior performance of MFAN compared to state-of-the-art forensics methods.
The method excels in identifying fake certificate images with varying tampered re-
gion scales and multiple manipulation types, showcasing its robustness in practical
scenarios. The significance of this research lies in extending image forensics tech-
nology beyond natural content images, addressing real-world concerns such as the
tampering of certificate images during events like the COVID-19 pandemic. By
presenting a comprehensive solution in the form of MFAN, the paper contributes
to the advancement of forensics techniques, particularly in the context of securing
individuals’ rights and interests.
P. Roy et al. [4] present a groundbreaking approach in their paper, intro-
ducing a pioneering method known as STEFANN—Scene Text Editor using a Font
Adaptive Neural Network. This method is designed to address the challenges as-

6
sociated with modifying text in images at a character-level, particularly in gener-
ating unobserved characters with visual consistency. The approach consists of two
stages: first, generating the unobserved character from an observed character, and
then replacing the source character with the generated one while maintaining geo-
metric and visual consistency with neighboring characters. The effectiveness of the
method is demonstrated through qualitative and quantitative evaluations on COCO-
Text and ICDAR datasets. In addition to introducing the STEFANN method, the
paper also discusses related works and the methodology of the proposed approach.
This includes the selection of the source character, generation of the binary target
character, the Font Adaptive Neural Network (FANnet), and color transfer. The
paper highlights the contributions and potential applications of the method, em-
phasizing its ability to correct errors, restore text, improve image reusability, and
its potential use in font adaptive image-based machine translation and font synthe-
sis. Furthermore, the paper proposes a CNN-based architecture called Colornet for
transferring color from a source character to a target character. This architecture is
trained with synthetically generated image pairs and is capable of faithfully trans-
ferring color variations. The paper also discusses character placement and evaluates
the performance of the proposed FANnet and Colornet architectures. Comparisons
with other methods, such as MC-GAN and Project Naptha, are provided, demon-
strating the effectiveness of the proposed approach. Overall, the paper presents a
comprehensive study on character generation and color transfer in natural scene im-
ages. While STEFANN offers various capabilities such as correcting errors, restor-
ing text, and improving image reusability, it does have limitations. For instance, it
may struggle with extreme perspective distortion, high occlusion, and large rotation.
Additionally, the FANnet architecture, while effective in generating lower-case tar-
get characters, faces challenges in predicting the size of the target character. The
current methodology also has limitations in editing high-resolution text regions and

7
may require rigorous upsampling, which can introduce distortion. To address these
limitations, the integration of super-resolution and better character segmentation al-
gorithms is planned for the future. The paper also recommends the use of robust
image authentication and digital forensic techniques to minimize the risk of misuse
of realistic text editing in images. In conclusion, the paper presents a method called
STEFANN, which enables character-level text editing in images
P. Zhuang et al. [5] address the escalating challenges associated with digi-
tal image tampering, which have been facilitated by powerful editing software. In
response to this growing issue, their paper introduces a targeted and innovative ap-
proach designed to identify and localize tampered regions within images. The focus
of their research is particularly directed towards addressing manipulations made us-
ing the widely utilized Photoshop tool. While existing research has made strides in
image tampering localization, practical applications continue to grapple with the
complexities of learning discriminative representations of tampering traces and the
scarcity of realistic training data. By honing in on the operations and tools char-
acteristic of Photoshop, the paper introduces a fully convolutional encoder-decoder
architecture, enriched with dense connections and dilated convolutions, designed to
capture nuanced tampering traces and achieve superior localization performance.
Notably, the dense connections amplify implicit supervision and encourage feature
reuse, while dilated convolutions enlarge the receptive field without compromising
feature map resolution. To overcome the challenge of insufficient realistic training
data, the authors propose an inventive training data generation strategy employ-
ing Photoshop scripting. This strategy emulates human manipulations, producing
large-scale tampered image samples efficiently. Extensive experiments across three
tampered image datasets demonstrate the method’s superiority over state-of-the-art
competitors, showcasing its robustness even when trained solely on generated im-
ages or fine-tuned with limited realistic tampered images. The paper’s contributions

8
lie in its focused approach to detecting Photoshop-related tampering traces, the in-
troduction of an effective architectural design, and the innovative data generation
strategy, collectively advancing the field of tampering localization. The significance
of the research extends to practical forensic applications, offering a promising solu-
tion in scenarios where realistic training data is scarce and emphasizing the nuanced
intricacies of specific editing tools and operations for improved image forensics.
In the work by Y. Gao et al[6], the authors delve into a pivotal aspect of
face editing, focusing their attention on the formidable task of preserving intricate
details in non-edited regions. The authors observe that existing generators tend to
hide information to meet cycle consistency constraints, resulting in a loss of cru-
cial details in the edited images. To address this, they propose a method named
HifaFace, which tackles the problem from two perspectives. First, they directly
feed the high-frequency information of the input image into the generator, relieving
it from the burden of synthesizing rich details. Second, an additional discrimi-
nator is introduced to encourage the generator to produce images with enhanced
details. Wavelet transformation is applied to capture high-frequency information,
and a novel module called Wavelet-based Skip Connection replaces the original
Skip Connection. The proposed method also introduces a novel attribute regression
loss for fine-grained and wider-range control over facial attributes during editing.
The authors demonstrate that this loss enables more precise control over attribute
changes, resulting in high-fidelity and arbitrary face editing. The proposed Hi-
faFace method outperforms state-of-the-art approaches, as demonstrated through
extensive qualitative and quantitative evaluations. Additionally, the paper sheds
light on the phenomenon of steganography, where generators encode rich details
as hidden signals, and proposes effective solutions to mitigate this issue. The pro-
posed framework not only enhances the fidelity of face editing but also allows for
the effective utilization of large amounts of unlabeled face images during training,

9
contributing to the robustness of the model in real-world scenarios. In summary,
the contributions of this work include the introduction of the HifaFace method,
a thorough analysis of steganography in cycle consistency, and the proposal of a
novel attribute regression loss, all of which collectively advance the field of face
editing by achieving high-fidelity results with fine-grained attribute control. The
effectiveness of the proposed framework is substantiated through comprehensive
experiments, establishing it as a noteworthy advancement in the domain of genera-
tive face editing.
Renwang Chen et al. [7] introduced SimSwap, an innovative face-swapping
framework. This cutting-edge system is meticulously crafted to deliver high-fidelity
results, showcasing its prowess in achieving lifelike face swaps. Traditional face-
swapping methods face challenges in achieving both generalization to arbitrary
faces and accurate preservation of attributes such as expressions and posture. Sim-
Swap addresses these difficulties through two key innovations. First, the Identity
Injection Module (IIM) transfers identity information from the source face to the
target face at the feature level, enabling the extension of an identity-specific face-
swapping algorithm to a more versatile framework for arbitrary face swapping.
Second, the Weak Feature Matching Loss is proposed to implicitly preserve fa-
cial attributes in an efficient manner. This loss aligns the generated result with the
input target at a high semantic level, contributing to improved attribute preservation
without the complexity of explicitly constraining each attribute. SimSwap demon-
strates competitive identity performance while surpassing previous state-of-the-art
methods in attribute preservation, as validated through extensive experiments on
diverse faces. The proposed framework addresses the challenges of generaliza-
tion and attribute preservation, making it a noteworthy advancement in the field of
face-swapping technology. The paper contributes valuable insights and provides a
significant improvement over existing methods, showcasing the potential for prac-

10
tical and high-quality face-swapping applications. The availability of the SimSwap
code on GitHub further enhances its accessibility and utility for researchers and
practitioners in the field.
Yuval Nirkin et al. [8] present a pioneering approach in their paper, intro-
ducing a groundbreaking method for face swapping that extends its applicability to
unconstrained and arbitrarily paired face images.demonstrating simplicity and effi-
cacy in face swapping even under challenging conditions. The contributions include
the use of a standard fully convolutional network (FCN) trained on a rich dataset
for face segmentation, surpassing previous results in accuracy and speed. The pro-
posed face swapping pipeline incorporates improved face segmentation and robust
system components, producing high-quality results in unconstrained settings. The
paper provides quantitative tests on the effect of intra- and inter-subject face swap-
ping on face verification, utilizing the Labeled Faces in the Wild (LFW) bench-
mark. The results reveal that intra-subject face swapping has minimal impact on
face verification accuracy, indicating the effectiveness of the method in preserving
subject identities without introducing artifacts. The paper also explores the percep-
tual phenomenon, quantitatively demonstrating that inter-subject face swapping,
which involves changes to facial appearance, makes the faces less recognizable in
machine face recognition. This quantitative analysis contributes valuable insights
to the field, addressing the lack of previous quantitative testing in face swapping
methods. The methodology involves semi-supervised labeling of face segmenta-
tion, motion cues, and 3D data augmentation for generating a rich image set with
face segmentation labels. Overall, the paper provides a comprehensive approach to
face swapping, combining technical innovations with extensive quantitative evalua-
tions, advancing the understanding of the impact of face swapping on machine face
recognition. The research not only contributes to face processing capabilities but
also has implications for privacy preservation, digital forensics, and data augmenta-

11
tion in applications with limited training data. The availability of code, deep mod-
els, and additional information on the project webpage enhances the reproducibility
and accessibility of the proposed method.

12
Chapter 3

METHODOLOGY

The proposed document image generation method is specifically designed


for captured/recaptured document images, employing a two-step process involving
document generation and distortion simulation. The method utilizes IDNet, which
comprises a face swapping network, face restoration network, and Text Editing
network (TENet), for the generation of original document images. The document
image is divided into three regions: text region, image region, and background re-
gion. The image region undergoes a content replacement process where the original
content is substituted with source images using SimSwap, and subsequently, high-
resolution restoration is performed using DFDNet. The text region is edited by
TENet or a Text Editor. The generated regions are then combined to create a new
original document image, with the background region left unchanged. The next step
involves converting these new images into captured documents through GAN-based
distortion simulation using DualGAN. Finally, the output captured/recaptured im-
ages are subjected to CNN-based recapture detection to assess the method’s effec-
tiveness. The document generation process with IDNet addresses the challenge of
obtaining face information and privacy concerns by treating it as a replacement for
face information. Using a few original document images containing faces, Sim-
Swap generates a large number of original document templates with various iden-
tities. The simulation of print-scan distortion, considered as the second step, is
essential for replicating the effects of the print-scan process. To mitigate the time-
consuming nature of print-scan simulation, a GAN-based method is employed to
efficiently introduce realistic distortions.
Figure 3.1: Overview Of Proposed System

3.1 GENERATION OF ORIGINAL DOCUMENT IMAGES

IDNet, a novel deep learning framework, is introduced for generating ID


photos by manipulating facial information in original document images. Compris-
ing two key components – the face swapping subnet (using SimSwap) and the face
restoration subnet (utilizing DFDNet) – IDNet produces 25 high-resolution origi-
nal document images with preserved semantic information. Five exemplary images
are showcased in Figure 1, highlighting their realistic resemblance to genuine ID
photos. This framework offers promising applications for subsequent studies and
practical uses due to its ability to generate authentic-looking ID images.

3.1.1 THE FACE SWAPPING NETWORK

The proposed method includes IDNet, a network designed for editing face
and text content in document images. IDNet encompasses several components, in-
cluding a face swapping network, a face restoration network, and a text editing
network. Here, we will focus on the face swapping network within IDNet. The
face swapping network in IDNet is a crucial component for generating original

14
document images. It is designed to seamlessly integrate new faces into document
images, allowing for the creation of realistic and diverse datasets for training foren-
sic networks. The face swapping process involves replacing the original faces in
the document images with different facial features while maintaining the overall
appearance and context of the document.
The face swapping network leverages advanced techniques such as Sim-
Swap, a state-of-the-art face swapping algorithm that utilizes deep learning and
computer vision methodologies. SimSwap is capable of accurately transferring fa-
cial attributes from a source face to a target face, ensuring that the swapped faces
appear natural and realistic within the context of the document image. The network
is trained on a large dataset of facial images to learn the intricate details and features
of different faces, enabling it to perform accurate and seamless face swapping. By
understanding the spatial relationships, contours, and expressions of faces, the net-
work can effectively replace the original faces in document images with new faces
while preserving the visual coherence and authenticity of the images.
Furthermore, the face swapping network incorporates privacy and security
considerations by ensuring that the swapped faces do not compromise the identity
or privacy of individuals depicted in the document images. This is achieved through
careful handling of facial features and the application of privacy-preserving tech-
niques during the swapping process. The effectiveness of the face swapping net-
work is evaluated through qualitative and quantitative assessments, demonstrating
its ability to generate diverse and realistic document images suitable for training
forensic networks. The network’s performance is crucial in ensuring that the gen-
erated datasets accurately reflect real-world scenarios, including variations in facial
features, expressions, and demographics.
Overall, the face swapping network in IDNet plays a pivotal role in the doc-
ument image generation process, contributing to the creation of high-quality, di-

15
verse, and privacy-preserving datasets for training forensic networks. Its integration
within the broader IDNet framework showcases its significance in addressing the
challenges associated with obtaining face information and privacy concerns while
generating original document images for forensic applications.

3.1.2 THE FACE RESTORATION NETWORK

The face restoration network within the proposed document image genera-
tion method plays a crucial role in enhancing the quality and authenticity of the
generated document images. This network is a component of IDNet, which en-
compasses face swapping, face restoration, and text editing capabilities. The face
restoration network is designed to address the challenges associated with preserving
the visual coherence and natural appearance of faces within document images. It
aims to restore and enhance facial features, ensuring that the swapped faces seam-
lessly integrate into the context of the document while maintaining realism and
authenticity.
The network leverages advanced techniques such as DFDNet, a state-of-
the-art face restoration algorithm that utilizes deep learning and image processing
methodologies. DFDNet is trained on a diverse dataset of facial images to learn
the intricate details and features of different faces, enabling it to perform accu-
rate and high-quality restoration of facial attributes within document images. By
understanding the spatial relationships, textures, and structures of faces, the face
restoration network can effectively enhance the visual quality of the swapped faces,
addressing issues such as blurriness, distortion, or loss of details that may occur dur-
ing the face swapping process. This ensures that the restored faces exhibit natural
and realistic appearances, contributing to the overall authenticity of the generated
document images.
Furthermore, the face restoration network incorporates privacy and security

16
considerations by ensuring that the restored faces do not compromise the identity
or privacy of individuals depicted in the document images. It employs privacy-
preserving techniques and ethical considerations to safeguard the integrity of fa-
cial features while enhancing their visual quality. The effectiveness of the face
restoration network is evaluated through qualitative and quantitative assessments,
demonstrating its ability to improve the overall quality and realism of the generated
document images. The network’s performance is crucial in ensuring that the gen-
erated datasets accurately reflect real-world scenarios, including variations in facial
features, expressions, and demographics, while maintaining privacy and authentic-
ity.

3.2 SIMULATION OF PRINT AND SCAN DISTORTION

The proposed method involves the use of DSNet for simulating print-and-
scan distortion. DSNet employs Generative Adversarial Networks (GANs) to re-
alistically simulate the distortions that occur during the print-and-scan process.
The effectiveness of the method is evaluated using CNN-based recapture detection,
demonstrating improved performance in training forensic networks for document
recapture detection. The method is evaluated using a large number of document
images, including genuine and recaptured images, for training CNN-based recap-
ture detection experiments. The evaluation shows promising results, indicating that
the generated database can effectively train forensic networks for document recap-
ture detection. The method is used to automatically generate databases for train-
ing forensic networks, and it is evaluated qualitatively and quantitatively, showing
effectiveness in generating realistic data and improving forensic network perfor-
mance. However, limitations include the inability to simulate all print-and-scan
distortions and the need for improved generalization performance.
The simulation of print-and-scan distortion is a critical component of the

17
proposed document image generation scheme. It involves the use of the Distor-
tion Simulation Network (DSNet) to realistically simulate the distortions that occur
during the print-and-scan process. DSNet is designed to address the challenges
associated with simulating print-and-scan distortions in a manner that is both effi-
cient and realistic. By employing a DualGAN-based structure, DSNet is capable of
simulating the distortions caused by devices such as printers, scanners, and phone
cameras. This simulation process is essential for generating captured or recaptured
images without the need for a large number of printing and scanning operations,
thereby reducing time and labor consumption during the construction of datasets.
The use of DSNet allows for the generation of near-realistic results, providing a
cost-effective and efficient alternative to traditional methods of dataset construc-
tion. By simulating the real physical channel of printing and scanning, DSNet en-
sures that the generated images accurately reflect the distortions that occur during
the document capture process.
The effectiveness of DSNet in simulating print-and-scan distortion is cru-
cial for generating diverse and realistic datasets for training forensic networks.
The simulated distortions contribute to the authenticity and visual coherence of
the generated document images, making them suitable for a wide range of forensic
applications, including document recapture detection. Overall, the simulation of
print-and-scan distortion through DSNet plays a pivotal role in the document image
generation process

18
Figure 3.2: Structure of DualGAN in our task

19
Chapter 4

DATA AND EXPERIMENT

In this study, the proposed method is employed to automate the generation


of a diverse set of document images, encompassing both authentic and recaptured
scenarios. Leveraging this extensive dataset, which also incorporates high-quality
data from previous works, particularly, a series of Convolutional Neural Network
(CNN) classification-based recapture detection experiments are conducted. The
performance of these experiments is systematically evaluated to assess the efficacy
of the proposed method in distinguishing between genuine and recaptured images.

4.1 DATA PREPARATION

In order to assess the effectiveness of our approach, the generation of image


data becomes imperative, with a primary focus on producing recaptured document
images for experimentation. This choice is driven by practical considerations, as
constructing recaptured images involves two print-and-capture processes, making
it more time-consuming for experts compared to captured images. Furthermore,
the scarcity of databases and deep learning methods for recapture attacks motivates
the use of our method to generate a substantial number of recaptured images. It
is important to clarify that the data generated by our method is versatile and not
confined solely to recapture detection; its applicability to specific scenarios needs
to be duly considered.
The utilized databases include the Student Card Image Dataset, denoted as
D1 and D2. D1 comprises 84 captured and 588 recaptured document images, while
D2 consists of 48 captured and 384 recaptured document images, with devices dif-
fering between the two datasets. To bolster baseline performance, these datasets
offer high-quality captured and recaptured images from various devices. The Gen-
erated Document Image Dataset, marked as D3, is crafted by our method. Lever-
aging student card templates from five universities, we employ SimSwap and Foxit
PDF Editor to replace face information and alter content, respectively. Using Du-
alGAN models, we induce distortion for each device or device-combination in D1,
resulting in the generation of 25 original student card images and 175 genuine im-
ages, along with 1225 recaptured images. These distortions simulate the character-
istics of seven devices, including four camera phones (Huawei P9, Apple iPhone6,
XiaoMi 8, RedMi Note 5) and three scanners (Brother DCP-1519, Benq K810, Ep-
son V330), none of which overlap between D1 and D2. This meticulously curated
dataset serves as a valuable resource for training neural networks and evaluating
recapture detection methods.

4.2 EXPERIMENTS

The experiments conducted in the research paper aim to evaluate the ef-
fectiveness of the proposed document image generation scheme. The experiments
focus on the role of different datasets in training neural networks for document
recapture detection.

4.2.1 INTRA-DATASET EXPERIMENT

Experiment I, known as the intra-dataset experiment, involves training and


testing the neural network using images from the same dataset (D1). This experi-
ment aims to assess the performance of the neural network when trained and tested
on images collected from the same devices. The results show that the CNN models
perform well in this scenario, with low Equal Error Rates (EERs) and high Area

21
Under the Curve (AUC) values.

4.2.2 CROSS-DATASET EXPERIMENT WITH ONLY ORIGINAL DATA

Experiment II, referred to as the cross-dataset experiment with only original


data, considers a more practical scenario where the devices used to collect the train-
ing set are different from those used for the testing set. This experiment evaluates
the performance of different approaches under cross-dataset conditions. The results
demonstrate that the performance of the CNN models is significantly degraded in
this scenario, with increased EERs and decreased AUC values.

4.2.3 CROSS-DATASET EXPERIMENT WITH BOTH ORIGINAL DATA


AND GENERATED DATA

Experiment III, the cross-dataset experiment with both original data and gen-
erated data, aims to evaluate the role of the generated data (D3) in training the
models. The generated data is added to the original data from D1 and D2, and
the models are trained using this combined dataset. The results show that the data
from D3 has a positive effect on model training, with increased AUC values and
decreased EERs compared to the models trained only with original data.

4.2.4 EXTENDED EXPERIMENT ON DIFFERENT TYPES OF IMAGES

The experiments highlight the importance of data in training neural networks


for document recapture detection. The results demonstrate that the generated data,
which includes recaptured images, improves the performance of the models in de-
tecting document recapture. The experiments provide valuable insights into the
effectiveness of the proposed document image generation scheme and its impact on
forensic network training.

22
Chapter 5

ADVANTAGES

The paper presents a document image generation scheme that offers sev-
eral advantages in the field of digital document forensics and image manipulation.
The proposed method leverages innovative techniques to address the challenges of
generating diverse and realistic document image databases for training forensic net-
works. Some of the key advantages of the paper include:
1. Cost-Effective Data Generation: The proposed method provides a cost-effective
approach to generating document image databases. By utilizing face swapping, face
restoration, and text editing networks, the method enables the creation of original
document images without the need for extensive manual labor or expensive equip-
ment. This cost-effective approach is particularly beneficial for organizations and
researchers with limited resources.
2. Improved Forensic Network Training: The paper demonstrates the effectiveness
of the proposed method in improving the performance of forensic networks for
document recapture detection. By automatically generating databases for training
forensic networks, the method contributes to enhancing the capabilities of forensic
tools and techniques, ultimately aiding in the detection of document recapture and
related forensic tasks.
3. Realistic Data Generation: The method focuses on generating realistic document
images, including genuine and recaptured images, to train forensic networks. By
simulating print-and-scan distortion using DualGAN, the proposed approach en-
sures that the generated data closely mimics the distortions that occur during the
document capture process. This emphasis on realism is crucial for training forensic
networks to accurately detect document recapture and other forensic tasks.
4. Privacy Considerations: The paper acknowledges the difficulty of obtaining
face information and privacy concerns, highlighting the importance of safeguarding
the integrity of facial features while enhancing their visual quality. By addressing
privacy considerations, the proposed method demonstrates a commitment to ethical
data generation practices, ensuring that the privacy of individuals depicted in the
generated document images is protected.
5. Versatility and Generalization: The proposed method is versatile and can be ap-
plied to various types of document images, such as College English Test passes and
teaching certificates, by modifying text and face regions and simulating distortion
caused by devices. Additionally, the paper acknowledges the need for improved
generalization performance and highlights future work to enhance image quality
and generalization performance, indicating a commitment to advancing the versa-
tility and applicability of the proposed method.
6. Comprehensive Literature Review: The paper provides a comprehensive review
of research papers and conference proceedings related to digital document foren-
sics, face editing, and image manipulation. By covering topics such as face spoofing
detection, document authentication, image tampering, and face swapping, the paper
offers valuable insights into the latest developments

24
Chapter 6

CHALLENGES

The paper faces several challenges that need to be addressed. One of the
main challenges is the need to ensure the realism and authenticity of the generated
document images. This includes addressing issues such as blurriness, distortion, or
loss of details that may occur during the face swapping and distortion simulation
processes. Ensuring that the restored faces seamlessly integrate into the context of
the document while maintaining realism and authenticity is crucial for the effec-
tiveness of the proposed method.
Another challenge is the privacy and ethical considerations associated with
the generation of document images. It is essential to ensure that the generated doc-
ument images do not compromise the identity or privacy of individuals depicted in
the images. This involves implementing privacy-preserving techniques and ethical
considerations to protect the privacy of individuals and maintain the ethical integrity
of the generated data.
Furthermore, the paper mentions limitations related to the inability to sim-
ulate all print-and-scan distortions and the need for improved generalization per-
formance. The proposed method aims to simulate print-and-scan distortions using
DSNet, but there is a need to address the challenges associated with accurately sim-
ulating the diverse range of distortions that can occur during the document capture
process. Additionally, improving the generalization performance of the generated
data is crucial for ensuring that the method can be effectively applied to a wide
range of forensic tasks beyond document recapture detection.
Chapter 7

RESULTS AND DISCUSSION

The paper presents results and discussions that highlight the effectiveness of
the proposed document image generation scheme and its implications for training
forensic networks.
The results demonstrate the positive impact of the generated data on training
forensic networks for document recapture detection. In the cross-database experi-
ment, the addition of the generated data leads to an average increase of 8.22% in
AUC and an average decrease of 8.57% in EER compared to the original model.
This indicates that the generated data significantly improves the performance of the
model in detecting document recapture. Furthermore, the paper discusses the ver-
satility of the proposed approach, noting that it can be used for generating various
types of document images, such as College English Test passes and teaching cer-
tificates, by modifying text and face regions and simulating distortion caused by
devices.
The discussions emphasize the importance of the generated database in ad-
dressing the scarcity of data for training forensic networks, particularly in privacy-
related image forensics areas. The paper acknowledges the challenges associated
with producing real data, especially for tasks involving privacy, and highlights the
need for convenient and low-cost methods to generate data for training forensic net-
works. The proposed document image generation scheme addresses this need by
providing a method to generate diverse and realistic datasets for training forensic
networks, thereby contributing to the advancement of image forensics tasks.
Overall, the results and discussions underscore the effectiveness of the pro-
posed method in generating realistic document images and its potential to improve
the performance of forensic networks for document recapture detection. The pa-
per provides valuable insights into the role of generated data in training forensic
networks and addresses the challenges associated with data scarcity and privacy
considerations in image forensics tasks.

27
Chapter 8

CONCLUSION

The paper proposes a document image generation scheme based on face


swapping and distortion generation to address the need for generating document
image databases at a low cost. The method leverages IDNet for editing face and text
content and DSNet for simulating print-and-scan distortion, aiming to improve the
performance of forensic networks for document recapture detection. The proposed
method, known as DIGNet, is evaluated qualitatively and quantitatively, demon-
strating its effectiveness in generating realistic data and enhancing forensic net-
work performance. However, limitations include the inability to simulate all print-
and-scan distortions and the need for improved generalization performance. Future
work will focus on improving image quality and generalization performance, as
well as applying the generated data to more forensic tasks. The paper also dis-
cusses various research papers and conference proceedings related to digital doc-
ument forensics, face editing, and image manipulation, providing insights into the
latest developments and techniques in these areas. Overall, the proposed docu-
ment image generation scheme shows promise in addressing the challenges of data
generation for forensic network training and has the potential to contribute to ad-
vancements in digital document forensics and related fields.
REFERENCES

[1] L. Zhao, C. Chen, and J. Huang, “Deep learning-based forgery attack on


document images,” IEEE Trans. Image Process., vol. 30, pp. 7964–7979,
2021.

[2] W. Sun, Y. Song, C. Chen, J. Huang, and A. C. Kot, “Face spoofing detec-
tion based on local ternary label supervision in fully convolutional networks,”
IEEE Trans. Inf. Forensics Security, vol. 15, pp. 3181–3196, 2020.

[3] Y. Sun, R. Ni, and Y. Zhao, “MFAN: Multi-level features attention network
for fake certificate image detection,” Entropy, vol. 24, no. 1, p. 118, Jan. 2022.

[4] P. Roy, S. Bhattacharya, S. Ghosh, and U. Pal., “STEFANN: Scene text


editor using font adaptive neural network,” in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 13228–13237.

[5] P. Zhuang, H. Li, S. Tan, B. Li, and J. Huang, “Image tampering localiza-
tion using a dense fully convolutional network,” IEEE Trans. Inf. Forensics
Security, vol. 16, pp. 2986–2999, 2021.

[6] Y. Gao, F. Wei, J. Bao, S. Gu, D. Chen, F. Wen, and Z. Lian , “High-
fidelity and arbitrary face editing,” in Proc. IEEE/CVF Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jun. 2021, pp. 16115–16124.

[7] R. Chen, X. Chen, B. Ni, and Y. Ge, “SimSwap: An efficient framework for
high fidelity face swapping,” in Proc. 28th ACM Int. Conf. Multimedia, Oct.
2020, pp. 2003–2011.

29
[8] Y. Nirkin, I. Masi, A. Tran Tuan, T. Hassner, and G. Medioni, “On face
segmentation, face swapping, and face perception,” in Proc. 13th IEEE Int.
Conf. Autom. Face Gesture Recognit. (FG), May 2018, pp. 98–105

30

You might also like