Deep Learning For Deepfakes Creation and Detection: A Survey

1
Deep Learning for Deepfakes Creation and

Detection: A Survey
Thanh Thi Nguyen, Quoc Viet Hung Nguyen, Cuong M. Nguyen, Dung Nguyen, Duc Thanh Nguyen,
Saeid Nahavandi, Fellow, IEEE
Abstract—Deep learning has been successfully applied I. I NTRODUCTION

to solve various complex problems ranging from big data
arXiv:1909.11573v3 [cs.CV] 26 Apr 2021
analytics to computer vision and human-level control. Deep In a narrow definition, deepfakes (stemming from
learning advances however have also been employed to cre- “deep learning” and “fake”) are created by techniques
ate software that can cause threats to privacy, democracy that can superimpose face images of a target person onto
and national security. One of those deep learning-powered
a video of a source person to make a video of the target
applications recently emerged is deepfake. Deepfake al-
gorithms can create fake images and videos that humans
person doing or saying things the source person does.
cannot distinguish them from authentic ones. The proposal This constitutes a category of deepfakes, namely face-
of technologies that can automatically detect and assess the swap. In a broader definition, deepfakes are artificial
integrity of digital visual media is therefore indispensable. intelligence-synthesized content that can also fall into
This paper presents a survey of algorithms used to create two other categories, i.e., lip-sync and puppet-master.
deepfakes and, more importantly, methods proposed to Lip-sync deepfakes refer to videos that are modified to
detect deepfakes in the literature to date. We present make the mouth movements consistent with an audio
extensive discussions on challenges, research trends and recording. Puppet-master deepfakes include videos of a
directions related to deepfake technologies. By reviewing
target person (puppet) who is animated following the
the background of deepfakes and state-of-the-art deepfake
detection methods, this study provides a comprehensive facial expressions, eye and head movements of another
overview of deepfake techniques and facilitates the devel- person (master) sitting in front of a camera [1].
opment of new and more robust methods to deal with the While some deepfakes can be created by traditional
increasingly challenging deepfakes. visual effects or computer-graphics approaches, the re-
Impact Statement—This survey provides a timely cent common underlying mechanism for deepfake cre-
overview of deepfake creation and detection methods ation is deep learning models such as autoencoders
and presents a broad discussion of challenges, potential and generative adversarial networks, which have been
trends, and future directions. We conduct the survey
applied widely in the computer vision domain [2]–[4].
with a different perspective and taxonomy compared to
existing survey papers on the same topic. Informative
These models are used to examine facial expressions
graphics are provided for guiding readers through the and movements of a person and synthesize facial images
latest development in deepfake research. The methods of another person making analogous expressions and
surveyed are comprehensive and will be valuable to the movements [5]. Deepfake methods normally require a
artificial intelligence community in tackling the current large amount of image and video data to train models
challenges of deepfakes. to create photo-realistic images and videos. As public
Keywords: deepfakes, face manipulation, artificial intelli-
figures such as celebrities and politicians may have a
gence, deep learning, autoencoders, GAN, forensics, survey.
large number of videos and images available online, they
are initial targets of deepfakes. Deepfakes were used to
T. T. Nguyen and D. T. Nguyen are with the School of Information
swap faces of celebrities or politicians to bodies in porn
Technology, Deakin University, Victoria, Australia. images and videos. The first deepfake video emerged in
Q. V. H. Nguyen is with School of Information and Communication 2017 where face of a celebrity was swapped to the face
Technology, Griffith University, Queensland, Australia. of a porn actor. It is threatening to world security when
C. M. Nguyen is with LAMIH UMR CNRS 8201, Université
Polytechnique Hauts-de-France, 59313 Valenciennes, France. deepfake methods can be employed to create videos
D. Nguyen is with Faculty of Information Technology, Monash of world leaders with fake speeches for falsification
University, Victoria, Australia. purposes [6]–[8]. Deepfakes therefore can be abused to
S. Nahavandi is with the Institute for Intelligent Systems Research
cause political or religion tensions between countries, to
and Innovation, Deakin University, Victoria, Australia.
Corresponding e-mail: thanh.nguyen@deakin.edu.au (T. T. fool public and affect results in election campaigns, or
Nguyen). create chaos in financial markets by creating fake news
2
[9]–[11]. It can be even used to generate fake satellite Deepfake Detection Challenge to catalyse more research
images of the Earth to contain objects that do not really and development in detecting and preventing deepfakes
exist to confuse military analysts, e.g., creating a fake from being used to mislead viewers [25]. Data obtained
bridge across a river although there is no such a bridge from https://app.dimensions.ai at the end of 2020 show
in reality. This can mislead a troop who have been guided that the number of deepfake papers has increased signif-
to cross the bridge in a battle [12], [13]. icantly in recent years (Fig. 1). Although the obtained
As the democratization of creating realistic digital numbers of deepfake papers may be lower than actual
humans has positive implications, there is also positive numbers but the research trend of this topic is obviously
use of deepfakes such as their applications in visual increasing.
effects, digital avatars, snapchat filters, creating voices of
those who have lost theirs or updating episodes of movies
without reshooting them [14]. However, the number of
malicious uses of deepfakes largely dominates that of
the positive ones. The development of advanced deep
neural networks and the availability of large amount of
data have made the forged images and videos almost
indistinguishable to humans and even to sophisticated
computer algorithms. The process of creating those ma-
nipulated images and videos is also much simpler today
as it needs as little as an identity photo or a short video
of a target individual. Less and less effort is required
to produce a stunningly convincing tempered footage.
Recent advances can even create a deepfake with just
a still image [15]. Deepfakes therefore can be a threat Fig. 1. Number of papers related to deepfakes in years from 2016
to 2020, obtained from https://app.dimensions.ai at the end of 2020
affecting not only public figures but also ordinary people. with the search keyword “deepfake” applied to full text of scholarly
For example, a voice deepfake was used to scam a CEO papers. The number of such papers in 2018, 2019 and 2020 are 64,
out of $243,000 [16]. A recent release of a software 368 and 1268, respectively.
called DeepNude shows more disturbing threats as it
can transform a person to a non-consensual porn [17]. This paper presents a survey of methods for creating
Likewise, the Chinese app Zao has gone viral lately as well as detecting deepfakes. There have been existing
as less-skilled users can swap their faces onto bodies survey papers about this topic in [26]–[28], we however
of movie stars and insert themselves into well-known carry out the survey with different perspective and taxon-
movies and TV clips [18]. These forms of falsification omy. In Section II, we present the principles of deepfake
create a huge threat to violation of privacy and identity, algorithms and how deep learning has been used to
and affect many aspects of human lives. enable such disruptive technologies. Section III reviews
Finding the truth in digital domain therefore has different methods for detecting deepfakes as well as their
become increasingly critical. It is even more challeng- advantages and disadvantages. We discuss challenges,
ing when dealing with deepfakes as they are majorly research trends and directions on deepfake detection and
used to serve malicious purposes and almost anyone multimedia forensics problems in Section IV.
can create deepfakes these days using existing deepfake
tools. Thus far, there have been numerous methods II. D EEPFAKE C REATION
proposed to detect deepfakes [19]–[23]. Most of them Deepfakes have become popular due to the quality
are based on deep learning, and thus a battle between of tampered videos and also the easy-to-use ability of
malicious and positive uses of deep learning methods their applications to a wide range of users with various
has been arising. To address the threat of face-swapping computer skills from professional to novice. These ap-
technology or deepfakes, the United States Defense plications are mostly developed based on deep learning
Advanced Research Projects Agency (DARPA) initiated techniques. Deep learning is well known for its capability
a research scheme in media forensics (named Media of representing complex and high-dimensional data. One
Forensics or MediFor) to accelerate the development variant of the deep networks with that capability is
of fake digital visual media detection methods [24]. deep autoencoders, which have been widely applied
Recently, Facebook Inc. teaming up with Microsoft Corp for dimensionality reduction and image compression
and the Partnership on AI coalition have launched the [29]–[31]. The first attempt of deepfake creation was
3
TABLE I
SUMMARY OF NOTABLE DEEPFAKE TOOLS
Tools Links Key Features

Faceswap https://github.com/deepfakes/faceswap - Using two encoder-decoder pairs.
- Parameters of the encoder are shared.
Faceswap-GAN https://github.com/shaoanlu/faceswap-GAN Adversarial loss and perceptual loss (VGGface) are added to an auto-encoder archi-
tecture.
Few-Shot Face https://github.com/shaoanlu/fewshot-face- - Use a pre-trained face recognition model to extract latent embeddings for GAN
Translation translation-GAN processing.
- Incorporate semantic priors obtained by modules from FUNIT [42] and SPADE [43].
DeepFaceLab https://github.com/iperov/DeepFaceLab - Expand from the Faceswap method with new models, e.g. H64, H128, LIAEF128,
SAE [44].
- Support multiple face extraction modes, e.g. S3FD, MTCNN, dlib, or manual [44].
DFaker https://github.com/dfaker/df - DSSIM loss function [45] is used to reconstruct face.
- Implemented based on Keras library.
DeepFake tf https://github.com/StromWine/DeepFake tf Similar to DFaker but implemented based on tensorflow.
AvatarMe https://github.com/lattas/AvatarMe - Reconstruct 3D faces from arbitrary “in-the-wild” images.
- Can reconstruct authentic 4K by 6K-resolution 3D faces from a single low-resolution
image [46].
MarioNETte https://hyperconnect.github.io/MarioNETte - A few-shot face reenactment framework that preserves the target identity.
- No additional fine-tuning phase is needed for identity adaptation [47].
DiscoFaceGAN https://github.com/microsoft/DiscoFaceGAN - Generate face images of virtual people with independent latent variables of identity,
expression, pose, and illumination.
- Embed 3D priors into adversarial learning [48].
StyleRig https://gvv.mpi-inf.mpg.de/projects/StyleRig - Create portrait images of faces with a rig-like control over a pretrained and fixed
StyleGAN via 3D morphable face models.
- Self-supervised without manual annotations [49].
FaceShifter https://lingzhili.com/FaceShifterPage - Face swapping in high-fidelity by exploiting and integrating the target attributes.
- Can be applied to any new face pairs without requiring subject specific training [50].
FSGAN https://github.com/YuvalNirkin/fsgan - A face swapping and reenactment model that can be applied to pairs of faces without
requiring training on those faces.
- Adjust to both pose and expression variations [51].
Transformable https://github.com/kyleolsz/TB-Networks - A method for fine-grained 3D manipulation of image content.
Bottleneck - Apply spatial transformations in CNN models using a transformable bottleneck
Networks framework [52].
“Do as I Do” github.com/carolineec/EverybodyDanceNow - Automatically transfer the motion from a source to a target person by learning a
Motion Trans- video-to-video translation.
fer - Can create a motion-synchronized dancing video with multiple subjects [53].
Neural Voice https://justusthies.github.io/posts/neural- - A method for audio-driven facial video synthesis.
Puppetry voice-puppetry - Synthesize videos of a talking head from an audio sequence of another person using
3D face representation. [54].
FakeApp, developed by a Reddit user using autoencoder-

decoder pairing structure [32], [33]. In that method, the
autoencoder extracts latent features of face images and
the decoder is used to reconstruct the face images. To
swap faces between source images and target images,
there is a need of two encoder-decoder pairs where
each pair is used to train on an image set, and the
encoder’s parameters are shared between two network
pairs. In other words, two pairs have the same encoder
network. This strategy enables the common encoder to
find and learn the similarity between two sets of face
images, which are relatively unchallenging because faces
normally have similar features such as eyes, nose, mouth
positions. Fig. 2 shows a deepfake creation process
where the feature set of face A is connected with the Fig. 2. A deepfake creation model using two encoder-decoder pairs.
decoder B to reconstruct face B from the original face A. Two networks use the same encoder but different decoders for training
This approach is applied in several works such as Deep- process (top). An image of face A is encoded with the common
encoder and decoded with decoder B to create a deepfake (bottom).
FaceLab [34], DFaker [35], DeepFake tf (tensorflow-
based deepfakes) [36].
4
By adding adversarial loss and perceptual loss im-

plemented in VGGFace [37] to the encoder-decoder
architecture, an improved version of deepfakes based
on the generative adversarial network (GAN) [38], i.e.
faceswap-GAN, was proposed in [39]. The VGGFace
perceptual loss is added to make eye movements to be
more realistic and consistent with input faces and help
to smooth out artifacts in segmentation mask, leading to
higher quality output videos. This model facilitates the
creation of outputs with 64x64, 128x128, and 256x256
resolutions. In addition, the multi-task convolutional Fig. 3. Categories of reviewed papers relevant to deepfake detection
methods where we divide papers into two major groups, i.e. fake
neural network (CNN) from the FaceNet implementation image detection and face video detection.
[40] is introduced to make face detection more stable
and face alignment more reliable. The CycleGAN [41] is
utilized for generative network implementation. Popular about the critical need of future development of more
deepfake tools and their typical features are summarized robust methods that can detect deepfakes from genuine.
in Table I. This section presents a survey of deepfake detection
methods where we group them into two major cate-
gories: fake image detection methods and fake video
III. D EEPFAKE D ETECTION detection ones (see Fig. 3). The latter is distinguished
Deepfakes are increasingly detrimental to privacy, so- into two smaller groups: visual artifacts within single
ciety security and democracy [55]. Methods for detecting video frame-based methods and temporal features across
deepfakes have been proposed as soon as this threat was frames-based ones. Whilst most of the methods based on
introduced. Early attempts were based on handcrafted temporal features use deep learning recurrent classifica-
features obtained from artifacts and inconsistencies of tion models, the methods use visual artifacts within video
the fake video synthesis process. Recent methods, on the frame can be implemented by either deep or shallow
other hand, applied deep learning to automatically extract classifiers.
salient and discriminative features to detect deepfakes
[56], [57]. A. Fake Image Detection
Deepfake detection is normally deemed a binary clas- Face swapping has a number of compelling applica-
sification problem where classifiers are used to classify tions in video compositing, transfiguration in portraits,
between authentic videos and tampered ones. This kind and especially in identity protection as it can replace
of methods requires a large database of real and fake faces in photographs by ones from a collection of stock
videos to train classification models. The number of fake images. However, it is also one of the techniques that
videos is increasingly available, but it is still limited cyber attackers employ to penetrate identification or
in terms of setting a benchmark for validating various authentication systems to gain illegitimate access. The
detection methods. To address this issue, Korshunov use of deep learning such as CNN and GAN has made
and Marcel [58] produced a notable deepfake dataset swapped face images more challenging for forensics
consisting of 620 videos based on the GAN model using models as it can preserve pose, facial expression and
the open source code Faceswap-GAN [39]. Videos from lighting of the photographs [66]. Zhang et al. [67] used
the publicly available VidTIMIT database [59] were used the bag of words method to extract a set of compact
to generate low and high quality deepfake videos, which features and fed it into various classifiers such as SVM
can effectively mimic the facial expressions, mouth [68], random forest (RF) [69] and multi-layer percep-
movements, and eye blinking. These videos were then trons (MLP) [70] for discriminating swapped face im-
used to test various deepfake detection methods. Test ages from the genuine. Among deep learning-generated
results show that the popular face recognition systems images, those synthesised by GAN models are probably
based on VGG [60] and Facenet [40], [61] are unable to most difficult to detect as they are realistic and high-
detect deepfakes effectively. Other methods such as lip- quality based on GAN’s capability to learn distribution
syncing approaches [62]–[64] and image quality metrics of the complex input data and generate new outputs with
with support vector machine (SVM) [65] produce very similar input distribution.
high error rate when applied to detect deepfake videos Most works on detection of GAN generated images
from this newly produced dataset. This raises concerns however do not consider the generalization capability
5
of the detection models although the development of general dataset is extracted from the ILSVRC12 [86].
GAN is ongoing, and many new extensions of GAN are The large scale GAN training model for high fidelity
frequently introduced. Xuan et al. [71] used an image natural image synthesis (BIGGAN) [87], self-attention
preprocessing step, e.g. Gaussian blur and Gaussian GAN [88] and spectral normalization GAN [89] are
noise, to remove low level high frequency clues of GAN used to generate fake images with size of 128x128. The
images. This increases the pixel level statistical similarity training set consists of 600,000 fake and real images
between real images and fake images and requires the whilst the test set includes 10,000 images of both types.
forensic classifier to learn more intrinsic and meaningful Experimental results show the superior performance of
features, which has better generalization capability than the proposed method against its competing methods such
previous image forensics methods [72], [73] or image as those introduced in [90]–[93].
steganalysis networks [74].
On the other hand, Agarwal and Varshney [75] cast the B. Fake Video Detection
GAN-based deepfake detection as a hypothesis testing
problem where a statistical framework was introduced Most image detection methods cannot be used for
using the information-theoretic study of authentication videos because of the strong degradation of the frame
[76]. The minimum distance between distributions of data after video compression [94]. Furthermore, videos
legitimate images and images generated by a particular have temporal characteristics that are varied among sets
GAN is defined, namely the oracle error. The analytic of frames and thus challenging for methods designed to
results show that this distance increases when the GAN detect only still fake images. This subsection focuses on
is less accurate, and in this case, it is easier to detect deepfake video detection methods and categorizes them
deepfakes. In case of high-resolution image inputs, an into two smaller groups: methods that employ temporal
extremely accurate GAN is required to generate fake features and those that explore visual artifacts within
images that are hard to detect. frames.
Recently, Hsu et al. [77] introduced a two-phase deep 1) Temporal Features across Video Frames: Based on
learning method for detection of deepfake images. The the observation that temporal coherence is not enforced
first phase is a feature extractor based on the common effectively in the synthesis process of deepfakes, Sabir et
fake feature network (CFFN) where the Siamese net- al. [95] leveraged the use of spatio-temporal features of
work architecture presented in [78] is used. The CFFN video streams to detect deepfakes. Video manipulation is
encompasses several dense units with each unit including carried out on a frame-by-frame basis so that low level
different numbers of dense blocks [79] to improve the artifacts produced by face manipulations are believed to
representative capability for the fake images. The number further manifest themselves as temporal artifacts with
of dense units is three or five depending on the validation inconsistencies across frames. A recurrent convolutional
data being face or general images, and the number of model (RCN) was proposed based on the integration of
channels in each unit is varied up to a few hundreds. Dis- the convolutional network DenseNet [79] and the gated
criminative features between the fake and real images, recurrent unit cells [96] to exploit temporal discrepancies
i.e. pairwise information, are extracted through CFFN across frames (see Fig. 4). The proposed method is tested
learning process. These features are then fed into the on the FaceForensics++ dataset, which includes 1,000
second phase, which is a small CNN concatenated to the videos [97], and shows promising results.
last convolutional layer of CFFN to distinguish deceptive
images from genuine. The proposed method is validated
for both fake face and fake general image detection. On
the one hand, the face dataset is obtained from CelebA
[80], containing 10,177 identities and 202,599 aligned
face images of various poses and background clutter. Fig. 4. A two-step process for face manipulation detection where the
Five GAN variants are used to generate fake images with preprocessing step aims to detect, crop and align faces on a sequence
of frames and the second step distinguishes manipulated and authentic
size of 64x64, including deep convolutional GAN (DC-
face images by combining convolutional neural network (CNN) and
GAN) [81], Wasserstein GAN (WGAN) [82], WGAN recurrent neural network (RNN) [95].
with gradient penalty (WGAN-GP) [83], least squares
GAN [84], and progressive growth of GAN (PGGAN) Likewise, Guera and Delp [98] highlighted that deep-
[85]. A total of 385,198 training images and 10,000 fake videos contain intra-frame inconsistencies and tem-
test images of both real and fake ones are obtained for poral inconsistencies between frames. They then pro-
validating the proposed method. On the other hand, the posed the temporal-aware pipeline method that uses
6
CNN and long short term memory (LSTM) to detect above the threshold of 0.5 with duration less than 7
deepfake videos. CNN is employed to extract frame-level frames. This method is evaluated on a dataset collected
features, which are then fed into the LSTM to create a from the web consisting of 49 interview and presentation
temporal sequence descriptor. A fully-connected network videos and their corresponding fake videos generated
is finally used for classifying doctored videos from real by the deepfake algorithms. The experimental results
ones based on the sequence descriptor as illustrated in indicate promising performance of the proposed method
Fig. 5. in detecting fake videos, which can be further improved
by considering dynamic pattern of blinking, e.g. highly
frequent blinking may also be a sign of tampering.
2) Visual Artifacts within Video Frame: As can be
noticed in the previous subsection, the methods using
temporal patterns across video frames are mostly based
on deep recurrent network models to detect deepfake
Fig. 5. A deepfake detection method using convolutional neural videos. This subsection investigates the other approach
network (CNN) and long short term memory (LSTM) to extract that normally decomposes videos into frames and ex-
temporal features of a given video sequence, which are represented plores visual artifacts within single frames to obtain
via the sequence descriptor. The detection network consisting of fully-
connected layers is employed to take the sequence descriptor as input
discriminant features. These features are then distributed
and calculate probabilities of the frame sequence belonging to either into either a deep or shallow classifier to differentiate be-
authentic or deepfake class [98]. tween fake and authentic videos. We thus group methods
in this subsection based on the types of classifiers, i.e.
On the other hand, the use of a physiological signal, either deep or shallow.
eye blinking, to detect deepfakes was proposed in [99] a) Deep classifiers: Deepfake videos are normally
based on the observation that a person in deepfakes created with limited resolutions, which require an affine
has a lot less frequent blinking than that in untampered face warping approach (i.e., scaling, rotation and shear-
videos. A healthy adult human would normally blink ing) to match the configuration of the original ones.
somewhere between 2 to 10 seconds, and each blink Because of the resolution inconsistency between the
would take 0.1 and 0.4 seconds. Deepfake algorithms, warped face area and the surrounding context, this
however, often use face images available online for process leaves artifacts that can be detected by CNN
training, which normally show people with open eyes, models such as VGG16 [101], ResNet50, ResNet101
i.e. very few images published on the internet show and ResNet152 [102]. A deep learning method to detect
people with closed eyes. Thus, without having access to deepfakes based on the artifacts observed during the face
images of people blinking, deepfake algorithms do not warping step of the deepfake generation algorithms was
have the capability to generate fake faces that can blink proposed in [103]. The proposed method is evaluated
normally. In other words, blinking rates in deepfakes are on two deepfake datasets, namely the UADFV and
much lower than those in normal videos. To discriminate DeepfakeTIMIT. The UADFV dataset [104] contains 49
real and fake videos, Li et al. [99] first decompose the real videos and 49 fake videos with 32,752 frames in
videos into frames where face regions and then eye areas total. The DeepfakeTIMIT dataset [64] includes a set of
are extracted based on six eye landmarks. After few low quality videos of 64 x 64 size and another set of high
steps of pre-processing such as aligning faces, extracting quality videos of 128 x 128 with totally 10,537 pristine
and scaling the bounding boxes of eye landmark points images and 34,023 fabricated images extracted from 320
to create new sequences of frames, these cropped eye videos for each quality set. Performance of the proposed
area sequences are distributed into long-term recurrent method is compared with other prevalent methods such
convolutional networks (LRCN) [100] for dynamic state as two deepfake detection MesoNet methods, i.e. Meso-
prediction. The LRCN consists of a feature extractor 4 and MesoInception-4 [94], HeadPose [104], and the
based on CNN, a sequence learning based on long short face tampering detection method two-stream NN [105].
term memory (LSTM), and a state prediction based on Advantage of the proposed method is that it needs not
a fully connected layer to predict probability of eye to generate deepfake videos as negative examples before
open and close state. The eye blinking shows strong training the detection models. Instead, the negative ex-
temporal dependencies and thus the implementation of amples are generated dynamically by extracting the face
LSTM helps to capture these temporal patterns effec- region of the original image and aligning it into multiple
tively. The blinking rate is calculated based on the scales before applying Gaussian blur to a scaled image
prediction results where a blink is defined as a peak of random pick and warping back to the original image.
7
This reduces a large amount of time and computational et al. [104] proposed a detection method by observing
resources compared to other methods, which require the differences between 3D head poses comprising head
deepfakes are generated in advance. orientation and position, which are estimated based on
Nguyen et al. [106] proposed the use of capsule 68 facial landmarks of the central face region. The 3D
networks for detecting manipulated images and videos. head poses are examined because there is a shortcoming
The capsule network was initially introduced to address in the deepfake face generation pipeline. The extracted
limitations of CNNs when applied to inverse graphics features are fed into an SVM classifier to obtain the
tasks, which aim to find physical processes used to pro- detection results. Experiments on two datasets show the
duce images of the world [107]. The recent development great performance of the proposed approach against its
of capsule network based on dynamic routing algorithm competing methods. The first dataset, namely UADFV,
[108] demonstrates its ability to describe the hierarchical consists of 49 deep fake videos and their respective real
pose relationships between object parts. This develop- videos [104]. The second dataset comprises 241 real
ment is employed as a component in a pipeline for images and 252 deep fake images, which is a subset
detecting fabricated images and videos as demonstrated of data used in the DARPA MediFor GAN Image/Video
in Fig. 6. A dynamic routing algorithm is deployed to Challenge [113]. Likewise, a method to exploit artifacts
route the outputs of the three capsules to the output of deepfakes and face manipulations based on visual
capsules through a number of iterations to separate features of eyes, teeth and facial contours was studied in
between fake and real images. The method is evaluated [114]. The visual artifacts arise from lacking global con-
through four datasets covering a wide range of forged sistency, wrong or imprecise estimation of the incident
image and video attacks. They include the well-known illumination, or imprecise estimation of the underlying
Idiap Research Institute replay-attack dataset [109], the geometry. For deepfakes detection, missing reflections
deepfake face swapping dataset created by Afchar et and missing details in the eye and teeth areas are
al. [94], the facial reenactment FaceForensics dataset exploited as well as texture features extracted from the
[110], produced by the Face2Face method [111], and facial region based on facial landmarks. Accordingly, the
the fully computer-generated image dataset generated by eye feature vector, teeth feature vector and features ex-
Rahmouni et al. [112]. The proposed method yields the tracted from the full-face crop are used. After extracting
best performance compared to its competing methods the features, two classifiers including logistic regression
in all of these datasets. This shows the potential of the and small neural network are employed to classify the
capsule network in building a general detection system deepfakes from real videos. Experiments carried out on
that can work effectively for various forged image and a video dataset downloaded from YouTube show the best
video attacks. result of 0.851 in terms of the area under the receiver
operating characteristics curve. The proposed method
however has a disadvantage that requires images meeting
certain prerequisite such as open eyes or visual teeth.
The use of photo response non uniformity (PRNU)
analysis was proposed in [115] to detect deepfakes from
authentic ones. PRNU is a component of sensor pattern
noise, which is attributed to the manufacturing imperfec-
tion of silicon wafers and the inconsistent sensitivity of
pixels to light because of the variation of the physical
characteristics of the silicon wafers. When a photo is
Fig. 6. Capsule network takes features obtained from the VGG-19
taken, the sensor imperfection is introduced into the
network [101] to distinguish fake images or videos from the real ones high-frequency bands of the content in the form of invisi-
(top). The pre-processing step detects face region and scales it to the ble noise. Because the imperfection is not uniform across
size of 128x128 before VGG-19 is used to extract latent features for the silicon wafer, even sensors made from the silicon
the capsule network, which comprises three primary capsules and two
output capsules, one for real and one for fake images (bottom). The wafer produce unique PRNU. Therefore, PRNU is often
statistical pooling constitutes an important part of capsule network considered as the fingerprint of digital cameras left in
that deals with forgery detection [106]. the images by the cameras [116]. The analysis is widely
used in image forensics [117]–[120] and advocated to
b) Shallow classifiers: Deepfake detection methods use in [115] because the swapped face is supposed
mostly rely on the artifacts or inconsistency of intrinsic to alter the local PRNU pattern in the facial area of
features between fake and real images or videos. Yang video frames. The videos are converted into frames,
8
which are cropped to the questioned facial region. The their sabotage strategy without using social media. For
cropped frames are then separated sequentially into eight example, this approach can be utilized by intelligence
groups where an average PRNU pattern is computed services trying to influence decisions made by important
for each group. Normalised cross correlation scores are people such as politicians, leading to national and in-
calculated for comparisons of PRNU patterns among ternational security threats [124]. Catching the deepfake
these groups. The authors in [115] created a test dataset alarming problem, research community has focused on
consisting of 10 authentic videos and 16 manipulated developing deepfake detection algorithms and numerous
videos, where the fake videos were produced from the results have been reported. This paper has reviewed
genuine ones by the DeepFaceLab tool [34]. The analysis the state-of-the-art methods and a summary of typical
shows a significant statistical difference in terms of mean approaches is provided in Table II. It is noticeable that a
normalised cross correlation scores between deepfakes battle between those who use advanced machine learning
and the genuine. This analysis therefore suggests that to create deepfakes with those who make effort to detect
PRNU has a potential in deepfake detection although a deepfakes is growing.
larger dataset would need to be tested. Deepfakes’ quality has been increasing and the per-
When seeing a video or image with suspicion, users formance of detection methods needs to be improved
normally want to search for its origin. However, there accordingly. The inspiration is that what AI has broken
is currently no feasibility for such a tool. Hasan and can be fixed by AI as well [125]. Detection methods are
Salah [121] proposed the use of blockchain and smart still in their early stage and various methods have been
contracts to help users detect deepfake videos based proposed and evaluated but using fragmented datasets.
on the assumption that videos are only real when their An approach to improve performance of detection meth-
sources are traceable. Each video is associated with a ods is to create a growing updated benchmark dataset
smart contract that links to its parent video and each of deepfakes to validate the ongoing development of
parent video has a link to its child in a hierarchical struc- detection methods. This will facilitate the training pro-
ture. Through this chain, users can credibly trace back cess of detection models, especially those based on deep
to the original smart contract associated with pristine learning, which requires a large training set [126].
video even if the video has been copied multiple times. On the other hand, current detection methods mostly
An important attribute of the smart contract is the unique focus on drawbacks of the deepfake generation pipelines,
hashes of the interplanetary file system, which is used i.e. finding weakness of the competitors to attack them.
to store video and its metadata in a decentralized and This kind of information and knowledge is not always
content-addressable manner [122]. The smart contract’s available in adversarial environments where attackers
key features and functionalities are tested against several commonly attempt not to reveal such deepfake creation
common security challenges such as distributed denial technologies. Recent works on adversarial perturbation
of services, replay and man in the middle attacks to attacks to fool DNN-based detectors make the deepfake
ensure the solution meeting security requirements. This detection task more difficult [127]–[131]. These are
approach is generic, and it can be extended to other types real challenges for detection method development and
of digital content, e.g., images, audios and manuscripts. a future research needs to focus on introducing more
robust, scalable and generalizable methods.
IV. D ISCUSSIONS AND F UTURE R ESEARCH Another research direction is to integrate detection
D IRECTIONS methods into distribution platforms such as social me-
Deepfakes have begun to erode trust of people in dia to increase its effectiveness in dealing with the
media contents as seeing them is no longer commen- widespread impact of deepfakes. The screening or filter-
surate with believing in them. They could cause distress ing mechanism using effective detection methods can be
and negative effects to those targeted, heighten disin- implemented on these platforms to ease the deepfakes
formation and hate speech, and even could stimulate detection [124]. Legal requirements can be made for
political tension, inflame the public, violence or war. tech companies who own these platforms to remove
This is especially critical nowadays as the technologies deepfakes quickly to reduce its impacts. In addition,
for creating deepfakes are increasingly approachable and watermarking tools can also be integrated into devices
social media platforms can spread those fake contents that people use to make digital contents to create im-
quickly [123]. Sometimes deepfakes do not need to be mutable metadata for storing originality details such as
spread to massive audience to cause detrimental effects. time and location of multimedia contents as well as their
People who create deepfakes with malicious purpose untampered attestment [124]. This integration is difficult
only need to deliver them to target audiences as part of to implement but a solution for this could be the use
9
of the disruptive blockchain technology. The blockchain Using detection methods to spot deepfakes is crucial,
has been used effectively in many areas and there are but understanding the real intent of people publishing
very few studies so far addressing the deepfake detection deepfakes is even more important. This requires the
problems based on this technology. As it can create judgement of users based on social context in which
a chain of unique unchangeable blocks of metadata, deepfake is discovered, e.g. who distributed it and what
it is a great tool for digital provenance solution. The they said about it [132]. This is critical as deepfakes
integration of blockchain technologies to this problem are getting more and more photorealistic and it is highly
has demonstrated certain results [121] but this research anticipated that detection software will be lagging behind
direction is far from mature. deepfake creation technology. A study on social context
TABLE II
S UMMARY OF PROMINENT DEEPFAKE DETECTION METHODS
Methods Classifiers/ Key Features Dealing Datasets Used

Techniques with
Eye blinking [99] LRCN - Use LRCN to learn the temporal patterns of eye Videos Consist of 49 interview and presentation videos,
blinking. and their corresponding generated deepfakes.
- Based on the observation that blinking frequency
of deepfakes is much smaller than normal.
Intra-frame CNN and CNN is employed to extract frame-level features, Videos A collection of 600 videos obtained from multiple
and temporal LSTM which are distributed to LSTM to construct sequence websites.
inconsistencies descriptor useful for classification.
[98]
Using face VGG16 [101] Artifacts are discovered using CNN models based Videos - UADFV [104], containing 49 real videos and 49
warping artifacts ResNet50, on resolution inconsistency between the warped face fake videos with 32752 frames in total.
[103] 101 or 152 area and the surrounding context. - DeepfakeTIMIT [64]
[102]
MesoNet [94] CNN - Two deep networks, i.e. Meso-4 and Videos Two datasets: deepfake one constituted from online
MesoInception-4 are introduced to examine videos and the FaceForensics one created by the
deepfake videos at the mesoscopic analysis level. Face2Face approach [111].
- Accuracy obtained on deepfake and FaceForensics
datasets are 98% and 95% respectively.
Capsule- Capsule - Latent features extracted by VGG-19 network [101] Videos/ Four datasets: the Idiap Research Institute replay-
forensics [106] networks are fed into the capsule network for classification. Images attack [109], deepfake face swapping by [94],
- A dynamic routing algorithm [108] is used to route facial reenactment FaceForensics [110], and fully
the outputs of three convolutional capsules to two computer-generated image set using [112].
output capsules, one for fake and another for real
images, through a number of iterations.
Head poses [104] SVM - Features are extracted using 68 landmarks of the Videos/ - UADFV consists of 49 deep fake videos and their
face region. Images respective real videos.
- Use SVM to classify using the extracted features. - 241 real images and 252 deep fake images from
DARPA MediFor GAN Image/Video Challenge.
Eye, teach and Logistic - Exploit facial texture differences, and missing Videos A video dataset downloaded from YouTube.
facial texture regression reflections and details in eye and teeth areas of
[114] and neural deepfakes.
network - Logistic regression and neural network are used for
classifying.
Spatio-temporal RCN Temporal discrepancies across frames are explored Videos FaceForensics++ dataset, including 1,000 videos
features with using RCN that integrates convolutional network [97].
RCN [95] DenseNet [79] and the gated recurrent unit cells [96]
Spatio-temporal Convolutional - An XceptionNet CNN is used for facial feature Videos FaceForensics++ [97] and Celeb-DF (5,639 deep-
features with bidirectional extraction while audio embeddings are obtained by fake videos) [141] datasets and the ASVSpoof
LSTM [140] recurrent stacking multiple convolution modules. 2019 Logical Access audio dataset [142].
LSTM - Two loss functions, i.e. cross-entropy and
network Kullback-Leibler divergence, are used.
Analysis of PRNU - Analysis of noise patterns of light sensitive sensors Videos Created by the authors, including 10 authentic and
PRNU [115] of digital cameras due to their factory defects. 16 deepfake videos using DeepFaceLab [34].
- Explore the differences of PRNU patterns be-
tween the authentic and deepfake videos because
face swapping is believed to alter the local PRNU
patterns.
Phoneme-viseme CNN - Exploit the mismatches between the dynamics Videos Four in-the-wild lip-sync deepfakes from Instagram
mismatches of the mouth shape, i.e. visemes, with a spoken and YouTube (www.instagram.com/bill posters uk
[133] phoneme. and youtu.be/VWMEDacz3L4) and others are cre-
- Focus on sounds associated with the M, B and P ated using synthesis techniques, i.e. Audio-to-
phonemes as they require complete mouth closure Video (A2V) [63] and Text-to-Video (T2V) [134].
while deepfakes often incorrectly synthesize it.
10
Methods Classifiers/ Key Features Dealing Datasets Used

Techniques with
Using ResNet50 - The ABC metric [137] is used to detect deepfake Videos VidTIMIT and two other original
attribution- model [102], videos without accessing to training data. datasets obtained from the COHFACE
based pre-trained on - ABC values obtained for original videos are greater (https://www.idiap.ch/dataset/cohface) and
confidence VGGFace2 than 0.94 while those of deepfakes have low ABC from YouTube. datasets from COHFACE
(ABC) metric [136] values. [138] and YouTube are used to generate two
[135] deepfake datasets by commercial website
https://deepfakesweb.com and another deepfake
dataset is DeepfakeTIMIT [139].
Emotion audio- Siamese Modality and emotion embedding vectors for the Videos DeepfakeTIMIT [139] and DFDC [144].
visual affective network [78] face and speech are extracted for deepfake detection.
cues [143]
Using Rules defined Temporal, behavioral biometric based on facial ex- Videos The world leaders dataset [1], FaceForensics++
appearance based on facial pressions and head movements are learned using [97], Google/Jigsaw deepfake detection dataset
and behaviour and behavioural ResNet-101 [102] while static facial biometric is [146], DFDC [144] and Celeb-DF [141].
[145] features. obtained using VGG [60].
FakeCatcher CNN Extract biological signals in portrait videos and use Videos UADFV [104], FaceForensics [110], FaceForen-
[147] them as an implicit descriptor of authenticity because sics++ [97], Celeb-DF [141], and a new dataset of
they are not spatially and temporally well-preserved 142 videos, independent of the generative model,
in deepfakes. resolution, compression, content, and context.
Preprocessing DCGAN, - Enhance generalization ability of deep learning Images - Real dataset: CelebA-HQ [85], including high
combined with WGAN-GP models to detect GAN generated images. quality face images of 1024x1024 resolution.
deep network and PGGAN. - Remove low level features of fake images. - Fake datasets: generated by DCGAN [81],
[71] - Force deep networks to focus more on pixel level WGAN-GP [83] and PGGAN [85].
similarity between fake and real images to improve
generalization ability.
Bag of words SVM, RF, MLP Extract discriminant features using bag of words Images The well-known LFW face database [148], con-
and shallow method and feed these features into SVM, RF and taining 13,223 images with resolution of 250x250.
classifiers [67] MLP for binary classification: innocent vs fabricated.
Pairwise learn- CNN concate- Two-phase procedure: feature extraction using CFFN Images - Face images: real ones from CelebA [80], and
ing [77] nated to CFFN based on the Siamese network architecture [78] and fake ones generated by DCGAN [81], WGAN
classification using CNN. [82], WGAN-GP [83], least squares GAN [84], and
PGGAN [85].
- General images: real ones from ILSVRC12 [86],
and fake ones generated by BIGGAN [87], self-
attention GAN [88] and spectral normalization
GAN [89].
Defenses VGG [60] and - Introduce adversarial perturbations to enhance Images 5,000 real images from CelebA [80] and 5,000 fake
against ResNet [102] deepfakes and fool deepfake detectors. images created by the “Few-Shot Face Translation
adversarial - Improve accuracy of deepfake detectors using Lips- GAN” method [149].
perturbations in chitz regularization and deep image prior techniques.
deepfakes [127]
Analyzing KNN, SVM, Using expectation maximization algorithm to extract Images Authentic images from CelebA and correspond-
convolutional and linear local features pertaining to convolutional generative ing deepfakes are created by five different GANs
traces [150] discriminant process of GAN-based image deepfake generators. (group-wise deep whitening-and-coloring transfor-
analysis mation GDWCT [151], StarGAN [152], AttGAN
[153], StyleGAN [154], StyleGAN2 [155]).
Face X-ray CNN - Try to locate the blending boundary between the Images FaceForensics++ [97], DeepfakeDetection (DFD)
[156] target and original faces instead of capturing the [146], DFDC [144] and Celeb-DF [141].
synthesized artifacts of specific manipulations.
- Can be trained without fake images.
Using common ResNet-50 Train the classifier using a large number of fake Images A new dataset of CNN-generated images, namely
artifacts of [102] pre- images generated by a high-performing uncondi- ForenSynths, consisting of synthesized images
CNN-generated trained with tional GAN model, i.e., ProGAN [85] and evaluate from 11 models such as StyleGAN [154], super-
images [157] ImageNet [86] how well the classifier generalizes to other CNN- resolution methods [158] and FaceForensics++
synthesized images. [97].
of deepfakes to assist users in such judgement is thus and thus the experts’ opinions may not be enough to
worth performing. authenticate these evidences because even experts are
Videos and photographics have been widely used as unable to discern manipulated contents. This aspect
evidences in police investigation and justice cases. They needs to take into account in courtrooms nowadays when
may be introduced as evidences in a court of law by images and videos are used as evidences to convict
digital media forensics experts who have background in perpetrators because of the existence of a wide range of
computer or law enforcement and experience in collect- digital manipulation methods [159]. The digital media
ing, examining and analysing digital information. The forensics results therefore must be proved to be valid
development of machine learning and AI technologies and reliable before they can be used in courts. This
might have been used to modify these digital contents
11
requires careful documentation for each step of the foren- [11] Guo, B., Ding, Y., Yao, L., Liang, Y., and Yu, Z. (2020). The
sics process and how the results are reached. Machine future of false information detection on social media: new
perspectives and trends. ACM Computing Surveys (CSUR),
learning and AI algorithms can be used to support the 53(4), 1-36.
determination of the authenticity of digital media and [12] Tucker, P. (2019, March 31). The newest AI-enabled
have obtained accurate and reliable results, e.g. [160], weapon: ‘Deep-Faking’ photos of the earth. Available at
[161], but most of these algorithms are unexplainable. https://www.defenseone.com/technology/2019/03/next-phase-
ai-deep-faking-whole-world-and-china-ahead/155944/
This creates a huge hurdle for the applications of AI [13] Fish, T. (2019, April 4). Deep fakes: AI-manipulated
in forensics problems because not only the forensics media will be ‘weaponised’ to trick military. Avail-
experts oftentimes do not have expertise in computer able at https://www.express.co.uk/news/science/1109783/deep-
fakes-ai-artificial-intelligence-photos-video-weaponised-china
algorithms, but the computer professionals also cannot
[14] Marr, B. (2019, July 22). The best (and scariest)
explain the results properly as most of these algorithms examples of AI-enabled deepfakes. Available at
are black box models [162]. This is more critical as https://www.forbes.com/sites/bernardmarr/2019/07/22/the-
the most recent models with the most accurate results best-and-scariest-examples-of-ai-enabled-deepfakes/
[15] Zakharov, E., Shysheya, A., Burkov, E., and Lempitsky, V.
are based on deep learning methods consisting of many (2019). Few-shot adversarial learning of realistic neural talking
neural network parameters. Explainable AI in computer head models. arXiv preprint arXiv:1905.08233.
vision therefore is a research direction that is needed to [16] Damiani, J. (2019, September 3). A voice deepfake was
promote and utilize the advances and advantages of AI used to scam a CEO out of $243,000. Available at
https://www.forbes.com/sites/jessedamiani/2019/09/03/a-
and machine learning in digital media forensics. voice-deepfake-was-used-to-scam-a-ceo-out-of-243000/
[17] Samuel, S. (2019, June 27). A guy made a deepfake app to
turn photos of women into nudes. It didn’t go well. Available
at https://www.vox.com/2019/6/27/18761639/ai-deepfake-
R EFERENCES deepnude-app-nude-women-porn
[18] The Guardian (2019, September 2). Chinese deepfake
[1] Agarwal, S., Farid, H., Gu, Y., He, M., Nagano, K., and Li, app Zao sparks privacy row after going viral. Available at
H. (2019, June). Protecting world leaders against deep fakes. https://www.theguardian.com/technology/2019/sep/02/chinese-
In Computer Vision and Pattern Recognition Workshops (pp. face-swap-app-zao-triggers-privacy-fears-viral
38-45). [19] Lyu, S. (2020, July). Deepfake detection: current challenges
[2] Tewari, A., Zollhoefer, M., Bernard, F., Garrido, P., Kim, H., and next steps. In IEEE International Conference on Multime-
Perez, P., and Theobalt, C. (2020). High-fidelity monocular dia and Expo Workshops (ICMEW) (pp. 1-6). IEEE.
face reconstruction based on an unsupervised model-based [20] Guarnera, L., Giudice, O., Nastasi, C., and Battiato, S. (2020).
face autoencoder. IEEE Transactions on Pattern Analysis and Preliminary forensics analysis of deepfake images. arXiv
Machine Intelligence, 42(2), 357-370. preprint arXiv:2004.12626.
[21] Jafar, M. T., Ababneh, M., Al-Zoube, M., and Elhassan, A.
[3] Lin, J., Li, Y., & Yang, G. (2021). FPGAN: Face de-
(2020, April). Forensics and analysis of deepfake videos.
identification method with generative adversarial networks for
In The 11th International Conference on Information and
social robots. Neural Networks, 133, 132-147.
Communication Systems (ICICS) (pp. 053-058). IEEE.
[4] Liu, M. Y., Huang, X., Yu, J., Wang, T. C., & Mallya, A.
[22] Trinh, L., Tsang, M., Rambhatla, S., and Liu, Y. (2020).
(2021). Generative adversarial networks for image and video
Interpretable deepfake detection via dynamic prototypes. arXiv
synthesis: Algorithms and applications. Proceedings of the
preprint arXiv:2006.15473.
IEEE, DOI: 10.1109/JPROC.2021.3049196.
[23] Younus, M. A., and Hasan, T. M. (2020, April). Effective
[5] Lyu, S. (2018, August 29). Detecting ‘deepfake’ and fast deepfake detection method based on Haar wavelet
videos in the blink of an eye. Available at transform. In 2020 International Conference on Computer
http://theconversation.com/detecting-deepfake-videos-in- Science and Software Engineering (CSASE) (pp. 186-190).
the-blink-of-an-eye-101072 IEEE.
[6] Bloomberg (2018, September 11). How faking videos [24] Turek, M. (2019). Media Forensics (MediFor). Available at
became easy and why that’s so scary. Available at https://www.darpa.mil/program/media-forensics
https://fortune.com/2018/09/11/deep-fakes-obama-video/ [25] Schroepfer, M. (2019, September 5). Creating a
[7] Chesney, R., and Citron, D. (2019). Deepfakes and the new data set and a challenge for deepfakes. Available at
disinformation war: The coming age of post-truth geopolitics. https://ai.facebook.com/blog/deepfake-detection-challenge
Foreign Affairs, 98, 147. [26] Tolosana, R., Vera-Rodriguez, R., Fierrez, J., Morales, A., &
[8] Hwang, T. (2020). Deepfakes: A Grounded Threat Assessment. Ortega-Garcia, J. (2020). Deepfakes and beyond: A survey of
Centre for Security and Emerging Technologies, Georgetown face manipulation and fake detection. Information Fusion, 64,
University. 131-148.
[9] Zhou, X., and Zafarani, R. (2020). A survey of fake [27] Verdoliva, L. (2020). Media forensics and deepfakes: an
news: fundamental theories, detection methods, and overview. IEEE Journal of Selected Topics in Signal Process-
opportunities. ACM Computing Surveys (CSUR), DOI: ing, 14(5), 910-932.
https://doi.org/10.1145/3395046. [28] Mirsky, Y., & Lee, W. (2021). The creation and detection of
[10] Kaliyar, R. K., Goswami, A., and Narang, P. (2020). Deepfake: deepfakes: A survey. ACM Computing Surveys (CSUR), 54(1),
improving fake news detection using tensor decomposition 1-41.
based deep neural network. Journal of Supercomputing, DOI: [29] Punnappurath, A., and Brown, M. S. (2019). Learning raw
https://doi.org/10.1007/s11227-020-03294-y. image reconstruction-aware deep image compressors. IEEE
12
Transactions on Pattern Analysis and Machine Intelligence. [51] Nirkin, Y., Keller, Y., & Hassner, T. (2019). FSGAN: subject
DOI: 10.1109/TPAMI.2019.2903062. agnostic face swapping and reenactment. In Proceedings of
[30] Cheng, Z., Sun, H., Takeuchi, M., and Katto, J. (2019). the IEEE/CVF International Conference on Computer Vision
Energy compaction-based image compression using convolu- (pp. 7184-7193).
tional autoencoder. IEEE Transactions on Multimedia. DOI: [52] Olszewski, K., Tulyakov, S., Woodford, O., Li, H., & Luo, L.
10.1109/TMM.2019.2938345. (2019). Transformable bottleneck networks. In Proceedings of
[31] Chorowski, J., Weiss, R. J., Bengio, S., and Oord, A. V. the IEEE/CVF International Conference on Computer Vision
D. (2019). Unsupervised speech representation learning using (pp. 7648-7657).
wavenet autoencoders. IEEE/ACM Transactions on Audio, [53] Chan, C., Ginosar, S., Zhou, T., & Efros, A. A. (2019).
Speech, and Language Processing. 27(12), pp. 2041-2053. Everybody dance now. In Proceedings of the IEEE/CVF In-
[32] Faceswap: Deepfakes software for all. Available at ternational Conference on Computer Vision (pp. 5933-5942).
https://github.com/deepfakes/faceswap [54] Thies, J., Elgharib, M., Tewari, A., Theobalt, C., & Nießner,
[33] FakeApp 2.2.0. Available at M. (2020, August). Neural voice puppetry: Audio-driven facial
https://www.malavida.com/en/soft/fakeapp/ reenactment. In European Conference on Computer Vision (pp.
[34] DeepFaceLab. Available at 716-731). Springer, Cham.
https://github.com/iperov/DeepFaceLab [55] Chesney, R., and Citron, D. K. (2018). Deep fakes: a loom-
[35] DFaker. Available at https://github.com/dfaker/df ing challenge for privacy, democracy, and national security.
[36] DeepFake tf: Deepfake based on tensorflow. Available at https://dx.doi.org/10.2139/ssrn.3213954.
https://github.com/StromWine/DeepFake tf [56] de Lima, O., Franklin, S., Basu, S., Karwoski, B., and George,
[37] Keras-VGGFace: VGGFace implementation with Keras frame- A. (2020). Deepfake detection using spatiotemporal convolu-
work. Available at https://github.com/rcmalli/keras-vggface tional networks. arXiv preprint arXiv:2006.14749.
[38] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde- [57] Amerini, I., and Caldelli, R. (2020, June). Exploiting pre-
Farley, D., Ozair, S., ... and Bengio, Y. (2014). Generative ad- diction error inconsistencies through LSTM-based classifiers
versarial nets. In Advances in Neural Information Processing to detect deepfake videos. In Proceedings of the 2020 ACM
Systems (pp. 2672-2680). Workshop on Information Hiding and Multimedia Security (pp.
[39] Faceswap-GAN. Available at 97-102).
https://github.com/shaoanlu/faceswap-GAN. [58] Korshunov, P., and Marcel, S. (2019). Vulnerability assessment
[40] FaceNet. Available at https://github.com/davidsandberg/facenet. and detection of deepfake videos. In The 12th IAPR Interna-
[41] CycleGAN. Available at https://github.com/junyanz/pytorch- tional Conference on Biometrics (ICB), pp. 1-6.
CycleGAN-and-pix2pix. [59] VidTIMIT database. Available at
[42] Liu, M. Y., Huang, X., Mallya, A., Karras, T., Aila, T., Lehti- http://conradsanderson.id.au/vidtimit/
nen, J., and Kautz, J. (2019). Few-shot unsupervised image- [60] Parkhi, O. M., Vedaldi, A., and Zisserman, A. (2015, Septem-
to-image translation. In Proceedings of the IEEE International ber). Deep face recognition. In Proceedings of the British
Conference on Computer Vision (pp. 10551-10560). Machine Vision Conference (BMVC) (pp. 41.1-41.12).
[43] Park, T., Liu, M. Y., Wang, T. C., and Zhu, J. Y. (2019). Se- [61] Schroff, F., Kalenichenko, D., and Philbin, J. (2015). Facenet:
mantic image synthesis with spatially-adaptive normalization. A unified embedding for face recognition and clustering. In
In Proceedings of the IEEE Conference on Computer Vision Proceedings of the IEEE Conference on Computer Vision and
and Pattern Recognition (pp. 2337-2346). Pattern Recognition (pp. 815-823).
[44] DeepFaceLab: Explained and usage tutorial. Available [62] Chung, J. S., Senior, A., Vinyals, O., and Zisserman, A.
at https://mrdeepfakes.com/forums/thread-deepfacelab- (2017, July). Lip reading sentences in the wild. In 2017
explained-and-usage-tutorial. IEEE Conference on Computer Vision and Pattern Recognition
[45] DSSIM. Available at https://github.com/keras-team/keras- (CVPR) (pp. 3444-3453).
contrib/blob/master/keras contrib/losses/dssim.py. [63] Suwajanakorn, S., Seitz, S. M., and Kemelmacher-Shlizerman,
[46] Lattas, A., Moschoglou, S., Gecer, B., Ploumpis, S., Tri- I. (2017). Synthesizing Obama: learning lip sync from audio.
antafyllou, V., Ghosh, A., & Zafeiriou, S. (2020). AvatarMe: ACM Transactions on Graphics (TOG), 36(4), 1–13.
realistically renderable 3D facial reconstruction ”in-the-wild”. [64] Korshunov, P., and Marcel, S. (2018, September). Speaker
In Proceedings of the IEEE/CVF Conference on Computer inconsistency detection in tampered video. In 2018 26th Eu-
Vision and Pattern Recognition (pp. 760-769). ropean Signal Processing Conference (EUSIPCO) (pp. 2375-
[47] Ha, S., Kersner, M., Kim, B., Seo, S., & Kim, D. (2020, April). 2379). IEEE.
MarioNETte: few-shot face reenactment preserving identity of [65] Galbally, J., and Marcel, S. (2014, August). Face anti-spoofing
unseen targets. In Proceedings of the AAAI Conference on based on general image quality assessment. In 2014 22nd
Artificial Intelligence (vol. 34, no. 07, pp. 10893-10900). International Conference on Pattern Recognition (pp. 1173-
[48] Deng, Y., Yang, J., Chen, D., Wen, F., & Tong, X. 1178). IEEE.
(2020). Disentangled and controllable face image genera- [66] Korshunova, I., Shi, W., Dambre, J., and Theis, L. (2017). Fast
tion via 3D imitative-contrastive learning. In Proceedings of face-swap using convolutional neural networks. In Proceedings
the IEEE/CVF Conference on Computer Vision and Pattern of the IEEE International Conference on Computer Vision (pp.
Recognition (pp. 5154-5163). 3677-3685).
[49] Tewari, A., Elgharib, M., Bharaj, G., Bernard, F., Seidel, H. [67] Zhang, Y., Zheng, L., and Thing, V. L. (2017, August).
P., Pérez, P., ... & Theobalt, C. (2020). StyleRig: Rigging Automated face swapping and its detection. In 2017 IEEE
StyleGAN for 3D control over portrait images. In Proceedings 2nd International Conference on Signal and Image Processing
of the IEEE/CVF Conference on Computer Vision and Pattern (ICSIP) (pp. 15-19). IEEE.
Recognition (pp. 6142-6151). [68] Wang, X., Thome, N., and Cord, M. (2017). Gaze latent
[50] Li, L., Bao, J., Yang, H., Chen, D., & Wen, F. (2019). support vector machine for image classification improved by
FaceShifter: Towards high fidelity and occlusion aware face weakly supervised region selection. Pattern Recognition, 72,
swapping. arXiv preprint arXiv:1912.13457. 59-71.
13
[69] Bai, S. (2017). Growing random forest on deep convolutional [87] Brock, A., Donahue, J., and Simonyan, K. (2018). Large scale
neural networks for scene categorization. Expert Systems with GAN training for high fidelity natural image synthesis. arXiv
Applications, 71, 279-287. preprint arXiv:1809.11096.
[70] Zheng, L., Duffner, S., Idrissi, K., Garcia, C., and Baskurt, [88] Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A. (2018).
A. (2016). Siamese multi-layer perceptrons for dimensionality Self-attention generative adversarial networks. arXiv preprint
reduction and face identification. Multimedia Tools and Appli- arXiv:1805.08318.
cations, 75(9), 5055-5073. [89] Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. (2018).
[71] Xuan, X., Peng, B., Dong, J., and Wang, W. (2019). On Spectral normalization for generative adversarial networks.
the generalization of GAN image forensics. arXiv preprint arXiv preprint arXiv:1802.05957.
arXiv:1902.11153. [90] Farid, H. (2009). Image forgery detection. IEEE Signal Pro-
[72] Yang, P., Ni, R., and Zhao, Y. (2016, September). Recapture cessing Magazine, 26(2), 16-25.
image forensics based on Laplacian convolutional neural net- [91] Mo, H., Chen, B., and Luo, W. (2018, June). Fake faces iden-
works. In International Workshop on Digital Watermarking tification via convolutional neural network. In Proceedings of
(pp. 119-128). the 6th ACM Workshop on Information Hiding and Multimedia
[73] Bayar, B., and Stamm, M. C. (2016, June). A deep learning Security (pp. 43-47).
approach to universal image manipulation detection using a [92] Marra, F., Gragnaniello, D., Cozzolino, D., and Verdoliva, L.
new convolutional layer. In Proceedings of the 4th ACM (2018, April). Detection of GAN-generated fake images over
Workshop on Information Hiding and Multimedia Security (pp. social networks. In 2018 IEEE Conference on Multimedia
5-10). ACM. Information Processing and Retrieval (MIPR) (pp. 384-389).
[74] Qian, Y., Dong, J., Wang, W., and Tan, T. (2015, March). Deep IEEE.
learning for steganalysis via convolutional neural networks. In [93] Hsu, C. C., Lee, C. Y., and Zhuang, Y. X. (2018, December).
Media Watermarking, Security, and Forensics 2015 (Vol. 9409, Learning to detect fake face images in the wild. In 2018
p. 94090J). International Symposium on Computer, Consumer and Control
[75] Agarwal, S., and Varshney, L. R. (2019). Limits of deep- (IS3C) (pp. 388-391). IEEE.
fake detection: A robust estimation viewpoint. arXiv preprint [94] Afchar, D., Nozick, V., Yamagishi, J., and Echizen, I. (2018,
arXiv:1905.03493. December). MesoNet: a compact facial video forgery detection
[76] Maurer, U. M. (2000). Authentication theory and hypothesis network. In 2018 IEEE International Workshop on Information
testing. IEEE Transactions on Information Theory, 46(4), Forensics and Security (WIFS) (pp. 1-7). IEEE.
1350-1356. [95] Sabir, E., Cheng, J., Jaiswal, A., AbdAlmageed, W., Masi, I.,
and Natarajan, P. (2019). Recurrent convolutional strategies for
[77] Hsu, C. C., Zhuang, Y. X., and Lee, C. Y. (2020). Deep fake
face manipulation detection in videos. In Proceedings of the
image detection based on pairwise learning. Applied Sciences,
IEEE Conference on Computer Vision and Pattern Recognition
10(1), 370.
Workshops (pp. 80-87).
[78] Chopra, S. (2005). Learning a similarity metric discrimina-
[96] Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D.,
tively, with application to face verification. In IEEE Confer-
Bougares, F., Schwenk, H., and Bengio, Y. (2014, October).
ence on Compter Vision and Pattern Recognition (pp. 539-
Learning phrase representations using RNN encoder–decoder
546).
for statistical machine translation. In Proceedings of the 2014
[79] Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, Conference on Empirical Methods in Natural Language Pro-
K. Q. (2017). Densely connected convolutional networks. In cessing (EMNLP) (pp. 1724-1734).
Proceedings of the IEEE Conference on Computer Vision and
[97] Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J.,
Pattern Recognition (pp. 4700-4708).
& Nießner, M. (2019). Faceforensics++: Learning to detect
[80] Liu, Z., Luo, P., Wang, X., and Tang, X. (2015). Deep manipulated facial images. In Proceedings of the IEEE/CVF
learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1-11).
International Conference on Computer Vision (pp. 3730-3738). [98] Guera, D., and Delp, E. J. (2018, November). Deepfake video
[81] Radford, A., Metz, L., and Chintala, S. (2015). Unsupervised detection using recurrent neural networks. In 2018 15th IEEE
representation learning with deep convolutional generative International Conference on Advanced Video and Signal Based
adversarial networks. arXiv preprint arXiv:1511.06434. Surveillance (AVSS) (pp. 1-6). IEEE.
[82] Arjovsky, M., Chintala, S., and Bottou, L. (2017, July). [99] Li, Y., Chang, M. C., and Lyu, S. (2018, December). In
Wasserstein generative adversarial networks. In International ictu oculi: Exposing AI created fake videos by detecting eye
Conference on Machine Learning (pp. 214-223). blinking. In 2018 IEEE International Workshop on Information
[83] Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Forensics and Security (WIFS) (pp. 1-7). IEEE.
Courville, A. C. (2017). Improved training of Wasserstein [100] Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach,
GANs. In Advances in Neural Information Processing Systems M., Venugopalan, S., Saenko, K., and Darrell, T. (2015). Long-
(pp. 5767-5777). term recurrent convolutional networks for visual recognition
[84] Mao, X., Li, Q., Xie, H., Lau, R. Y., Wang, Z., and Paul and description. In Proceedings of the IEEE Conference on
Smolley, S. (2017). Least squares generative adversarial net- Computer Vision and Pattern Recognition (pp. 2625-2634).
works. In Proceedings of the IEEE International Conference [101] Simonyan, K., and Zisserman, A. (2014). Very deep con-
on Computer Vision (pp. 2794-2802). volutional networks for large-scale image recognition. arXiv
[85] Karras, T., Aila, T., Laine, S., and Lehtinen, J. (2017). Pro- preprint arXiv:1409.1556.
gressive growing of GANs for improved quality, stability, and [102] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual
variation. arXiv preprint arXiv:1710.10196. learning for image recognition. In Proceedings of the IEEE
[86] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Conference on Computer Vision and Pattern Recognition (pp.
Ma, S., ... and Berg, A. C. (2015). ImageNet large scale vi- 770-778).
sual recognition challenge. International Journal of Computer [103] Li, Y., and Lyu, S. (2019). Exposing deepfake videos by
Vision, 115(3), 211-252. detecting face warping artifacts. In Proceedings of the IEEE
14
Conference on Computer Vision and Pattern Recognition analysis. IEEE Transactions on Biometrics, Behavior, and
Workshops (pp. 46-52). Identity Science, 1(4), 302-317.
[104] Yang, X., Li, Y., and Lyu, S. (2019, May). Exposing deep [120] Phan, Q. T., Boato, G., and De Natale, F. G. (2019). Accurate
fakes using inconsistent head poses. In 2019 IEEE Interna- and scalable image clustering based on sparse representation of
tional Conference on Acoustics, Speech and Signal Processing camera fingerprint. IEEE Transactions on Information Foren-
(ICASSP) (pp. 8261-8265). IEEE. sics and Security, 14(7), 1902-1916.
[105] Zhou, P., Han, X., Morariu, V. I., and Davis, L. S. (2017, [121] Hasan, H. R., and Salah, K. (2019). Combating deepfake
July). Two-stream neural networks for tampered face detection. videos using blockchain and smart contracts. IEEE Access,
In 2017 IEEE Conference on Computer Vision and Pattern 7, 41596-41606.
Recognition Workshops (CVPRW) (pp. 1831-1839). IEEE. [122] IPFS powers the Distributed Web. Available at https://ipfs.io/
[106] Nguyen, H. H., Yamagishi, J., and Echizen, I. (2019, May). [123] Zubiaga, A., Aker, A., Bontcheva, K., Liakata, M., and Procter,
Capsule-forensics: Using capsule networks to detect forged R. (2018). Detection and resolution of rumours in social
images and videos. In 2019 IEEE International Conference media: A survey. ACM Computing Surveys (CSUR), 51(2), 1-
on Acoustics, Speech and Signal Processing (ICASSP) (pp. 36.
2307-2311). IEEE. [124] Chesney, R. and Citron, D. K. (2018, October 16). Disin-
[107] Hinton, G. E., Krizhevsky, A., and Wang, S. D. (2011, June). formation on steroids: The threat of deep fakes. Available at
Transforming auto-encoders. In International Conference on https://www.cfr.org/report/deep-fake-disinformation-steroids.
Artificial Neural Networks (pp. 44-51). Springer, Berlin, Hei- [125] Floridi, L. (2018). Artificial intelligence, deepfakes and a
delberg. future of ectypes. Philosophy and Technology, 31(3), 317-321.
[108] Sabour, S., Frosst, N., and Hinton, G. E. (2017). Dynamic [126] Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang,
routing between capsules. In Advances in Neural Information M., and Ferrer, C. C. (2020). The deepfake detection challenge
Processing Systems (pp. 3856-3866). dataset. arXiv preprint arXiv:2006.07397.
[109] Chingovska, I., Anjos, A., and Marcel, S. (2012, September). [127] Gandhi, A., and Jain, S. (2020). Adversarial perturbations fool
On the effectiveness of local binary patterns in face anti- deepfake detectors. arXiv preprint arXiv:2003.10596.
spoofing. In Proceedings of the International Conference of [128] Neekhara, P., Hussain, S., Jere, M., Koushanfar, F., and
Biometrics Special Interest Group (BIOSIG) (pp. 1-7). IEEE. McAuley, J. (2020). Adversarial deepfakes: evaluating vulner-
[110] Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., ability of deepfake detectors to adversarial examples. arXiv
and Nießner, M. (2018). FaceForensics: A large-scale video preprint arXiv:2002.12749.
dataset for forgery detection in human faces. arXiv preprint [129] Carlini, N., and Farid, H. (2020). Evading deepfake-image
arXiv:1803.09179. detectors with white-and black-box attacks. In Proceedings of
[111] Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., and the IEEE/CVF Conference on Computer Vision and Pattern
Nießner, M. (2016). Face2Face: Real-time face capture and Recognition Workshops (pp. 658-659).
reenactment of RGB videos. In Proceedings of the IEEE [130] Yang, C., Ding, L., Chen, Y., and Li, H. (2020). Defending
Conference on Computer Vision and Pattern Recognition (pp. against GAN-based deepfake attacks via transformation-aware
2387-2395). adversarial faces. arXiv preprint arXiv:2006.07421.
[112] Rahmouni, N., Nozick, V., Yamagishi, J., and Echizen, I. [131] Yeh, C. Y., Chen, H. W., Tsai, S. L., and Wang, S. D. (2020).
(2017, December). Distinguishing computer graphics from Disrupting image-translation-based deepfake algorithms with
natural images using convolution neural networks. In 2017 adversarial attacks. In Proceedings of the IEEE Winter Con-
IEEE Workshop on Information Forensics and Security (WIFS) ference on Applications of Computer Vision Workshops (pp.
(pp. 1-6). IEEE. 53-62).
[113] Guan, H., Kozak, M., Robertson, E., Lee, Y., Yates, A. N., [132] Read, M. (2019, June 27). Can you spot
Delgado, A., ... and Fiscus, J. (2019, January). MFC datasets: a deepfake? Does it matter? Available at
Large-scale benchmark datasets for media forensic challenge http://nymag.com/intelligencer/2019/06/how-do-you-spot-
evaluation. In 2019 IEEE Winter Applications of Computer a-deepfake-it-might-not-matter.html.
Vision Workshops (WACVW) (pp. 63-72). [133] Agarwal, S., Farid, H., Fried, O., and Agrawala, M.
[114] Matern, F., Riess, C., and Stamminger, M. (2019, January). (2020). Detecting deep-fake videos from phoneme-viseme
Exploiting visual artifacts to expose deepfakes and face ma- mismatches. In Proceedings of the IEEE/CVF Conference on
nipulations. In 2019 IEEE Winter Applications of Computer Computer Vision and Pattern Recognition Workshops (pp. 660-
Vision Workshops (WACVW) (pp. 83-92). IEEE. 661).
[115] Koopman, M., Rodriguez, A. M., and Geradts, Z. (2018). [134] Fried, O., Tewari, A., Zollhöfer, M., Finkelstein, A., Shecht-
Detection of deepfake video manipulation. In The 20th Irish man, E., Goldman, D. B., ... and Agrawala, M. (2019). Text-
Machine Vision and Image Processing Conference (IMVIP) based editing of talking-head video. ACM Transactions on
(pp. 133-136). Graphics (TOG), 38(4), 1-14.
[116] Rosenfeld, K., and Sencar, H. T. (2009, February). A study of [135] Fernandes, S., Raj, S., Ewetz, R., Singh Pannu, J., Kumar
the robustness of PRNU-based camera identification. In Media Jha, S., Ortiz, E., ... and Salter, M. (2020). Detecting deepfake
Forensics and Security (Vol. 7254, p. 72540M). International videos using attribution-based confidence metric. In Proceed-
Society for Optics and Photonics. ings of the IEEE/CVF Conference on Computer Vision and
[117] Li, C. T., and Li, Y. (2012). Color-decoupled photo response Pattern Recognition Workshops (pp. 308-309).
non-uniformity for digital image forensics. IEEE Transactions [136] Cao, Q., Shen, L., Xie, W., Parkhi, O. M., and Zisserman,
on Circuits and Systems for Video Technology, 22(2), 260-271. A. (2018, May). VGGFace2: A dataset for recognising faces
[118] Lin, X., and Li, C. T. (2017). Large-scale image clustering across pose and age. In 2018 13th IEEE International Confer-
based on camera fingerprints. IEEE Transactions on Informa- ence on Automatic Face and Gesture Recognition (pp. 67-74).
tion Forensics and Security, 12(4), 793-808. IEEE.
[119] Scherhag, U., Debiasi, L., Rathgeb, C., Busch, C., and Uhl, A. [137] Jha, S., Raj, S., Fernandes, S., Jha, S. K., Jha, S., Jalaian, B.,
(2019). Detection of face morphing attacks based on PRNU ... and Swami, A. (2019). Attribution-based confidence metric
15
for deep neural networks. In Advances in Neural Information want. IEEE Transactions on Image Processing, 28(11), 5464-
Processing Systems (pp. 11826-11837). 5478.
[138] Fernandes, S., Raj, S., Ortiz, E., Vintila, I., Salter, M., [154] Karras, T., Laine, S., and Aila, T. (2019). A style-based
Urosevic, G., and Jha, S. (2019, October). Predicting heart generator architecture for generative adversarial networks. In
rate variations of deepfake videos using neural ODE. In Proceedings of the IEEE Conference on Computer Vision and
2019 IEEE/CVF International Conference on Computer Vision Pattern Recognition (pp. 4401-4410).
Workshop (ICCVW) (pp. 1721-1729). IEEE. [155] Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and
[139] Korshunov, P., and Marcel, S. (2018). Deepfakes: a new threat Aila, T. (2020). Analyzing and improving the image quality
to face recognition? assessment and detection. arXiv preprint of StyleGAN. In Proceedings of the IEEE/CVF Conference on
arXiv:1812.08685. Computer Vision and Pattern Recognition (pp. 8110-8119).
[140] Chintha, A., Thai, B., Sohrawardi, S. J., Bhatt, K. M., Hicker- [156] Li, L., Bao, J., Zhang, T., Yang, H., Chen, D., Wen, F., & Guo,
son, A., Wright, M., and Ptucha, R. (2020). Recurrent convolu- B. (2020). Face X-ray for more general face forgery detection.
tional structures for audio spoof and video deepfake detection. In Proceedings of the IEEE/CVF Conference on Computer
IEEE Journal of Selected Topics in Signal Processing, DOI: Vision and Pattern Recognition (pp. 5001-5010).
10.1109/JSTSP.2020.2999185. [157] Wang, S. Y., Wang, O., Zhang, R., Owens, A., & Efros, A.
[141] Li, Y., Yang, X., Sun, P., Qi, H., and Lyu, S. (2020). Celeb-DF: A. (2020). CNN-generated images are surprisingly easy to
A large-scale challenging dataset for deepfake forensics. In spot... for now. In Proceedings of the IEEE/CVF Conference
Proceedings of the IEEE/CVF Conference on Computer Vision on Computer Vision and Pattern Recognition (pp. 8695-8704).
and Pattern Recognition (pp. 3207-3216). [158] Dai, T., Cai, J., Zhang, Y., Xia, S. T., & Zhang, L.
[142] Todisco, M., Wang, X., Vestman, V., Sahidullah, M., Delgado, (2019). Second-order attention network for single image super-
H., Nautsch, A., ... and Lee, K. A. (2019). ASVspoof 2019: resolution. In Proceedings of the IEEE/CVF Conference on
Future horizons in spoofed and fake audio detection. arXiv Computer Vision and Pattern Recognition (pp. 11065-11074).
preprint arXiv:1904.05441. [159] Maras, M. H., and Alexandrou, A. (2019). Determining au-
[143] Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., and thenticity of video evidence in the age of artificial intelligence
Manocha, D. (2020). Emotions don’t lie: A deepfake detec- and in the wake of deepfake videos. The International Journal
tion method using audio-visual affective cues. arXiv preprint of Evidence and Proof, 23(3), 255-262.
arXiv:2003.06711. [160] Su, L., Li, C., Lai, Y., and Yang, J. (2017). A fast forgery
detection algorithm based on exponential-Fourier moments for
[144] Dolhansky, B., Howes, R., Pflaum, B., Baram, N., and Fer-
video region duplication. IEEE Transactions on Multimedia,
rer, C. C. (2019). The deepfake detection challenge (DFDC)
20(4), 825-840.
preview dataset. arXiv preprint arXiv:1910.08854.
[161] Iuliani, M., Shullani, D., Fontani, M., Meucci, S., and Piva,
[145] Agarwal, S., El-Gaaly, T., Farid, H., and Lim, S. N. (2020).
A. (2018). A video forensic framework for the unsupervised
Detecting deep-fake videos from appearance and behavior.
analysis of MP4-like file container. IEEE Transactions on
arXiv preprint arXiv:2004.14491.
Information Forensics and Security, 14(3), 635-645.
[146] Dufour, N., and Gully, A. (2019). Contributing
[162] Malolan, B., Parekh, A., and Kazi, F. (2020, March). Explain-
Data to Deepfake Detection Research. Available at:
able deep-fake detection using visual interpretability methods.
https://ai.googleblog.com/2019/09/contributing-data-to-
In The 3rd International Conference on Information and
deepfake-detection.html.
Computer Technologies (ICICT) (pp. 289-293). IEEE.
[147] Ciftci, U. A., Demir, I., & Yin, L. (2020). FakeCatcher: Detec-
tion of synthetic portrait videos using biological signals. IEEE
Transactions on Pattern Analysis and Machine Intelligence,
DOI: 10.1109/TPAMI.2020.3009287.
[148] Huang, G. B., Mattar, M., Berg, T., and Learned-Miller,
E. (2007, October). Labelled faces in the wild: A database
for studying face recognition in unconstrained environ-
ments. Technical Report 07-49, University of Massachusetts,
Amherst, http://vis-www.cs.umass.edu/lfw/. Thanh Thi Nguyen was a Visiting Scholar
[149] Shaoanlu’s GitHub. (2019). Few-Shot Face Translation with the Computer Science Department at
GAN. Available at https://github.com/shaoanlu/fewshot-face- Stanford University, California, USA in 2015
translation-GAN. and the Edge Computing Lab, Harvard Uni-
[150] Guarnera, L., Giudice, O., and Battiato, S. (2020). Deepfake versity, Massachusetts, USA in 2019. He re-
detection by analyzing convolutional traces. In Proceedings of ceived a European-Pacific Partnership for ICT
the IEEE/CVF Conference on Computer Vision and Pattern Expert Exchange Program Award from Eu-
Recognition Workshops (pp. 666-667). ropean Commission in 2018, and an Aus-
[151] Cho, W., Choi, S., Park, D. K., Shin, I., and Choo, J. (2019). tralia–India Strategic Research Fund Early-
Image-to-image translation via group-wise deep whitening- and Mid-Career Fellowship Awarded by the Australian Academy of
and-coloring transformation. In Proceedings of the IEEE Con- Science in 2020. Dr. Nguyen obtained a PhD in Mathematics and
ference on Computer Vision and Pattern Recognition (pp. Statistics from Monash University, Australia and has expertise in
10639-10647). various areas, including AI, deep learning, reinforcement learning,
[152] Choi, Y., Choi, M., Kim, M., Ha, J. W., Kim, S., and Choo, computer vision, cyber security, IoT, and data science. He is currently
J. (2018). StarGAN: Unified generative adversarial networks a Senior Lecturer in the School of Information Technology, Deakin
for multi-domain image-to-image translation. In Proceedings University, Victoria, Australia.
of the IEEE Conference on Computer Vision and Pattern
Recognition (pp. 8789-8797).
[153] He, Z., Zuo, W., Kan, M., Shan, S., and Chen, X. (2019).
AttGAN: Facial attribute editing by only changing what you
16
Quoc Viet Hung Nguyen received a PhD de- Duc Thanh Nguyen was awarded a PhD
gree from EPFL, Switzerland. He is currently in Computer Science from the University of
a senior lecturer with Griffith University, Aus- Wollongong, Australia in 2012. Currently, he
tralia. He has published several articles in top- is a lecturer in the School of Information
tier venues, such as SIGMOD, VLDB, SIGIR, Technology, Deakin University, Australia. His
KDD, AAAI, ICDE, IJCAI, JVLDB, TKDE, research interests include computer vision and
TOIS, and TIST. His research interests include pattern recognition. He has published his work
data mining, data integration, data quality, in- in highly ranked publication venues in Com-
formation retrieval, trust management, recom- puter Vision and Pattern Recognition such as
mender systems, machine learning, and big data visualization. the Journal of Pattern Recognition, CVPR, ICCV, ECCV. He also has
served a technical program committee member for many premium
conferences such as CVPR, ICCV, ECCV, AAAI, ICIP, PAKDD and
reviewer for the IEEE Trans. Intell. Transp. Syst., the IEEE Trans.
Image Process., the IEEE Signal Processing Letters, Image and Vision
Computing, Pattern Recognition, Scientific Reports.
Cuong M. Nguyen received the B.Sc. and

M.Sc. degrees in Mathematics from Vietnam
National University, Hanoi, Vietnam. In 2017,
he received the Ph.D. degree from School
of Engineering, Deakin University, Australia,
where he worked as postdoctoral researcher
and sessional lecturer for several years. He is
currently a postdoc in Autonomous Vehicles at
LAMIH UMR CNRS 8201, Université Poly-
technique Hauts-de-France, France. His research interests lie in the
areas of Optimization, Machine Learning, and Control Systems.
Saeid Nahavandi received a Ph.D. from

Durham University, U.K. in 1991. He is an
Alfred Deakin Professor, Pro Vice-Chancellor
(Defence Technologies), Chair of Engineering,
and the Director for the Institute for Intelligent
Systems Research and Innovation at Deakin
University, Victoria, Australia. His research
interests include modelling of complex sys-
tems, machine learning, robotics and haptics.
He is a Fellow of Engineers Australia (FIEAust), the Institution of
Engineering and Technology (FIET) and IEEE (FIEEE).
Dung Tien Nguyen received the B.Eng. and He is the Co-Editor-in-Chief of the IEEE Systems Journal, As-
M.Eng. degrees from the People Securiry sociate Editor of the IEEE/ASME Transactions on Mechatronics,
Academy and the Vietnam National University Associate Editor of the IEEE Transactions on Systems, Man and
University of Enginerring and Technology in Cybernetics: Systems, and an IEEE Access Editorial Board member.
2006 and 2013, respectively. He received his
PhD degree in the area of multimodal emotion
recognition using deep learning techniques
from Queensland University of Technology in
Brisbane, Australia in 2019. He is currently
working as a research fellow at Deakin University in computer vision,
machine learning, deep learning, image processing, and affective
computing.

Deep Learning For Deepfakes Creation and Detection: A Survey

Uploaded by

Copyright:

Available Formats

You might also like

Deep Learning For Deepfakes Creation and Detection: A Survey

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning For Deepfakes Creation and Detection: A Survey

Uploaded by

Copyright:

Available Formats

1

Deep Learning for Deepfakes Creation and

Abstract—Deep learning has been successfully applied I. I NTRODUCTION

Tools Links Key Features

FakeApp, developed by a Reddit user using autoencoder-

By adding adversarial loss and perceptual loss im-

Methods Classifiers/ Key Features Dealing Datasets Used

Methods Classifiers/ Key Features Dealing Datasets Used

Cuong M. Nguyen received the B.Sc. and

Saeid Nahavandi received a Ph.D. from

You might also like