Professional Documents
Culture Documents
A Machine Learning Based Approach For Deepfake Detection in Social Media Through Key Video Frame Extraction
A Machine Learning Based Approach For Deepfake Detection in Social Media Through Key Video Frame Extraction
https://doi.org/10.1007/s42979-021-00495-x
ORIGINAL RESEARCH
Abstract
In the last few years, with the advent of deepfake videos, image forgery has become a serious threat. In a deepfake video,
a person’s face, emotion or speech are replaced by someone else’s face, different emotion or speech, using deep learning
technology. These videos are often so sophisticated that traces of manipulation are difficult to detect. They can have a heavy
social, political and emotional impact on individuals, as well as on the society. Social media are the most common and serious
targets as they are vulnerable platforms, susceptible to blackmailing or defaming a person. There are some existing works for
detecting deepfake videos but very few attempts have been made for videos in social media. The first step to preempt such
misleading deepfake videos from social media is to detect them. Our paper presents a novel neural network-based method
to detect fake videos. We applied a key video frame extraction technique to reduce the computation in detecting deepfake
videos. A model, consisting of a convolutional neural network (CNN) and a classifier network, is proposed along with the
algorithm. The Xception net has been chosen over two other structures—InceptionV3 and Resnet50—for pairing with our
classifier. Our model is a visual artifact-based detection technique. The feature vectors from the CNN module are used as the
input of the subsequent classifier network for classifying the video. We used the FaceForensics++ and Deepfake Detection
Challenge datasets to reach the best model. Our model detects highly compressed deepfake videos in social media with a
very high accuracy and lowered computational requirements. We achieved 98.5% accuracy with the FaceForensics++ dataset
and 92.33% accuracy with a combined dataset of FaceForensics++ and Deepfake Detection Challenge. Any autoencoder
generated video can be detected by our model. Our method has detected almost all fake videos if they possess more than
one key video frame. The accuracy reported here is for detecting fake videos when the number of key video frames is one.
The simplicity of the method will help people to check the authenticity of a video. Our work is focused, but not limited, to
addressing the social and economical issues due to fake videos in social media. In this paper, we achieve the high accuracy
without training the model with an enormous amount of data. The key video frame extraction method reduces the computa-
tions significantly, as compared to existing works.
Keywords Deepfake · Deep learning · Key video frame extraction · Depthwise separable convolution · Convolution neural
network (CNN) · Transfer learning · Social media · Compressed video
1
* Saraju P. Mohanty Department of Computer Science and Engineering,
saraju.mohanty@unt.edu University of North Texas, Denton, TX, USA
2
Alakananda Mitra School of Engineering and Informatics, National University
AlakanandaMitra@my.unt.edu of Ireland, Galway, Ireland
3
Peter Corcoran Department of Electrical Engineering, University of North
peter.corcoran@nuigalway.ie Texas, Denton, TX, USA
Elias Kougianos
elias.kougianos@unt.edu
SN Computer Science
Vol.:(0123456789)
98 Page 2 of 18 SN Computer Science (2021) 2:98
SN Computer Science
SN Computer Science (2021) 2:98 Page 3 of 18 98
Video
Feature
Or Classification
Extraction
Fig. 1 Deepfakes created by Facebook to fight against a disinforma- very sophisticated and are created using different deep learn-
tion disaster source Facebook
ing techniques—by autoencoder and generative adversarial
networks (GANs). It is practically impossible for a human
applied as an edge fake video detecting tool. Our classifier to distinguish between a real and a forged video when these
network applies mainly autoencoder generated videos. are uploaded in social media.
First we tried to detect each frame with our previously Social media videos are highly compressed. Our goal is
proposed Algorithm 1 [23]. Then, instead of detecting each to apply our proposed technique with lower computational
and every frame, we applied our newly proposed Algo- burden which can detect these deepfake videos at any com-
rithm 2 where a key video frame extraction technique has pression level in social media, and which are generated
been utilized. These two algorithms are discussed in detail in mostly by an autoencoder. Our main goal is to lower the
Sect. 7. As our detection method is based on finding changes computations as much as possible, so that in the future we
of visual artifacts in a frame due to deepfake forgery, we can apply the algorithm at edge devices. People can check
assume that key frames contain all the artifacts. This simple the authenticity of videos with a limited memory device,
assumption helps us to propose an algorithm with a good such as a smart phone. In our previous paper [23], we tested
accuracy and much lower computational cost. Lower com- our model only on the FF++ dataset [27]. In this paper we
putational cost means that the algorithm can be deployed extended our work by testing the network with an additional
at the edge since the limited memory in a smart phone will dataset, namely DFDC [28].
not be a barrier.
The proposed network consists of two modules: (1) a The Challenges in Solving the Problem
convolutional neural network (CNN) for frame feature
extraction, and (2) a classifier network consisting of Globa- In social media, uploaded images/videos are highly com-
lAveragePooling2D and fully connected layers. To choose pressed. People check their social media account from
the best CNN module we followed our previous work [23]. smartphones or tablets, along with computers. The existing
We experimented mainly with three networks-Xception solutions to detect deepfake images/videos are mostly for
[24], InceptionV3 [25], and Resnet50 [26], and finally uncompressed data and the models are not suitable for social
chose Xception network as the feature extractor. We chose media videos. As people check their social media account
only these three networks over other available CNN mod- from smart phones, the model should be of smaller size
ules because of their smaller size. During selection of the too. The challenge was to solve three problems simultane-
CNN module, we used only the FF++ dataset [27]. Then ously—detecting deepfake videos, and to develop a model
we applied the newly proposed technique over the chosen applicable to compressed video and also a lighter version of
CNN module with a larger dataset consisting of the FF++ it. In our previous paper [23] we tried to solve the first two
and DFDC (partial) datasets. A system level overview of problems. In this paper we address the third problem too by
the network is shown in Fig. 2. Our model works on the using a computer vision technique for lower computational
proposed algorithm. effort.
The Problem Addressed in the Current Paper The Solution Proposed in the Current Paper
The problem addressed in this paper lies in the very nature To address the above mentioned challenges we propose a
of how a deepfake video is created. Deepfake videos are novel technique of applying key video frame approach to our
SN Computer Science
98 Page 4 of 18 SN Computer Science (2021) 2:98
previously proposed neural network based model [23] which The Novelty of the Solution Proposed
can detect deepfake videos in social media at any compres-
sion level. We limit the data by extracting only key video Our goal is to obtain a model for detecting social media
frames from each video. It reduces the number of frames deepfake videos/images suitable for use at the edge. To
from videos to be checked for authenticity, largely without achieve that goal we apply a computer vision technique and
compromising accuracy significantly. During detection, a simple classifier with the Xception network. The proposed
instead of checking each frame for authenticity, we check algorithm reduces the computation during detection. By
the visual artifact changes only for the key video frames. applying the key video frame extraction during data process-
This reduces the computations. As the detection process ing we reduced the number of frames significantly and still
consists of fewer data computations, this approach is one received a high accuracy. We tried to reduce the data to be
step forward to apply our network at the edge. We propose a processed, as at any edge device memory is a limiting factor.
lightweight approach compared to the highly computation-
ally expensive existing works. Our main contributions are
as follows: Related Prior Works
– An algorithm of lower complexity to detect deepfake Identifying manipulated and falsified contents is techni-
videos. cally demanding and challenging. In the past 2 decades in
– Initially we used three different CNN networks, smaller media forensics, substantial work to detect image and video
in size, for feature extraction. They are (1) Xception, (2) forgery has been done. Most of the solutions proposed for
Inception V3, and (3) Resnet50. We compared them and video forensics are for easy manipulations, such as copy-
finally selected the Xception network [24] as our CNN move [32], dropped or duplicated frames [33], or varying
module. The detailed work has been discussed in our interpolation [34]. But use of auto-encoders or GANs has
previous work [23]. In Xception, introduced by Chollet made image/video forgery sophisticated. These computer
in 2016, depthwise separable convolution has been used generated forged videos are hard to detect with previously
which made it accurate and cheaper. existing detection techniques. Stacked auto-encoders, CNNs,
– The feature vector from the CNN is used as the input of long-short term memory (LSTM) networks, or GANS have
a GlobalAveragePooling2D layer with a dropout layer, been explored in detection models to detect video manipula-
followed by a fully connected (FC) with a dropout layer tion. Some of the existing works are summarized in Table 1.
and lastly a classification softmax layer. This classifier Among deep learning solutions, some are temporal fea-
has been used to detect video. ture based and some are based on visual artifacts. In visual-
– Our novelty here is to combine a well known method artifact based works, videos are processed frame-by-frame.
of computer vision to our simple classifier model for Each frame contains different features which generate vari-
detecting deepfake videos. The existing works have very ous inconsistencies in the manipulated region of an image.
complex structures [29–31] for detection. Our main goal These features are first extracted and then used as input to
is to reduce the computational cost of detection without a deep learning classifier as CNN models can detect these
excessively sacrificing accuracy. artifacts. The classifiers are ResNet152 [26], VGG16 [41],
– A novel technique of training without a very large train- Inception V3 [25], DenseNet etc. Certain works are associ-
ing dataset is reported. ated with detection techniques based on eye blinking rate
– For training and testing, we primarily used the FF++ and [37], noting the difference between head pose [42] of an
DFDC data sets. We used the compressed deepfake and original video and a fake video, and detecting the artifacts
original videos of the former. These compressed videos of eyes, teeth and face [40]. The human blinking pattern
have two different compression levels—one is low loss has also been used in another recent paper [43]. A general
and the other is high loss. The data sets are a good rep- capsule network based method has been proposed to detect
resentation of social media scenarios. To obtain a better manipulated images and videos [29]. A VGG19 [41] net-
generalized model, we added another dataset. But to limit work has been used for latent feature extraction along with a
the training time we used 1/3 of the DFDC dataset. We capsule network to detect different spoofs, replay attacks etc.
compressed the DFDC dataset at three compression lev- Two inception modules along with two classic convolution
els with the H264 video compressor. Finally we trained layers followed by maxpooling layers have been explored
our network with this mixed dataset. in [38]. This approach is at a mesoscopic level. Audio and
SN Computer Science
SN Computer Science (2021) 2:98 Page 5 of 18 98
Sabir et al. [35] FaceForensics++ Use spatio-temporal features of Not applicable to long video clips
video streams
Bidirectional Recurrent Neural Not trained on a large dataset
Network (RNN) + DenseNet/
ResNet50
Güera and Delp [36] Hollywood-2 Human Actions Temporal inconsistencies of deep- Didn’t take into account of com-
(HOHA) fake video is taken into account pressed videos
Inception-V3 + LSTM
Li et al. [37] Closed Eyes in the Wild (CEW) Used long term recurrent convolu- Applied to uncompressed videos
tional networks
Measured the eye blinking rate
VGG16 + LSTM + FC
Afchur et al. [38] Downloaded from internet and Mesonet structures-Meso-4 and Accuracy is lower for highly com-
processed MesoInception-4 used 2 inception pressed video + 2 FC layers
modules + 2 classic convolution
layers
Li et al. [39] UADFV and DeepfakeTIMIT Face warping artifacts. Used 4 Compression has not been con-
CNN models sidered
Measured resolution inconsistency
between the warped face area and
face
Matern et al. [40] A combination of various sources Facial texture difference, and miss- Not for compressed video
ing details in eye and teeth
Logistic regression model and
neural network
Nguyen et al. [29] Four major datasets VGG-19 + Capsule Network Accuracy is low for highly com-
pressed data
Hashmi et al. [30] DFDC whole dataset CNN + LSTM Computation complexity is high
Used facial landmarks and convolu- Minimum video length is 10 s.
tional features Works well for long videos
Kumar et al. [31] FaceForensics++ Triplet architecture For highly compressed video
Celeb-DF Metric learning approach
Previous work FaceForensics++ Face artifacts analysis For compressed video
by the authors [23] XceptionNet + Classifier network High accuracy
Proposed model and algorithm DFDC and FaceForensics++ Face artifacts analysis with key For any level of compressed video.
video frame approach Applicable to any video even
with only one key video frame
XceptionNet + Classifier Network Faster with less computation than
authors’ previous work
video parts of a video clip have been used in getting emotion was used in a different paper to classify forged videos [35].
embedding to detect fake videos [44]. A comparative study A DenseNet structure combined with recurrent neural net-
among different approaches has been provided in [45] where work (RNN) has been used. A blockchain based approach
the authors evaluated existing techniques. to detect forged videos has been proposed by Hasan and
Other than the visual artifact based works, there is another Salah [46]. Each video is linked to a smart contract and
parallel type of work that is prevalent. These are based on it has a hierarchical relation to its child video. A video is
temporal features of a video. A combined network of a CNN called pristine if the original smart contract is traced. Unique
and LSTM architectures has been explored [36]. The CNN hashes help to store the data to the InterPlanetary File Sys-
module extracts features from the input image sequence. The tem (IPFS) peer-to-peer network. This model claims to be
feature vector is fed into the LSTM network. Here, the long- extendable to audio or images. In [47], the ownership of a
short term memory generates a sequence detector from the video has been stated by detecting fake video and distin-
feature vector. Finally, a fully connected layer classifies the guishing it from real video. Spatio-temporal features has also
video as manipulated or real. Another combined network been used in detecting deepfake videos [48].
SN Computer Science
98 Page 6 of 18 SN Computer Science (2021) 2:98
of deepfake videos at different compression levels. The suc- (a) Training Phase
cess of XceptionNet on the FF++ dataset motivated us to
find a better deep neural model with higher accuracy and not Generation
limited to a particular dataset. We assume a video is called
manipulated when at least one frame of the video is forged.
This simple assumption along with our technique make our Encoder Decoder B
video classifier less computation intensive. Reconstructed
Original Latent Face A
Figure A Figure B
From Latent Face A
Why are Deepfakes Hard to Detect? (b) Generation Phase
SN Computer Science
SN Computer Science (2021) 2:98 Page 7 of 18 98
Fig. 5 The generated key video frames from a 20 s video Fig. 6 The proposed flow of video processing
SN Computer Science
98 Page 8 of 18 SN Computer Science (2021) 2:98
ReLU
SeparableConv 728, 3x3
MaxPooling 3x3,
stride=2x2
Feature 1 2 8 6
Vector 3 6 5 4
3 5.75
7 2 3 9
1 1 1 0 7 2.75 4.75
2 Global
Dropout Fully
3 Average
… (0.5) Connected
Pooling2D Feature Vector Real
or
Fake
…
…
In the previous Section, the proposed novel technique to There are three elements in a convolution operation:
detect deepfake videos in social media has been mentioned.
The theoretical perspective along with the algorithm have – Input image
been discussed in the current Section. – Feature detector or Kernel or Filter
– Feature map
SN Computer Science
SN Computer Science (2021) 2:98 Page 9 of 18 98
∫0 ∫ 0 (1)
(I ∗ h)(x, y) = I(x − i, y − j)h(i, j)didj, ( n
)2
1
(5)
∑
ED = t− 𝛿i wi Ii , .
2
where I is the input image and h is the kernel. i=1
The complexity of the convolution operation is expressed where 𝛿i ∼ Bernoulli(p) . The expectation of the gradient of
as N × D2G × D2K × M where DF × DF × M is the size of the the dropout network is expressed as:
input image and the filter size is DK × DK × M . M is the
number of channels in the input image. The size of the fea-
[ ]
𝜕ED
E = − tpi Ii + wi p2i Ii2 + wi Var(𝛿i )Ii2
ture matrix is DG × DG × M . 𝜕wi
The complexity is decreased in depthwise separable con- n (6)
volution. It divides the convolution operation in two parts:
∑
+ wj pi pj Ii Ij ,
(1) depthwise convolution–filtering stage, and (2) pointwise j=1,j≠1
convolution–combination stage. In depthwise convolution
the complexity is M × D2G × D2K while for pointwise convolu- 𝜕EN
tion it is N × D2G × D2K × M . The overall complexity is can = + wi pi (1 − pi )Ii2 . (7)
𝜕wi
be estimated as the following:
In the above expressions, w� = p ∗ w , so the expectation of
Total complexity = M × D2G × D2K +N× D2G the gradient with dropout becomes equal to the gradient of
(2)
× D2K ×M a regularized linear network:
)2
The relative complexities of the two convolutions is the
( n n
1
(8)
∑ ∑
following: ER = t− pi wi Ii + pi (1 − pi )w2i Ii2 .
2 i=1 i=1
Complexity Depthwise Separable Conv. 1 1
= + 2 (3) The above Eq. (8) results in a maximum for p = 0.5.
Complexity Standard Conv. N DK
It is evident from Eq. (3) that the complexity of standard Soft‑Max Layer
convolution is much higher than the depthwise separable con-
volution. It means that the Xception Network provides much To predict the class of the video—pristine or manipulated,
faster and cheaper convolution than standard convolution. a softmax layer is used at the end of the network. It takes
an M-dimensional vector and creates another vector of the
Global Average Pooling Layer same size but with values ranging from 0 to 1 making the
sum of the values to 1. If the probability distribution of two
It helps to reduce the number of parameters and eventually classes provided by the softmax layer is P(yk ) for input yk to
to minimize overfitting. It down-samples by computing the the softmax layer, the output ŷ from the softmax layer can
mean or average of the width and height dimensions of the be predicted by the following expression:
input. For the Global Average Pooling layer there are no
(9)
( )
ŷ = arg maxk=2 P yk .
parameters to learn. It takes the average of each feature map
for each category of a classification problem and returns a
vector which is directly fed into the next layer. It is more Training Loss
robust as it sums up the spatial information.
During training, we minimize the categorical cross entropy
Dropout Layer loss to get optimal parameters of the network to best predict
the class. It is the measure of performance of a classification
It is very common for a deep network to overfit. The drop- model whose output is a probability, and ranges from 0 to 1.
out layer stops the overfitting of a neural network. The least In binary classification like our case the cross entropy loss
square loss for a single layer linear network with activation is expressed as:
function f (x) = x is expressed as the following [51, 52]:
SN Computer Science
98 Page 10 of 18 SN Computer Science (2021) 2:98
Experimental Validation
Datasets
SN Computer Science
SN Computer Science (2021) 2:98 Page 11 of 18 98
For selecting the CNN module, we trained the net- Table 3 Dataset details for final work
work with only FF++ data at compression level c = 23, Dataset No. of original videos No. of manipulated
as shown in Table 2. But to make a generalized model, ( c = 15 , c = 23 , c = 40) videos ( c = 15 , c = 23 ,
during the final part of our work we trained our neural c = 40)
network model with mixed compression level dataset and
FF++ 2000 2000
we had to construct the dataset as in Table 3. To train our
DFDC 5773 5765
model we construct a mixed dataset consisting of DFDC
and FF++ dataset. We focused on the compressed dataset
as any image/video loses clarity or in other terms shows
losses as compression increases. FF++ has two different shows the key video frames from a 10 s fake video in DFDC
compression levels videos. It represents a realistic sce- dataset and Fig. 10b shows key video frames of a 24 s fake
nario for social media. We changed the compression level video in the FaceForensics++ dataset.
of DFDC dataset in making our dataset. The dataset details
are shown in Table 3. Experimental Setup
– FaceForensics++ (FF++) dataset For our work, we In this section, we discuss the implementation set up. The
used 2000 deepfake videos and 2000 original videos at whole work consists of two parts. In the first part, we chose
different compression levels. As the videos uploaded in our feature extractor and in the second part we introduced
social media are compressed, FaceForensics++ dataset a unique way to detect deepfake videos using a key frame
is a good representative of the social media scenario. extraction technique to reduce the computations. The first
We used two compression level video sets—one with part was trained on a smaller dataset.
quantization parameter or constant rate factor 23 and the Transfer learning We used transfer learning for better
other at 40. accuracy and to save training time. A pre-trained model
– Deepfake detection challenge (DFDC) dataset We used approach was taken. Initially Resnet50, InceptionV3 and
part of 470GB dataset—5765 manipulated videos and Xception net have been chosen as the feature extractor
5773 original videos. We changed the compression levels which are trained on Imagenet dataset. So they already have
to three levels c = 15 , c = 23 , and c = 40 . We kept the learned to detect basic and general features of images as they
number of videos for low level compression c = 15 mini- were trained on 1000 classes of 1,000,000 images. Lower
mum. As loss increases with the compression level, we level layers extract basic features like lines or edges whereas
wanted to train our network more on c = 23 and c = 40 middle or higher layers extract more complex and abstract
than c = 15 videos. We changed the compression levels features and features defining classification.
with an H.264 encoder using FFmpeg software [58].
SN Computer Science
98 Page 12 of 18 SN Computer Science (2021) 2:98
Trained on
Trained To train our network first we trained the classifier keeping
on our Tested on
Imagenet dataset Unseen Data the weights of feature extractor frozen and then fine tuned
the whole network from end-to-end. We repeated the whole
Load Replace Add Train
Fine Tune Predict process for our three CNN modules and finally chose the
End-to-
pretrained
network
Final
Layer
Classifier
Network
Classifier
Network
End
&
Calculate Xception network as our feature extractor. The overall work
Network Accuracy
flow is presented in Fig. 11. Figure 12 shows the steps fol-
lowed in our work for training and validation. Table 4 shows
Early layers
the number of frames used for training and validation pur-
Last layers learn :
learn : • fake & real poses our work.
• edges frames features
• corners Implementation details We implemented our proposed
• colors
framework in Keras with the TensorFlow backend. FFmpeg
is used to clip the videos and 68-landmarks in the dlib pack-
Fig. 11 End-to-end work flow age for face detection. For training we used a Tesla T4 GPU
with 64GB Memory. A GeForce RTX 2060 is used to evalu-
ate the model.
Training Results
Optimizer : Adam
Classifier Training : Learning Rate 5e-4 Figure 13a shows the accuracy vs different CNN networks.
Fine Tunning : Learning Rate 5e-5
It is clear that the Xception net performed better with our
classifier. Once the feature extractor was chosen, we moved
to the final work. Fig. 14b–d show the output of different
Testing layers of Xception net for the key video frame (Fig. 14a).
Test the model with unseen FF++ and DFDC data In our initial experiment we tested 200 videos of FF++
unseen data and chose our feature extractor. We obtained
96% accuracy for the videos with compression level c = 23 .
Then we tested the model with our newly proposed Algo-
Fig. 12 The specific steps followed for testing and validation pur-
rithm 2 and two sets of data—first with the same 200 videos
poses
from FF++ we used for testing initially and achieved 98.5%
accuracy. Then we tested the model with 600 mixed com-
Table 4 Frames details for training and validation pression videos from FF++ and DFDC test dataset. Most
Division No. of frames No. of frames
of them are highly compressed (c = 40). We were able to
During Part-1 During Part-2
achieve accuracy of 92.33% even with high loss videos. The
results are shown in Fig. 15.
Train 124,959 49,373
Valid 31,240 12,345
SN Computer Science
SN Computer Science (2021) 2:98 Page 13 of 18 98
96
93
88
86
80 80
2%
2%
TP
48% TN
48%
FP
FN
Analysis of Results
SN Computer Science
98 Page 14 of 18 SN Computer Science (2021) 2:98
20.0
Algo.1,c=23 Algo.2,Mixed c Algo.2,Mixed c Test accuracy for our model applying Algorithm 2 is
0.0
FF++ FF++ FF++ & DFDC shown in Fig. 16. Precision, Recall and F1-score for our
(a) Test Accuracy Plot model are defined by the following expressions:
TP
( )
Precision = (12)
600 600 TP + FP
FF++ & DFDC
500 Mixed c (
TP
)
(13)
Number of Videos
200 200
Recall =
400 TP + FN
FF++ FF++
300 c=23 Mixed c ( )
200 2
F1-score = 1 1 (14)
+
100 Precision Recall
0
Algorithm.1 Algorithm.2 Algorithm.2 For our classification model, the above metrics are calcu-
lated using Table 6 and are shown in Fig. 17 and Table 8.
(b) Number of Videos for the Experiments Our Algorithm 1 is applicable to any length video as
we process frames at 24 fps. In most fake videos, since
Fig. 15 Comparison of results between two experiments only the face is changed the number of key frames is low.
Our Algorithm 2 can detect fake videos even if only one
key frame is extracted from the testing video but its accu-
racy increases enormously (almost all results were correct)
if the video has more than one key frame. Our reported
accuracy for Algorithm 2 considers all cases even if the
video contains only one key frame. That is why we report
accuracy as 92.33%.
If the video is very hazy, our model might not produce
accurate results. The number of FN is mostly contributed
by the hazy fake videos. The number of hazy videos in
the training and validation dataset was not sufficient for
our model to learn and resulted in uncertain prediction for
those videos.
The accuracy of our model was lower when we added
DFDC dataset with FF++ dataset because we train our
model with only partial DFDC dataset. We believe that, as
Fig. 16 Test accuracy calculation for final experiment
SN Computer Science
SN Computer Science (2021) 2:98 Page 15 of 18 98
Table 6 Data to calculate test Algorithm + compression + test videos CNN + proposed Number of Test Data
accuracy classifier
TP TN FP FN
Algorithm 1 ResNet50 80 96 04 20
+ c=23 InceptionV3 84 88 12 16
+ 200 (FF++) videos Xception 96 96 04 04
Algorithm 2
+ Mixed c Xception 98 99 01 02
+ 200 (FF++) videos
Algorithm 2
+ Mixed c Xception 276 278 22 24
+ 600 (FF++ & DFDC) videos
Table 7 Test accuracy DFDC dataset is a vast dataset, training our model with the
Testing data No. of videos Algorithm Compres- Accuracy
entire dataset would have resulted in better accuracy.
sion levels
Comparisons
FF++ 200 Algorithm 1 c = 23 96.00
FF++ 200 Algorithm 2 c ≤ 40 98.50
We compared our results with existing results [27, 30, 31].
FF++ DFDC 600 Algorithm 2 c ≤ 40 92.33 The detailed comparison is shown in Table 9. Figure 18
shows a comparative view between our and other existing
works. We achieved 98.5% and 92.33% accuracy with our
proposed algorithm for two different test dataset videos.
We believe incorporating more manipulated videos with
different performers, and different light and noise condi-
tion in the training dataset will increase the performance
of our model.
Fig. 17 Metric calculation
200 FF++ Videos Algorithm 1 ResNet50 + Classifier 88.00 0.95 0.80 0.87
InceptionV3 + Classifier 86.00 0.89 0.88 0.88
Xception + Classifier 96.00 0.96 0.96 0.96
200 FF++ videos Algorithm 2 Xception + Classifier 98.50 0.99 0.98 0.98
600 FF++ and DFDC videos Algorithm 2 Xception + Classifier 92.33 0.93 0.92 0.92
SN Computer Science
98 Page 16 of 18 SN Computer Science (2021) 2:98
Test Accuracy % We evaluated with a much bigger unseen dataset and were
100
98.5
96.0 95.73 able to achieve good accuracy for highly compressed high
92.33
90
86.74 loss data.
81.0
In this paper:
80 75.1
70
– We proposed a deep neural method for detecting social
60
media deepfake videos.
50 – Introduced an algorithm which cuts down the computa-
40
tional burden significantly.
30
– We avoided training with enormous amounts of data even
though we accommodated a large number of videos.
– We achieved high accuracy even for highly compressed
20
10
video.
0
Proposed Model, Proposed Proposed Xcepon in Xcepon in 68-landmarks Triplets in
Xception + Proposed Classifier + Algo.2 + 68 landmarks FF++ DFDC Any (checked upto c = 40) 92.33
Xception + Proposed Classifier + Algo.2 + 68 landmarks FF++ Any (checked upto c = 40) 98.50
Xception + Proposed Classifier + Algo.1 + 68 landmarks [23] FF++ c = 23 96.00
Xception in FF++ Paper [27] FF++ c = 40 81.00
c = 23 95.73
CNN + LSTM + 68-landmarks [30] DFDC NA 75.10
Triplets (semi-hard) [31] FF++ c = 40 86.74
SN Computer Science
SN Computer Science (2021) 2:98 Page 17 of 18 98
Acknowledgements This article is an extended version of our previous 18. Chesney R, Citron D. Deepfakes: a looming crisis for national
conference paper presented at [23]. security, democracy and privacy? Web. 2018. https://www.lawfa
reblog.com/deepfakes-looming-crisis-national-secur ity-democ
racy-and-privacy
Compliance with Ethical Standards 19. Kietzmann J, Lee LW, McCarthy IP, Kietzmann TC. Deepfakes:
trick or treat? Bus Horiz. 2020;63(2):135–46.
Conflict of interest The authors declare that they have no conflict of 20. Reid S. The deepfake dilemma: reconciling privacy and first
interest and there was no human or animal testing or participation in- amendment protections. Univ Pa J Const Law. 2020. https://ssrn.
volved in this research. All data were obtained from public domain com/abstract=3636464 (Forthcoming).
sources. 21. Gerstner E. Face/off: “DeepFake” face swaps and privacy laws.
Def Couns J. 2020;87:1.
22. Westerlund M. The emergence of Deepfake technology: a
review. Technol Innov Manag Rev. 2019;9:39–52. https://doi.
References org/10.22215/timreview/1282. Accessed 19 Jan 2021.
23. Mitra A, Mohanty SP, Corcoran P, Kougianos E. A novel machine
1. Zhu J, Park T, Isola P, Efros AA. Unpaired image-to-image trans- learning based method for deepfake video detection in social
lation using cycle-consistent adversarial networks. In: Proceedings media. In: Proceedings of the 6th IEEE international symposium
IEEE international conference on computer vision (ICCV). 2017. on smart electronic systems (iSES). 2020. (In Press).
pp. 2242–2251. 24. Chollet F. Xception: deep learning with depthwise separable con-
2. Karras T, Laine S, Aittala M, Hellsten J, Lehtinen J, Aila T. Ana- volutions. CoRR. 2016. arxiv: 1610.02357. Accessed 19 Jan 2021.
lyzing and improving the image quality of stylegan. In: Proceed- 25. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethink-
ings of IEEE/CVF conference on computer vision and pattern ing the inception architecture for computer vision. 2015. arXiv
recognition (CVPR); 2020. p. 8107–8116. :1512.00567. Accessed 19 Jan 2021.
3. Zhao Y, Chen C. Unpaired image-to-image translation via latent 26. He K, Zhang X, Ren S, Sun J. Deep residual learning for image
energy transport. 2020. arXiv:2012.00649. recognition. In: Proceedings of IEEE conference on computer
4. Faceswap. Web. https: //github .com/deepfa kes/facesw ap. Accessed vision and pattern recognition (CVPR). 2016. p. 770–778.
19 Jan 2021. Accessed 19 Jan 2021.
5. DFaker. Web. https: //github .com/dfaker /df. Accessed 19 Jan 2021. 27. Rossler A, Cozzolino D, Verdoliva L, Riess C, Thies J, Niess-
6. Chou HTG, Edge N. They are happier and having better lives than ner M. FaceForensics++: learning to detect manipulated facial
i am: the impact of using Facebook on perceptions of others’ lives. images. In: Proceedings of IEEE international conference on com-
Cyberpsychol Behav Soc Netw. 2012;15(2):117–21. puter vision. 2019. p. 1–11. Accessed 19 Jan 2021.
7. Baccarella C, Wagner T, Kietzmann J, McCarthy I. Social media? 28. Dolhansky B, Bitton J, Pflaum B, Lu J, Howes R, Wang M, Ferrer
It’s serious! understanding the dark side of social media. Eur. CC. The deepfake detection challenge dataset. 2020. https://githu
Manag. J. 2018;36:431–8. b.com/deepfakes/faceswap0
8. Allcott H, Gentzkow M. Social media and fake news in the 2016 29. Nguyen HH, Yamagishi J, Echizen I. Capsule-forensics: Using
election. J. Econ. Perspect. 2017;31(2):211–36. capsule networks to detect forged images and videos. In: Proceed-
9. Faceswap. Web. https://faceswap.dev/. Accessed 19 Jan 2021. ings of IEEE international conference on acoustics, speech and
10. DeepFaceLab. Web. https://github.com/iperov/DeepFaceLab. signal processing (ICASSP); 2019. p. 2307–2311.
Accessed 19 Jan 2021. 30. Hashmi MF, Ashish BKK, Keskar AG, Bokde ND, Yoon JH,
11. Deepfake Video. Web. https://edition.cnn.com/2019/06/11/tech/ Geem ZW. An exploratory analysis on visual counterfeits using
zuckerberg-deepfake/index.html. Accessed 19 Jan 2021. conv-lstm hybrid architecture. IEEE Access. 2020;8:101293–308.
12. Bregler C, Covell M, Slaney M. Video rewrite: driving visual Accessed 19 Jan 2021.
speech with audio. In: SIGGRAPH’97: Proceedings of the 24th 31. Kumar A, Bhavsar A, Verma R. Detecting deepfakes with metric
annual conference on computer graphics and interactive tech- learning. In: 2020 8th international workshop on biometrics and
niques; 1997. p. 353–360. forensics (IWBF). 2020. p. 1–6.
13. Garrido P, Valgaerts L, Rehmsen O, Thormaehlen T, Perez P, 32. D’Amiano L, Cozzolino D, Poggi G, Verdoliva L. A Patch-
Theobalt C. Automatic Face Reenactment. In: 2014 IEEE Con- Match-based dense-field algorithm for video copy-move detec-
ference on Computer Vision and Pattern Recognition. Columbus, tion and localization. IEEE Trans Circuits Syst Video Technol.
OH, 2014. p. 4217–4224. 2019;29(3):669–82. Accessed 19 Jan 2021.
14. Thies J, Zollhofer M, Niessner M, Valgaerts L, Stamminger M, 33. Gironi A, Fontani M, Bianchi T, Piva A, Barni M. A video foren-
Theobalt C. Realtime expression transfer for facial reenactment. sic technique for detecting frame deletion and insertion. In: Pro-
In: Proceedings of ACM SIGGRAPH Asia 2015 ACM transac- ceedings of IEEE international conference on acoustics, speech
tions on graphics (TOG), Art No. 183, vol. 34; 2015. Accessed and signal processing (ICASSP); 2014. p. 6226–6230.
19 Jan 2021. 34. Ding X, Yang G, Li R, Zhang L, Li Y, Sun X. Identifica-
15. Suwajanakorn S, Seitz SM, Kemelmacher-Shlizerman I. Synthe- tion of motion-compensated frame rate up-conversion based
sizing Obama: learning lip sync from audio. ACM Trans Graph on residual signals. IEEE Trans Circuits Syst Video Technol.
(TOG). 2017;36:4. 2018;28(7):1497–512.
16. Tewari A, Zollhofer M, Kim H, Garrido P, Bernard F, Perez P, 35. Sabir E, Cheng J, Jaiswal A, AbdAlmageed W, Masi I, Natarajan
Theobalt C. MoFA: model-based deep convolutional face autoen- P. Recurrent convolutional strategies for face manipulation detec-
coder for unsupervised monocular reconstruction. In: Proceed- tion in videos. In: Proceedings of the IEEE conference on com-
ings of IEEE international conference on computer vision (ICCV). puter vision and pattern recognition workshops; 2019. p. 80–87.
2017. pp. 3735–3744. Accessed 19 Jan 2021.
17. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, 36. Güera D, Delp EJ. Deepfake video detection using recurrent neu-
Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: ral networks. In: Proceedings of 15th IEEE international confer-
Proceedings of advances in neural information processing sys- ence on advanced video and signal based surveillance; 2018. p.
tems. 2014. p. 2672–2680 1–6. Accessed 19 Jan 2021.
SN Computer Science
98 Page 18 of 18 SN Computer Science (2021) 2:98
37. Li Y, Chang M, Lyu S. In Ictu Oculi: Exposing AI created fake 49. Li X, Lang Y, Chen Y, Mao X, He Y, Wang S, Xue H, Lu Q.
videos by detecting eye blinking. In: Proceedings of IEEE inter- Sharp multiple instance learning for deepfake video detection. In:
national workshop on information forensics and security. 2018. Proceedings of the 28th ACM international conference on multi-
p. 1–7. Accessed 19 Jan 2021. media. 2020. p. 1864–1872.
38. Afchar D, Nozick V, Yamagishi J, Echizen I. MesoNet: a compact 50. Charitidis P, Kordopatis-Zilos G, Papadopoulos S, Kompatsiaris I.
facial video forgery detection network. In: Proceedings of IEEE Investigating the impact of pre-processing and prediction aggrega-
international workshop on information forensics and security tion on the deepfake detection task; 2020. arXiv:2006.07084
(WIFS). 2018. p. 1–7. 51. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov
39. Li Y, Lyu S. Exposing deepfake videos by detecting face warping R. Dropout: a simple way to prevent neural networks from overfit-
artifacts. CoRR. 2018. https://github.com/deepfakes/faceswap. ting. J Mach Learn Res. 2014;15(1):1929–58.
Accessed 19 Jan 2021. 52. Baldi P, Sadowski PJ. Understanding dropout. In: Burges CJC,
40. Matern F, Riess C, Stamminger M. Exploiting visual artifacts Bottou L, Welling M, Ghahramani Z, Weinberger KQ, editors.
to expose deepfakes and face manipulations. In: Proceedings of Advances in neural information processing systems, vol. 26; 2013.
IEEE winter applications of computer vision workshops; 2019. p. p. 2814–22. Accessed 19 Jan 2021.
83–92. Accessed 19 Jan 2021. 53. Gloe T, Böhme R. The “Dresden image database” for benchmark-
41. Simonyan K, Zisserman A. Very deep convolutional networks for ing digital image forensics. In: Proceedings of the ACM sympo-
large-scale image recognition. 2014. arXiv:1409.1556. Accessed sium on applied computing; 2010. p. 1584–1590. Accessed 19 Jan
19 Jan 2021. 2021.
42. Yang X, Li Y, Lyu S. Exposing deep fakes using inconsistent head 54. Amerini I, Ballan L, Caldelli R, Del Bimbo A, Serra G. A
poses. In: Proceedings of IEEE international conference on acous- sift-based forensic method for copy-move attack detection and
tics, speech and signal processing (ICASSP). 2019. p. 8261–8265. transformation recovery. IEEE Trans. Inf. Forensics Secur.
Accessed 19 Jan 2021. 2011;6(3):1099–110.
43. Jung T, Kim S, Kim K. DeepVision: deepfakes detection using 55. Zampoglou M, Papadopoulos S, Kompatsiaris Y. Detecting image
human eye blinking pattern. IEEE Access. 2020;8:83144–54. splicing in the wild (web). In: Proceedings of IEEE international
https://github.com/deepfakes/faceswap. Accessed 19 Jan 2021. conference on multimedia expo workshops (ICMEW). 2015. p.
44. Mittal T, Bhattacharya U, Chandra R, Bera A, Manocha D. Emo- 1–6. Accessed 19 Jan 2021.
tions don’t lie: an audio-visual deepfake detection method using 56. Dolhansky B, Howes R, Pflaum B, Baram N, Canton-Ferrer C.
affective cues. In: Proceedings of the 28th ACM international The deepfake detection challenge (dfdc) preview dataset;2019.
conference on multimedia; 2020. p. 2823–2832. arXiv:abs/1910.08854. Accessed 19 Jan 2021.
45. Korshunov P, Marcel S. Deepfakes: a new threat to face recogni- 57. Rössler A, Cozzolino D, Verdoliva L, Riess C, Thies J, Nießner
tion? Assessment and detection. 2018. Accessed 19 Jan 2021. M. Faceforensics: a large-scale video dataset for forgery detection
46. Hasan HR, Salah K. Combating deepfake videos using blockchain in human faces. 2018. arXiv:abs/1803.09179.
and smart contracts. IEEE Access. 2019;7:41596–606. Accessed 58. Bellard F, Niedermayer M. Ffmpeg. Web (2012). https://githu
19 Jan 2021. b.com/deepfakes/faceswap3. Accessed 19 Jan 2021.
47. Sethi L, Dave A, Bhagwani R, Biwalkar A. Video security against
deepfakes and other forgeries. J Discrete Math Sci Cryptogr. Publisher’s Note Springer Nature remains neutral with regard to
2020;23:349–63. jurisdictional claims in published maps and institutional affiliations.
48. Ganiyusufoglu I, Ngo LM, Savov N, Karaoglu S, Gevers T. Spa-
tio-temporal features for generalized detection of deepfake videos.
2020.
SN Computer Science