A Deep Learning Framework For Audio Deepfake Detection

Arabian Journal for Science and Engineering (2022) 47:3447–3458
https://doi.org/10.1007/s13369-021-06297-w
RESEARCH ARTICLE-ELECTRICAL ENGINEERING
A Deep Learning Framework for Audio Deepfake Detection

Janavi Khochare1 · Chaitali Joshi1 · Bakul Yenarkar1 · Shraddha Suratkar1 · Faruk Kazi1
Received: 6 June 2021 / Accepted: 3 October 2021 / Published online: 8 November 2021
© King Fahd University of Petroleum & Minerals 2021
Abstract
Audio deepfakes have been increasingly emerging as a potential source of deceit, with the development of avant-garde methods
of synthetic speech generation. Hence, differentiating fake audio from the real one is becoming even more difficult owing to
the increasing accuracy of text-to-speech models, posing a serious threat to speaker verification systems. Within the domain
of audio deepfake detection, a majority of experiments have been based on the ASVSpoof or the AVSpoof dataset using
various machine learning and deep learning approaches. In this work, experiments were performed on a more recent dataset,
the Fake or Real (FoR) dataset which contains data generated using some of the best text to speech models. Two approaches
have been adopted to the solve problem: feature-based approach and image-based approach. The feature-based approach
involves converting audio data into a dataset consisting of various spectral features of the audio samples, which are fed to the
machine learning algorithms for the classification of audio as fake or real. While in the image-based approach audio samples
are converted into melspectrograms which are input into deep learning algorithms, namely Temporal Convolutional Network
(TCN) and Spatial Transformer Network (STN). TCN has been implemented because it is a sequential model and has been
shown to give good results on sequential data. A comparison between the performances of both the approaches has been
made, and it is observed that deep learning algorithms, particularly TCN, outperforms the machine learning algorithms by
a significant margin, with a 92 percent test accuracy. This solution presents a model for audio deepfake classification which
has an accuracy comparable to the traditional CNN models like VGG16, XceptionNet, etc.
Keywords Audio deepfakes · Feature-based classification · Image-based classification · Temporal convolutional networks ·
Spatial transformer networks
1 Introduction human lives but when misused can be the source of great
damage is deepfakes. A technology that was conceived as an
A steep increase in the number of social media platforms has innovative research area now threatens to pose as a tool to
facilitated the rapid spread of information, bringing with it spread misinformation.
several advantages by connecting people across the globe. Deepfake—an amalgamation of the words deep learning
But these media are also being used as weapons of mass and fake, refers to the synthetic data generated using deep
misinformation and causing widespread unrest. One of the learning techniques such as Generated Adversarial Networks
technologies which possesses great potential to transform (GAN) and Recurrent Neural Networks (RNN). Deepfake
has been used as a tool for committing forgeries, allowing
B Shraddha Suratkar a person to impersonate someone else, the development of
sssuratkar@ce.vjti.ac.in
which has been bolstered by an increasing influx of infor-
Janavi Khochare mation over the internet making it possible to train the deep
jskhochare_b17@ee.vjti.ac.in
learning network on a large set of data making the forged
Chaitali Joshi material indistinguishable from the real ones for humans.
cpjoshi_b17@ee.vjti.ac.in
Earlier studies into audio deepfake detection tasks have
Bakul Yenarkar primarily used the AVSpoof or ASVSpoof datasets. How-
byyenarkar_b17@ee.vjti.ac.in
ever, one major disadvantage of using these datasets is that
Faruk Kazi they do not include audio created by the most up-to-date text-
fskazi@el.vjti.ac.in
to-speech algorithms, which sound more like human speech
1 Veermata Jijabai Technological Institute, Mumbai, India
123
3448 Arabian Journal for Science and Engineering (2022) 47:3447–3458
and may be indistinguishable to the human ear. When classi- the results as described in Sect. 5 which also presents a
fying such audio from actual human-generated audio, a more comparison between the various models implemented in this
complicated task may arise, necessitating the development of works as well as the methods employed in a previously pub-
robust solutions. lished work. Section 6 mentions a closing remark of the paper
The experiments as described in this work have been based and highlights the future directions of the work.
on the Fake or Real (FoR) dataset [1]. This dataset has been
chosen to train the models as described in this work because
it contains samples of audio generated by the latest text-to- 2 Related Work
speech models as well as natural utterances, allowing the
development of such models that can distinguish between Audio signals form an unregulated natural world with a
fake and real audio more robustly. variety of content, such as music, speech, and ambient
This work proposes a framework for the task of audio sounds. Most research involving audio as an input over
deepfake detection. To classify between real and fake audios, the last two decades using methods that differ primarily
two approaches have been adopted: a feature-based approach in the types of audio features used or classification tech-
using machine learning algorithms and an image-based niques employed to various scenarios to classify audio
approach using deep learning algorithms. In the feature- [7–11]. With the advancements in text-to-speech technolo-
based classification approach, the audio file is converted into gies, cutting-edge systems can now generate good-quality,
a feature-based dataset consisting of various spectral features near realistic-sounding speech from a limited quantity of
of the samples, these features are further input to the machine speech data from target speakers [12]. In contrast, it also
learning algorithms to differentiate between audio samples offers a substantial threat to automatic speaker verification
as real or fake. The machine learning algorithms that have (ASV) systems, since the faked speech generated by such
been utilized for this purpose are Support Vector Machines techniques can readily attack the system [13,14]. The protec-
(SVM) [2], Light Gradient Boosting Machines (LGBM) tion of the ASV system, therefore, calls for deepfake speech
[3], Extreme Gradient Boosting (XGBoost) [4], K-Nearest detection.
Neighbors (KNN) [5], and Random Forest (RF) ) [6]. The A majority of the experiments in the domain of audio deep-
image-based classification approach has been undertaken fakes detection have been based on the ASVS-poof dataset.
to represent the audio sample in the form of melspectro- These works provide an early insight into the methods and
grams which are further fed as an input to the deep learning models used for detecting forged speech. Both machine
algorithms that classify whether the given audio instance is learning and deep learning models have been implemented
real or fake. The deep learning architectures that have been for the said task. [15–17] propose the implementation of Deep
used include Convolutional Neural Networks (CNN), Spatial Neural Networks (DNNs) for the task of synthetic speech
Transformer Network (STN), and Temporal Convolutional detection and involve the use of variants of Convolutional
Network (TCN). TCN being a sequential network is better Neural Networks to study the suitability of these models
able to capture the features in data which is also sequen- and suggest that approaches which are end-to-end are more
tial in nature and hence, presents a model which is more suited for detecting fake audio. [18] proposes to use a Rel-
suited to the task of audio deepfake detection. The deep ative Phase Shift to detect synthetic speech and Gaussian
learning architectures have been used to classify melspectro- Mixture Model (GMM) and Support Vector Machine(SVM)
grams generated from the incoming audio, and this presents which are shown to reduce the vulnerability of speaker ver-
a case similar to image classification. Thus, the major con- ification systems. A comparative study between DNN and
tributions of this work include adopting two approaches, viz, Hidden Markov Model (HMM) based on their capability to
image-based approach and feature-based approach for audio identify spoofed speech has been made in [19] which shows
deepfake detection on FoR dataset, comparing the perfor- that DNN outperforms HMM, implying that DNN is better
mance of the models implemented under each approach and able to capture the patterns that can be identified as being
proposing a comparatively simpler deep learning model for characteristic of spoofed audio. [20] demonstrates the usage
the task of audio deepfake detection which has an accuracy of spectrograms as a form of image input to CNN in audio
which is comparable to that of the traditional deep learning processing scenarios and thus, forms the basis for image-
algorithms like VGG19, XceptionNet, etc. The rest of the based audio processing techniques. The paper [21] presents
paper is organized as follows: Sect. 2 describes the works a DNN classifier for detection, also highlighting that Human
which have been done previously in the domain of audio pro- Log-Likelihoods (HLL) as a scoring metric is more appro-
cessing and audio deepfake detection. Section 3 introduces priate as compared to the traditional log-likelihood ratios
and explains the most important concepts and terminologies (LLR). The paper also implements different cepstral coeffi-
that have been implemented in this work. Section 4 explains cients to train classifiers. The research paper [22] examines
the exact methodology and experiments performed to obtain various robust audio features like Mel Frequency Cepstral
123
Arabian Journal for Science and Engineering (2022) 47:3447–3458 3449
Coefficient (MFCC), spectrogram, etc., and their effect on the deep learning-based classifiers using this dataset. This paper
accuracy of GMM-UBM as the classifier model and shows shows that the VGG16 model performs the best among all
that when a combination of a various audio features is used the other models used. A major drawback of these works is
gives the best result in terms of Equal Error Rate (EER). that TCN as a model and its suitability for sequential input
[23,24] propose the use of CNN as classifiers for sound data has not been considered for the task of audio deepfake
classification scenarios. An exhaustive description and com- detection which has been successfully addressed in this work.
parison of the various deep learning models used for audio
deepfake detection has been made in [25] and shows that the
CNN-RNN model achieves the best result among all the other 3 Preliminaries
models used. [24,26] describes how short-term spectral fea-
tures can be effectively used for synthetic speech detection, 1. TCN
further showing that the MFCCs outperform other spectral Temporal Convolution Networks or TCN (Fig. 1) [24,29]
features when used as feature inputs to the model. Finally, is a family of CNNs with two distinctive features:
[27] discusses the various limitations of the spoofing detec-
(a) The output is identical in terms of length to the input.
tion mechanisms.
(b) Data are not leaked to the past from the future.
CNNs are considered to be robust and sturdy models; how-
ever, they are restricted to be spatially invariant to the data
which is given as an input in a parameter effective and compu- To satisfy the first condition, a 1D Fully-Convolutional
tational manner. The paper [28] introduces a state-of-the-art Network (FCN) is used wherein the length of all hidden layers
architecture called STN which permits the spatial manipula- is kept equal to the length of the input layer along with a
tion of data present in the network. STN engender models that zero padding which has a length of kernel size is equal to 1.
learn invariance to scale, rotation, translation, and generic Further to meet the second condition, causal convolutions are
wrapping, providing more efficient results used, meaning that only such convolutions where the output
TCN [29] has been shown in experiments to outperform at certain time is convolved only with components from that
traditional RNNs and LSTMs across a wide range of tasks. and earlier times which are present in the previous layer are
[24] serves as a starting point for understanding the concept used.
of TCN and its building blocks, as well as outlining the var- Hence, TCN can be seen as a combination of 1D-FCN
ious applications to which this model can be applied. TCN and causal convolutions. This type of architecture is useful
has already been used in language processing [30], activity to catch long-term dependencies using dilated convolutions.
detection [31,32], and event detection tasks. [33] describes Dilated convolutions allow for the TCN to have a wider recep-
the use of TCN for audio spoof detection. This paper com- tive field, meaning that a wider section of the input data can
pares TCN and Multilayer Perceptron (MLP) performance be used to obtain the output. Here, for a 1D sequence input
and has shown that TCN is better than MLP. [34] describes is given as y ∈ Rn and filter as f : {0, . . . , m − 1} → R
a time series forecast TCN architecture and outlines how and the dilated convolution operation F on element u can be
the network can efficiently identify temporal dependencies defined as:
in data. It can be concluded from the experiments described

m−1
in these abovementioned papers that TCN captures temporal F(u) = (y ∗b f )(u) = f ( j)yu−b j (1)
dependence in the data effectively. j=0
With the synthetic speech generation making advance-
ments each day, the requirement of detection of synthetic here in Eq. (1), b represents dilation factor, m is the filter
speech increases and so does the requirement for an up- size, and u − b j is the direction of the past.
to-date dataset for natural and computer-generated speech. The receptive field size can be changed, by changing the
The paper “FoR: A Dataset for Synthetic Speech Detec- kernel size or the dilation rate. Furthermore, skip connec-
tion” [1] introduces a new dataset called FoR dataset which tions and residual blocks within the architecture allow the
has around 198,000 utterances including both real and syn- gradients to pass through easily.
thetic speech generated using the latest algorithms. It also
focuses on analyzing how synthetic speech is generated and o = Activation(y + F(y)) (2)
the performance of various deep learning models that classify
this synthetic speech. It also presents various experiments As given in Eq. (2), the output of series of transformations
that demonstrate the versatility and usefulness of the newly F is added to input y of the block.
introduced FoR dataset to improve the quality of synthetic
audios generated using various methods and also in the detec- 2. STN
tion of computer-generated speech as it is possible to train CNNs are considered to be extremely powerful models
123
Equation (3) represents affine transformation matrix Θ

[35].
⎡ t⎤
s xi
xi
s = Θ ⎣ yit ⎦ (4)
yi
1
Here, (xit , yit ) = coordinates in the target (output) feature

Fig. 1 TCN map and (xis , yis ) = coordinates in the source (input) feature
map.
Source column vector and target column vector with Θ
being the affine transformation matrix as shown in Eq. (4).
⎛ t⎞ ⎛ ⎞
xi xit
xis θ θ θ
= Tθ (G i ) = Aθ ⎝ yit ⎠ = 11 12 13 ⎝yt ⎠ (5)
Fig. 2 A spatial transformer yis θ21 θ22 θ23 i
1 1
Calculation of affine transformation matrix is carried out

for image-based tasks, but are hindered because they
using Eq. (5) [35].
are spatially variant to the information that has been
Feature map and sampling grid are given as an input to
fed as an input in a computational and parameter effec-
the sampler, and it generates an output map which is sampled
tive manner. So, a spatial transformer was introduced
from input at the grid points. Generation of pixel intensity of
which notably allows spatial manipulation of the data. A
every target coordinate by bilinear interpolation is performed
dynamic mechanism that can perform a spatial transfor-
by the sampler.
mation on an image of a feature map is a spatial transform
module. A generalization of differentiable attention to H W
a spatial transformation is called an STN [28] (Fig. 2). Vic = c
Unm max
n m
STN applies an affine transformation which is followed
×(0, 1 − |xis − m|)max(0, 1 − |yis − n|) (6)
by bilinear interpolation that, in turn, helps to remove
spatial invariance from the images.
Pixel intensities via Bilinear Interpolation is calculated
using Eq. (6) [35].
The three components that make up the spatial trans-
former mechanism are localization network, grid generator, 3. Melspectrograms
and sampler. The input feature maps are passed through sev- The spectrogram is a graphic representation of signal
eral hidden layers by the localization network to produce strength in terms of the signal’s intensity, which shows
output parameter theta for the spatial transformer. The size frequency variations of acoustic signals with respect to
of theta varies depending on the type of transformation that time. It is generated by performing Fast Fourier Trans-
is parameterized. Localization network can be defined as form (FFT) on overlapping windowed sections of the
signal. Often, the frequency is stated on the y-axis and
the time on the x-axis. The color of the plot exposes the
– Input Input feature map U of shape (H, W, C) intensity. Melspectrogram [21,28] is a type of spectro-
– Output A transformation matrix Θ of shape (6,) gram where the frequency is mapped and presented in
– Structure Either a ConvNet or FCN the Mel scale, i.e., a scale of pitches. Fig. 4 represents a
melspectrogram file of an audio instance in FoR dataset.
The grid generator creates a sampling grid by using pre- Steps performed are:
dicted transformation parameters. The task allocated to the (i) First step in generating melspectrogram involves
grid generator is to calculate the coordinates in the source framing and windowing of input audio signal. The
feature map of every target pixel. This can be performed by goal is to divide the signal into a series of short
an affine transformer [35]. overlapping frames in order to ensure that they are
stationary. A stationary signal represents the genuine

θ11 θ12 θ13 statistical and temporal properties. Windowing is fre-
Θ= (3)
θ21 θ22 θ23 quently done with rectangular windows, such as the
123
Fig. 3 Conversion of acoustic signal to melpectrogram
and class. It has around 195,541 samples from

various sources, and it was published so that the
raw data can be used in various pre-processing
techniques.
(ii) For-norm It has similar files to that of for-
original, and it has an equal number of samples
from different gender and class. It is normalized
in terms of volume, sample rate and several chan-
nels. The for-norm version consists of around
69,400 samples.
(iii) For-2sec This dataset consists of files trim-med at
2 s. It has equal number of samples from different
gender and class just like the normalized version
Fig. 4 Melspectrogram (for-norm) and has a total of 17,870 samples from
these sources.
(iv) For-rerec This version of dataset includes rere-
Hamming window, which hides the possibility of dis- corded files from for-2second dataset.
torted segments at the edges. This is done to simulate a situation in case an
(ii) The signals are converted from the time domain to audio is sent through a phone call or voice mes-
the frequency domain using the Fourier transform, sage or similar voice channel by any attacker.
which allows them to be represented statistically and (b) Generated audios This dataset consists of a collection
spectrally. of audios generated using real-time voice cloning [37].
(iii) The spectrogram envelope is generated using the Mel The input given to this model consists of an audio sam-
filter banks [36]. ple of a person extracted from YouTube and sample
text. The output is audio generated using real-time
voice cloning and includes audios from speakers of
all genders, age and accents. This collection of audios
4 Methodology is included in the dataset to know how well can the
model be generalized.
1. Dataset Description 2. Preprocessing
(a) Fake or Real (FoR) Dataset The FoR dataset [1] is The dataset consists of data in the form of audio which
a set of around 195,000 utterances from real humans cannot be directly given as an input to the classifica-
and computer generated speech. The computer gen- tion models. These data are needed to be converted to
erated samples are created using latest methodologies a suitable format before starting the actual classification
in speech synthesis. The real utterances have samples task and so preprocessing of data was necessary. In this
recorded using different microphones that include dif- project, two classification-based approaches have been
ferent samples from variety of accents and from all adopted, namely feature-based classification and image-
genders. Classifiers can be trained using this dataset based classification. Process flow diagram of deepfakes
to identify synthetic speech. There are four versions audio detection is as shown in Fig. 5
of this dataset: for-orig, for-norm, for-2sec, and for- (a) Feature-based classification:
rerec: In the feature-based classification approach, the audio
(i) For-original It consists of a collection of original file is converted into a feature-based dataset consist-
files from various speakers. This dataset has an ing of various features of the audio samples [22] as
unequal number of samples from different gender
123
Fig. 5 Process flow diagram for

deepfake audio detection
Fig. 6 Feature extraction
shown in Fig. 6. First, the individual audio file is Eq. (7):

taken as an input and loaded using librosa library. This

outputs a time series and the sampling rate which rep- 1 2
resent the digital form of the audio file. Using librosa xrms = (x + x22 + · · · + xn2 ) (7)
n 1
library functions, features, as mentioned below, are
extracted. This gives another array and to have a sin- where n = number of samples and xi = ith sam-
gle value for each feature, the mean of this array is ple.
calculated which gives the final value of the feature For a general signal, the RMSE is calculated
for that particular audio file. These features are fur- as a single value from all the values present in
ther fed to the machine learning algorithms for the the signal. However, due to the non-stationary
classification of audio samples as real or fake. Each nature of audio signals, it is preferable to com-
audio sample in the dataset has been represented as a pute the energy over short intervals to capture
vector of 37 features of audio. This was done using a the variation over time. Different types of speech
sliding window technique, and thus, the features are may have different energy content, making it an
time-varying sequences. The features that have been important feature to consider for audio classifi-
extracted include: cation.
(i) Mean Square Energy Given a signal x(n), the (ii) Chroma Features The chroma features represent
Root Mean Square Energy (RMSE) is given by the frequency spectrum as 12 pitch classes. The
entire frequency is divided into 12 bins denoting
the 12 chromas present in the musical octave.
123
The chromagram of a sample audio signal is pre- given by Eq. (11):

sented in Fig. 7.
(iii) Spectral Centroid The spectral centroid is indica-
1
WL
tive of the nature of the frequency prevalent |sgn[x(n)] − sgn[x(n − 1)]| (11)
in the signal. For instance, a higher spectral WL
n=1
centroid value indicates a concentration in the
high-frequency region in the spectrum. It is com- where x(n) is the audio signal, WL is the length
puted for the ith audio frame as shown in Eq. (8): of the window, and sgn is the signum function.
The zero crossing rate plot of the audio sample
k=N
k=1 f (k). f (k)
is presented in Fig. 11.
μ= n=N (8) (vii) Mel Frequency Cepstral Coefficients (MFCCs)
k=1 m(k)
MFCCs reflect the enclosed envelope of the
where m(k) is the magnitude at the kth frequency power spectrum that depicts the characteristics of
bin, f (k) is the center frequency at the kth fre- the vocal tract and human voice. Mel-frequency
quency bin. The spectral centroid plot of the cepstrum, formed by these coefficients, is used to
audio sample is shown in Fig. 8. identify periodic components of a time domain
(iv) Spectral Bandwidth The range of frequencies in signal as peaks in a new domain called the “que-
the audio signal spectrum which corresponds to frency” domain [38]. The MFCCs are obtained
one-half of the peak magnitude of the spectrum by transforming the signal to frequency domain
is known as the spectral bandwidth of the audio from time domain, and then to the quefrency
signal. The pth order spectral bandwidth is given domain from frequency domain, using a series of
by Eq. (9): mathematical transformations. The process for
generating MFCC is similar to that for generating
1 a melspectrogram, with the following additions
p
[39]:
m(k)( f (k) − μ) p
(9)
A. The magnitude of powers from Mel filters is
k
converted to a logarithmic scale.
where m(k) is the magnitude at the kth frequency B. Then, discrete cosine transform is applied
bin, f (k) is the center frequency at the kth fre- which outputs cepstral coefficients.
quency bin and μ is the spectral centroid. The C. Applying the Discrete Cosine Transform
spectral bandwidth plot of the sample audio sig- (DCT) on the results produced in the pre-
nal is shown in Fig. 9. ceding stage, which results in cepstral coef-
(v) Spectral Rolloff The spectral rolloff is the point ficients. The MFCC plot of the sample audio
in the frequency below which 85 percent of the signal is shown in Fig. 12.
energy present in the spectrum resides. This pro- In this work, 20 MFCCs have been considered.
vides the general shape of the spectra by ignoring (b) Image-based classification
the outlier higher frequencies and focusing on the The image-based classification approach has been
portions of the spectra where most of the energy undertaken to represent the audio sample in the form
is present. It is given by Eq. (10): of melspectrograms which are further fed as an input
for deep learning algorithms that classify whether the
given audio instance is real or fake. Each audio sample

fr
N
arg max m(k) ≥ 0.85 m(k) (10) in the dataset has been represented as a melspectro-
f r {1,...,N } gram of the audio data. Melspectrograms have been
k=1 k=1
generated by using the librosa library. The mel level
where f r is the rolloff frequency, and m(k) is the for representing the frequency axis was selected as
magnitude at the kth frequency bin. The spectral 256. The human ear recognizes frequencies that are
rolloff plot of sample audio signal is shown in in the range of 20 Hz and 20 kHz. As the lowest
Fig. 10. threshold frequency perceptible to the human ear is
(vi) Zero Crossing Rate The rate of sign-changes approximately 20 Hz, frequencies below the value of
in a signal is referred to as the zero-crossing 20 Hz are not shown in the spectrogram. The sam-
rate. Lower frequency signals have a lower zero- pling rate of all audio files is 16 kHz, so according
crossing rate due to lesser oscillations per second to the Nyquist sampling theorem, the audio samples
as compared to higher frequency signals. It is should contain frequencies only up to 8 kHz. So the
123
Fig. 7 Chromagram of sample audio signal
frequency range of the spectrogram is between 20

Hz and 8 kHz. The Hann window function is imple-
mented, and the window is set as 1024. The hop length Fig. 8 Spectral centroid plot of sample audio signal
of the window is taken as 256. The melspectrogram
generated as an image of the audio sample has a height
of 256 due to its number of mel levels and width being
conditional on the length of the audio. As most of the
deep learning algorithms implemented for the clas-
sification task required the dataset to be uniform in
terms of the size of the melspectrogram image, the
width threshold of the image has been set to be 256.
For this task, the For-2-sec dataset has been used so
each spectrogram corresponds to 2 s of audio samples.
3. Implementation
(a) Feature-based approach
The feature-based classification task involves classi-
fying the given audio instance as real or fake. The data
Fig. 9 Spectral bandwidth plot of sample audio signal
obtained by preprocessing of the audio sample, i.e.,
the vector of 37 audio features was fed as an input
to the machine learning algorithms. For this task,
five machine learning algorithms namely, SVM [2],
LGBM [3], XGBoost [4], KNN [5], and RF [6] have
been implemented and the performance of each of
these has been compared. The hyperparameters were
optimized using GridSearchCV. The output of these
models is whether the given instance of audio is real
or fake.
(b) Image-based Approach:
In the image-based classification task, melspectro-
grams obtained after preprocessing of the audio
samples are given as an input to the models. In this
approach, two deep learning architectures have been
Fig. 10 Spectral rolloff plot of sample audio signal
used as audio classifiers and those are STN [28] and
TCN [29]. The main purpose of using these net-
works is that the data presented in a melspectrogram The detailed explanation of the architectures of the models
is sequential in nature and a sequential network per- implemented is as follows:
forms well on sequential input data. The entire dataset
is segregated into training, testing and evaluation set (i) STN The architecture consists of a 2D-FCN. The opti-
according to the ratio 60:20:20. The output of the mizer used for this model is Adam with the loss
models is whether the given sample is real or fake. function being binary cross-entropy. The model is
implemented according to the architecture explained
123
in the preliminaries and also in (Fig. 2). The archi-

tecture includes the localization network, the grid
generator, and the sampler. The model was trained on
Nvidia-DGX 1 for 15 epochs and the training and eval-
uation process took around half an hour to complete.
To avoid overfitting, early stopping has been imple-
mented. The model is trained on both the classes real
and fake.
(ii) TCN The architecture consists of a 1D-FCN where
the convolutions are causal. This is accomplished by
setting the “padding” parameter as causal. The overall
architecture is as shown in Fig. 13a. The choice of the
optimizer is Adam optimizer with the loss function
Fig. 11 Zero crossing rate plot of sample audio signal
being binary cross-entropy. The entire model can be
described as follows: The basic building block of TCN
is made up of fully connected 1D convolution layers
with a spatial dropout layer with the necessary dilation
factor. A residual connection connects the input to
output to allow for the flow of gradients. This structure
is repeated several times. The input is given to three
TCN architectures with varying kernel sizes, and the
output from the three blocks is concatenated followed
by batch normalization and dropout layer to finally
give the desired output. The overall architecture is
as shown in Fig. 13b. The TCN model was trained
and tested on Nvidia-DGX 1 for 25 epochs with the
use of early stopping to prevent overfitting. The entire
process of training and testing took around half an
Fig. 12 MFCC plot of sample audio signal hour to complete.
For both STN and TCN, after training, the model’s perfor-
mance has been validated using a validation dataset and
finally, they are applied on a test dataset which includes
Fig. 13 TCN
123
Table 1 Feature-based audio

Model Validation accuracy Test accuracy Precision Recall F-score
classification results
SVM 0.85 0.67 0.69 0.67 0.65
Random Forest 0.80 0.62 0.60 0.59 0.57
KNN 0.75 0.62 0.61 0.61 0.61
XGBoost 0.70 0.59 0.59 0.58 0.57
LGBM 0.75 0.60 0.60 0.59 0.58
Table 2 Image-based classification results machine learning models have almost similar performance
Model Validation accuracy Test accuracy and saturate after a certain value. Accuracies of the models
were in the range of 0.60–0.70. The SVM model performed
TCN 0.98 0.92 the best when compared to other machine learning models.
STN 0.89 0.80 Table 2 shows the results for image-based classification after
using the deep learning models. It can be observed that the
TCN performed the best by giving a test accuracy of 92 per-
Table 3 Test accuracy values
cent. STN performed better than the machine learning models
Reference Algorithm Test by giving a test accuracy of 80 percent. Overall, the image-
accuracy
based classification provided better results than feature-based
Algorithms 4-layer fully 0.46 classification on this audio deepfake detection task. This can
implemented in [1] connected be attributed to the fact that TCN being a sequential net-
2-Layer CNN(+2FC) 0.41 work is able to capture the features of sequential data more
3-Layer CNN(+2FC) 0.38 effectively as compared to the other models that have been
VGG16 0.95 implemented. Hence, it can be observed that employing an
VGG19 0.96 image-based approach with TCN as the classifier can give
InceptionV3 0.79 better results as compared to the Machine Learning models
ResNet 0.78 with a feature-based approach.
MobileNet 0.94 Moreover, experiments have been held earlier on the same
XceptionNet 0.74 dataset as described in [1]. The results obtained in those
Proposed system SVM 0.67 experiments are shown in Table 3. It can be observed that the
Random Forest 0.62 best-performing model has an accuracy of about 96%. How-
KNN 0.62 ever, these models are computationally expensive, whereas
XGBoost 0.59
the model implemented in this paper i.e., TCN is compara-
LGBM 0.60
tively less complex and achieves an accuracy that does not
lag behind by a significant amount. Hence, implementing
TCN 0.92
TCN can be computationally efficient as well as provides
STN 0.80
equivalent results.
data similar to the one encountered in the training dataset

as well as that from the custom-built dataset. The overall 6 Conclusion
flow of the experiments carried out is shown in Fig. 5.
Deepfakes have been a major cause of concern in recent
times. This material poses a danger to the whole world in
5 Result the form of manipulated audio-visual content, as it can be
the source of mass misinformation or even harassment. It is,
This section describes the results obtained from the exper- therefore, necessary to use more rigorous methods to combat
iments conducted as described in the previous sections for the destructive uses of deepfakes. This work focuses mainly
audio deepfake detection. The metrics used for evaluating on the task of deepfake audio detection. Two approaches
the performance of various models are accuracy, precision, have been adopted to classify between real and fake audio:
and recall. Table 1 represents the results as observed from the a feature-based approach using machine learning algorithms
machine learning models for feature-based audio classifica- and an image-based approach using deep learning algorithms
tion tasks in this experiment. It can be observed that all the and later a comparative analysis on both these approaches has
123
been made. From the results mentioned above, the deep learn- Conferences on the Move to Meaningful Internet Systems, pp. 986–
ing algorithms surpassed the machine learning algorithms. 996. Springer, Berlin (2003)
6. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
The TCN model gave highest 92 percent test accuracy. The 7. Wold, E.; Blum, T.; Keislar, D.; Wheaten, J.: Content-based clas-
reason behind TCN performing well may be attributed to sification, search, and retrieval of audio. IEEE Multimedia 3(3),
the fact that the data input is sequential and TCN being a 27–36 (1996)
sequential network performs well on such data. Also, the 8. Nanni, L.; Costa, Y.M.G.; Lucio, D.R.; Silla, C.N., Jr.; Brahnam,
S.: Combining visual and acoustic features for audio classification
architecture in the proposed system is simpler as compared tasks. Pattern Recogn. Lett. 88, 49–56 (2017)
to the models used in paper [1] and achieves similar accuracy 9. Lie, L.; Zhang, H.-J.; Jiang, H.: Content analysis for audio clas-
as those models. sification and segmentation. IEEE Trans. Speech Audio Process.
This work can be expanded in the future by introducing 10(7), 504–516 (2002)
10. Zhao, J.; Mao, X.; Chen, L.: Speech emotion recognition using
new features to the feature-based methodology that will help
deep 1D & 2D CNN LSTM networks. Biomed. Signal Process.
in improving the results of machine learning models. Further, Control 47, 312–323 (2019)
amplitude-based classification can be implemented by using 11. Carey, M.J.; Parris, E.S.; Lloyd-Thomas, H.: A comparison of fea-
deep learning models. One limitation of this work is that in tures for speech, music discrimination. In: 1999 IEEE International
Conference on Acoustics, Speech, and Signal Processing. Proceed-
the image-based approach, STFT and MFCC have not been ings. ICASSP99 (Cat. No. 99CH36258), vol. 1, pp. 149–152. IEEE,
used as input. The image-based approach can be extended to London (1999)
using STFT and MFCC as input to the deep learning models 12. Stylianou, Y.: Voice transformation: a survey. In: 2009 IEEE Inter-
along with melspectrogram and a comparative study can be national Conference on Acoustics, Speech and Signal Processing,
pp. 3585–3588. IEEE, New York (2009)
performed to see which feature gives the best result with the 13. Wu, Z.; Evans, N.; Kinnunen, T.; Yamagishi, J.; Alegre, F.; Li, H.:
given classifier. Also, variants of TCN can be implemented Spoofing and countermeasures for speaker verification: a survey.
as classifiers and a comparison can be drawn between their Speech Commun. 66, 130–153 (2015)
performances to ascertain which variant performs the best. 14. Wu, Z.; De Leon, P.L.; Demiroglu, C.; Khodabakhsh, A.; King,
S.; Ling, Z.-H.; Saito, D.; Stewart, B.; Toda, T.; Wester, M.;
According to Wavenet [40], it is feasible to provide raw audio et al.: Anti-spoofing for text-independent speaker verification: an
as input into neural networks without converting it to spec- initial database, comparison of countermeasures, and human per-
trograms. Raw audio was not much utilized as a classifier for formance. IEEE/ACM Trans. Audio Speech Lang. Process. 24(4),
synthetic speech recognition. This might improve classifica- 768–783 (2016)
15. Reimao, R.A.M.: Synthetic speech detection using deep neural net-
tion accuracy while also cutting down on pre-processing time works. Thesis, York University, Toronto, Ontario (2019)
as spectrograms need not be generated. Hence, the develop- 16. Muckenhirn, H.; Magimai-Doss, M.; Marcel, S.: End-to-end con-
ment of raw audio classifiers can be studied. volutional neural network-based voice presentation attack detec-
tion. In: 2017 IEEE International Joint Conference on Biometrics
Acknowledgements Authors acknowledge Centre of Excellence in (IJCB), pp. 335–341. IEEE, New York (2017)
Complex and Nonlinear Dynamical Systems (CoE-CNDS) laboratory 17. Dinkel, H.; Qian, Y.; Kai, Yu.: Investigating raw wave deep neural
for providing support and platform for research. networks for end-to-end speaker spoofing detection. IEEE/ ACM
Trans. Audio Speech Lang. Process. 26(11), 2002–2014 (2018)
18. De Leon, P.L.; Hernaez, I.; Saratxaga, I.; Pucher, M.; Yamagishi,
Declarations J.: Detection of synthetic speech for the problem of imposture. In:
2011 IEEE International Conference on Acoustics, Speech and Sig-
nal Processing (ICASSP), pp. 4844–4847. IEEE, New York (2011)
19. Ze, H.; Senior, A.; Schuster, M.: Statistical parametric speech syn-
Conflict of interest The authors declare that they no conflict of interest. thesis using deep neural networks. In: 2013 IEEE International
Conference on Acoustics, Speech and Signal Processing, pp. 7962–
7966. IEEE, London (2013)
20. Dörfler, M.; Bammer, R.; Grill, T.: Inside the spectrogram: Convo-
References lutional neural networks in audio processing. In: 2017 International
Conference on Sampling Theory and Applications (SampTA), pp.
1. Reimao, R.; Tzerpos, V.: For: a dataset for synthetic speech detec- 152–155. IEEE, New York (2017)
tion. In: 2019 International Conference on Speech Technology and 21. Hong, Yu.; Tan, Z.-H.; Ma, Z.; Martin, R.; Guo, J.: Spoofing
Human–Computer Dialogue (SpeD), pp. 1–10. IEEE (2019) detection in automatic speaker verification systems using DNN
2. Evgeniou, T.; Pontil, M.: Support vector machines: theory and classifiers and dynamic acoustic features. IEEE Trans. Neural
applications. In: Advanced Course on Artificial Intelligence, pp. Netw. Learn. Syst. 29(10), 4633–4644 (2017)
249–257. Springer, Berlin (1999) 22. Balamurali, B.T.; Lin, K.E.; Lui, S.; Chen, J.-M.; Herremans, D.:
3. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Toward robust audio spoofing detection: a detailed comparison
Liu, T.-Y.: Lightgbm: a highly efficient gradient boosting decision of traditional and learned features. IEEE Access 7, 84229–84241
tree. Adv. Neural. Inf. Process. Syst. 30, 3146–3154 (2017) (2019)
4. Chen, T.; Guestrin, C.: Xgboost: a scalable tree boosting system. In: 23. Maccagno, A.; Mastropietro, A.; Mazziotta, U.; Scarpiniti, M.; Lee,
Proceedings of the 22nd ACM SIGKDD International Conference Y.-C.; Uncini, A.: A CNN approach for audio classification in con-
on Knowledge Discovery and Data Mining, pp. 785–794 (2016) struction sites. In: Progresses in Artificial Intelligence and Neural
5. Guo, G.; Wang, H.; Bell, D.; Bi, Y.; Greer, K.: KNN model-based Systems, pp. 371–381. Springer, Berlin (2019)
approach in classification. In: OTM Confederated International
123
24. Bai, S.; Kolter, J.Z.; Koltun, V.: An empirical evaluation of 33. Tian, X.; Xiao, X.; Chng, E.S.; Li, H.: Spoofing speech detection
generic convolutional and recurrent networks for sequence model- using temporal convolutional neural network. In: 2016 Asia-Pacific
ing (2018). arXiv preprint arXiv:1803.01271 Signal and Information Processing Association Annual Summit
25. Zhang, C.; Yu, C.; Hansen, J.H.L.: An investigation of deep- and Conference (APSIPA), pp. 1–6. IEEE, London (2016)
learning frameworks for speaker verification antispoofing. IEEE 34. Chen, Y.; Kang, Y.; Chen, Y.; Wang, Z.: Probabilistic forecasting
J. Select. Top. Signal Process. 11(4), 684–694 (2017) with temporal convolutional neural network. Neurocomputing 399,
26. Paul, D.; Pal, M.; Saha, G.: Spectral features for synthetic speech 491–501 (2020)
detection. IEEE J. Select. Top. Signal Process. 11(4), 605–617 35. Danilyuk, K.: Convnets series. Spatial transformer networks-
(2017) towards data science. Towards Data Sci. (2017)
27. Kinnunen, T.; Sahidullah, M.; Delgado, H.; Todisco, M.; Evans, N.; 36. Nagarajan, S.; Nettimi, S.S.S.; Kumar, L.S.; Nath, M.K.; Kanhe,
Yamagishi, J.; Lee, K.A.: The ASVspoof 2017 challenge: assessing A.: Speech emotion recognition using cepstral features extracted
the limits of replay spoofing attack detection (2017) with novel triangular filter banks based on bark and ERB frequency
28. Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K.: scales. Digit. Signal Process. 104, 102763 (2020)
Spatial transformer networks. Adv. Neural Inform. Process. Syst. 37. Jia, Y.; Zhang, Y.; Weiss, R.J.; Wang, Q.; Shen, J.; Ren, F.; Chen,
28, 2017–2025 (2015) Z.; Nguyen, P.; Pang, R.; Moreno, I.L.; et al.: Transfer learning
29. Lea, C.; Vidal, R.; Reiter, A.; Hager, G.D.: Temporal convolutional from speaker verification to multispeaker text-to-speech synthesis
networks: A unified approach to action segmentation. In: Euro- (2018). arXiv preprint arXiv:1806.04558
pean Conference on Computer Vision, pp. 47–54. Springer, Berlin 38. Dash, T.K.; Mishra, S.; Panda, G.; Satapathy, S.C.: Detection of
(2016) COVID-19 from speech signal using bio-inspired based cepstral
30. Alqahtani, S.; Mishra, A.; Diab, M.: Efficient convolutional features. Pattern Recogn. 117, 107999 (2021)
neural networks for diacritic restoration (2019). arXiv preprint 39. Zheng, F.; Zhang, G.; Song, Z.: Comparison of different imple-
arXiv:1912.06900 mentations of MFCC. J. Comput. Sci. Technol. 16(6), 582–589
31. Lea, C.; Flynn, M.D.; Vidal, R.; Reiter, A.; Hager, G.D.: Tempo- (2001)
ral convolutional networks for action segmentation and detection. 40. van den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals,
In: Proceedings of the IEEE Conference on Computer Vision and O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K.:
Pattern Recognition, pp. 156–165 (2017) Wavenet: a generative model for raw audio (2016). arXiv preprint
32. Farha, Y.A.; Gall, J.: MS-TCN: multi-stage temporal convolu- arXiv:1609.03499
tional network for action segmentation. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition, pp. 3575–3584 (2019)
123

A Deep Learning Framework For Audio Deepfake Detection

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Deep Learning Framework For Audio Deepfake Detection

Uploaded by

Copyright:

Available Formats

Arabian Journal for Science and Engineering (2022) 47:3447–3458

RESEARCH ARTICLE-ELECTRICAL ENGINEERING

A Deep Learning Framework for Audio Deepfake Detection

Equation (3) represents affine transformation matrix Θ

Here, (xit , yit ) = coordinates in the target (output) feature

Calculation of affine transformation matrix is carried out

Fig. 3 Conversion of acoustic signal to melpectrogram

and class. It has around 195,541 samples from

Fig. 5 Process flow diagram for

Fig. 6 Feature extraction

shown in Fig. 6. First, the individual audio file is Eq. (7):

The chromagram of a sample audio signal is pre- given by Eq. (11):

Fig. 7 Chromagram of sample audio signal

frequency range of the spectrogram is between 20

in the preliminaries and also in (Fig. 2). The archi-

Table 1 Feature-based audio

data similar to the one encountered in the training dataset

You might also like