Professional Documents
Culture Documents
A Deep Learning Framework For Audio Deepfake Detection
A Deep Learning Framework For Audio Deepfake Detection
https://doi.org/10.1007/s13369-021-06297-w
Received: 6 June 2021 / Accepted: 3 October 2021 / Published online: 8 November 2021
© King Fahd University of Petroleum & Minerals 2021
Abstract
Audio deepfakes have been increasingly emerging as a potential source of deceit, with the development of avant-garde methods
of synthetic speech generation. Hence, differentiating fake audio from the real one is becoming even more difficult owing to
the increasing accuracy of text-to-speech models, posing a serious threat to speaker verification systems. Within the domain
of audio deepfake detection, a majority of experiments have been based on the ASVSpoof or the AVSpoof dataset using
various machine learning and deep learning approaches. In this work, experiments were performed on a more recent dataset,
the Fake or Real (FoR) dataset which contains data generated using some of the best text to speech models. Two approaches
have been adopted to the solve problem: feature-based approach and image-based approach. The feature-based approach
involves converting audio data into a dataset consisting of various spectral features of the audio samples, which are fed to the
machine learning algorithms for the classification of audio as fake or real. While in the image-based approach audio samples
are converted into melspectrograms which are input into deep learning algorithms, namely Temporal Convolutional Network
(TCN) and Spatial Transformer Network (STN). TCN has been implemented because it is a sequential model and has been
shown to give good results on sequential data. A comparison between the performances of both the approaches has been
made, and it is observed that deep learning algorithms, particularly TCN, outperforms the machine learning algorithms by
a significant margin, with a 92 percent test accuracy. This solution presents a model for audio deepfake classification which
has an accuracy comparable to the traditional CNN models like VGG16, XceptionNet, etc.
Keywords Audio deepfakes · Feature-based classification · Image-based classification · Temporal convolutional networks ·
Spatial transformer networks
1 Introduction human lives but when misused can be the source of great
damage is deepfakes. A technology that was conceived as an
A steep increase in the number of social media platforms has innovative research area now threatens to pose as a tool to
facilitated the rapid spread of information, bringing with it spread misinformation.
several advantages by connecting people across the globe. Deepfake—an amalgamation of the words deep learning
But these media are also being used as weapons of mass and fake, refers to the synthetic data generated using deep
misinformation and causing widespread unrest. One of the learning techniques such as Generated Adversarial Networks
technologies which possesses great potential to transform (GAN) and Recurrent Neural Networks (RNN). Deepfake
has been used as a tool for committing forgeries, allowing
B Shraddha Suratkar a person to impersonate someone else, the development of
sssuratkar@ce.vjti.ac.in
which has been bolstered by an increasing influx of infor-
Janavi Khochare mation over the internet making it possible to train the deep
jskhochare_b17@ee.vjti.ac.in
learning network on a large set of data making the forged
Chaitali Joshi material indistinguishable from the real ones for humans.
cpjoshi_b17@ee.vjti.ac.in
Earlier studies into audio deepfake detection tasks have
Bakul Yenarkar primarily used the AVSpoof or ASVSpoof datasets. How-
byyenarkar_b17@ee.vjti.ac.in
ever, one major disadvantage of using these datasets is that
Faruk Kazi they do not include audio created by the most up-to-date text-
fskazi@el.vjti.ac.in
to-speech algorithms, which sound more like human speech
1 Veermata Jijabai Technological Institute, Mumbai, India
123
3448 Arabian Journal for Science and Engineering (2022) 47:3447–3458
and may be indistinguishable to the human ear. When classi- the results as described in Sect. 5 which also presents a
fying such audio from actual human-generated audio, a more comparison between the various models implemented in this
complicated task may arise, necessitating the development of works as well as the methods employed in a previously pub-
robust solutions. lished work. Section 6 mentions a closing remark of the paper
The experiments as described in this work have been based and highlights the future directions of the work.
on the Fake or Real (FoR) dataset [1]. This dataset has been
chosen to train the models as described in this work because
it contains samples of audio generated by the latest text-to- 2 Related Work
speech models as well as natural utterances, allowing the
development of such models that can distinguish between Audio signals form an unregulated natural world with a
fake and real audio more robustly. variety of content, such as music, speech, and ambient
This work proposes a framework for the task of audio sounds. Most research involving audio as an input over
deepfake detection. To classify between real and fake audios, the last two decades using methods that differ primarily
two approaches have been adopted: a feature-based approach in the types of audio features used or classification tech-
using machine learning algorithms and an image-based niques employed to various scenarios to classify audio
approach using deep learning algorithms. In the feature- [7–11]. With the advancements in text-to-speech technolo-
based classification approach, the audio file is converted into gies, cutting-edge systems can now generate good-quality,
a feature-based dataset consisting of various spectral features near realistic-sounding speech from a limited quantity of
of the samples, these features are further input to the machine speech data from target speakers [12]. In contrast, it also
learning algorithms to differentiate between audio samples offers a substantial threat to automatic speaker verification
as real or fake. The machine learning algorithms that have (ASV) systems, since the faked speech generated by such
been utilized for this purpose are Support Vector Machines techniques can readily attack the system [13,14]. The protec-
(SVM) [2], Light Gradient Boosting Machines (LGBM) tion of the ASV system, therefore, calls for deepfake speech
[3], Extreme Gradient Boosting (XGBoost) [4], K-Nearest detection.
Neighbors (KNN) [5], and Random Forest (RF) ) [6]. The A majority of the experiments in the domain of audio deep-
image-based classification approach has been undertaken fakes detection have been based on the ASVS-poof dataset.
to represent the audio sample in the form of melspectro- These works provide an early insight into the methods and
grams which are further fed as an input to the deep learning models used for detecting forged speech. Both machine
algorithms that classify whether the given audio instance is learning and deep learning models have been implemented
real or fake. The deep learning architectures that have been for the said task. [15–17] propose the implementation of Deep
used include Convolutional Neural Networks (CNN), Spatial Neural Networks (DNNs) for the task of synthetic speech
Transformer Network (STN), and Temporal Convolutional detection and involve the use of variants of Convolutional
Network (TCN). TCN being a sequential network is better Neural Networks to study the suitability of these models
able to capture the features in data which is also sequen- and suggest that approaches which are end-to-end are more
tial in nature and hence, presents a model which is more suited for detecting fake audio. [18] proposes to use a Rel-
suited to the task of audio deepfake detection. The deep ative Phase Shift to detect synthetic speech and Gaussian
learning architectures have been used to classify melspectro- Mixture Model (GMM) and Support Vector Machine(SVM)
grams generated from the incoming audio, and this presents which are shown to reduce the vulnerability of speaker ver-
a case similar to image classification. Thus, the major con- ification systems. A comparative study between DNN and
tributions of this work include adopting two approaches, viz, Hidden Markov Model (HMM) based on their capability to
image-based approach and feature-based approach for audio identify spoofed speech has been made in [19] which shows
deepfake detection on FoR dataset, comparing the perfor- that DNN outperforms HMM, implying that DNN is better
mance of the models implemented under each approach and able to capture the patterns that can be identified as being
proposing a comparatively simpler deep learning model for characteristic of spoofed audio. [20] demonstrates the usage
the task of audio deepfake detection which has an accuracy of spectrograms as a form of image input to CNN in audio
which is comparable to that of the traditional deep learning processing scenarios and thus, forms the basis for image-
algorithms like VGG19, XceptionNet, etc. The rest of the based audio processing techniques. The paper [21] presents
paper is organized as follows: Sect. 2 describes the works a DNN classifier for detection, also highlighting that Human
which have been done previously in the domain of audio pro- Log-Likelihoods (HLL) as a scoring metric is more appro-
cessing and audio deepfake detection. Section 3 introduces priate as compared to the traditional log-likelihood ratios
and explains the most important concepts and terminologies (LLR). The paper also implements different cepstral coeffi-
that have been implemented in this work. Section 4 explains cients to train classifiers. The research paper [22] examines
the exact methodology and experiments performed to obtain various robust audio features like Mel Frequency Cepstral
123
Arabian Journal for Science and Engineering (2022) 47:3447–3458 3449
Coefficient (MFCC), spectrogram, etc., and their effect on the deep learning-based classifiers using this dataset. This paper
accuracy of GMM-UBM as the classifier model and shows shows that the VGG16 model performs the best among all
that when a combination of a various audio features is used the other models used. A major drawback of these works is
gives the best result in terms of Equal Error Rate (EER). that TCN as a model and its suitability for sequential input
[23,24] propose the use of CNN as classifiers for sound data has not been considered for the task of audio deepfake
classification scenarios. An exhaustive description and com- detection which has been successfully addressed in this work.
parison of the various deep learning models used for audio
deepfake detection has been made in [25] and shows that the
CNN-RNN model achieves the best result among all the other 3 Preliminaries
models used. [24,26] describes how short-term spectral fea-
tures can be effectively used for synthetic speech detection, 1. TCN
further showing that the MFCCs outperform other spectral Temporal Convolution Networks or TCN (Fig. 1) [24,29]
features when used as feature inputs to the model. Finally, is a family of CNNs with two distinctive features:
[27] discusses the various limitations of the spoofing detec-
(a) The output is identical in terms of length to the input.
tion mechanisms.
(b) Data are not leaked to the past from the future.
CNNs are considered to be robust and sturdy models; how-
ever, they are restricted to be spatially invariant to the data
which is given as an input in a parameter effective and compu- To satisfy the first condition, a 1D Fully-Convolutional
tational manner. The paper [28] introduces a state-of-the-art Network (FCN) is used wherein the length of all hidden layers
architecture called STN which permits the spatial manipula- is kept equal to the length of the input layer along with a
tion of data present in the network. STN engender models that zero padding which has a length of kernel size is equal to 1.
learn invariance to scale, rotation, translation, and generic Further to meet the second condition, causal convolutions are
wrapping, providing more efficient results used, meaning that only such convolutions where the output
TCN [29] has been shown in experiments to outperform at certain time is convolved only with components from that
traditional RNNs and LSTMs across a wide range of tasks. and earlier times which are present in the previous layer are
[24] serves as a starting point for understanding the concept used.
of TCN and its building blocks, as well as outlining the var- Hence, TCN can be seen as a combination of 1D-FCN
ious applications to which this model can be applied. TCN and causal convolutions. This type of architecture is useful
has already been used in language processing [30], activity to catch long-term dependencies using dilated convolutions.
detection [31,32], and event detection tasks. [33] describes Dilated convolutions allow for the TCN to have a wider recep-
the use of TCN for audio spoof detection. This paper com- tive field, meaning that a wider section of the input data can
pares TCN and Multilayer Perceptron (MLP) performance be used to obtain the output. Here, for a 1D sequence input
and has shown that TCN is better than MLP. [34] describes is given as y ∈ Rn and filter as f : {0, . . . , m − 1} → R
a time series forecast TCN architecture and outlines how and the dilated convolution operation F on element u can be
the network can efficiently identify temporal dependencies defined as:
in data. It can be concluded from the experiments described
m−1
in these abovementioned papers that TCN captures temporal F(u) = (y ∗b f )(u) = f ( j)yu−b j (1)
dependence in the data effectively. j=0
With the synthetic speech generation making advance-
ments each day, the requirement of detection of synthetic here in Eq. (1), b represents dilation factor, m is the filter
speech increases and so does the requirement for an up- size, and u − b j is the direction of the past.
to-date dataset for natural and computer-generated speech. The receptive field size can be changed, by changing the
The paper “FoR: A Dataset for Synthetic Speech Detec- kernel size or the dilation rate. Furthermore, skip connec-
tion” [1] introduces a new dataset called FoR dataset which tions and residual blocks within the architecture allow the
has around 198,000 utterances including both real and syn- gradients to pass through easily.
thetic speech generated using the latest algorithms. It also
focuses on analyzing how synthetic speech is generated and o = Activation(y + F(y)) (2)
the performance of various deep learning models that classify
this synthetic speech. It also presents various experiments As given in Eq. (2), the output of series of transformations
that demonstrate the versatility and usefulness of the newly F is added to input y of the block.
introduced FoR dataset to improve the quality of synthetic
audios generated using various methods and also in the detec- 2. STN
tion of computer-generated speech as it is possible to train CNNs are considered to be extremely powerful models
123
3450 Arabian Journal for Science and Engineering (2022) 47:3447–3458
123
Arabian Journal for Science and Engineering (2022) 47:3447–3458 3451
123
3452 Arabian Journal for Science and Engineering (2022) 47:3447–3458
123
Arabian Journal for Science and Engineering (2022) 47:3447–3458 3453
123
3454 Arabian Journal for Science and Engineering (2022) 47:3447–3458
123
Arabian Journal for Science and Engineering (2022) 47:3447–3458 3455
Fig. 13 TCN
123
3456 Arabian Journal for Science and Engineering (2022) 47:3447–3458
Table 2 Image-based classification results machine learning models have almost similar performance
Model Validation accuracy Test accuracy and saturate after a certain value. Accuracies of the models
were in the range of 0.60–0.70. The SVM model performed
TCN 0.98 0.92 the best when compared to other machine learning models.
STN 0.89 0.80 Table 2 shows the results for image-based classification after
using the deep learning models. It can be observed that the
TCN performed the best by giving a test accuracy of 92 per-
Table 3 Test accuracy values
cent. STN performed better than the machine learning models
Reference Algorithm Test by giving a test accuracy of 80 percent. Overall, the image-
accuracy
based classification provided better results than feature-based
Algorithms 4-layer fully 0.46 classification on this audio deepfake detection task. This can
implemented in [1] connected be attributed to the fact that TCN being a sequential net-
2-Layer CNN(+2FC) 0.41 work is able to capture the features of sequential data more
3-Layer CNN(+2FC) 0.38 effectively as compared to the other models that have been
VGG16 0.95 implemented. Hence, it can be observed that employing an
VGG19 0.96 image-based approach with TCN as the classifier can give
InceptionV3 0.79 better results as compared to the Machine Learning models
ResNet 0.78 with a feature-based approach.
MobileNet 0.94 Moreover, experiments have been held earlier on the same
XceptionNet 0.74 dataset as described in [1]. The results obtained in those
Proposed system SVM 0.67 experiments are shown in Table 3. It can be observed that the
Random Forest 0.62 best-performing model has an accuracy of about 96%. How-
KNN 0.62 ever, these models are computationally expensive, whereas
XGBoost 0.59
the model implemented in this paper i.e., TCN is compara-
LGBM 0.60
tively less complex and achieves an accuracy that does not
lag behind by a significant amount. Hence, implementing
TCN 0.92
TCN can be computationally efficient as well as provides
STN 0.80
equivalent results.
123
Arabian Journal for Science and Engineering (2022) 47:3447–3458 3457
been made. From the results mentioned above, the deep learn- Conferences on the Move to Meaningful Internet Systems, pp. 986–
ing algorithms surpassed the machine learning algorithms. 996. Springer, Berlin (2003)
6. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
The TCN model gave highest 92 percent test accuracy. The 7. Wold, E.; Blum, T.; Keislar, D.; Wheaten, J.: Content-based clas-
reason behind TCN performing well may be attributed to sification, search, and retrieval of audio. IEEE Multimedia 3(3),
the fact that the data input is sequential and TCN being a 27–36 (1996)
sequential network performs well on such data. Also, the 8. Nanni, L.; Costa, Y.M.G.; Lucio, D.R.; Silla, C.N., Jr.; Brahnam,
S.: Combining visual and acoustic features for audio classification
architecture in the proposed system is simpler as compared tasks. Pattern Recogn. Lett. 88, 49–56 (2017)
to the models used in paper [1] and achieves similar accuracy 9. Lie, L.; Zhang, H.-J.; Jiang, H.: Content analysis for audio clas-
as those models. sification and segmentation. IEEE Trans. Speech Audio Process.
This work can be expanded in the future by introducing 10(7), 504–516 (2002)
10. Zhao, J.; Mao, X.; Chen, L.: Speech emotion recognition using
new features to the feature-based methodology that will help
deep 1D & 2D CNN LSTM networks. Biomed. Signal Process.
in improving the results of machine learning models. Further, Control 47, 312–323 (2019)
amplitude-based classification can be implemented by using 11. Carey, M.J.; Parris, E.S.; Lloyd-Thomas, H.: A comparison of fea-
deep learning models. One limitation of this work is that in tures for speech, music discrimination. In: 1999 IEEE International
Conference on Acoustics, Speech, and Signal Processing. Proceed-
the image-based approach, STFT and MFCC have not been ings. ICASSP99 (Cat. No. 99CH36258), vol. 1, pp. 149–152. IEEE,
used as input. The image-based approach can be extended to London (1999)
using STFT and MFCC as input to the deep learning models 12. Stylianou, Y.: Voice transformation: a survey. In: 2009 IEEE Inter-
along with melspectrogram and a comparative study can be national Conference on Acoustics, Speech and Signal Processing,
pp. 3585–3588. IEEE, New York (2009)
performed to see which feature gives the best result with the 13. Wu, Z.; Evans, N.; Kinnunen, T.; Yamagishi, J.; Alegre, F.; Li, H.:
given classifier. Also, variants of TCN can be implemented Spoofing and countermeasures for speaker verification: a survey.
as classifiers and a comparison can be drawn between their Speech Commun. 66, 130–153 (2015)
performances to ascertain which variant performs the best. 14. Wu, Z.; De Leon, P.L.; Demiroglu, C.; Khodabakhsh, A.; King,
S.; Ling, Z.-H.; Saito, D.; Stewart, B.; Toda, T.; Wester, M.;
According to Wavenet [40], it is feasible to provide raw audio et al.: Anti-spoofing for text-independent speaker verification: an
as input into neural networks without converting it to spec- initial database, comparison of countermeasures, and human per-
trograms. Raw audio was not much utilized as a classifier for formance. IEEE/ACM Trans. Audio Speech Lang. Process. 24(4),
synthetic speech recognition. This might improve classifica- 768–783 (2016)
15. Reimao, R.A.M.: Synthetic speech detection using deep neural net-
tion accuracy while also cutting down on pre-processing time works. Thesis, York University, Toronto, Ontario (2019)
as spectrograms need not be generated. Hence, the develop- 16. Muckenhirn, H.; Magimai-Doss, M.; Marcel, S.: End-to-end con-
ment of raw audio classifiers can be studied. volutional neural network-based voice presentation attack detec-
tion. In: 2017 IEEE International Joint Conference on Biometrics
Acknowledgements Authors acknowledge Centre of Excellence in (IJCB), pp. 335–341. IEEE, New York (2017)
Complex and Nonlinear Dynamical Systems (CoE-CNDS) laboratory 17. Dinkel, H.; Qian, Y.; Kai, Yu.: Investigating raw wave deep neural
for providing support and platform for research. networks for end-to-end speaker spoofing detection. IEEE/ ACM
Trans. Audio Speech Lang. Process. 26(11), 2002–2014 (2018)
18. De Leon, P.L.; Hernaez, I.; Saratxaga, I.; Pucher, M.; Yamagishi,
Declarations J.: Detection of synthetic speech for the problem of imposture. In:
2011 IEEE International Conference on Acoustics, Speech and Sig-
nal Processing (ICASSP), pp. 4844–4847. IEEE, New York (2011)
19. Ze, H.; Senior, A.; Schuster, M.: Statistical parametric speech syn-
Conflict of interest The authors declare that they no conflict of interest. thesis using deep neural networks. In: 2013 IEEE International
Conference on Acoustics, Speech and Signal Processing, pp. 7962–
7966. IEEE, London (2013)
20. Dörfler, M.; Bammer, R.; Grill, T.: Inside the spectrogram: Convo-
References lutional neural networks in audio processing. In: 2017 International
Conference on Sampling Theory and Applications (SampTA), pp.
1. Reimao, R.; Tzerpos, V.: For: a dataset for synthetic speech detec- 152–155. IEEE, New York (2017)
tion. In: 2019 International Conference on Speech Technology and 21. Hong, Yu.; Tan, Z.-H.; Ma, Z.; Martin, R.; Guo, J.: Spoofing
Human–Computer Dialogue (SpeD), pp. 1–10. IEEE (2019) detection in automatic speaker verification systems using DNN
2. Evgeniou, T.; Pontil, M.: Support vector machines: theory and classifiers and dynamic acoustic features. IEEE Trans. Neural
applications. In: Advanced Course on Artificial Intelligence, pp. Netw. Learn. Syst. 29(10), 4633–4644 (2017)
249–257. Springer, Berlin (1999) 22. Balamurali, B.T.; Lin, K.E.; Lui, S.; Chen, J.-M.; Herremans, D.:
3. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Toward robust audio spoofing detection: a detailed comparison
Liu, T.-Y.: Lightgbm: a highly efficient gradient boosting decision of traditional and learned features. IEEE Access 7, 84229–84241
tree. Adv. Neural. Inf. Process. Syst. 30, 3146–3154 (2017) (2019)
4. Chen, T.; Guestrin, C.: Xgboost: a scalable tree boosting system. In: 23. Maccagno, A.; Mastropietro, A.; Mazziotta, U.; Scarpiniti, M.; Lee,
Proceedings of the 22nd ACM SIGKDD International Conference Y.-C.; Uncini, A.: A CNN approach for audio classification in con-
on Knowledge Discovery and Data Mining, pp. 785–794 (2016) struction sites. In: Progresses in Artificial Intelligence and Neural
5. Guo, G.; Wang, H.; Bell, D.; Bi, Y.; Greer, K.: KNN model-based Systems, pp. 371–381. Springer, Berlin (2019)
approach in classification. In: OTM Confederated International
123
3458 Arabian Journal for Science and Engineering (2022) 47:3447–3458
24. Bai, S.; Kolter, J.Z.; Koltun, V.: An empirical evaluation of 33. Tian, X.; Xiao, X.; Chng, E.S.; Li, H.: Spoofing speech detection
generic convolutional and recurrent networks for sequence model- using temporal convolutional neural network. In: 2016 Asia-Pacific
ing (2018). arXiv preprint arXiv:1803.01271 Signal and Information Processing Association Annual Summit
25. Zhang, C.; Yu, C.; Hansen, J.H.L.: An investigation of deep- and Conference (APSIPA), pp. 1–6. IEEE, London (2016)
learning frameworks for speaker verification antispoofing. IEEE 34. Chen, Y.; Kang, Y.; Chen, Y.; Wang, Z.: Probabilistic forecasting
J. Select. Top. Signal Process. 11(4), 684–694 (2017) with temporal convolutional neural network. Neurocomputing 399,
26. Paul, D.; Pal, M.; Saha, G.: Spectral features for synthetic speech 491–501 (2020)
detection. IEEE J. Select. Top. Signal Process. 11(4), 605–617 35. Danilyuk, K.: Convnets series. Spatial transformer networks-
(2017) towards data science. Towards Data Sci. (2017)
27. Kinnunen, T.; Sahidullah, M.; Delgado, H.; Todisco, M.; Evans, N.; 36. Nagarajan, S.; Nettimi, S.S.S.; Kumar, L.S.; Nath, M.K.; Kanhe,
Yamagishi, J.; Lee, K.A.: The ASVspoof 2017 challenge: assessing A.: Speech emotion recognition using cepstral features extracted
the limits of replay spoofing attack detection (2017) with novel triangular filter banks based on bark and ERB frequency
28. Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K.: scales. Digit. Signal Process. 104, 102763 (2020)
Spatial transformer networks. Adv. Neural Inform. Process. Syst. 37. Jia, Y.; Zhang, Y.; Weiss, R.J.; Wang, Q.; Shen, J.; Ren, F.; Chen,
28, 2017–2025 (2015) Z.; Nguyen, P.; Pang, R.; Moreno, I.L.; et al.: Transfer learning
29. Lea, C.; Vidal, R.; Reiter, A.; Hager, G.D.: Temporal convolutional from speaker verification to multispeaker text-to-speech synthesis
networks: A unified approach to action segmentation. In: Euro- (2018). arXiv preprint arXiv:1806.04558
pean Conference on Computer Vision, pp. 47–54. Springer, Berlin 38. Dash, T.K.; Mishra, S.; Panda, G.; Satapathy, S.C.: Detection of
(2016) COVID-19 from speech signal using bio-inspired based cepstral
30. Alqahtani, S.; Mishra, A.; Diab, M.: Efficient convolutional features. Pattern Recogn. 117, 107999 (2021)
neural networks for diacritic restoration (2019). arXiv preprint 39. Zheng, F.; Zhang, G.; Song, Z.: Comparison of different imple-
arXiv:1912.06900 mentations of MFCC. J. Comput. Sci. Technol. 16(6), 582–589
31. Lea, C.; Flynn, M.D.; Vidal, R.; Reiter, A.; Hager, G.D.: Tempo- (2001)
ral convolutional networks for action segmentation and detection. 40. van den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals,
In: Proceedings of the IEEE Conference on Computer Vision and O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K.:
Pattern Recognition, pp. 156–165 (2017) Wavenet: a generative model for raw audio (2016). arXiv preprint
32. Farha, Y.A.; Gall, J.: MS-TCN: multi-stage temporal convolu- arXiv:1609.03499
tional network for action segmentation. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition, pp. 3575–3584 (2019)
123