Applsci 12 09188 v2

applied
sciences
Article
Transformer-Based Multilingual Speech Emotion Recognition
Using Data Augmentation and Feature Fusion
Badriyya B. Al-onazi 1 , Muhammad Asif Nauman 2 , Rashid Jahangir 3 , Muhmmad Mohsin Malik 4 ,
Eman H. Alkhammash 5, * and Ahmed M. Elshewey 6
1 Department of Language Preparation, Arabic Language Teaching Institute, Princess Nourah bint
Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia
2 Department of Computer Science, University of Engineering and Technology, Lahore 54890, Pakistan
3 Department of Computer Science, COMSATS University Islamabad, Vehari Campus, Vehari 61100, Pakistan
4 Department of Interdisciplinary Field, National University of Medical Sciences, Rawalpindi 46000, Pakistan
5 Department of Computer Science, College of Computers and Information Technology, Taif University,
P.O. Box 11099, Taif 21944, Saudi Arabia
6 Department of Computer Science, Faculty of Computers and Information, Suez University, Suez, Egypt
* Correspondence: eman.kms@tu.edu.sa
Abstract: In recent years data science has been applied in a variety of real-life applications such as
human-computer interaction applications, computer gaming, mobile services, and emotion evalua-
tion. Among the wide range of applications, speech emotion recognition (SER) is also an emerging
and challenging research topic. For SER, recent studies used handcrafted features that provide
the best results but failed to provide accuracy while applied in complex scenarios. Later, deep
learning techniques were used for SER that automatically detect features from speech signals. Deep
learning-based SER techniques overcome the issues of accuracy, yet there are still significant gaps
Citation: Al-onazi, B.B.;
in the reported methods. Studies using lightweight CNN failed to learn optimal features from
Nauman, M.A.; Jahangir, R.;
composite acoustic signals. This study proposed a novel SER model to overcome the limitations
Malik, M.M.; Alkhammash, E.H.;
mentioned earlier in this study. We focused on Arabic vocal emotions in particular because they
Elshewey, A.M. Transformer-Based
Multilingual Speech Emotion
received relatively little attention in research. The proposed model performs data augmentation
Recognition Using Data before feature extraction. The 273 derived features were fed as input to the transformer model for
Augmentation and Feature Fusion. emotion recognition. This model is applied to four datasets named BAVED, EMO-DB, SAVEE, and
Appl. Sci. 2022, 12, 9188. https:// EMOVO. The experimental findings demonstrated the robust performance of the proposed model
doi.org/10.3390/app12189188 compared to existing techniques. The proposed SER model achieved 95.2%, 93.4%, 85.1%, and 91.7%
Academic Editor: Juan Francisco De
accuracy on BAVED, EMO-DB, SAVEE, and EMOVO datasets respectively. The highest accuracy
Paz Santana was obtained using BAVED dataset, indicating that the proposed model is well suited to Arabic
vocal emotions.
Received: 31 July 2022
Accepted: 28 August 2022
Keywords: multilingual; transformer; SER; speech emotion recognition; Arabic vocal emotion;
Published: 14 September 2022
artificial intelligence
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations. 1. Introduction
Humans are able to produce a wide range of sounds in conversations, pertaining to a
variety of multilingual words through their mouths [1,2]. The understanding of these words
Copyright: © 2022 by the authors.
among humans, while communicating with each other is obvious, but when it comes to the
Licensee MDPI, Basel, Switzerland.
human—machine emulation, it requires considerable speech signal processing. Humanoid
This article is an open access article
speech signals reveal exploitable information about the language, overarching message
distributed under the terms and content, intonation, trait, accent, gender, and other distinct properties of the speaker.
conditions of the Creative Commons To make machines capable to infer emotional information from speakers’ utterances by
Attribution (CC BY) license (https:// processing and classifying the speech signals is referred to as Speech Emotion Recognition
creativecommons.org/licenses/by/ (SER) [3].
4.0/).
Appl. Sci. 2022, 12, 9188. https://doi.org/10.3390/app12189188 https://www.mdpi.com/journal/applsci

Appl. Sci. 2022, 12, 9188 2 of 17
Researchers have recently paid a lot of attention to the creation of an efficient human-
technology communication interface, which falls under the category of human-computer
interaction (HCI). In a broader perspective, emotion recognition from speech signals has
been used in a variety of applications, including speech synthesis, robotics, automobile
industry, multimedia technologies, healthcare, education, forensics, customer service, and
entertainment [4]. The taxonomy of text-to-speech aids in the development of screen
readers for the blind [5]. Nowadays, robotic technologies are increasingly evolving sensor
data that comes from internal and external environmental factors, impacting decision-
making through emotion recognition. Experimental paradigms that identify emotions
from publicly available datasets must be carefully evaluated since they may fail when
used in a real-world context [6]. In the automobile industry advanced image/video and
voice-based signal processing methods are employed to identify drowsiness and lassitude
in drivers [7–9].
The abstraction of hidden features of an individual or a system extracted from het-
erogeneous multimedia content for recommendation systems has received the attention
of SER researchers [10]. Intensity detection of virus attacks through patients’ voice sig-
nals has opened a new challenge area in medical science since the current COVID-19
spread [11]. Speech recognition techniques have recently been incorporated into online
learning platforms for emotional analysis. Positive emotions consequently influence stu-
dents’ inspiration, dedication, and self-learning abilities. On the other hand, negative
emotions leave an adverse impact on students’ performance [12]. A recent study has
discovered several significant facts in human brain images, demonstrating how they do
cognitive-based speech synthesis, which has given rise to new research trends in automatic
speech recognition (ASR) and forensic systems [11–13]. In order to provide high-quality
customer care services, SER techniques are being applied to estimate the emotions of
telemarketing agents and customers from their conversations.
The desirable emotion detection can be achieved directly from speech signals. A
classic SER system includes signal pre-processing, feature extraction, feature selection, and
classifier to emotional information like happy, sad, fear, surprise, anger, etc. Each phase of
the SER process is detailed as follows.
Speech Signal Processing: Speech signals data collected from speakers’ utterances are
generally uneven and cumbersome, which requires apt pre-processing and post-processing
for extraction of appropriate acoustic features [13]. Pre-emphasis, speech segmentation
or block framing, window weighing function, voice activity detection (VAD), normal-
ization, noise reduction, and feature selection are examples of commonly used speech
processing techniques.
Feature Extraction: The success ratio of SER classifiers mainly depends on the ex-
traction of discriminative and relevant feature sets. There is currently no set of traits
for SER that is accepted globally, according to experimental studies. Therefore, whether
to use global or local features for better classification outcomes relies on the research
methodology. In addition, inappropriate features can be overhead for the classifier.
The continuous and variable length speech signals enclose rich information about the
emotions. Global features signify overall statistical facts like minimum or maximum
values, mean, median, and standard deviation. These are also known as long-term or
suprasegmental features. Local features, on the other hand (also known as short-term
or segmental features) [14] describe the chronologically changing aspects in which a
stationary state is estimated [15].
In recent research work, handcrafted feature extraction remained immensely under
consideration, however, it required sufficient domain knowledge for the experts. The
four main categories listed below [16,17] can be used to analyze speech features in SER
systems [18,19].
Appl. Sci. 2022, 12, 9188 3 of 17
1. Prosodic Features: Innovation of prosodic or para-linguistic features has carried

out the most distinguishing properties of emotional information for SER systems.
Prosodic features like intonation and rhythm are extracted from large units of human
utterances, thus categorized in long-term features. Widely employed characteristics
on which typical prosodic features are based are pitch, energy, length, and frequency.
2. Spectral Features: Humanoid vocal tract is a kind of filter that controls the shape of
sound produced during speech utterance. Vocal tract properties are better explained
when presented in the frequency domain. The process of transformation of time
domain signals into equivalent frequency domain signals through Fourier transform
contributes to spectral features extraction. The segmented acoustic data of variable
length (usually 20 to 30 ms) obtained from window weighting function is used as
input in spectral filters. Various types of spectral parameters in speech have been
explored using manifold dimensional reduction methods, such as Linear Predictive
Coding (LPC), Linear Prediction Cepstral Coefficients (LPCC), Log Frequency Power
Coefficient (LFPC), Gamma tone Frequency Cepstral Coefficients (GFCC) and Mel-
Frequency Cepstral Coefficients (MFCC). Among all the above-mentioned filter-banks,
MFCC is the most implemented in SER because of its robustness and noisy background
resistance. After segmentation of utterances, MFCC features are obtained that describe
the short-term power spectrum in acoustic signals. Frequencies of the acoustic tone of
vocal tract are known as Formants, which define phonetic features of vowels.
3. Voice Quality Features: Voice Quality features define the physical characteristics
of the glottal source. And there exists a strong correlation between the quality of
voice and the emotional data of the speech. The set of features associated with
voice quality includes jitter (i.e., per cycle frequency variation parameter), shimmer
(defined as amplitude variation parameter for soundwave), fundamental frequency
and harmonics to noise ratio (HNR) [20].
4. TEO Based Features: Teager Energy Operator (TEO) was presented by Teager [21]
for the detection of stress rate in speech signals. In a stressful state, due to muscular
tension in the glottal of a speaker, changes may occur in airflow while producing
sound. The non-linear TEO was mathematically formulated by Kaiser as shown
in Equation (1). Where, Ψ represents the TEO and x (n) depicts the acoustic signal
under consideration.
Ψ[ X (n)] = x2 (n) − x (n + 1) x (n − 1) (1)
SER Classification: Emotion classification in the SER system is accomplished by a variety

of classifiers trained based on the master feature vector. The traditional Machine Learning
(ML) classifiers [4], specifically used in SER systems include K-Nearest Neighbours (k-
NN), Support Vector Machine (SVM), Hidden Markov Model, Gaussian Mixture Model
(GMM) and Artificial Neural Network (ANN). Additionally, instead of using an individual
classifiers, different classifiers have also been used in combination to improve the efficacy
of emotion detection process [22].
There are few limitations of classic ML based techniques when applied for speech
emotion detection [23]. First, there is no definite ML based algorithm, capable of extracting
discriminative and accurate features for multilingual speech databases. Moreover, the
frequency and time domain acoustic features are unable to train a model for multilingual
SER. Second, manually extracted features through knowledgeable domain experts are
specifically domain dependent that cannot be reused in other identical domains. Third,
efficient modelling of machine learning classifiers necessitates a massive number of la-
belled handcrafted features to attain maximum performance for speech emotion detection.
Construction of database for the languages, for instance Arabic, collecting large amount of
labelled data may require a soundproof eco-friendly environment. Fourth, because of the
generic nature of ML based SER classifiers, the classification performance may drastically
decrease for variant linguistic content, speakers, and acoustic settings.
Appl. Sci. 2022, 12, 9188 4 of 17
In contrast to the longstanding ML techniques, Deep Learning (DL) methods reduced

the manual efforts for the extraction of discriminative features from acoustic signals. Deep
Learning comprises of manifold layered architecture of artificial neural networks that
learns the features in hierarchical manner. The most practicing speech emotion detection
classifiers based on DL are Deep Neural Networks (DNN), restricted Boltzmann machine
(RBM), Convolutional Neural Networks (CNN), Long-Short Term Memory Networks
(LSTM) and Recurrent Neural Networks (RNN). Additionally, some enhanced DL methods
such as Deep Autoencoder (AE), Multitask Learning, Attention Mechanism, Transfer
Learning and Adversarial Training have also been employed for SER. Most importantly,
multiple DL classification methods can be collectively employed in different layers for
model construction, which is robust and improve overall performance and flexibility. Thus,
in this study, a novel transformer based method was proposed to improve the recognition
rate of emotion detection and reduce the model training time under limited computation
resources. The performance of proposed transformer is compared with baseline deep
learning methods in terms of accuracy. In addition, this also investigate the optimize
set of features and data augmentation methods to reduce the overfitting problem due to
inadequate data.
The rest of this paper is structured as follows. Section 2 presents the literature review
on SER techniques. Section 3 describes the datasets employed to perform experiments,
features extraction, transformer model adopted for SER and evaluation measures. Section 4
presents the experimental results and the significance of the observed findings. The conclu-
sion and future research directions are provided in Section 5.
2. Literature Review
Numerous works has been done for emotion detection from the speech signals using
both machine and deep learning techniques. However, to a great extent, performance and
efficacy of the proposed model depends on dataset used for model training. This section
is divided into two sections; first analyses few pre-constructed speech datasets for SER
along with their shortcomings and second reviews some latest research contributions for
the development of effective SER classification models using DL techniques.
An Urdu database for speech emotion recognition have been introduced for the train-
ing of ML models. The data collection comprises of recorded audios for seven emotions
with 734 Urdu and 701 Sindhi utterances [24]. Moreover, the authors employed baseline
classification using OpenSmile feature extraction toolkit in terms of UAR. The results show
that the regression models applied on ComParE performs better classification, achieving
56.03% accuracy for Urdu and Sindhi SER. The collection of Urdu-Sindhi Corpus encom-
passes only 10 scripted sentences for every type of emotion. The major limitation of such
recordings is that, for instance, a specific scripted sentence utterance may represent more
than one emotion. Moreover, in such cases speaker is not representing real world situa-
tion. In another acted Urdu SER database, ref. [25] collected utterances of 25 words from
50 speakers (25 male and 25 female) based on three emotions. Nonetheless, the employment
of such small-scale Urdu databases using DL models would be inappropriate and may
not be useful for the development of real time SER applications. Ref. [26] formulated
first custom URDU dataset for Urdu language. In URDU dataset 400 audio records were
collected from different Urdu television talk shows representing 4 basic emotions. Though,
the collected data contained unscripted spontaneous conversations of anchors with their
guests, representation of actual emotions in TV talk shows is quite unusual because the
most dominant emotion in such conversations is neutral or aggressive. Since the training
of deep learning models require huge amount of data, the small URDU dataset may lead to
over fitting or under fitting problems.
In a study, ref. [27] used an endways multi-learning trick (MLT) based on 1D enhanced
CNN model for automatic extraction of local and global emotional features from acoustic
signals. For the enhancement of recognition rate, the proposed solution extracted the
discriminative features using dynamic fusion framework. The proposed multi-learning
Appl. Sci. 2022, 12, 9188 5 of 17
model evaluated both short as well as long-term relative dependencies over two benchmark
SER databases, IEMOCAP and EMO-DB, with 73% and 90% accuracy rates, respectively.
However, the method relatively takes more time to train and test the real time speech signals
as compared to other models. Exploring two datasets, RECOLA (to employ regression) and
IEMOCAP (for classification task), ref. [28] detected the emotions in speech using a novel
end-to-end DNN algorithm. The authors found that SER performance in simulation results
was optimum when proposed method applied with RMS aggregation and context stacking.
The proposed DiCCOSER-CS model improved the arousal CCC by 9.5% and the valence
CCC by 12.7% in regression task as compared with CNN-LSTM.
Recurrent Neural Networks models have also been implemented with slight variations
for feature learning in SER systems. A similar study [29] illustrates deep RNN, learning
frame-level categorization as well as temporal collection into long time intervals. Also,
the authors presented weighted time-pooling scheme to classify emotionally prominent
fragments from the speech signals using simple attention mechanism. The model achieved
5.7% weighted and 3.1% unweighted classification accuracy using IEMOCAP dataset,
when compared with conventional SVM-SER model. Although, the model introduced
novel technique for emotion recognition, the experiments were conducted using only
one database.
For the imbalanced SER [30] introduced Attention-Integrated Convolutional RNN.
The experimental results on Emo-DB and IEMOCAP demonstrate better performance for
imbalanced speech emotions. In another related work, a self-attention mechanism [31] is
added to the Bidirectional Long Short-Term Memory (BLSTM) based classification. The
proposed solution not only accomplished correlation of speech signals in the statements
but also improved the diverse information using directional self-attention (DSA). The
algorithm performed well on IEMOCAP (comprising complete, script, and spontaneous)
and EMO databases.
There are few works on the domain of Arabic spoken speech emotion recognition.
Therefore, this study focused on the recognition of Arabic speech emotion. In 2018, accord-
ing to [32], the first study on emotion recognition in spoken Arabic data is proposed. The
speech corpus is collected from Arabic TV shows. Videos were labeled into three categories
happy, angry or surprised. The study used 35 classification methods. The best classification
performance is provided by the Sequential Minimal Optimization (SMO) classifier, which
achieves 95.5%.
Another study on Arabic speech emotion was introduced in [33]. This study provides
a dataset called “Semi-natural Egyptian Arabic speech emotion (EYASE)”. This study
was focused on Egyptian sentences. This study adopts SVM and KNN classifiers. The
highest accuracy achieved was 95% using SVM. The highest accuracy was obtained in
Male emotion recognition. Happiness was the most difficult to detect, while anger was the
easiest.
A study by [34] that built dataset based on real-world Arabic speech dialogs for
detecting anger in Arabic conversation. The result revealed that acoustic sound features
such as fundamental frequency, energy, and formants are more suitable for detecting
anger. The experimental findings demonstrated that support vector machine classifiers can
identify anger in real time at detection rate of more than 77%.
3. Proposed Methodology
This section discusses the detailed research methodology adopted for emotion recog-
nition from audio files. In the proposed research methodology, the augmentation methods
(white noise, time stretch) were utilized on each file to enhance the training data. Secondly,
a variety of features including MFCCs, spectrograms, chroma, and tonnetz, and spectral
contrast are extracted from each file and fed as input to train the transformer model for
SER. Finally, the proposed method is evaluated using four datasets based on robustness,
accuracy, and time. The proposed transformer model has been implemented using Keras
library that helps to perform simulations easily. Moreover, for SER, Librosa library is used
Appl. Sci. 2022, 12, 9188 6 of 17
to perform experiments and used to evaluate the sound signals. This library packages helps
to implement the data augmentation and feature extraction methods. All the experiments
are performed on a laptop with 2.5 GHz Dual-Core Intel® Core i5 processer, 12 GB memory
and 512 GB SSD hard drive. The detail description of each subsection is presented below.
3.1. Datasets
As the initial process of the proposed methodology, four publicly available datasets
are collected. In this work, Basic Arabic Vocal-Emotions Dataset (BAVED) is selected to
evaluate the proposed method. This dataset is the collection of Arabic audio/wav including
seven words ([“like”] úæJ.j.« @, [“dislike”] úæJ.j.ªK ÕË, [“this”] @ Yë, [“film”] ÕÎJ®Ë @, [“fabulous”]
, [“bad”] úæ) to assess the emotions starting from 0 to 6 respectively.
©K@P, [“good”] Èñ®ªÓ
Furthermore, each these words was divided in three higher classes presenting the human
emotions and these classes are 0, 1 and 2. The first corresponding level 0 defines low
emotions that present tired or exhausted, level 1 defines neutral emotions and level 2
number correspondence represents higher level of emotion either it is positive or negative
like happiness or sadness enjoying or feeling angry. The complete set of recordings in
BAVED dataset have 1935 instances recorded from 61 individuals selected 45 men and
16 women voices in different occasions.
We also used the following datasets to evaluate the proposed model: EMO-DB, SAVEE,
and EMOVO. EMO-DB [35] is employed as one of the datasets for our experiments as it is
widely used by the researchers in the field of SER. This is a German dataset that comprises
535 audio files of distinct duration belonging to 5 male and 5 female’s speakers. These files
are further classified into seven classes: disgusting, angry, boredom, happy, fear, neutral,
and sad. As voice quality depends on the sample rate, encoded method, bit rate, and file
format, all the files are consisting of wav format, mono-channel, 16 kHz sample rate, and
16-bits per sample.
Surrey Audio-Visual Expressed Emotion (SAVEE) is another famous multimodal
dataset which contains seven types of emotions recorded from 4 male actors. The total
utterances in this dataset are 480 created in British English. Ten evaluators examine the
audio, video, and combined audio-visual conditions to assess the quality of recordings.
Labels of used emotion utterances in our study are neutral, happiness, sadness, anger,
surprise, fear, and disgust. The standard TIMIT corpus script was used to record statements.
EMOVO dataset contains seven emotional states that are disgust, fear, anger, joy,
surprise, sadness, and neutral. EMOVO is the first dataset that contains Italian language
utterances. This dataset has recordings of 6 actors speaking the content of 14 sentences. All
588 utterances in this dataset are further annotated and divided into two groups, including
24 annotators in each group. The Fondazione Ugo Bordoni laboratories were used to record
all utterances.
3.2. Data Augmentation

Data augmentation is a technique that is commonly used to prevent overfitting while
applying machine learning based techniques for classification of speech emotions. Some
common techniques like addition of Gaussian noise, stretching time or shifting the pitch are
used for data augmentation applied while training the data. Besides the overfitting, data
augmentation is also used for different purposes like generalization of accuracy, robust
performance, and perfect data distribution while having less variance in data. To enhance
the performance of proposed transformer model, this study used above mentioned data
augmentation techniques. Gaussian noise technique was used to increase the training
data considering the amplitude of noise value denoted as “σ”. It is important to select
the value of σ because higher or low value can affect the performance of model. If it is
low, it decreases the performance while a higher value of σ increase the difficulty in model
optimization. The pitch shift technique was used to generate new speech signals whose
pitch wavelength was shifted by a series up to n steps. Pitch shifting does not affect the
enhance the performance of proposed transformer model, this study used above men-
tioned data augmentation techniques. Gaussian noise technique was used to increase the
training data considering the amplitude of noise value denoted as “σ”. It is important to
select the value of σ because higher or low value can affect the performance of model. If
it is low, it decreases the performance while a higher value of σ increase the difficulty in
Appl. Sci. 2022, 12, 9188 7 of 17
model optimization. The pitch shift technique was used to generate new speech signals
whose pitch wavelength was shifted by a series up to n steps. Pitch shifting does not affect
the duration of signal. Time stretching approach was used in last that altered the tempo
duration
and pitch of signal.
of the signal. Time
Thesestretching
three dataapproach was used
augmentation in last that
techniques altered
increase thethe tempo and
training
pitch of the signal. These three data augmentation techniques increase
data to improve the performance of proposed model and keep away from overfitting. The the training data to
improve the performance of proposed model and keep away from overfitting.
results of the three augmentation methods applied in our proposed model is shown in The results
of
Figure 1. the three augmentation methods applied in our proposed model is shown in Figure 1.
(a) (b) (c) (d) (e)

Figure 1. Audio
Figure features
1. Audio extraction
features after after
extraction data augmentation in theinform
data augmentation of spectrogram
the form (a) MFCC
of spectrogram (a) MFCC
(b) Chromagram (c) Spectral Contrast (d) Mel spectrogram (e) Tonnetz.
(b) Chromagram (c) Spectral Contrast (d) Mel spectrogram (e) Tonnetz.
3.3. Feature Extraction

3.3. Feature Extraction
In theInsake of speech
the sake emotion
of speech classification
emotion feature
classification extraction
feature can help
extraction to diminish
can help to diminish
computational
computational errors,
errors, computational
computational time,
time, andand model
model complexity.
complexity. However,
However, it itisisneces-
necessary
sary to
to extract
extract such
such acoustic
acoustic feature
feature who
who provide
provide valid information about the emotion. For
the proposed
For the proposed model, Mel-spectrogram, Chromagram, MFCCs, delta delta MFCCs,
MFCCs, delta-delta
delta-
delta MFCCs and Tonnetz representation and Spectral Contrast features were extractedfrom
MFCCs and Tonnetz representation and Spectral Contrast features were extracted
fromBAVED,
BAVED,EMO-DB,
EMO-DB,SAVEE
SAVEEand EMOVO
and EMOVO datasets. The
datasets. Thefollowing
followinglistlist
provide
provide some details
some
about
details aboutextracted features.
extracted features.
• •Mel-spectrogram
Mel-spectrogramfeatures: to extract
features: thesethese
to extract features audioaudio
features signals broken
signals down
broken into into
down
frames
frames andeach
and for for each
frame frame Fast Fourier
Fast Fourier transform
transform is applied.
is applied. The frequency
The frequency spectrum
spectrum
is separated
is separated by thebyfrequency
the frequency of equal
of equal space space
to get tothe
getMel-scale
the Mel-scale of signal
of signal frame.frame.
• •Chromagram
Chromagram features: 12 distinct pitch classes used to extract
features: 12 distinct pitch classes used to extract Chromagram features Chromagram features
using STFT and binning method. These features help to differentiate harmony and and
using STFT and binning method. These features help to differentiate harmony
pitchpitch classes.
classes.
• •MFFCs,
MFFCs,
delta delta
MFCCs, MFCCs, delta–delta
delta–delta MFCCs MFCCsbasedbased features:
features: MFCCs MFCCs features
features represent
represent
short-term
short-term power power
spectrum spectrum
while while Mel-frequency
Mel-frequency Cepstrum Cepstrum have spaced
have equally equallyfre- spaced
frequency band and provide insights for better classification
quency band and provide insights for better classification by measuring the capacity by measuring the capacity
of human
of human ear ear to bear
to bear thethe audio
audio signal.There
signal. Thereare are4040sets
sets of
of MFCCs,
MFCCs, delta deltaMFCCs,
MFCCs,delta-
delta-delta MFCCs in the three Mel-cepstrum extracted features. All audiosignals
delta MFCCs in the three Mel-cepstrum extracted features. All audio signalswere
divided into equal length frames to extract MFCC features.
were divided into equal length frames to extract MFCC features. These frames These frames revised
re- to
visedremove silence
to remove portion
silence for both
portion for the
bothstart
theand
startend
andofendthe frame
of the by
frameapplying windowing
by applying
operations. Furthermore, the time domain signal converted
windowing operations. Furthermore, the time domain signal converted to the fre- to the frequency domain
using Fast Fourier Transform (FFT). A mean scale filter
quency domain using Fast Fourier Transform (FFT). A mean scale filter bank applied bank applied to compute
values of
to compute FFT computed
values frequencies.
of FFT computed Equation
frequencies. of Meanofscale
Equation Mean filter bank
scale is bank
filter
is
Mel(f) = 2595 × log10 (1 + f/700) (2)
Mel(f) = 2595 × log10 (1 + f/700) (2)
After the computation of frequency values power logs are further computed for each
AfterMel-frequency.
the computation Atofthe
frequency valuesCosine
end Discrete powerTransform
logs are further
(DCT)computed for each all
used to transform
Mel-frequency. At the end Discrete
log-Mel-spectrums into timeCosine
domain Transform (DCT) used
and the extracted to transform
amplitudes knownallaslog-
MFCCs.
•
Mel-spectrums
Tonnetzinto time
based domainthe
features: andrelation
the extracted amplitudes
in between fall andknown
rise ofasspeech
MFCCs.signal of har-
monic network is described by pitch space of six-dimensions. To distinguish Q1
environmental sounds the tonal features of audio frames play an important role.
• Spectral contrast-based features: the root mean square difference between spectral proof
and the peak of signal frames computed to attain Spectral features.
The total number of features extracted for this study is 273 which includes 128 + 12
+ 40 + 40 + 40 + 6 + 7 features of Mel-spectrogram, chromagram, MFFC, delta-MFCC,
delta-delta MFCC, tonnetz, and spectral respectively. These extracted features are further
used for the training of the proposed model.
Appl. Sci. 2022, 12, 9188 8 of 17
3.4. Evaluation Metrics

Different evaluation metrics like Accuracy, precision, recall, and F1-score were applied
to evaluate each ESC model. For each prediction, a confusion matrix was used to detect
True-positive (TP), False-positive (FP), True-negative (TN), and False-negative (FN).
Accuracy: this metric is used to calculate the frequency of sound classes that can be
accurately determined from the entire speech signal. The following equation is used to
calculate the accuracy of results:
1 N ( TP + TN )i
Accuracy =
N ∑i=1 (TP + TN + FP + FN )i (3)
Recall: to check the number of positive instances that are accurately detected by the
proposed model is checked by recall using the following equation.
1 N ( TP)i
Recall =
N ∑i=1 (TP + FN )i (4)
Precision: to check the correctly detected actual utterances precision method is used
using the following equation.
1 N ( TP)i
Precision =
N ∑i=1 (TP + FP)i (5)
The evaluation matrices used to evaluate the proposed transformer model are widely
used to measure the performance of models used for detection, classification, and predic-
tion systems.
3.5. Transformer Model

Initially, the transformer model was erected to use for machine-level translation. How-
ever later it becomes popular in Natural Language Processing (NLP) and replaces recurrent
neural networks (RNN) with its efficient performance [36,37]. The transformer model uses
a different internal attention mechanism also known as the self-attention mechanism elimi-
nate the recurrence of any type of processing. It uses a linear transformation to generate
significant features for a given statement or utterance.
Exploring its structure, it consists of a block having sub-blocks of encoders and
decoders. Both sub-blocks perform different activities like encoder transforming au-
dio signals into an order of transitional illustrations. These illustrations are further
passed to the decoder block to produce the desired output. Encoder encodes feature
vectors into X = (x1, . . . , xT) and decoder decodes encoded feature illustrations into
W = wm = (w1, . . . , wM). Because of the autoregressive nature of the transformer model,
each stage takes input that is the output of the previous stage and each block has multiple
self-attention layers that are connected with each other. Both encoder and decoder blocks
worked individually.
The traditional speech recognition models used attention mechanisms that include
simple blocks of encoder and decoder. First, the sound signals given to the encoder encode
into alternative signals that are further decoded by the decoder that predicts the input
features and converts the labels into the sequence provided by the encoder. To predict
the output and recognition of speech the attention mechanism is used and highlights the
significant parts of speech. As compared to all these traditional models the transformer
model has multiple encoders and decoders and each block has an internal self-attention
mechanism, and these attention layers are interconnected.
According to the studies, an encoder of a transformer model contains six layers
of coders placed in a sequence of top to bottom. All coders have the same structures,
and several coders are not fixed and can vary in different models for the encoder block.
The structure of coders is similar to a block, but weights can be different. Usually, the
Appl. Sci. 2022, 12, 9188 9 of 17
encoder of the transformer model takes extracted features as input that were extracted
by frequency coefficients or convolutional neural networks. Each encoder in a sequence
performs a similar task and transforms extracted features into an alternative signal using
self-attention and feed-forward transformed vectors to the next encoder using a simple
Artificial neural network. The transformed vectors are transmitted to the decoder block
by the last encoder.
The decoder block has the same numbers of decoders as in the encoder block
arranged in a sequence. Each decoder in the block has a similar two-layered structure
to encoders where the first layer takes input and transforms features into vectors and
the second layer feeds these encoded vectors to the subsidiary encoder. However, the
difference in the decoder layer is that there is an additional attention layer that helps
the decoder to focus on a significant part of given frames of encoded feature vectors.
Here the self-attention layer considers the prior data to decode the feature vectors and
predict the incoming sentences. Then the second layer outputs the posterior probabilities
of decoded words or characters and the same procedure is applied for each word of
the statement.
The self-attention mechanism makes the transformer model successful in speech
recognition however this mechanism can be further extended to improve the performance
of the model. It can be expanded to a multi-head self-attention mechanism. It can be
extended in a way that it split the input sequence into different chunks and these chunks
can be augmented in multiple dimensions and projections. The other way to extend the
mechanism is that each chunk of input features passed through an individual and unbiased
attention mechanism. The third way to extend the mechanism is that before projecting the
output encoding each chunk is concatenated while encoding. The third way of extending
the self-attention mechanism can be described as follows
Multi-Head Attention or MHA (Q, K, V) = [e1 , · · ·, eH ] Wo
where ei = Attention (QWQ i , KWK i , V WV i )

However, in the above equation of MHA H describes the total number of heads, the
input sequence of dimensions is din while dk = din /H, and the encoding generated by head
i is ei , Wo ∈ Rdin×din , W×WQ i ∈ Rdin×dk , WK i ∈ Rdin×dk and WV i ∈ Rdin×dv .
Transformer models provide higher dimensional representations because of the multi-
head self-attention mechanism from multiple subspaces. The structure of the proposed
transformer model is presented in Table 1.
Table 1. Structure of proposed model.
Layer (Type) Output Shape Param# Connected to

inputs (InputLayer) [(None, None, 273)] 0 []
tf.math.reduce_sum_6
(None, None) 0 [‘inputs [0][0]’]
(TFOpLambda)
tf.cast_6(TFOpLambda) (None, None) 0 [‘tf.math.reduce_sum_6[0][0]’]
enc_padding_mask (Lambda) (None, 1, 1, None) 0 [‘tf.cast_6[0][0]’]
[‘inputs [0][0]’,
encoder (Functional) (None, None, 273) 2,656,905
‘enc_padding_mask [0][0]’]
tf.reshape_6 (TFOpLambda) (None, 273) 0 [‘encoder [0][0]’]
outputs (Dense) (None, 7) 1918 [‘tf.reshape_6[0][0]’]
(None, None) 0 [‘inputs [0][0]’]
(TFOpLambda)
tf.cast_6(TFOpLambda) (None, None) 0 [‘tf.math.reduce_sum_6[0][0]’]
enc_padding_mask (Lambda) (None, 1, 1, None) 0 [‘tf.cast_6[0][0]’]
encoder (Functional) (None, None, 273) 2,656,905 [‘inputs [0][0]’, ‘enc_padding_mask [0][0]’]
Appl. Sci. 2022, 12, 9188
tf.reshape_6 (TFOpLambda) (None, 273) 0 [‘encoder [0][0]’] 10 of 17
outputs (Dense) (None, 7) 1918 [‘tf.reshape_6[0][0]’]
4.4.Experiments
Experiments
Thefour
The fourpreviously
previously described
described datasets
datasets were
were used
used to run speaker-independent
to run speaker-independent experi-
exper-
ments to
iments to evaluate
evaluate the
the proposed
proposed SER.
SER. It
It applied
applied aa percentage-split
percentage-split technique
technique where
where itit split
split
the data into two sets, the first set including 80% of the data used to train the SER
the data into two sets, the first set including 80% of the data used to train the SER models, models,
and the
and the remaining
remaining 20%20% data
data were
wereused
usedtotoperform
performtesting.
testing.Prior
Priorstudies
studiesconcluded
concludedthat it isit
that
better to train the model using 70–80% of the dataset, and the remaining 30–20%
is better to train the model using 70–80% of the dataset, and the remaining 30–20% should should
beused
be used for
for testing
testing purposes.
purposes. This
This dataset
dataset splitting
splitting technique
technique can
can help
help achieve
achieve improved
improved
results, and SER models prevent overestimating and overlapping results. Figure 2 presents
results, and SER models prevent overestimating and overlapping results. Figure 2 pre-
the performance of the proposed transformer model applied to the datasets as mentioned
sents the performance of the proposed transformer model applied to the datasets as men-
above, using a feature set of 273 to train the model. The table provides information about
tioned above, using a feature set of 273 to train the model. The table provides information
the accuracy of the proposed model applied to selected datasets.
about the accuracy of the proposed model applied to selected datasets.
(a)
(b)
Figure 2. Cont.
Appl. Sci. 2022, 12, x FOR PEER REVIEW 11 of 17
Appl. Sci. 2022, 12, 9188 11 of 17
(c)
(d)
Figure 2. The training and validation accuracies along with losses of SER models for the (a) BAVED,
Figure 2. The training and validation accuracies along with losses of SER models for the (a) BAVED,
(b) EMO-DB, (c) SAVEE, and (d) EMOVO.
(b) EMO-DB, (c) SAVEE, and (d) EMOVO.
4.1.
4.1. Models
Models Prediction
Prediction Performance
The
The proposed modeluses
proposed model usesfour
fourdatasets
datasetsininthis
this study.
study. Tables
Tables 2–52–5 present
present the the evalua-
evaluation
tion
and and classification
classification results
results of BAVED,
of BAVED, EMO-DB,
EMO-DB, SAVEE,SAVEE, and EMOVO
and EMOVO datasets
datasets respec-
respectively.
tively. The metrics
The metrics used forused for evaluation
evaluation are precision,
are precision, recall,recall, F1-measure,
F1-measure, and accuracy
and accuracy for
for each
each emotion
emotion type.type. Tabulated
Tabulated results
results describe
describe the efficiency
the efficiency of theofproposed
the proposed
modelmodel com-
compared
with baseline
pared techniques
with baseline for SERfor
techniques [38–42].
SER [38–42].
Table
Table 2. Performance of the proposed
proposed model
model for
forthe
theBAVED
BAVEDdataset.
dataset.
Emotion
Emotion Accuracy
Accuracy (%)(%) Precision
Precision (%)(%) Recall
Recall (%)
(%) F1-Score(%)
F1-Score (%)
Tired/Exhausted
Tired/Exhausted 94 94 9393 9494 94
94
Neutral
Neutral 93 93 9393 9393 93
93
Happy/Sad/Angry 96 98 96 97
Happy/Sad/Angry 96 98 96 97
Average 95 95 95 95
Average 95 95 95 95
Table 3. Performance of the proposed model for the EMO-DB dataset.
Emotion Accuracy (%) Precision (%) Recall (%) F1-Score (%)

Appl. Sci. 2022, 12, 9188 12 of 17
Table 3. Performance of the proposed model for the EMO-DB dataset.

Anger 95 93 95 94
Boredom 94 94 92 93
Disgust 100 92 100 96
Fear 94 87 94 90
Happy 78 91 78 84
Neutral 98 94 98 96
Sad 97 100 97 99
Average 93 93 93 93
Table 4. Performance of the proposed model for the SAVEE dataset.

Anger 79 97 79 87
Disgust 81 89 81 85
Fear 79 78 79 78
Happy 77 70 77 73
Neutral 93 90 93 91
Sad 96 87 96 91
Surprise 83 85 83 84
Average 84 85 84 84
Table 5. Performance of the proposed model for the EMOVO dataset.

Anger 97 90 97 94
Disgust 83 79 83 81
Fear 96 100 96 98
Joy 93 97 93 95
Neutral 87 100 87 93
Sad 85 84 85 84
Surprise 99 92 99 95
Average 91 92 91 91
The highest predicted accuracy is shown in the confusion matrix (Figure 3a) of the
BAVED dataset. It achieved 96% for anger/sad/happy emotion, 94% for tired/exhausted
emotion, and 93% for neutral. The overall predicted accuracy of the proposed model
also outperformed the existing SER models applied on BAVED [43]. The proposed model
achieved 93% accuracy for EMO-DB and outperforms the baseline EMO-DB models. The
proposed model predicts better accuracy because of discriminative features for all types of
emotions. Figure 3b describes the accuracy and confusion of inter-emotions for the SER
model. The diagonal values show the predicted values in the confusion matrix, while the
values shown in equivalent rows represent the inter-emotion confusion.
Appl. Sci. 2022, 12, x FOR PEER REVIEW 13 of 17
Appl. Sci. 2022, 12, 9188 13 of 17
(a) (b)
(c) (d)
Figure
Figure 3. 3.
Confusion
Confusionmatrixes
matrixes of the BAVED,
of the BAVED,EMO-DB,
EMO-DB, SAVEE,
SAVEE, and and EMOVO.
EMOVO. (a)BAVED;
(a) The The BAVED;
(b) the (b)
theEMO-DB;
EMO-DB;(c)(c) the SAVEE; (d) the EMOVO. The y-axes illustrate the true/actual labels, and the x-
the SAVEE; (d) the EMOVO. The y-axes illustrate the true/actual labels, and the x-axis
axisillustrates
illustrates the predicted labels.
the predicted labels.
Figure
Figure3c3cshows
shows the
the SAVEE datasetconfusion
SAVEE dataset confusion matrix
matrix thatthat represents
represents accuracies
accuracies for all for
allemotions.
emotions.ItItpresents
presents96% accuracy
96% for sad
accuracy emotion,
for sad 93% for
emotion, 93%neutral, 83% for83%
for neutral, surprise, 81%
for surprise,
for disgust, and 79% for anger emotion. However, the fear emotion achieved
81% for disgust, and 79% for anger emotion. However, the fear emotion achieved the low- the lowest
estaccuracy.
accuracy. The possible
The reason
possible behind
reason this misclassification
behind might be
this misclassification because
might the frequency
be because the fre-
of signals of this emotion is very similar to other emotions and hence is challenging
quency of signals of this emotion is very similar to other emotions and hence is challeng- for
the proposed model to recognize emotion accurately. In the end, Figure 3d shows the
ing for the proposed model to recognize emotion accurately. In the end, Figure 3d shows
confusion matrix of the EMOVO dataset. The average accuracy of the EMOVO SER model
the confusion matrix of the EMOVO dataset. The average accuracy of the EMOVO SER
in emotion prediction is 91%. However, the individual emotions also achieved higher
model in emotion
accuracy prediction
as surprise is 91%. 99%
amotion achieved However, the97%
accuracy, individual
for anger emotions also96%
emotions, and achieved
for
higher
fear emotion. For the F1 score of BAVED, neutral emotion achieved the lowest score of 93%,and
accuracy as surprise amotion achieved 99% accuracy, 97% for anger emotions,
96% for anger/sad/happy
while fear emotion. Foremotion
the F1 score
achievedof BAVED,
a 96% F1 neutral
score, and emotion achieved emotion
tired/exhausted the lowest
score of 93%, while anger/sad/happy emotion achieved a 96%
achieved 94%. The proposed SER model was low in memory size and computational F1 score, and tired/ex-
hausted emotion
time, which made achieved 94%. The
the transformer proposed
model SER
possible model was
to improve low in memory
the prediction size and
performance.
Another factortime,
computational that which
makes the
made proposed SER modelmodel
the transformer robustpossible
is the usage of discriminative
to improve the predic-
emotion
tion features Another
performance. and plainfactor
transformer modelthe
that makes design.
proposed SER model robust is the usage
of discriminative emotion features and plain transformer model design.
4.2. Comparison with Baseline Techniques

Appl. Sci. 2022, 12, 9188 14 of 17
4.2. Comparison with Baseline Techniques

Four publicly available datasets BAVED, EMO-DB, SAVEE and, EMOVO are used
to train the proposed transformer model for an effective SER system. The results demon-
strated the effectiveness of the proposed transformer model in predicting emotions from
audio signals. To achieve higher accuracy for SER models the proposed model was
trained with highly effective features. To avoid overfitting the raw data was augmented
with a data augmentation technique and then fed to the transformer model for training.
Many techniques are applied for SER based on deep learning and machine learning
algorithms taking high-level features to train them are not enough competent to pre-
dict higher accuracy, reduce computational time and lower model size [44]. As a way
to address these problems a comparatively lighter transformer model was proposed
that takes 273 acoustic features as input and can learn higher concentrations of input
representations. The proposed model depicts higher accuracy for all datasets used to
train the model. When the problem of accuracy was resolved then computational time
was reduced by a lighter transformer model. To prove that the proposed transformer
model provides robust results in solving the accuracy, computational cost, and model
size problems the results were compared with the standard techniques used for speech
emotion detection.
The performance of the proposed model was compared to the baseline techniques
using the same datasets for training and testing the model to evaluate its significance
and strength. Table 6 comprehensively summarizes the comparative analysis of baseline
techniques and the proposed model. The results in the table evidence the robustness and
significance of the proposed SER model compared to baseline methods. However, the
proposed model loses its performance a bit for some instances of individual emotions.
For example, the proposed model obtained 98% accuracy for neutral emotion, but
in [38] it is 100% in the EMO-DB SER model. On the other hand, the proposed model
achieved the highest overall prediction accuracy compared to baseline methods with
prediction accuracy. The proposed model is suitable for real-life applications because
of its lightweight structure, low computational cost, and ability to predict accuracy for
individual instances. To conclude, it can be stated that the proposed SER model is more
reliable, generic, and accurate.
Table 6. Comparison of the proposed model and the baseline models using BAVED, EMO-DB, SAVEE
and, the EMOVO datasets.
Accuracy (%) of Each Emotion

Study Features Models Dataset
A B C D F H N S U Avg
Melspectrogram + MFCC +
[38] CNN EMO-DB 100 61 × 67 67 100 100 87 × 86
Chroma + tonnetz + SC
EMO-DB 92 88 × 99 92 92 90 93 × 90
[39] Features extracted using AlexNet. DCNN
SAVEE 90 × × 48 50 47 82 58 53 67
ZCR + MFCC + F0 + Energy
[40] fuzzy RF EMO-DB 77 71 × 92 57 82 69 81 × 81
+ HNR
[41] MFCC, MFM, LPCC, LFPC SVM EMOVO 79 × × 74 65 68 76 85 67 73
[42] Features extracted using CNN. CNN EMO-DB 91 90 × 87 92 66 85 88 × 85
BAVED 96 × × 94 × 96 93 96 × 95
SC + MFCC + ∆MFCC + EMO-DB 95 94 × 100 94 78 98 97 × 93
Our
∆∆MFCC + Melspectrogram + transformer
model SAVEE 79 × × 81 79 77 93 96 83 84
tonnetz + chroma
EMOVO 97 × × 83 96 93 87 85 99 91
A = Anger, B = Boredom, C = Calm, D = Disgust, F = Fear, H = Happy, N = Neutral, S = Sad, U = Surprise,
RF = Random Forest.
Appl. Sci. 2022, 12, 9188 15 of 17
5. Conclusions
Significant feature extraction and accurate classification are the main reasons that make
SER more challenging. This study presented a novel transformer model based on the fusion
of 273 acoustic features. This study presented a novel transformer model based on the fusion
of seven acoustic features. To address the issue of computational cost, the proposed model
employed a smaller number of layers. This study concentrated on Arabic vocal emotions
since they got less attention in the research community. We used the transformer model on
the “Basic Arabic Vocal Emotions” Dataset named BAVED, which resulted in a 95% accuracy.
We further tested the transformer model on three more datasets: EMO-DB, SAVEE, and
EMOVO, achieving accuracy of 93.4%, 85.1%, and 91.7%, respectively. The cross comparison
of results with baseline SER techniques showed the robustness and significance of proposed
transformer model. However, the accuracy levels were improved by enhancement of the
proposed model. The results also show that the proposed transformer model is particularly
well suited to Arabic vocal emotions recognition. The context or aspect of the emotion will
be extracted in future work. Furthermore, implementing recurrent neural networks while
using optimal speech features with the proposed transformer can increase the performance
of the SER system since it can offer high-level features more accurately. A comparative
analysis of SER models based on deep learning techniques applied to other databases is
also planned for future work. Moreover, the accuracy of gender-based speech emotion
recognition will be compared, and the emotion that is most easily recognized should
be identified.
Author Contributions: Conceptualization, E.H.A., B.B.A.-o. and M.A.N.; methodology, E.H.A. and
R.J.; software, R.J.; validation, B.B.A.-o., M.A.N., E.H.A. and M.M.M.; formal analysis, E.H.A., R.J. and
A.M.E.; investigation, R.J.; resources, E.H.A., A.M.E. and A.M.E.; data curation, R.J.; writing—original
draft preparation, R.J. and A.M.E.; writing—review and editing, E.H.A. and A.M.E.; visualization,
A.M.E.; supervision, E.H.A.; project administration, E.H.A. and B.B.A.-o.; funding acquisition, E.H.A.
and B.B.A.-o. All authors have read and agreed to the published version of the manuscript.
Funding: This work is supported by Princess Nourah bint Abdulrahman University Researchers
Supporting Project number (PNURSP2022R263), Princess Nourah bint Abdulrahman University,
Riyadh, Saudi Arabia.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Acknowledgments: The authors would like to acknowledge Princess Nourah bint Abdulrahman
University Researchers Supporting Project number (PNURSP2022R263), Princess Nourah bint Abdul-
rahman University, Riyadh, Saudi Arabia.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Mahlberg, M. Lexical cohesion: Corpus linguistic theory and its application in English language teaching. Int. J. Corpus Linguist.
2006, 11, 363–383. [CrossRef]
2. Stenström, A.B.; Aijmer, K. Discourse patterns in Spoken and Written Corpora; Pragmatics and Beyond New Series; John Benjamins
Publishing Company: Amsterdam, The Netherlands, 2004; pp. 1–14.
3. Huijuan, Z.; Ning, Y.; Ruchuan, W. Coarse-to-Fine Speech Emotion Recognition Based on Multi-Task Learning. J. Signal Process.
Syst. 2020, 93, 299–308. [CrossRef]
4. Koduru, A.; Valiveti, H.B.; Budati, A.K. Feature extraction algorithms to improve the speech emotion recognition rate. Int. J.
Speech Technol. 2020, 23, 45–55. [CrossRef]
5. Taylor, P. Text-to-Speech Synthesis; Cambridge University Press: Cambridge, UK, 2009.
6. Cavallo, F.; Semeraro, F.; Fiorini, L.; Magyar, G.; Sinčák, P.; Dario, P. Emotion Modelling for Social Robotics Applications: A
Review. J. Bionic Eng. 2018, 15, 185–203. [CrossRef]
7. De Naurois, C.J.; Bourdin, C.; Stratulat, A.; Diaz, E.; Vercher, J.-L. Detection and prediction of driver drowsiness using artificial
neural network models. Accid. Anal. Prev. 2019, 126, 95–104. [CrossRef] [PubMed]
Appl. Sci. 2022, 12, 9188 16 of 17
8. Katsis, C.D.; Rigas, G.; Goletsis, Y.; Fotiadis, D.I. Emotion Recognition in Car Industry. In Emotion Recognition: A Pattern Analysis
Approach; Konar, A., Chakraborty, A., Eds.; Wiley Online Library: Hoboken, NJ, USA, 2015; pp. 515–544. [CrossRef]
9. Jahangir, R.; Teh, Y.W.; Mujtaba, G.; Alroobaea, R.; Shaikh, Z.H.; Ali, I. Convolutional neural network-based cross-corpus speech
emotion recognition with data augmentation and features fusion. Mach. Vis. Appl. 2022, 33, 1–16. [CrossRef]
10. Jiang, J.; Wang, H.H. Application intelligent search and recommendation system based on speech recognition technology. Int. J.
Speech Technol. 2021, 24, 23–30. [CrossRef]
11. Brown, C.; Chauhan, J.; Grammenos, A.; Han, J.; Hasthanasombat, A.; Spathis, D.; Xia, T.; Cicuta, P.; Mascolo, C. Exploring auto-
matic diagnosis of covid-19 from crowdsourced respiratory sound data. In Proceedings of the 26th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, Virtual Event, CA, USA, 6–10 July 2020.
12. El Hammoumi, O.; Benmarrakchi, F.; Ouherrou, N.; El Kafi, J.; El Hore, A. Emotion Recognition in E-learning Systems. In
Proceedings of the 2018 6th International Conference on Multimedia Computing and Systems (ICMCS), Rabat, Morocco, 10–12
May 2018.
13. Alim, S.A.; Rashid, N.K.A. Some Commonly Used Speech Feature Extraction Algorithms; IntechOpen: London, UK, 2018. [CrossRef]
14. Borrelli, C.; Bestagini, P.; Antonacci, F.; Sarti, A.; Tubaro, S. Synthetic speech detection through short-term and long-term
prediction traces. EURASIP J. Inf. Secur. 2021, 2021, 1–14. [CrossRef]
15. Chougule, S.V. Analyzing Vocal Tract Parameters of Speech. In Advances in Signal and Data Processing; Springer: Berlin/Heidelberg,
Germany, 2021; pp. 369–376. [CrossRef]
16. Glittas, A.X.; Gopalakrishnan, L. A low latency modular-level deeply integrated MFCC feature extraction architecture for speech
recognition. Integration 2020, 76, 69–75. [CrossRef]
17. Jahangir, R.; Teh, Y.W.; Nweke, H.F.; Mujtaba, G.; Al-Garadi, M.A.; Ali, I. Speaker identification through artificial intelligence
techniques: A comprehensive review and research challenges. Expert Syst. Appl. 2021, 171, 114591. [CrossRef]
18. Akçay, M.B.; Oğuz, K. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting
modalities, and classifiers. Speech Commun. 2019, 116, 56–76. [CrossRef]
19. Wani, T.M.; Gunawan, T.S.; Qadri, S.A.A.; Kartiwi, M.; Ambikairajah, E. A Comprehensive Review of Speech Emotion Recognition
Systems. IEEE Access 2021, 9, 47795–47814. [CrossRef]
20. Teixeira, J.P.; Oliveira, C.; Lopes, C.J.P.T. Vocal acoustic analysis–jitter, shimmer and hnr parameters. Procedia Technol. 2013, 9,
1112–1122. [CrossRef]
21. Teager, H.; Teager, S. Evidence for nonlinear sound production mechanisms in the vocal tract. In Speech Production and Speech
Modelling; Springer: Berlin/Heidelberg, Germany, 1990; pp. 241–261.
22. Bharti, D.; Kukana, P. A Hybrid Machine Learning Model for Emotion Recognition From Speech Signals. In Proceedings of the
2020 International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India, 10–12 September 2020.
23. Dargan, S.; Kumar, M.; Ayyagari, M.R.; Kumar, G. A Survey of Deep Learning and Its Applications: A New Paradigm to Machine
Learning. Arch. Comput. Methods Eng. 2019, 27, 1071–1092. [CrossRef]
24. Syed, Z.S.; Ali, S.; Shehram, M.; Shah, M.S. Introducing the Urdu-Sindhi Speech Emotion Corpus: A Novel Dataset of Speech
Recordings for Emotion Recognition for Two Low-Resource Languages. Int. J. Adv. Comput. Sci. Appl. 2020, 11. [CrossRef]
25. Peerzade, G.; Deshmukh, R.R.; Waghmare, S.D.; Jans, P.V. Design and Development of Emotion Recognition System for Urdu
Language. Int. J. Sci. Res. Comput. Sci. Appl. Manag. Stud. 2018, 7.
26. Latif, S.; Qayyum, A.; Usman, M.; Qadir, J. Cross lingual speech emotion recognition: Urdu vs. western languages. In Proceedings
of the 2018 International Conference on Frontiers of Information Technology (FIT), Islamabad, Pakistan, 17–19 December 2018.
27. Kwon, S. MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach. Expert Syst.
Appl. 2021, 167, 114177.
28. Tang, D.; Kuppens, P.; Geurts, L.; van Waterschoot, T. End-to-end speech emotion recognition using a novel context-stacking
dilated convolution neural network. EURASIP J. Audio Speech Music Process. 2021, 2021, 1–16. [CrossRef]
29. Mirsamadi, S.; Barsoum, E.; Zhang, C. Automatic speech emotion recognition using recurrent neural networks with local attention.
In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans,
LA, USA, 5–9 March 2017; pp. 2227–2231.
30. Ai, X.; Sheng, V.S.; Fang, W.; Ling, C.X.; Li, C. Ensemble Learning With Attention-Integrated Convolutional Recurrent Neural
Network for Imbalanced Speech Emotion Recognition. IEEE Access 2020, 8, 199909–199919. [CrossRef]
31. Li, D.; Liu, J.; Yang, Z.; Sun, L.; Wang, Z. Speech emotion recognition using recurrent neural networks with directional self-
attention. Expert Syst. Appl. 2021, 173, 114683. [CrossRef]
32. Klaylat, S.; Osman, Z.; Hamandi, L.; Zantout, R. Emotion recognition in Arabic speech. Analog. Integr. Circuits Signal Processing
2018, 96, 337–351. [CrossRef]
33. Abdel-Hamid, L. Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features. Speech Commun.
2020, 122, 19–30. [CrossRef]
34. Khalil, A.; Al-Khatib, W.; El-Alfy, E.S.; Cheded, L. Anger detection in arabic speech dialogs. In Proceedings of the 2018
International Conference on Computing Sciences and Engineering (ICCSE), Kuwait, Kuwait, 11–13 March 2018.
35. Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B. A database of German emotional speech. In Proceedings of the
Ninth European Conference on Speech Communication and Technology, Lisbon, Portugal, 4–8 September 2005.
Appl. Sci. 2022, 12, 9188 17 of 17
36. Liu, M.; Ren, S.; Ma, S.; Jiao, J.; Chen, Y.; Wang, Z.; Song, W. Gated transformer networks for multivariate time series classification.
arXiv 2021, arXiv:2103.14438.
37. Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. Adv. Neural Inf. Processing Syst. 2015, 28.
38. Issa, D.; Demirci, M.F.; Yazici, A. Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process.
Control 2020, 59, 101894. [CrossRef]
39. Farooq, M.; Hussain, F.; Baloch, N.K.; Raja, F.R.; Yu, H.; Bin Zikria, Y. Impact of Feature Selection Algorithm on Speech Emotion
Recognition Using Deep Convolutional Neural Network. Sensors 2020, 20, 6008. [CrossRef]
40. Chen, L.; Su, W.; Feng, Y.; Wu, M.; She, J.; Hirota, K. Two-layer fuzzy multiple random forest for speech emotion recognition in
human-robot interaction. Inf. Sci. 2019, 509, 150–163. [CrossRef]
41. Ancilin, J.; Milton, A. Improved speech emotion recognition with Mel frequency magnitude coefficient. Appl. Acoust. 2021,
179, 108046. [CrossRef]
42. Sajjad, M. and S. Kwon, Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE
Access 2020, 8, 79861–79875.
43. Mohamed, O.; Aly, S.A. Arabic Speech Emotion Recognition Employing Wav2vec2. 0 and HuBERT Based on BAVED Dataset.
arXiv 2021, arXiv:2110.04425.
44. Mustaqeem; Kwon, S. Att-Net: Enhanced emotion recognition system using lightweight self-attention module. Appl. Soft Comput.
2021, 102, 107101. [CrossRef]

Applsci 12 09188 v2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Applsci 12 09188 v2

Uploaded by

Copyright:

Available Formats

applied

Appl. Sci. 2022, 12, 9188. https://doi.org/10.3390/app12189188 https://www.mdpi.com/journal/applsci

1. Prosodic Features: Innovation of prosodic or para-linguistic features has carried

Ψ[ X (n)] = x2 (n) − x (n + 1) x (n − 1) (1)

SER Classification: Emotion classification in the SER system is accomplished by a variety

In contrast to the longstanding ML techniques, Deep Learning (DL) methods reduced

3.2. Data Augmentation

(a) (b) (c) (d) (e)

3.3. Feature Extraction

3.4. Evaluation Metrics

3.5. Transformer Model

Multi-Head Attention or MHA (Q, K, V) = [e1 , · · ·, eH ] Wo

where ei = Attention (QWQ i , KWK i , V WV i )

Table 1. Structure of proposed model.

Layer (Type) Output Shape Param# Connected to

outputs (Dense) (None, 7) 1918 [‘tf.reshape_6[0][0]’]

Appl. Sci. 2022, 12, 9188 11 of 17

Emotion Accuracy (%) Precision (%) Recall (%) F1-Score (%)

Table 3. Performance of the proposed model for the EMO-DB dataset.

Emotion Accuracy (%) Precision (%) Recall (%) F1-Score (%)

Table 4. Performance of the proposed model for the SAVEE dataset.

Emotion Accuracy (%) Precision (%) Recall (%) F1-Score (%)

Table 5. Performance of the proposed model for the EMOVO dataset.

Emotion Accuracy (%) Precision (%) Recall (%) F1-Score (%)

4.2. Comparison with Baseline Techniques

4.2. Comparison with Baseline Techniques

Accuracy (%) of Each Emotion

You might also like