XEmoAccent Embracing Diversity in Cross-Accent Emo

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.202x.0xxx000

XEmoAccent: Embracing Diversity in Cross-Accent


Emotion Recognition using Deep Learning
RAHEEL AHMAD1 , ARSHAD IQBAL1 (Member, IEEE), MUHAMMAD MOHSIN JADOON1
(Member, IEEE), NAVEED AHMAD2 (Senior Member, IEEE) and YASIR JAVED2 (Senior Member,
IEEE)
1
Sino-Pak Center for Artificial Intelligence (SPCAI), Pak-Austria Fachhochschule: Institute of Applied Sciences and Technology (PAF-IAST), Mang, Haripur
22620, Pakistan
2
Department of Computer Science, Prince Sultan University, Riyadh, Saudi Arabia
Corresponding author: Arshad Iqbal (e-mail: arshad.iqbal@spcai.paf-iast.edu.pk).
The authors would like to thank Prince Sultan University for paying the Article Processing Charges (APC) of this publication. They would
also like to thank Prince Sultan University for their support.

ABSTRACT Speech is a powerful means of expressing thoughts, emotions, and perspectives. However,
accurately determining the emotions conveyed through speech remains a challenging task. Existing manual
methods for analyzing speech to recognize emotions are prone to errors, limiting our understanding and
response to individuals’ emotional states. To address diverse accents, an automated system capable of real-
time emotion prediction from human speech is needed. This paper introduces a speech emotion recognition
(SER) system that leverages supervised learning techniques to tackle cross-accent diversity. Distinctively, the
system extracts a comprehensive set of nine speech features—Zero Crossing Rate, Mel Spectrum, Pitch, Root
Mean Square values, Mel Frequency Cepstral Coefficients, chroma-stft, and three spectral features (Centroid,
Contrast, and Roll-off) for refined speech signal processing and recognition. Seven machine learning models
are employed, encompassing Random Forest, Logistic Regression, Decision Tree, Support Vector Machines,
Gaussian Naive Bayes, K-Nearest Neighbors, ensemble learning, and four individual, hybrid deep learning
models including Long short-term memory (LSTM) and 1-Dimensional Convolutional Neural Network
(1D-CNN) with stratified cross-validation. Audio samples from diverse English regions are combined to
train the models. The performance evaluation results of conventional machine learning and deep learning
models indicate that the Random Forest-based feature selection model achieves the highest accuracy of up to
76% among the conventional machine learning models. Simultaneously, the 1D-CNN model with stratified
cross-validation reaches up to 99% accuracy. The proposed framework enhances the cross-accent emotion
recognition accuracy up to 86.3%, 89.87%, 90.27%, and 84.96% by margins of 14.71%, 10.15%, 9.6%, and
16.52% respectively.

INDEX TERMS Machine learning, Deep learning, Speech Emotion Recognition (SER), Random Forest
(RF), Logistic Regression (LR), Decision Tree (DT), Support Vector Machines (SVM), K-Nearest Neighbors
(KNN), 1-Dimensional Convolutional Neural Network (1D-CNN)

I. INTRODUCTION of speech emotion recognition (SER) is to recognize and


interpret the emotional information given through speech ac-
E MOTIONS are crucial in influencing our conversations
and decisions in our daily lives. There is enormous
potential for improving Human-Computer Interface (HCI)
curately [3]. These systems endeavor to fill the interaction gap
between humans and technology by recording and analyzing
systems [1], which depend on the capacity to recognize and linguistic patterns, prosodic cues, and acoustic characteristics
understand emotions effectively. Speech is a powerful expres- of voice signals. A wide range of applications can be stemmed
sion that can reveal much about a person’s emotional condi- out from comparable systems [4]. HCI systems can acquire
tion [2]. As a result, automatic emotion detection and classifi- these qualities by training computers to recognize and interact
cation from voice signals emerge as a promising technology, with human feelings. For example, think about a social robot
capitalizing on developments in speech signal processing, that modifies its actions in response to the mood of the person
deep learning, and language processing [1]. The primary goal it is talking with or a virtual assistant that can pick up on a

VOLUME 11, 2023 1

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379

Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning

user’s irritation and offer sympathetic advice [5]. challenging.


HCI through emotions conveyed via speech may lead to The previous research on recognizing speech emotions has
better user experiences, more efficient workflows, and deeper many issues when testing on benchmark datasets. These is-
connections between humans and machines. In addition, there sues include limited prediction accuracy [17], [21], [22], [23],
is much hope for the future of speech-based emotion recogni- the extraction of only a few speech features [18], [19], [20]
tion in mental health. Insights on a person’s mental health can such as Mel-Frequency Cepstral Coefficients (MFCC) [24],
be acquired from observing how their emotional state deviates [25], Zero Crossing rate (ZCR), and Chroma - STFT [21]
from normal [6]. Researchers and doctors can get reliable data that leads the models to excel only for specific applications.
to aid in identifying, treating, and tracking emotional illnesses Issues like the emotion class imbalance problem and over-
like depression, anxiety, and bipolar disorder by studying hid- fitting phenomenon retard the model’s performance on novel
den speech patterns and retrieving feelings [7]. Speech-based data, making the overall results of emotion recognition less
emotion identification can aid in the prompt identification and reliable. As a result, there is a pressing need to overcome these
intervention of mental health issues, leading to better patient obstacles and advance current SER models. These models
outcomes and more effective, tailored mental health care require cross-cultural datasets, with feature extraction strate-
[8]. Furthermore, speech-based emotion identification has gies to mitigate emotion class imbalance problems and rein-
the potential to alter interactions with customers completely. force model generalizability to accurately capture the original
Contact centres and automated response systems can adopt nature of emotional states in speech signals. By resolving
emotion detection algorithms to assess callers’ moods based these issues, augmenting the precision and consistency of
on their tone of voice [9]. speech emotion detection systems becomes possible, facilitat-
More positive experiences with customers and higher lev- ing their widespread use in various applications and domains.
els of satisfaction can follow when customer service represen-
tatives use this data to respond to them with empathy and un-
derstanding. Businesses can better address customers’ needs
by considering their preferences and emotions [10]. On the
other hand, SER systems can help create more tailored lessons
in the classroom. Teachers can learn about their students’
emotional involvement in class through their vocal emotions.
Real-time input on students’ emotional states can be pretty
helpful in providing more efficient and suitable learning en-
vironments [11]. Furthermore, educational technologies like
intelligent tutoring systems can use speech-based emotion
recognition to create adaptable and psychologically respond-
ing learning experiences for its students [12]. Generally, there FIGURE 1: Traditional SER System
is a significant emotional difference between the two exam
formats. Students who take computer-based exams report The proposed scheme attempts to solve the current con-
supposing more positive emotions, such as pleasure, hope, straints in the field and enhance the model’s accuracy, fidelity,
and pride, and less negative ones, such as anger, anxiety, and and generalization by merging datasets from diverse regions.
distress. Since computer-based testing elicits a more positive Integrating datasets improves the model’s ability to generalize
emotional reaction from students, it highlights the potential new data and provides a more thorough knowledge of emo-
advantages of switching to it in higher education settings [13]. tional expressions. The conventional SER systems, illustrated
Even though there have been improvements in speech- in Figure 1, comprise a few essential components. The pro-
based mood recognition, there are still some challenges. One cedure starts with receiving an audio speech signal, which
of them is understanding complicated emotional states that is subsequently processed by a speech processing module.
vary from person to person [14]. Emotions are complex and In feature extraction, the module converts the unprocessed
can be affected by many things, such as cultural differences, audio input into a set of feature vectors. After that, the vectors
personal speaking depths, and the situation [15]. It still needs are put through a feature selection process, which helps de-
to be improved to employ more accurate models to learn termine which features are most important for the emotion
and understand the hidden content of speech signals. Also, recognition task. Following the selection of these features,
it is essential for training and testing emotion recognition the models of machine learning or deep learning frameworks
systems to have diverse and well-annotated speech samples. are used to classify the category of corresponding emotion.
Another problem is that even though people have attempted Finally, the emotion recognition stage determines what the
to make labelled datasets [16], issues like a lack of data, user is feeling based on the results of the classification step
biased representations, and a lack of standards in how emo- and then outputs that emotion. The main contributions are
tions are marked make it hard to create accurate models that listed as follows;
can be used in many different situations. Because of these • An ensemble method of speech datasets is proposed
restrictions, producing accurate and generalized models is to enable cross-accent speech recognition by reducing
2 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379

Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning

language biases and enhancing model generalization ir- leads to an accuracy improvement of 10% for EMO-DB, 26%
respective of cultural inclination. for SAVEE, and 21% for RAVDESS datasets. They suggest
• An enhanced approach for cross-accent speech recog- comparatively examining SER methods grounded in deep
nition is introduced. It involves extracting up to nine learning strategies using different datasets.
speech features and eight emotion features from the
cross-accent speech emotion dataset.
• A deep learning framework is presented to learn the tem-
poral and spectral dependencies enabling discernment of y
deep voice patterns. Thus, it enhances the performance
of the cross-accent speech recognition framework.
The subsequent sections of the proposed work are structured
in the following manner. Section 2 summarizes the existing
Input Layer Convolution Max-Pooling Dense/FC Output
literature on SER, covering feature extraction and classi- Layer Layer Layers Layer
fication methods. Section 3 discusses the background for
datasets, models, and features. The fourth section discusses FIGURE 2: Architecture of CNN
the proposed methodology, encompassing an in-depth feature
extraction, model classification, evaluation, and discussion. Similarly, Huang et al. investigate SER using the fractional
The experimental outcomes and subsequent analysis are pre- Fourier transform (FrFT) [27]. They extract the MFCC fea-
sented in Section 5, while Section 6 serves as the conclusion ture using the RAVDESS dataset with the Fractional Fourier
of this paper. Transform (FrFT). The best possible setting for the ’p’ pa-
rameter in the FrFT algorithm, determined by the ambiguity
II. RELATED WORKS function and the MFCC, is for each speech signal frame. They
Several research studies have been conducted on extracting can detect all eight emotional states by deploying the LSTM
emotions from speech data. The SER system comprises two network, and the results are enhanced by up to 79.86% com-
fundamental procedures in conventional machine learning pared to the ordinal Fourier Transform (FT) method. Since
methodologies: traditional feature extraction and multi-class they did not obtain satisfying findings for neutral, happy, and
emotion classification. This section examines the current calm emotions, they recommended improving their results in
methodologies that utilize different datasets for real-time SER the future.
and the feature extraction techniques documented in scholarly Automatic SER is highlighted in the study [28], which
literature. This section discusses several extant models related highlights the deployment of parallel-based network train-
to SER in real-time. ing on the RAVDESS dataset. The researchers benchmarked
Alluhaidan et al. [26] aim to enhance SER using hy- various architectures, including standalone CNN designs
brid features in conjunction with Convolutional Neural Net- (Vgg-16, Resnet-50), attention-based networks (LSTM +
works (CNNs). CNNs are advanced deep-learning models Attention, Transformer), and hybrid architectures (Time-
that excel in voice interpretation. They use automatic and distributed CNN + Transformer). Their proposed parallel
adaptive learning to extract spatial hierarchies of charac- networks, namely CNN+Transformer and CNN + Bi-LSTM-
teristics from raw input data. The convolutional layer is a Attention modules, aim to encapsulate spatial and temporal
key component of CNNs, applying filters to scan the in- features. By converting raw audio into Mel-Spectrograms
put data and performing convolution, which helps identify and implementing data augmentation techniques, the parallel
patterns like edges and corners. The basic structure of the CNN-Transformer achieved an accuracy of 89.33%, while
CNN model is shown in Figure 2, where the max pooling the CNN+ Bi-LSTM-Attention network reached 85.67%.
layer in CNN down-samples the input, reducing complexity The Bi-LSTM-Attention module is a fusion of Bi-directional
and overfitting by selecting the highest value in each sliding Long Short-Term Memory (Bi-LSTM) Network and Atten-
window and Dense layer, placed after convolutional and pool- tion processes.
ing layers, performing classification tasks on the extracted The Bi-LSTM is a variant of recurrent neural networks that
features. To classify seven emotional states, the authors de- effectively incorporates information from both past and future
ploy three popular datasets, namely Emotion Database (Emo- contexts by processing data in both directions. The Attention
DB), Surrey Audio-Visual Expressed Emotion (SAVEE), and mechanism empowers the model to choose to concentrate on
Ryerson Audio-Visual Database of Emotional Speech and particular aspects of the input, assigning varying degrees of
Song (RAVDESS). They improve performance by extract- significance to these parts during the process of generating
ing MFCCs and Mel-Frequency Cepstral Coefficients time- the output. That’s why the combined Bi-LSTM Attention
domain features (MFCCT) through feature extraction in their module reveals the capability to effectively handle data in
proposed study. Among the machine learning models tested, both forward and backward directions while also possessing
the 1D-CNN model outperformed others, achieving an accu- the ability to concentrate on particular portions of the input
racy of 96.6% for Emotion Database (EMO-DB), 92.6% for selectively. The block diagram of Bi-LSTM with the attention
SAVEE, and 91.4% for RAVDESS. Their proposed strategy module is given in Figure 3. Overall results highlight the
VOLUME 11, 2023 3

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379

Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning

usefulness of parallel architectures for extracting emotional demonstrates a commendable F1-score of 83% and is crucial
aspects from speech data, and they aid in creating effective in annotating a significant proportion of the data. Follow-
and efficient SER models. For the future direction, they pro- ing this, the automatic annotation approach based on neural
pose more complex data augmentation techniques using deep networks substantially increases the annotated dataset, im-
learning methods to enhance the prediction results. proving emotion categorization accuracy by up to 5.9%. The
Aayushi et al. present facial expressions and speech sig- meticulous methodology employed in this study establishes
nals to implement emotion recognition systems for medical a solid basis for understanding the complexities of emotional
applications [29]. They operate the Gabor filter with SVM dynamics in the context of a pandemic but also carries signifi-
and CNN model architecture to retrieve the features from cant implications for enhancing the accuracy and flexibility of
images and extract MFCC speech features from the speech emotion detection systems in accurately capturing emotional
signals. They used RAVDESS, Toronto Emotional Speech situations that occur in real-world scenarios.
Set (TESS), Crowd-sourced Emotional Multimodal Actors In [33], various classifiers apply to a merged dataset and to
(CREMA-D), and SAVEE datasets for speech signals. Si- RAVDESS and TESS datasets that are analyzed individually.
multaneously, they select Japanese Female Facial Expression An analysis of comparative performance shows that Gradient
(JAFFE), the Kaggle face expression recognition dataset, and Boosting performs better than other classifiers when applied
Emotions in Context (EMOTIC) for image emotions. They to the combined dataset, achieving up to 84.96% efficiency.
achieve encouraging prediction results and expect to work Additionally, when compared to other classifiers, the MLP
for enhanced sentiment classification versions encompassing classifier achieves superior results across all three datasets.
audio and images using multimodal emotion identification. Shifted Linear Discriminant Analysis (S-LDA) is proposed
in paper [34] to derive dynamic attributes from static low-
level variables like MFCC and Pitch. These adjusted features
Forget
Gate go into a 1D-CNN to extract high-level features for auto-
Memory matic event recognition (AER). Three databases, SAVEE,
Input Attention Based -
Cell
Module eNTERFACE, and Berlin, assess the suggested methods’ per-
Input Gate formance in a classification test. Their results demonstrate
that the highest levels of accuracy for Automatic Emotion
Recognition (AER) are obtained for eNTERFACE at 96.41%,
Next
Output Gate Bi - LSTM the Berlin at 99.59%, and the SAVEE at 99.57%.
Layer
According to the findings presented by [35] acknowledge
FIGURE 3: Bi-LSTM with Attention Module that the former investigators do not accomplish capturing
a voice signal’s global, long-term context since they only
Mohanty et al. implement a Deep Convolutional Neu- recover local hidden facets. Due to limited dataset availability
ral Network (D-CNN) for identifying emotions [30]. This and inadequate feature portrayals, they exhibit poor recogni-
method accurately categorizes seven different human senti- tion performance. They propose ensemble approaches, com-
ments by utilizing spoken language’s spectral and prosodic bining the predictive performance of three model architec-
characteristics. To examine the efficacy of their proposed tures, 1D-CNN, LSTM, and Gated Recurrent Unit (GRU),
method, the authors use a variety of datasets, such as with a Fully Connected (FC) layer at the end to address con-
RAVDESS, SAVEE, TESS, and CREMA-D. The findings temporary problems. For each audio file in the TESS, EMO-
demonstrate outstanding accuracy rates of 96.54%, 92.38%, DB, RAVDESS, SAVEE, and CREMA-D datasets, they ex-
99.42%, and 87.90% for each dataset. Real-time testing is tract both local and global features, including MFCC, Log-
performed on combined datasets, and the results showed an mel spectrum, ZCR, chromogram, and RMS value. The en-
accuracy rate of 90.27%. The findings provide prospects semble method is used to achieve exceptional weight average
for improving user experiences in interactive systems and accuracy results of 99.46% for TESS, 95.42% for EMO-DB,
contribute to advancing the field of emotion recognition. 95.62% for RAVDESS, 93.22% for SAVEE, and 90.47% for
The study [31] applies the 1D-CNN classifier model to the CREMA-D respectively.
RAVDESS and TESS datasets and reaches an accuracy of The problem of distinguishing between positive and nega-
90.48% and 95.79%, respectively. They introduce a model tive emotions in audio recordings is tackled using deep learn-
with enhanced emotion categorization outcomes that could be ing methods. This research [36] uses five publicly available
implemented in smart home assistants to identify a person’s emotion speech datasets to build a 1D-CNN model. The ex-
emotions. perimental outcomes prove the model’s efficacy in identifying
Alenezi et al. [32] exploit a hybrid approach to monitor positive and negative emotional speech data. Dealing with
people’s emotions from Arabic tweets involving lexicon- ambiguous or noisy emotion labels in the datasets is just
based rule-based data labelling and Long Short-Term Mem- one feature extraction and classification challenge that needs
ory (LSTM) neural network techniques to comprehensively to be cleared. The study demonstrates that the classification
examine a dataset of 5.5 million Arabic tweets collected accuracy for voice emotion recognition increases when many
between January and August 2020. The rule-based technique models and datasets are used. These results show the potential
4 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379

Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning

of applying deep learning methods to this study area and help in the range of 0 to 1 [48]. For this purpose, the logistic func-
advance the field. tion (or sigmoid function) is used to model the association
Sharifa et al. [37] examine the impact of cultural accep- between the input attributes and the likelihood of belonging
tance on eliciting and recognizing emotions within an Arab to a specific class in the Logistic Regression model. The
culture. The researchers employed standardized English and sigmoid function converts the input to a number between 0
introductory Arabic stimuli to elicit six universally recog- and 1 to calculate the likelihood that an instance belongs to
nized emotions and assess the physiological and behavioural the positive class. To extend it to multiple classes, SoftMax
reactions of a sample consisting of 29 people. Notably, the activation is used to adapt Logistic Regression for multiclass
clip’s origin and the language used do not directly impact SER.
the elicitation of emotions. However, cultural acceptance is
contingent upon religious and cultural ideals. The findings 2) Decision Trees (DT)
indicate that the multiclass Support Vector Machine (SVM) A supervised learning model used for classification and re-
classifier consistently displays higher accuracy in recognizing gression learning analysis. In the case of classification, Gini
emotions in Arabic clips average of 60% compared to English index and Entropy are used to measure the impurity and dis-
clips average of 52%. This suggests that using culturally rel- order in the dataset, respectively. Mean squared error metric
evant clips boosts the emotional response in the classification is mostly used for the decision tree regression problem. This
process. model can effectively identify nonlinear relationships in data
Studies on SER systems between 2000 and 2017 are ana- due to its interpretable nature. Decision transparency is de-
lyzed by [38] in 2018 from three points of view: the database, pendent on the Gini Index, Entropy and Squared error and can
feature extraction, and classifiers. However, only classical deal with numerical and categorical features [49]. A decision
machine learning methods receive an evaluation as a poten- tree can help in emotion classification where contributing
tial classification tool, and the scientists express that neural features can be extracted by executing conditional rules.
networks (NN) and deep learning approaches await investiga-
tion. The paper contains a substantial section on databases and 3) RF
feature extraction. One year later, author Khalil et al. studied The RF algorithm is an ensemble learning technique integrat-
discrete methods in SER while using deep learning [39]. ing numerous decision trees to generate predictions [50]. The
The paper discussed several different methods of deep learn- training of each decision tree involves a stochastic selection
ing, such as auto-encoders, deep neural networks (DNNs), of a subset of the available data, followed by a computation
CNNs, and recurrent neural networks (RNNs), are discussed, of the final prediction using a voting or averaging procedure.
along with their benefits and drawbacks. Nevertheless, the The RF algorithm is extensively engaged in multi-class SER
research does not focus on the readily available strategies to because it manages high-dimensional data, captures complex
overcome limitations. Similarly, the subsequent year, 2021, feature connections, and reduces over-fitting. The utilization
examines deep learning algorithms for SER with accessible of RF involves the combination of multiple decision trees and
datasets, followed by traditional machine learning techniques serves to augment the model’s robustness and capacity for
for SER. The authors provide a multi-faceted analysis of the generalization.
differences and similarities between various practical neural
network methods for voice emotion identification [40]. 4) SVM
SVM is a robust supervised learning method regarding high-
III. MACHINE LEARNING MODELS AND FEATURES dimensional feature spaces and complex decision limits. The
COMPREHENSION method improves separation and generalization by increasing
Machine learning models depend on the features comprehen- the margin between classes [51]. SVMs are well suited for
sion within datasets for their effective utilization. Features are multiclass classification because of their resistance to noise
used by the model to make informed decisions. Therefore, the and outliers.
selection of the relevant features is crucial for the machine
learning models to learn the required patterns in the data. 5) Gaussian Naive Bayes (GNB)
A well curated dataset is vital, considering the diversity of Bayes’ theorem [52] forms the basis of the GNB classifier,
features, for effective and generalized model performance. which assumes that attributes are conditionally independent
The limitation of the data reflecting the features due to the given the class. It uses few computations and performs well
scarcity of data can be enhanced by data augmentation tech- on tiny datasets. Since it implies feature independence, it
niques, which artificially expands the size of dataset by data can be used for either statistical or text-based approaches to
transformations. SER. This works well when the distribution of class follows
a Gaussian distribution.
A. MACHINE LEARNING MODELS
1) Logistic Regression (LR) 6) KNN
It is a supervised machine learning algorithm used for binary This method uses only their immediate proximity to one
classification aiming to predict the likelihood of an instance another to find correlations between data points [53]. In the
VOLUME 11, 2023 5

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379

Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning

context of multiclass speech emotion classification, k-NN input sequence to learn representations and extract features
emerges when comparable emotions cluster comes closer, from the data. The convolutional process is carried out in a
allowing it to capture the local structure of the data effectively. 1D-CNN by sliding a filter over the sequence of inputs and
It works well when the limits of the possible choices are fuzzy calculating the dot product between the kernel and the local
or nonlinear. By polling nearby nodes, it’s able to perform region of the input. This is known as the sliding window
multiclass classification. method. Applying this technique at each place along the
sequence allows one to acquire the output feature map.
7) Ensemble Method (Voting classifier)
Due to their capacity to mix many base models and increase B. SPEECH FEATURES OVERVIEW
overall classification performance, ensemble approaches like 1) MFCC
the voting classifier are commonly utilized in multiclass
The Mel-Frequency Cepstral Coefficients MFCC is a com-
speech emotion identification. The Voting Classifier takes the
monly used set of spectral features in emotional Speech
predictions from several models and uses the majority vote to
Recognition (SR). These coefficients comprise a collection
determine the final classification. This reduces the impact of
of values that convey relevant information regarding the con-
inaccuracies and biases in the individual models, resulting in
figuration of the speech signal’s spectrum. MFCC is applied
more reliable classifications. Better predictions can be made
in emotion classification, speaker identification, and SR sys-
with the help of many models working together to minimize
tems. It can be represented as
bias and variation [54].
f
8) LSTM Mel(f ) = 2595 × log10 (1 + ), (1)
700
The LSTM model analyzes sequential data, such as sen-
timental analysis, machine translation, and speech-emotion where f is the frequency in Hz, and Mel(f) is the frequency
recognition. LSTM networks possess a distinctive structure recognized by the human ear in the Mels scale [45].
comprising memory cells and gating mechanisms, allowing
them to selectively retain and discard information across 2) ZCR
extended sequences. This characteristic makes them highly The ZCR measures the transition frequency between positive
suitable for tasks that require capturing temporal dependen- and negative changes within an audio signal or vice versa.
cies and modeling sequential patterns. Within the domain It provides insights into the temporal characteristics and the
of SER, the LSTM model accepts acoustic features in the presence of high-frequency components in speech signal.
form of sequences, such as MFCCs or spectrograms, and ZCR is applied in speech activity recognition and music
subsequently subjects them to a series of LSTM layers for information retrieval. It creates a 1D array of retrieved data.
processing. The LSTM layer comprises memory cells and The mathematical representation for ZCR is given below.
input, output, and forget gates. The gates regulate the infor-
T
mation flow and determine the selection of information to 1X
be retained, discarded, or transmitted. Utilizing a SoftMax ZCR = |s(t) − s(t − 1)|, (2)
T t=1
activation function enables the LSTM model to effectively
manage multiclass classification tasks by allocating probabil- where T is the total number of signal frames, s(t) is the signal
ities to individual emotion classes [55]. value at time t, and s(t-1) is the signal value at time t-1, the
previous time step.
9) Bi-LSTM
It is an advanced version of the LSTM model. Bi-LSTM 3) CHROMA-STFT
differs from traditional LSTM models in that it simultane-
The chroma (STFT (t, f )) feature represents the distribution of
ously considers past and future data. It does this by using
pitch classes in the audio signal. It captures the tonic infor-
two LSTM layers. One of the LSTM layers goes through
mation and helps identify the musical and emotional aspects
the input pattern from beginning to end, and the other goes
of the speech signal. Chroma is applied in music information
through it backwards. By merging data collected from both
retrieval tasks such as chord recognition, melody extraction,
sides, Bi-LSTM gets a complete picture of the sequence of
and music genre classification. The mathematical expression
inputs. This makes it better to capture dependencies. This
for chroma-stft is
two-way method makes it easier for the model to make precise
forecasts and leads to better results in several sequential data X
STFT (t, f ) = [x(n) · w(n − t) · e−j2πfn ], (3)
applications [28].
n

10) 1D-CNN Network where x(n) signal value at time n, w(n - t) windowing function
It is a variation of CNN developed expressly to manage one value at a time (n-t), and exponential function create complex
dimensional sequential data, such as text or time series. It sine signals at frequency f.
applies convolutional operations along the time axis to the
6 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379

Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning

4) Spectral Centroid percentage is attained. This data can prove advantageous in


The spectral centroid (Scentroid ) can be considered the point various tasks such as classifying music genres, segmenting
at which the frequency distribution is in equilibrium [46]. audio into distinct components, and identifying events or
It provides information about the sound’s typical pitch or auditory signals.
tone. As an illustration, a sound with a high pitch will have a
higher spectral centroid, whereas a sound with a low pitch will 7) Root Mean Square
have a lower spectral centroid. This feature helps understand The root-mean-square value quantifies the typical amount of
distinct emotions communicated in speech, differentiating power the voice signal carries. It offers information regarding
across speakers, categorizing music genres, and other similar the amplitude and loudness of the entire signal.
tasks. The mathematical equation for the spectral centroid is v
u n
given below. u1 X
P
(fA) x=t x 2, (6)
Scentroid = P , (4) n i=1 i
A
where n is the n represents the total no of samples in the
where A is the amplitude or power at frequency f, and f
speech signal, xi signal value at sample I.
corresponds to the frequency of a particular bin.
8) Mel-Spectrogram
5) Spectral Contrast
The amplitude difference between the peaks and troughs in a The Mel-spectrogram is a graphical depiction of the fre-
spectrogram is spectral contrast (Scontrast ). It is crucial for dif- quency characteristics of the speech signal in the Mel scale.
ferentiating between sounds and emotions since it catches the The device can capture spectral characteristics, which may
spectral variations. Numerous audio processing tasks use the prove valuable in capturing melodic and timbral information.
spectral contrast feature, such as voice recognition, speaker The dimensionality of the mel-spectrogram feature is (128,)
recognition, and music genre categorization. It helps capture always for any signal value.
crucial spectral fluctuations in the signal to better distinguish
9) Pitch
and characterize sounds and emotions. The mathematical
representation for spectral contrast is The apparent pitch of a voice correlates to the fundamental
frequency of the underlying speech signal, represented by
(Pi − Ti ) the term pitch. Applications include voice analysis, mod-
Scontrast = , (5)
(Pi + Ti + ε) elling of prosody, and recognizing emotions. Understanding a
where i is the index representing specific frequency bin, Pi speaker’s tone of voice and the emphasis and cadence of their
and Ti represent peak and Trough amplitude at frequency bin words can be gleaned from their pitch.
i, ε constant to avoid division by zero [47].
C. KEY DATASETS
6) Spectral Roll-off 1) TESS
The spectral roll-off (Sroll-off ) technique facilitates the iden- It is an English language dataset containing 2800 audio
tification of the frequency threshold, i.e., 85%, at which a recordings of trained individuals exhibiting various emotional
specific proportion of the overall acoustic energy becomes expressions. It comprises seven emotions: anger, disgust, fear,
centered. The calculation involves summating energy across happiness, surprise, sadness, and neutrality. TESS is devel-
various frequencies in ascending order until a predetermined oped to offer research in emotion recognition and related

FIGURE 4: The Roadmap of SER using Proposed Approach

VOLUME 11, 2023 7

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379

Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning

fields by giving access to a comprehensive collection of speech signal. This strategy helps in reducing overfitting and
speech samples comprising of various emotions [41]. boosts the model’s capacity to handle noisy and practical
speech input. It also helps reduce the time it takes to train
2) CREMA-D the model.
The audio-visual Crema-D English language dataset consist-
ing of 7442 audio and video recordings that composed the 2) Pitch Alteration
CREMA-D dataset feature of male and female performers Pitch Alterations simulate various sounds and speech styles
reciting 12 scripted sentences with various facial expressions by introducing changes in the pitch contour. It captures var-
and vocal inflections. It is comprised of a varied collection ious vocal expressions and increases the model’s capacity to
of six distinct sentiments. Because it provides comprehensive identify emotions across a broader range of pitch ranges.
audio and video data collection, frustration, disgust, fear,
happiness, neutrality, and sadness are helpful for emotion 3) Time Stretching
identification and multi-modal analysis [42].
Time Stretching methodology alters the temporal dimension
of a speech signal, mimicking fluctuations in speech velocity
3) SAVEE
and length. Incorporating time-related trends and dependen-
This dataset includes 480 audio files of female and male cies enhances the model’s ability to effectively capture and
performers with various emotional states, such as neutral, process speech data across different speaking rates, thereby
pleased, sad, angry, afraid, disgusted, and surprised [43]. increasing its robustness.
It is a helpful resource for studying emotion identification,
particularly for examining emotions communicated through
Cross- Accent Emotions Signal Processing Framework
Machine Learning Framework
male voices. Ensembling based Feature Selection

HAPPY(j=1) ZCR 1. RF
(e=1,2,3,4) (0 - 1)
4) RAVDESS 2. LR
3. DT
SAD(j=2) CHROMA-STFT
It is also an English language dataset with 1440 audio and (2-12) 4. SVC
(e=1,2,3,4) 5. KNN
video recordings of performers pronouncing scripted dia- MFCC 6. GNB
NEUTRAL(j=3) (13-33)
logue and singing songs with various facial expressions and 7. Voting
(e=1,2,3,4)
TESS(e=1)
vocal nuances. There are twenty-four actors (12 men and SURPRISE(j=4) Spect Centroid
SAVEE(e=2) (34)
12 women), encompassing a range of emotions, including (e=3,4) Speech Emotion
CREMA-D(e=3) Spectral Features Classification
peaceful, joyful, sad, furious, afraid, surprised, and disgusted. DISGUST(j=5) Contrast Extracted
RAVDESS is a data collection application in emotion recog- RAVDESS(e=4) (e=1,2,3,4) (35-41) (172)

nition and offers a large variety of emotional speech and song CALM(j=6) Spect roll-off 1. LSTM
(e=4) (42) 2. LSTM + 1D-
samples [44]. CNN
RMS 3. BILSTM+1D-
FEAR(j=7) (43)
(e=1,2,3,4)
CNN
D. DATA AUGMENTATION TECHNIQUES MEL-SPEC 4. 1DCNN-
(44-171) Stratified
ANGRY(j=8)
It is used to strengthen and generalize speech datasets. (e=1,2,3,4)
Cross
Pitch Validation
Data augmentation solves imbalanced class problems and (172)
Deep Learning Framework
increases model prediction accuracy, precision, and recall.
FIGURE 6: Proposed Cross-Accent Emotion Recognition
1) Noise Injection Framework
The addition of artificial noise to the dataset enlarges the data
for learning features and would not drown out the natural
IV. PROPOSED ACCENT EMOTION RECOGNITION
FRAMEWORK
The proposed framework targets four different stages for
cross-accent emotion recognition as depicted in Figure 4.
First, it aims to develop an extensive database by merging
diverse English datasets like TESS, Crema-D, SAVEE, and
RAVDESS. Then, it extracts necessary speech characteristics
such as MFCC, ZCR, RMS and others using the Python-
based Librosa library from extensive databases. The third
stage employ conventional machine learning and deep learn-
ing models, with hyperparameters tuning, to suggest the most
appropriate ones for multi-class emotion classification. The
Final fourth stage focus on the evaluation of the machine
FIGURE 5: Count of Total Emotions learning using possible evaluation metrics like accuracy, re-
8 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379

Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning

call, and F1 scores on holdout data for multi-class emotion


classification.

A. CROSS-ACCENT DATASET ENSEMBLE METHOD AND


DATA AUGMENTATION
This part of the proposed framework requires identifying po-
tential diverse resources that can provide crucial speech data.
Therefore, four distinct datasets like TESS from Toronto,
SAVEE from the United Kingdom, RAVDESS from North
America and CREMA-D from diverse ethnic backgrounds
including African American, Asian, Caucasian, and Hispanic,
FIGURE 7: Input Speech Signal at different Energy Levels
are utilized to create a diverse dataset. The speech data in each
dataset is carefully chosen, and the number of distinct emo-
tions in each dataset is visualized to ensure its relevance to
the proposed research problem. All four datasets are merged extracting key speech features. In the proposed database, most
to create a comprehensive database containing emotions from speakers begin speaking after a threshold of 0.5 seconds and
diverse sources as depicted in the Cross-Accent Emotions continue until 2.5 seconds, with the maximum file duration
Ensembling block in Figure 6. In the figure, the expression of 3 seconds as shown in Figure 7. Features from each
’e’ represents the unique identifier for four datasets —TESS emotion file are extracted from 0.6 to 2.5 seconds to ensure
as e=1, SAVEE as e=2, CREMA-D as e=3, and RAVDESS as that only the relevant part of the speech is included in the
e=4. The index ’j = 1 to 8’ expresses the total eight emotions. feature extraction process, eliminating unnecessary noise. A
Notably, emotions such as Happy, Sad, Neutral, Disgust, list of a total of nine speech features such as ZCR, chroma-
Fear, and Angry are common to all four datasets (’e=1, 2, 3, stft, MFCC, RMS, Mel-spectrogram, pitch, and three spectral
4’), while Surprise is present in CREMA-D and RAVDESS features centroid, contrast, and roll-off extracted from each
(’e=3, 4’), and Calm is unique to RAVDESS (’e=4’). The speech file as represented in signal processing framework-
combination of all these datasets approximately gives 12162 based feature selection block Figure 6. Each of these speech
emotion-labelled files, as shown in Figure 5 with the original characteristics produces a distinct array of values or coef-
label. This newly developed dataset enables a wide range ficients. For instance, output ZCR produces a single scaler
of emotional expressions and enhances the speech emo- value indexing at (0 – 1) in the final speech feature list, in-
tion recognition generalization capabilities of the proposed dicating how quickly the speech signal changes; chroma-soft
framework. The data collection phase encompasses the care- gives a series of 10 coefficients at index (2 – 12), highlighting
ful selection of well-annotated voice datasets, considering the presence of tonal content in the speech signal, MFCC
their cultural backgrounds and contextual variations. Prepro- with 20 coefficients at (13 – 33) describing various aspects
cessing applies to maintaining data quality and addressing of speech signal spectrum.
any ethical considerations. These merged and preprocessed Furthermore, the spectral centroid yields a single scalar
datasets form the foundation for subsequent steps in our work, value at (34), indicating where most of the audio signal’s
including feature extraction, model training, and evaluation. energy lies in frequency; spectral contrast with seven scaler
After the data collection phase, two data augmentation values at index (35 – 41) represents the difference in am-
techniques are applied to mitigate class imbalance issues plitude between peaks and troughs in the audio spectrum.
and overfitting challenges. Firstly, adding 35% white noise Then, the third spectral roll-off feature at index (42) with
to each data file enhanced data diversity and robustness, a single value marks the frequency below which a specific
connecting this to real-world scenarios where the spoken percentage (e.g., 85%) of the total spectral energy lies. At
language is often interrupted by ambient noise, finally making index (43), RMS gives a single scalar value representing the
the classification model compatible with classifying emo- audio signal’s overall energy level or loudness. A 2D array
tional noise environments, making it more practical for real of values is usually in a Mel-spectrogram with 128, and
world scenarios. Second, the pitch is enhanced by 70% to every element on this matrix gives the energy of a particular
ensure the dataset’s expressiveness, with 80% time-stretching frequency for the specific duration. Finally, at index (172),
to capture different temporal aspects of speech patterns. Over- the pitch estimation generates a scalar value that signifies the
all, data augmentation increases model prediction accuracy, fundamental frequency. This process yields 172 features for
precision, and recall. Subsequently, the augmented data is each input file, also given in Table 1, resulting in a feature
used for the succeeding phases of feature extraction, model matrix of dimensions (36,486, 172). All extracted features un-
training, and model evaluation. dergo scaling and normalization to ensure dataset uniformity.
A list of all of the above features is extracted using appropriate
B. SPEECH CHARACTERISTICS signal-processing techniques. Each feature captures specific
In order to establish a practical approach for training machine aspects of the speech signal related to emotions and is used as
learning models on speech signals, the second stage involves input for the subsequent stages of model training and emotion
VOLUME 11, 2023 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379

Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning

recognition. + 1D-CNN, and 1D-CNN. Each model uses a diverse method


to improve results, such as increasing/decreasing number of
TABLE 1: List of Total Extracted Features
nodes, modifying number of layers, and adding regularizer
Features Name Extracted Features and dropout layers. For example, KBest feature selection
ZCR 1 or random feature selection method is used for 1D-CNN
CHROMA-STFT 12 architecture. Further details of each model and its proposed
MFCC 20 architecture are elaborated below.
Spectral Centroid 1

Dropout
Spectral Contrast 7

Dropout
LSTM

BN
Spectral roll-off 1

LSTM

BN
RMS 1

Dense/FC
Mel-Spectrogram 128

Dropout
LSTM
Pitch 1

BN
Total 172

C. MODEL TRAINING
The newly created dataset with dimensions (36486, 172) is
split into segments to train seven conventional machine learn-
BN = Batch Normalization
ing models and four deep learning models. For machine learn-
ing models, 95% of the data is allocated for training, while
the remaining 5% is reserved for testing. This significant
training data allocation aims to optimize model performance
by enabling the capture of intricate patterns while mitigating FIGURE 8: Visual Representation of LSTM Network with
overfitting risks. Similarly, for deep learning models, 80% of Optimizing Layers
the data is assigned for training the models, and the remaining
20% is reserved for testing the trained models, as shown in In the deep learning phase, an initial approach uses a simple
Figure 6. LSTM model as illustrated in Figure 8. This model includes
three LSTM layers with decreasing units, three batch normal-
1) Conventional Machine Learning Models
ization layers, and a dropout rate of 30%. L2 regularization
Seven machine learning models, LR, DT, RF, SVM, KNN,
with a coefficient of 0.001 is employed for each LSTM layer
GNB, and ensemble, are employed for multiclass speech
to enhance the model generalization and reduce overfitting.
emotion classification. These models are trained using default
parameters, and performance is evaluated based on predic- TABLE 2: Proposed Layer Type LSTM Architecture with
tion accuracy, detailed in Table 5 for case I. After this, the Hyper-parameters
GNB, DT, and RF models undergo fine-tuning procedures
Layer Type Architecture with Hyper-parameters
by changing the hyperparameters, including max-depth, n- Layer 1 LSTM, Neurons = 128, Input Shape = (171, 1), Regular-
estimators, min-samples-split, and min-samples-leaf to op- izer L2 = 0.001
timize predictive results. In cases where fine-tuning does Layer 2, 3 BatchNormalization, Dropout = 0.3
Layer 4 LSTM, Neurons = 64, Regularizer L2 = 0.001
not yield improved results, an RF-based feature selection Layer 5, 6 BatchNormalization, Dropout = 0.3
technique is activated. This technique uses Gini impurity or Layer 7 LSTM, Neurons = 32, Regularizer L2 = 0.001
information gain to assign values between 0 and 1 to each Layer 8, 9 BatchNormalization, Dropout = 0.4
feature. This approach identifies features with strong correla- Layer 10 Fully Connected Layer, Neurons = 8, Activation Func-
tion = Softmax
tions to the dependent variable for further analysis. The RF Layer 11 Optimizer = Adam, Loss = Categorical Cross-Entropy
feature selection technique results are evaluated at various Layer 12 Batch Size = 16, Epochs = 200
threshold levels, including 0.004, 0.005, 0.007, and 0.008, to
select the most pertinent features from the initial 172. Table Also, optimization is performed using the Adam optimizer,
5, cases (II, III, IV, and V) presents the testing accuracy while categorical-cross entropy operates as the loss function,
results of all machine learning models at these threshold with accuracy as the chosen evaluation metric. The more
levels. The features selected from the final threshold level of focused architecture of the LSTM model with layers and
0.008, yielding the highest prediction accuracy, contribute to hyperparameters is given in Table 2. To address the issues
the classification report in Table 6 and the confusion matrix with the LSTM model performance, two hybrid models, 1D-
in Figure 10. CNN + LSTM and 1D-CNN + Bi-LSTLM, are employed in
almost identical architectures given in Table 3. These hybrid
2) Deep Learning Models architectures mitigate the rising validation loss and stationary
The proposed deep learning framework mainly employs four validation accuracy faced by LSTM. The hybrid models are
main models including LSTM, LSTM + 1D-CNN, Bi-LSTM designed by combining three layers of 1D-CNN, with filter
10 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379

Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning

sizes ranging from 32 to 128, using the same kernel size of model is trained to assess their impact on validation accuracy
3 and ReLU activation function. Additionally, two LSTM and loss. In case five, employing the same 1D-CNN model
layers are incorporated into the model with decreasing LSTM with the KBest Feature selection technique is used by leverag-
units of 64 to 32 and three dense layers with decreasing ing the Analysis of Variance (ANOVA) F-value. This method
units of 128 to 8. Based on categorical cross-entropy loss, discerns the correlation between input features and the target
the model performance is tested, and early stopping is imple- variable in classification tasks, highlighting the most relevant
mented. While the integration of models produced promising features for classification. This systematic approach elimi-
results compared to the LSTM, it is essential to note that nates less significant features, ultimately identifying the top
further optimization is required beyond the capabilities of this 40 and 20 features significantly influencing classification
architecture. accuracy. The results for each model are detailed in Table 7.
TABLE 3: Proposed Layer Type (1D-CNN + Bi-LSTM) Ar- In the proposed framework two models, 1D-CNN simple
chitecture with Hyper-parameters and 1D-CNN with stratified cross-validation are employed
with identical architecture where the simple model operates
Layer Type Architecture with Hyper-parameters
Layer 1 Conv1D, Filters = 64, Kernel_size = 3, Activation =
without cross-validation, while the other incorporates vali-
ReLU, Input_Shape = (171, 1) dation as shown in Table 4. These architectures encompass
Layer 2, 4, 6 MaxPooling1D (pool_size = 2) multiple layers, including 9- Conv1D (1D convolutional) lay-
Layer 3 Conv1D, Filters = 64, Kernel_size = 3, Activa- ers, 5-maxPooling1D layers, and 2-Dense (fully connected)
tion_Function = ReLU
Layer 5 Conv1D, Filters = 128, Kernel_size = 3, Activa- layers with categorical cross entropy as a loss function. The
tion_Function = ReLU Conv1D layers are designed to extract relevant features from
Layer 7 Bi-LSTM, Neurons = 64 the input data, while the maxPooling1D layers down-sample
Layer 8, 10 Dropout = 0.5
Layer 9 Bi-LSTM, Neurons = 32 the feature maps to capture pertinent features. Given the
Layer 11 Fully Connected Layer 1, Neurons = 128, Activa- multiclass nature of the classification task, the categorical
tion_Function = ReLU cross entropy loss function is activated. The callback function
Layer 12, 14 Dropout = 0.5
Layer 13 Fully Connected Layer 2, Neurons = 64, Activa-
is implemented to adjust the learning rate dynamically during
tion_Function = ReLU training and to halt the training process in the absence of
Layer 15 Fully Connected Layer 3, Neurons = 8, Activa- observed improvements in validation loss. This feature saves
tion_Function = Softmax computational resources and time, ensuring efficient model
Layer 16 Optimizer = Adam, Loss = Categorical_cross_entropy
Layer 17 Batch_size = 32, Epochs = 50 training.
Finally, the stratified K-Fold cross-validation technique is
Then, 162 features are randomly selected. The 1D-CNN employed, which partitions the data into training, validation,

FIGURE 9: Proposed 1D-CNN Network Architecture

VOLUME 11, 2023 11

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379

Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning

TABLE 4: Layer Type and CNN Architecture with Hyper-parameters


Layer Type CNN Architecture with Hyper-parameters
0 Input shape = (36486, 171)
1,2 Conv1D, Neurons = 1024, Kernel size = 5, Strides = 1, Activation Function = ReLU, Padding = Same
4,5 Conv1D, Neurons = 512, Kernel size = 5, Strides = 1, Activation Function = ReLU, Padding = Same
3,6 MaxPooling1D 2, Pooling size = 5, Strides = 2, Padding = Same
7,8 Conv1D, Neurons = 256, Kernel size = 3, Strides = 1, Activation Function = ReLU, Padding = Same
10,11 Conv1D, Neurons = 128, Kernel size = 3, Strides = 1, Activation Function = ReLU, Padding = Same
9,12 MaxPooling1D 4, Pooling size = 5, Strides = 2, Padding = Same
13 Conv1D 9, Neurons = 64, Kernel size = 3, Strides = 1, Activation Function = ReLU, Padding = Same
14 MaxPooling1D 5, Pooling size = 3, Strides = 2, Padding = Same
15 Flatten Layer
16 Fully Connected Layer 1, Neurons = 512, Activation Function = ReLU, Dropout = 0.5
17 Fully Connected Layer 2, Neurons = 8, Activation Function = Softmax
Loss Function = Categorical cross-entropy, Optimizer = RMSprop, Learning rate = 0.001,
18
Reduce Learning Rate, Early Stopping, Epochs = 50, Batch Size = 128,
19 Stratified Cross Validation = 5

and test sets. This technique enhances the model’s robustness V. PERFORMANCE EVALUATION AND DISCUSSION
and inference capabilities. The model’s efficacy on the test Several machine learning and deep learning models for voice
set is assessed through a classification report in Table 8 and emotion recognition are developed and evaluated, utilizing
a confusion matrix in Figure 11. The finalized architecture of various speech datasets. Table 5 presents the comparative
the proposed deep learning framework is illustrated in Figure prediction accuracy outcomes of the machine learning models
9. Overall, hyperparameter tuning methods for each deep for SER. In the scenario where all 172 features are selected
learning model are explored. For instance, there are varying (Case I), the models display diverse levels of accuracy, with
activation functions, loss functions, and learning rates, as well nearly all models showing under-fitting results with less than
as modifying the number of layers and neurons. While these 35% training accuracy. These results suggest that models tend
adjustments are evaluated, the potential remains for further to overfit by using all nine features, including temporal and
enhancement in the model’s performance for cross-accent spectral, making testing accuracy uncertain. However, when
emotion recognition tasks. employing the feature selection technique via the RF model to
capture only crucial features denoted by α, get values ranging
between 0 to 0.1. Moreover, employing the RF-derived fea-
D. APPLICATIONS OF PROPOSED ACCENT EMOTION ture selection method at varying threshold levels 0 ≤ α ≤ 0.1
RECOGNITION FRAMEWORK (Cases II to V) significantly boosts testing accuracy. At a
significance level of α > 0.008 (Case V), the RF model
The potential applications of this framework are in online
achieves a peak accuracy of 0.76, outperforming all other
education [62], family household robot assistant [63], noise
models. Similarly, the Decision Tree algorithm substantially
detection for human robots [64], stress identification for air
increases, reaching an accuracy level of 0.61.
traffic controllers [65], financial distress management [66],
The effectiveness of the Ensemble Voting technique in ag-
and remote health care [67]. For instance, during the online
gregating the predictions of multiple models is demonstrated
education system, the proposed system can be used to en-
by its achievement of an accuracy of 0.64 in Case II. So,
hance the teaching quality due to cross-culture background
the findings highlight the significance of feature selection in
of students and teachers [62]. An intelligent family household
enhancing the efficacy of machine learning algorithms to rec-
robot can employ the proposed SER framework to accurately
ognize emotions in speech. This investigation identifies the
estimate the sentiments of user in a global perspective which
RF model as the most precise model. After assessing multiple
enables friendly interaction of robots with human beings [63].
models, the RF algorithm emerges as the most suitable choice
Human listeners often identify the noise talk, but machines
for implementing the emotion classification task, achieving a
cannot perform this task without using specific filters. There-
commendable accuracy rate of 76% on testing data. That is
fore, this framework can be used to detect the noise signal,
why the final RF model is used to find the classification report
extract sentiment, and assist in taking the required action
and classification on the testing dataset given in Table 6.
[64]. Similarly, Air traffic controllers interact with pilots and
people worldwide, which leads to diverse accent emotion As proposed, the RF model shows favourable performance
variations. The proposed cross-accent SER can enhance their across diverse emotion categories, as evidenced by the clas-
inter-communications [65]. Furthermore, cross-accent SER sification report. From Table 6, the precision values of the
can enhance the estimates of emotions in a diverse financial model range from 0.72 to 0.91, indicating the accurate clas-
distress, social interaction, and healthcare management [66], sification of instances for each emotion class. Meanwhile,
[67]. recall values between 0.65 and 0.93 indicate that the RF
model captures a significant proportion of instances for most
emotions. The F1 scores, which vary between 0.71 and 0.89,
12 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379

Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning

TABLE 5: Comparative Analysis of Different Machine Learning Models in Terms of Prediction Accuracy
CaseI CaseII CaseIII CaseIV CaseV
Models
α>0 α > 0.004 α > 0.005 α > 0.007 α > 0.008
Linear Regression 0.24 0.50 0.50 0.49 0.34
Decision Tree 0.31 0.57 0.56 0.50 0.61
Random Forest 0.31 0.74 0.72 0.61 0.76
Support Vector Classifier 0.23 0.53 0.51 0.51 0.37
K-Nearest Neighbor 0.25 0.56 0.55 0.54 0.43
Gaussian Naïve Bayes 0.24 0.23 0.23 0.26 0.22
Ensemble ... 0.64 0.62 ... 0.60

demonstrate a good balance between precision and recall, emphasizes the significance of a comprehensive feature set
successfully capturing every emotional category’s unique in accurately capturing and recognizing emotional cues in
characteristics. The RF model demonstrates consistent per- speech data. However, a slight decline in accuracy is observed
formance across various emotions, even when facing varying when using the KBest method to reduce the feature space.
degrees of support. This highlights the model’s constancy in The delicate balance between dimensionality reduction and
the face of imbalanced datasets. The model generalizes by re- retaining crucial features is highlighted, emphasizing the need
vealing high precision, recall, and F1 scores for emotions such for careful consideration when selecting a feature reduction
as anger, surprise, and neutral. In contrast, other emotions ex- technique to avoid sacrificing essential information. The 1D-
hibit slightly lower but still admirable performance. Despite CNN Simple model and the 1D-CNN with Stratified Cross-
its imperfections, the outcomes show significant achievement Validation utilize all 172 features. The 1D-CNN Simple
in precisely categorizing human emotions. The RF model’s model achieves a validation loss of 3.112 and an accuracy
effectiveness, generalization capacity, and consistency make of 0.6411, prioritizing simplicity in its architecture but at the
it a practical choice for recognizing emotions, especially in expense of accuracy compared to previous models.
imbalanced data situations. However, it requires additional On the other hand, the 1D-CNN model with Stratified
scrutiny and assessment of unobserved or heterogeneous Cross-Validation, trained for 50 iterations, improves perfor-
datasets. The comparative analysis of different deep learn- mance significantly with a validation loss of 0.032 and an
accuracy of 0.99. This technique effectively addresses class
TABLE 6: Classification Report of Proposed RF Model
imbalance and yields noteworthy improvements in loss and
Class Precision Recall F1-Score Support accuracy. Classification results for the 1D-CNN framework
angry 0.77 0.86 0.81 281
calm 0.72 0.93 0.81 28
disgust 0.73 0.69 0.71 281 TABLE 7: Proposed Deep Learning Models Analysis
Fear 0.83 0.65 0.73 281
Happy 0.79 0.68 0.73 295 Models Epochs Loss Accuracy
Neutral 0.73 0.82 0.77 253 LSTM 20 1.947 0.157
1D-CNN + LSTM 20 1.070 0.575
Sad 0.72 0.82 0.77 311
1D-CNN + Bi-LSTM 20 1.048 0.6203
Surprise 0.91 0.86 0.89 95
1D-CNN-(162) 50 0.842 0.678
1D-CNN-KBest-(40) 40 1.069 0.585
ing models SER, as depicted in Table 7, provides valuable 1D-CNN-KBest-(20) 30 0.236 0.5129
insights. Challenges arise with specific models throughout the 1D-CNN-(172) 25 3.112 0.6411
experimentation process, leading to exploring alternative ap- 1D-CNN-Cross Validation (Proposed) 50 0.032 0.99
proaches to improve validation loss and accuracy. Initially, the
LSTM model, a popular choice for sequence modelling tasks using the stratified cross-validation method are displayed in
given in Figure 8, is employed. However, despite its potential, Table 8. The outcomes show the model’s ability to categorize
the LSTM model exhibits limited effectiveness in capturing emotions into several groups correctly. The precision values
speech data’s complex patterns and structures because of its indicate the model’s accuracy in identifying occurrences of
limited validation accuracy of 0.1568, which falls short of each emotion class, which range from 0.99 to 1.00. Consis-
expectations. To address LSTM limitations, a combination of tently high recall values suggest the model captures a sizable
1D-CNN and LSTM is explored, exploiting the strengths of subset of the true positive examples for each class. Excellent
both architectures. This fusion approach improves accuracy performance is also indicated across all emotion classes by
by 0.5748 by effectively capturing local and temporal de- the F1-score values, which combine precision and recall.
pendencies in speech data. The 1D-CNN + Bi-LSTM model These findings confirm the comparison analysis, suggesting
further builds upon this success by leveraging bidirectional that the 1D-CNN architecture with stratified cross-validation
LSTM layers, achieving an even higher accuracy of 0.6203. is viable for speech emotion identification tasks due to its
Notably, the importance of feature selection is observed high validation accuracy and low loss. Apart from conducting
throughout the experiments. Later, in the case of further a comparative analysis between deep learning models, the
improving results, the 1D-CNN model, using 162 chosen fea- utilization of a confusion matrix on hold out datasets given in
tures, demonstrates the highest accuracy, 0.6785. This result table 6 and 8 provides additional insights into the efficacy of
VOLUME 11, 2023 13

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379

Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning

TABLE 8: Classification Report of Proposed 1D-CNN Model


with Stratified Cross Validation
Class Precision Recall F1-Score Support
angry 0.99 1.00 1.00 1150
calm 1.00 0.98 0.99 117
disgust 0.99 0.99 0.99 1153
Fear 0.99 1.00 0.99 1143
Happy 0.99 0.99 0.99 1156
Neutral 0.99 0.99 0.99 1018
Sad 0.99 0.99 0.99 1162
Surprise 1.00 1.00 1.00 390

FIGURE 11: Confusion Matrix of Proposed 1D-CNN model

foundation for training seven conventional machine learning


models, including RF, DT, LR, and others, and four deep
learning models like LSTM, LSTM + 1D-CNN, Bi-LSTM
+ 1D-CNN, and 1D-CNN. From the results obtained, im-
plementing these frameworks has provided a substantial leap
in emotion recognition performance, with the proposed ma-
chine learning models, especially Random Forest, achieving
FIGURE 10: Confusion Matrix of Proposed RF Model 76% classification accuracy. Compared to the proposed deep
learning models, the 1D-CNN model with stratified cross-
validation reaches 99% testing accuracy. The precision in
these models in accurately classifying speech emotions given the 1D-CNN model proves its competency to capture intri-
in Figures 10 and 11. cate speech characteristics associated with different emotion
The proposed research notes that both machine learning classes. The cornerstone of this research lies in a careful
and deep learning methodologies demonstrate instances of feature selection process that ultimately enhances signal-to-
misclassification, specifically about the "calm" and "sur- noise ratio and prediction quality [68]. The RF model exhibits
prise" emotional categories. The observed misclassification notable improvement in emotion classification by focusing
pattern aligns with the imbalanced distribution of the emo- on the most informative features from the list of all extracted
tional classes in the dataset. This can present difficulties for speech features. Meanwhile, the 1D-CNN model shows ex-
the models in effectively and precisely recognizing these ceptional performance in attaining high classification accu-
emotions. The results are consistent with the tabulated data, racy and setting new benchmarks in the field.
demonstrating that the precision, recall, and F1-score met- The comprehensive benchmark studies in Table 9, with
rics for the "calm" and "surprise" categories exhibit a slight their respective subset of features, datasets, and accuracy
decrease compared to the remaining emotional categories. scores, show a snapshot of the broader research landscape
Recognizing the impact of class inequalities on the model’s in SER. Multiple features based speech emotion recognition
accuracy is crucial, as it can introduce biases and impact the systems are proposed considering distinct machine learn-
overall precision of emotion recognition. ing models such as voting classifier [19], [61], attention-
based multi-learning model (ABMD) [23], 1D-CNN [26]
A. DISCUSSION and MViTv2 [60]. However, these multi-featured emotion
This research focuses on improving SER by creating cross- recognition systems target a particular region accent. To
accent emotion recognition that reduces cultural and geo- bridge this gap, a multi-cultural accent emotion recognition
graphical disparities. In this way, four large datasets reflec- system is required to recognize cross accent emotions over
tive of diverse English regions amalgamate to form a multi- diverse cultures. This challenging task requires an intelligent
corpus dataset. From the multi-corpus dataset, nine speech framework to train on an inclusive dataset of multi-cultural
features encompassing MFCC, ZCR, Spectral, chroma, and speech accents emotion. Efforts have been made to cope
others are meticulously extracted. These features serve as the with this challenge, therefore, multi-cultural features dataset-
14 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379

Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning

TABLE 9: Performance Evaluation of Proposed Scheme with Existing Schemes


Author (Year) Techniques Features Database Accuracy(%)
Shah et al. [17] (2023) Random Forest MFCC RAVDESS + TESS + 86.3
SAVEE
Waleed Alsabhan [19] 2D-CNN, (CNN + LSTM ZCR + RMSE + MFCC SAVEE 51.30, 97.13
(2023) + Attention)
Kakuba et al. [23] ABMD MFCC + Mel spectrograms + RAVDESS, SAVEE 85.89, 93.75
(2022) Chroma grams
Alluhaidan et al. [26] 1D-CNN MFCC + MFCCTs EMO-DB, SAVEE, 96.6, 92.6, 91.4
(2023) RAVDESS
Aayushi et al. [29] CNN MFCC RAVDESS + SAVEE 89.87
(2023)
Mohanty et al. [30] D-CNN spectral + prosodic RAVDESS + SAVEE + 90.27
(2023) TESS + CREMA-D
Nasim et al. [33] Gradient Boosting MFCC + Chroma + Mel- RAVDESS + TESS 84.96
(2021) spectrogram
Ahmed et al. [35] GRU + LSTM + 1D-CNN MFCC + Log-Mel spectrum + TESS, RAVDESS, 99.46, 95.62,
(2023) ZCR + Chromogram + RMS SAVEE, CREMA-D 93.22, 90.47
Li et al. [57] (2023) MDSA Mel-Spectrogram + MFCC + Berlin + IEMOCAP + 64.58
Jitter + Shimmer + F0 CVE + EMOVO + TESS
Kexin et al. [58] TEDFSL, CNN + Bi- Mel-Spectrogram + MFCC + IEMOCAP + YouTube + approx 91.31
(2023) LSTM, DNN Jitter + Shimmer + F0 MOUD
Latif et al. [59] (2022) ARRi, sARDi Mel Filter Banks (MFBs) IEMOCAP + 48.8, 50.6
MSPIMPROV +
RECOLA + EMODB,
FAU-AIBO + LibriSpeech
Ong et al. [60] (2023) MViTv2 Mel-spectrogram + Mel-STFT RAVDESS 81.75
Zhang et al. [61] Voting Deep learning Pitch + Energy + ZCR + Short RAVDESS 82.3
(2023) time energy + RMS + MFCC +
Mel + chroma + formats + jitter
+ shimmer + spectral (contrast,
centroid, flatness, amplitude)
Proposed Framework LR, DT, RF, SVC, KNN, MFCC + ZCR + CHROMA- TESS + SAVEE + 34.0, 61.0, 76.0,
Machine Learning GNB, Ensemble STFT + RMS + Mel- CREMA-D + RAVDESS 37.0, 43.0, 22.0,
Models spectrogram + Spectral 60.0
(Centroid + Contrast + roll-off)
+ Pitch
Proposed Framework LSTM, 1D-CNN + LSTM, MFCC + ZCR + CHROMA- TESS + SAVEE + 15.7, 57.7, 62.03,
Deep Learning Mod- 1D-CNN + Bi-LSTM, 1D- STFT + RMS + Mel- CREMA-D + RAVDESS 64.11, 99.0
els CNN, 1D-CNN stratified spectrogram + Spectral
cross-validation (Centroid + Contrast + roll-off)
+ Pitch

based solutions are proposed [57-59]. In the solution, the in terms of accuracy and comprehensiveness. The proposed
accuracy is enhanced up to 91.31%, leaving a significant gap research illustrates transformative advancement in real-time
for improvement. emotion recognition compared to benchmark studies.
In summary, deep learning concepts can process large
Furthermore, Shah et al. [17] and Aayushi et al. [29]
amounts of data quantitatively and qualitatively to improve
integrate multiple datasets, thereby achieving accuracy up
the accuracy of spoken language sentiment analysis. Careful
to 86.3% and 89.87%. However, the system is limited to
implementation of exclusive selection and hierarchical cross-
only a single speech feature named MFCC. Mohanty et al.
validation allows this model to outperform traditional meth-
[30] integrates four datasets and employs D-CNN model to
ods in terms of positive sentiment accuracy.
enhance the accuracy up to 90.27%, thereby underscoring the
efficacy of the proposed model selection and optimization
techniques. Nasim et al., integrates two datasets and employ VI. CONCLUSION
Gradient boosting which enhances the classification accuracy This research develops a system that accurately identifies
up to 84.96% [33]. To address this challenge, we propose a and analyzes speech-based emotions across various cultural
multi-cultural cross accent emotion recognitions system by and linguistic backgrounds using speech datasets encompass-
considering a range of multiple features speech recognition ing a broad spectrum of accents, ensuring a comprehensive
systems by employing conventional machine learning and approach to cross-accent emotion recognition. Recognizing
deep learning-based framework. The proposed scheme shows emotions in conversations from speakers of different lan-
14.71%, 10.15%, 9.6%, and 16.52% improvements as com- guages presents unique challenges. The recommended ap-
pared to the conventional schemes [17, 29, 30, 33]. Over- proach involves collecting well-annotated speech datasets,
all, comparison with previous studies of SER and detailed extracting a wide range of speech features, applying data
analysis demonstrate the efficiency of the proposed approach augmentation techniques, and employing advanced machine
VOLUME 11, 2023 15

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379

Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning

learning and deep learning classifiers. The results underscore [6] S. Madanian, D. Parry, O. Adeleye, C. Poellabauer, F. Mirza, S. Mathew,
the importance of feature selection in boosting the perfor- and S. Schneider, "Automatic Speech Emotion Recognition Using Machine
Learning: Digital Transformation of Mental Health," 2022.
mance of machine learning algorithms in speech emotion [7] A.G. Harvey, E. Watkins, and W. Mansell, "Cognitive behavioural pro-
recognition. Due to their distinct acoustic characteristics, cesses across psychological disorders: A transdiagnostic approach to re-
some accents pose challenges in emotion detection, high- search and treatment," Oxford University Press, USA, 2004.
[8] A. Grünerbl, A. Muaremi, V. Osmani, G. Bahle, S. Oehler, G. Tröster,
lighting the complexity of cross-accent analysis. Likewise, O. Mayora, C. Haring, and P. Lukowicz, "Smartphone-based recognition
the system’s performance varies when analyzing different of states and state changes in bipolar disorder patients," IEEE Journal of
accents, showcasing its strengths, and pinpointing areas for Biomedical and Health Informatics, vol. 19, no. 1, pp. 140–148, 2014.
IEEE.
improvement in a cross-accent setting. The RF model stands [9] M. Bojanić, V. Delić, and A. Karpov, "Call redistribution for a call center
out as the most accurate, achieving a notable 76% accuracy based on speech emotion recognition," Applied Sciences, vol. 10, no. 13,
on test data. This model consistently performs across various p. 4653, 2020. MDPI.
[10] X. Li and R. Lin, "Speech emotion recognition for power customer ser-
emotions, illustrating its effectiveness, adaptability, and re- vice," in 2021 7th International Conference on Computer and Communi-
liability, especially with imbalanced datasets. Deep learning cations (ICCC), pp. 514–518, 2021. IEEE.
models, like the combination of 1D-CNN and LSTM and the [11] D. Tanko, S. Dogan, F.B. Demir, M. Baygin, S.E. Sahin, and T. Tuncer,
"Shoelace pattern-based speech emotion recognition of the lecturers in
1D-CNN + Bi-LSTM, tap into the strengths of both structures distance education: ShoePat23," Applied Acoustics, vol. 190, p. 108637,
to attain higher accuracy rates. The 1D-CNN model, when 2022. Elsevier.
paired with Stratified Cross-Validation, addresses class im- [12] T. Zhang, M. Hasegawa-Johnson, and S.E. Levinson, "Children’s emotion
recognition in an intelligent tutoring scenario," in Proc. Eighth European
balances, significantly improving loss and accuracy metrics. Conf. Speech Comm. and Technology (INTERSPEECH), 2004.
This model achieves a remarkable validation loss of 0.032 and [13] R. AlSufayan and D. A. El-Dakhs, "Achievement Emotions in
an accuracy of 99%. This notice misclassification patterns, Paper-Based Exams vs. Computer-Based Exams: The Case of a
Private Saudi University," International Journal of Online Pedagogy
emphasizing the need for further research and addressing and Course Design (IJOPCD), vol. 13, no. 1, pp. 1-21, 2023.
potential biases from imbalanced emotion class datasets. This http://doi.org/10.4018/IJOPCD.322084
system has promising applications in human-computer inter- [14] P. Vasuki and C. Aravindan, "Hierarchical classifier design for speech
emotion recognition in the mixed-cultural environment," Journal of Exper-
action, mental health care, virtual assistants, and E-Learning. imental & Theoretical Artificial Intelligence, vol. 33, no. 3, pp. 451–466,
In the future, the recommendation is to examine diverse 2021. Taylor & Francis.
datasets to validate the model’s efficacy in varied contexts. By [15] A. Wierzbicka, "Emotions across languages and cultures: Diversity and
universals," Cambridge University Press, 1999.
incorporating more varied and balanced emotion datasets and [16] Z. Li, L. He, J. Li, L. Wang, and W.-Q. Zhang, "Towards Discriminative
ongoing research in real-time emotion classification, this sys- Representations and Unbiased Predictions: Class-Specific Angular Soft-
tem’s future applications could achieve exceptional accuracy max for Speech Emotion Recognition," in INTERSPEECH, pp. 1696–
1700, 2019.
in predicting the emotions of English speakers, regardless of [17] N. Shah, K. Sood, and J. Arora, "Speech emotion recognition for psy-
their accent or dialect. Moreover, extracting unique features chotherapy: an analysis of traditional machine learning and deep learning
such as fundamental frequency, Linear Predictive Control techniques," in 2023 IEEE 13th Annual Computing and Communication
Workshop and Conference (CCWC), pp. 0718–0723, 2023. IEEE.
(LPC), and Tonal Features can further enhance these aspects. [18] L.-M. Zhang, Y. Li, Y.-T. Zhang, G.W. Ng, Y.-B. Leau, and H. Yan,
This research is a foundation for future studies, setting a "A Deep Learning Method Using Gender-Specific Features for Emotion
precedent for further advancements in the SER field. Recognition," Sensors, vol. 23, no. 3, p. 1355, 2023. MDPI.
[19] W. Alsabhan, "Human–Computer Interaction with a Real-Time Speech
Emotion Recognition with Ensembling Techniques 1D Convolution Neural
ACKNOWLEDGMENT Network and Attention," Sensors, vol. 23, no. 3, p. 1386, 2023. MDPI.
The authors would like to thank Prince Sultan University for [20] A. Muraleedharan and M. Garcia-Constantino, "Domestic Violence De-
tection Using Smart Microphones," in International Conference on Ubiq-
paying the Article Processing Charges (APC) of this publica- uitous Computing and Ambient Intelligence, pp. 357–368, 2022. Springer.
tion. They would also like to thank Prince Sultan University [21] K. Jain, A. Chaturvedi, J. Dua, and R.K. Bhukya, "Investigation Using
for their support. MLP-SVM-PCA Classifiers on Speech Emotion Recognition," in 2022
IEEE 9th Uttar Pradesh Section International Conference on Electrical,
Electronics and Computer Engineering (UPCON), pp. 1–6, 2022. IEEE.
REFERENCES [22] A. Agrima, A. Barakat, I. Mounir, A. Farchi, L. ElMazouzi, and B. Mounir,
[1] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. "Speech Emotion Recognition Using Energies in six bands and Multilayer
Fellenz, and J.G. Taylor, "Emotion recognition in human-computer inter- Perceptron on RAVDESS Dataset," in 2022 5th International Conference
action," IEEE Signal Processing Magazine, vol. 18, no. 1, pp. 32–80, 2001. on Advanced Communication Technologies and Networking (CommNet),
IEEE. pp. 1–5, 2022. IEEE.
[2] R.W. Picard, E. Vyzas, and J. Healey, "Toward machine emotional intel- [23] S. Kakuba, A. Poulose, and D.S. Han, "Attention-based multi-learning
ligence: analysis of affective physiological state," IEEE Transactions on approach for speech emotion recognition with dilated convolution," IEEE
Pattern Analysis and Machine Intelligence, vol. 23, no. 10, pp. 1175–1191, Access, vol. 10, pp. 122302–122313, 2022. IEEE.
Oct. 2001. doi: 10.1109/34.954607. [24] A. Ochi and X. Kang, "Learning a Parallel Network for Emotion Recogni-
[3] M.D. Pell and S.A. Kotz, "On the time course of vocal emotion recogni- tion Based on Small Training Data," in 2022 8th International Conference
tion," PLoS One, vol. 6, no. 11, p. e27256, Nov. 2011. Public Library of on Systems and Informatics (ICSAI), pp. 1–5, 2022. IEEE.
Science San Francisco, USA. [25] R.R. Paul, S.K. Paul, and M.E. Hamid, "A 2D Convolution Neural Network
[4] M. El Ayadi, M.S. Kamel, and F. Karray, "Survey on speech emotion recog- Based Method for Human Emotion Classification from Speech Signal,"
nition: Features, classification schemes, and databases," Pattern Recogni- in 2022 25th International Conference on Computer and Information
tion, vol. 44, no. 3, pp. 572–587, 2011. Elsevier. Technology (ICCIT), pp. 72–77, 2022. IEEE.
[5] H. Yan, M.H. Ang, and A.N. Poo, "A survey on perception methods for [26] A.S. Alluhaidan, O. Saidani, R. Jahangir, M.A. Nauman, and O.S. Neffati,
human–robot interaction in social robots," International Journal of Social ‘‘Speech Emotion Recognition through Hybrid Features and Convolutional
Robotics, vol. 6, pp. 85–119, 2014. Springer. Neural Network," Applied Sciences, vol. 13, no. 8, p. 4750, 2023. MDPI.

16 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379

Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning

[27] L. Huang and X. Shen, "Research on Speech Emotion Recognition Based [49] J.R. Quinlan, "Induction of decision trees," Machine learning, vol. 1, pp.
on the Fractional Fourier Transform," Electronics, vol. 11, no. 20, p. 3393, 81–106, 1986. Springer.
2022. MDPI. [50] L. Breiman, "Random forests," Machine learning, vol. 45, pp. 5–32, 2001.
[28] J.L. Bautista, Y.K. Lee, and H.S. Shin, "Speech emotion recognition based Springer.
on parallel CNN-attention networks with multi-fold data augmentation," [51] C. Cortes and V. Vapnik, "Support-vector networks," Machine learning,
Electronics, vol. 11, no. 23, p. 3935, 2022. MDPI. vol. 20, pp. 273–297, 1995. Springer.
[29] A. Chaudhari, C. Bhatt, T.T. Nguyen, N. Patel, K. Chavda, and K. Sarda, [52] I. Rish, "An empirical study of the naive Bayes classifier," in IJCAI 2001
"Emotion Recognition System via Facial Expressions and Speech Using workshop on empirical methods in artificial intelligence, vol. 3, no. 22, pp.
Machine Learning and Deep Learning Techniques," SN Computer Science, 41–46, 2001.
vol. 4, no. 4, p. 363, 2023. Springer. [53] T. Cover and P. Hart, "Nearest neighbor pattern classification," IEEE
[30] A. Mohanty, R.C. Cherukuri, and A.R. Prusty, "Improvement of Speech transactions on information theory, vol. 13, no. 1, pp. 21–27, 1967. IEEE.
Emotion Recognition by Deep Convolutional Neural Network and Speech [54] T.G. Dietterich, "Ensemble methods in machine learning," in International
Features," in Congress on Intelligent Systems, pp. 117–129, 2022. Springer. workshop on multiple classifier systems, pp. 1–15, 2000. Springer.
[31] R. Chatterjee, S. Mazumdar, R.S. Sherratt, R. Halder, T. Maitra, and D. [55] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural
Giri, "Real-time speech emotion analysis for smart home assistants," IEEE computation, vol. 9, no. 8, pp. 1735–1780, 1997. MIT press.
Transactions on Consumer Electronics, vol. 67, no. 1, pp. 68–76, 2021. [56] Y. Xu, H. Su, G. Ma, and X. Liu, "A novel dual-modal emotion recognition
IEEE. algorithm with fusing hybrid features of audio signal and speech context,"
[32] A. Al-Laith and M. Alenezi, "Monitoring People’s Emotions and Symp- Complex & Intelligent Systems, vol. 9, no. 1, pp. 951–963, 2023. Springer.
toms from Arabic Tweets during the COVID-19 Pandemic," Informa- [57] S. Li, P. Song, and W. Zheng, "Multi-Source Discriminant Subspace
tion, vol. 12, no. 2, article number 86, 2021. ISSN: 2078-2489. DOI: Alignment for Cross-Domain Speech Emotion Recognition," IEEE/ACM
10.3390/info12020086. Transactions on Audio, Speech, and Language Processing, vol. 31, pp.
[33] A.S. Nasim, R.H. Chowdory, A. Dey, and A. Das, "Recognizing Speech 2448–2460, 2023. doi:10.1109/TASLP.2023.3288415.
Emotion Based on Acoustic Features Using Machine Learning," in 2021 [58] Z. Kexin and L. Yunxiang, "Speech Emotion Recognition Based on Trans-
International Conference on Advanced Computer Science and Information fer Emotion-Discriminative Features Subspace Learning," IEEE Access,
Systems (ICACSIS), pp. 1–7, 2021. IEEE. vol. 11, pp. 56336–56343, 2023. doi:10.1109/ACCESS.2023.3282982.
[34] P. Tiwari and A.D. Darji, "A novel S-LDA features for automatic emotion [59] S. Latif, R. Rana, S. Khalifa, R. Jurdak, and B. Schuller, "Self Supervised
recognition from speech using 1-D CNN," International Journal of Mathe- Adversarial Domain Adaptation for Cross-Corpus and Cross-Language
matical, Engineering and Management Sciences, vol. 7, no. 1, p. 49, 2022. Speech Emotion Recognition," IEEE Transactions on Affective Computing,
International Journal of Mathematical, Engineering and Management Sci- vol. 14, no. 3, pp. 1912–1926, 2023. doi:10.1109/TAFFC.2022.3167013.
ences. [60] K. L. Ong, C. P. Lee, H. S. Lim, K. M. Lim, and A. Alqahtani, "Mel-
[35] M.R. Ahmed, S. Islam, A.K.M. Islam, and S. Shatabda, "An ensemble MViTv2: Enhanced Speech Emotion Recognition With Mel Spectrogram
1D-CNN-LSTM-GRU model with data augmentation for speech emotion and Improved Multiscale Vision Transformers," IEEE Access, vol. 11, pp.
recognition," Expert Systems with Applications, vol. 218, p. 119633, 2023. 108571–108579, 2023. doi:10.1109/ACCESS.2023.3321122.
Elsevier. [61] L.-M. Zhang, G. W. Ng, Y.-B. Leau, and H. Yan, "A Parallel-Model Speech
[36] Y.-C. Kao, C.-T. Li, T.-C. Tai, and J.-C. Wang, "Emotional speech anal- Emotion Recognition Network Based on Feature Clustering," IEEE Access,
ysis based on convolutional neural networks," in 2021 9th International vol. 11, pp. 71224–71234, 2023. doi:10.1109/ACCESS.2023.3294274.
Conference on Orange Technology (ICOT), pp. 1–4, 2021. IEEE. [62] J. Liu, X. Wu, and X. Wu, "Prototype of educational affective arousal
[37] S. Alghowinem, R. Goecke, M. Wagner, and A. Alwabil, "Evaluating and evaluation system based on facial and speech emotion recognition," Inter-
Validating Emotion Elicitation Using English and Arabic Movie Clips national Journal of Information and Education Technology, vol. 9, no. 9,
on a Saudi Sample," Sensors, vol. 19, no. 10, p. 2218, May 2019, doi: pp. 645–651, 2019.
10.3390/s19102218. [63] X. Huahu, G. Jue, and Y. Jian, "Application of Speech Emotion Recogni-
[38] M. Swain, A. Routray, and P. Kabisatpathy, "Databases, features and tion in Intelligent Household Robot," in 2010 International Conference on
classifiers for speech emotion recognition: a review," International Journal Artificial Intelligence and Computational Intelligence, vol. 1, pp. 537–541,
of Speech Technology, vol. 21, pp. 93–120, 2018. Springer. 2010. doi:10.1109/AICI.2010.118.
[39] R.A. Khalil, E. Jones, M.I. Babar, T. Jan, M.H. Zafar, and T. Alhussain, [64] S. Hamsa, I. Shahin, Y. Iraqi, and N. Werghi, "Emotion Recognition
"Speech emotion recognition using deep learning techniques: A review," From Speech Using Wavelet Packet Transform Cochlear Filter Bank and
IEEE Access, vol. 7, pp. 117327–117345, 2019. IEEE. Random Forest Classifier," IEEE Access, vol. 8, pp. 96994–97006, 2020.
[40] B.J. Abbaschian, D. Sierra-Sosa, and A. Elmaghraby, "Deep learning tech- doi:10.1109/ACCESS.2020.2991811.
niques for speech emotion recognition, from databases to models," Sensors, [65] Md. Zia Uddin and Erik G. Nilsson, "Emotion recognition using speech and
vol. 21, no. 4, p. 1249, 2021. MDPI. neural structured learning to facilitate edge intelligence," Engineering Ap-
[41] Toronto Emotional Speech Set (TESS), https://www.kaggle.com/datasets/ plications of Artificial Intelligence, vol. 94, 2020, article number 103775,
ejlok1/toronto-emotional-speech-set-tess. ISSN 0952-1976, https://doi.org/10.1016/j.engappai.2020.103775.
[42] Cremad: Crowd-sourced Emotional Multimodal Actors Dataset, https:// [66] P. Hajek and M. Munk, "Speech emotion recognition and text sen-
www.kaggle.com/datasets/ejlok1/cremad, Jun 2018. timent analysis for financial distress prediction," Neural Computing
[43] Surrey Audiovisual Expressed Emotion (SAVEE) | Kaggle, https://www. and Applications, vol. 35, no. 29, pp. 21463–21477, Mar. 2023, doi:
kaggle.com/datasets/ejlok1/surrey-audiovisual-expressed-emotion-savee. 10.1007/s00521-023-08470-8. [Online]. Available: http://dx.doi.org/10.
[44] RAVDESS Emotional speech audio, https://www.kaggle.com/datasets/ 1007/s00521-023-08470-8.
uwrfkaggler/ravdess-emotional-speech-audio. [67] H.-C. Li, T. Pan, M.-H. Lee, and H.-W. Chiu, "Make Patient Con-
[45] P. Sandhya, V. Spoorthy, S.G. Koolagudi, and N.V. Sobhana, "Spectral sultation Warmer: A Clinical Application for Speech Emotion Recog-
features for emotional speaker recognition," in 2020 Third International nition," Applied Sciences, vol. 11, no. 11, p. 4782, May 2021,
Conference on Advances in Electronics, Computers and Communications doi: 10.3390/app11114782. [Online]. Available: http://dx.doi.org/10.3390/
(ICAECC), pp. 1–6, 2020. IEEE. app11114782.
[46] P. Burk, L. Polansky, D. Repetto, M. Roberts, and D. Rockmore, "Mu- [68] L. Mentch and S. Zhou, "Randomization as regularization: A degrees of
sic and computers: a theoretical and historical approach," Preface to the freedom explanation for random forest success," The Journal of Machine
Archival Version (Spring, 2011), 2011. Learning Research, vol. 21, no. 1, pp. 6918–6953, 2020, publisher: JML-
[47] L. Malmqvist, "RapiCSF-A fast test of spectral contrast," 2013. RORG.
[48] T. Hastie, R. Tibshirani, and J.H. Friedman, "The elements of statistical
learning: data mining, inference, and prediction," vol. 2, 2009. Springer.

VOLUME 11, 2023 17

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379

Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning

RAHEEL AHMAD He obtained a Bachelor of ARSHAD IQBAL (Member, IEEE) received the
Science B.S. degree in Electrical Engineering from B.S. degree in electrical and computer engineering
the University of Sargodha, Pakistan, in 2017. Af- from COMSATS (CIIT), Abbottabad, Pakistan, in
ter completing his bachelor’s degree, he began a 2013, and the M.S. and Ph.D. degrees in electrical
pursuit to enhance his knowledge and skills in and computer engineering from Sungkyunkwan
Artificial Intelligence. The individual is presently University, Suwon, South Korea, in 2020. Since
engaged in the pursuit of a Master of Science 2021, he has been an Assistant Professor with the
(M.S.) degree in the field of Artificial Intelligence Sino-Pak Center for Artificial Intelligence (SP-
at the esteemed Pak-Austria Fachhochschule Insti- CAI), Pak-Austria Fachhochschule: Institute of
tute of Applied Sciences and Technology, located Applied Sciences and Technology (PAF-IAST),
in Haripur, Khyber Pakhtunkhwa (KPK), Pakistan. In addition to his aca- Haripur, Pakistan. His research interests include medium access control,
demic endeavors, He has actively participated in implementing Artificial resource allocation, the Internet of Things, applied artificial intelligence,
Intelligence and Machine Learning. He fulfills the role of an independent WLAN, sensors networks, energy harvesting networks, backscatter commu-
Machine Learning tutor, sharing knowledge and expertise with individuals nication networks, power saving, distributed communication networks, and
who aspire to become students or professionals in the field. The individual’s next generation communication networks. He was a recipient of the fully-
devotion to education is demonstrated by developing artificial intelligence funded ICT research and development scholarship for undergraduate by the
(AI) solutions specifically designed for academic and industrial contexts. Ministry of Information Technology (IT), Pakistan. He was also a recipient
This exemplifies their dedication to closing the divide between theoretical of HEC scholarship under Human Resource Development (HRD) initiative
knowledge and practical implementation. M.S. leading to Ph.D. program of faculty development for UESTPS, Phase-1
Batch-IV.

MUHAMMAD MOHSIN JADOON (Member,


IEEE) received the B.S. degree in electronic en-
gineering from COMSATS University Islamabad,
Islamabad, Pakistan, in 2007, the M.S. degree in
electronic engineering from International Islamic
University Islamabad (IIUI), Islamabad, in 2011,
and the Ph.D. degree in split degree programs, i.e.,
course work from IIUI and research from Queen
Merry University, London, U.K., in 2018.,He is
currently a Postdoctoral Research Fellow with the
Department of Radiology and Imaging Processing, Yale University, New
Haven, CT, USA. He is also Lecturer with the Department of Electrical
Engineering, IIUI. His research interests include signal and processing,
sensors, and biomedical imaging.

18 VOLUME 11, 2023

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379

Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning

NAVEED AHMAD received a B.S. degree in YASIR JAVED is a highly qualified data scien-
computer science from the University of Peshawar, tist, senior programmer / developer with over 18
Pakistan, in 2007, and a Ph.D. degree in Computer years’ experience in research, security program-
Science from the University of Surrey, U.K., in ming, software development, project management,
2013. He is currently working as an Associate and analytics. As part of his research, he has in-
Professor with the College of Computer and Infor- terests in data analytics, forensics, smart cities,
mation Sciences, Prince Sultan University, Riyadh, network security, and education sustainability, in-
Saudi Arabia. His research interests include se- structional development, learning and education
curity and privacy in emerging networks, such as sustainability, robotics, unmanned aerial vehicles,
VANETs, DTN, Internet of Things (IoT), Machine vehicular platoons, secure software development,
Learning, and Big Data. signal processing, IoT analytics, intelligent applications, and predictive com-
puting inspired by artificial intelligence. With an outstanding PhD student
award from UNIMAS, Sarawak, Dr. Yasir holds a PhD degree. In addition,
he was awarded a rector’s medal in his MS degree as well as a Distinguished
Teaching Award from the President. Listed in Top researcher award at PSU
in recognition of his research contributions, he has published over 100 peer-
reviewed articles in top-tier journals, conference proceedings, and book
chapters.. Additionally, he serves as a reviewer for several journals. With
regards to his professional experience, he has undertaken a variety of national
and international research funding projects and has also served as an analyst
programmer at the Prince Megren Data Center, the Center of Excellence, and
the Research and Initiative Center at Prince Sultan University. He serves as
Chair of the ACM Professional Chapter in KSA and is an active member of
the RIOTU group at Prince Sultan University.

VOLUME 11, 2023 19

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4

You might also like