Professional Documents
Culture Documents
XEmoAccent Embracing Diversity in Cross-Accent Emo
XEmoAccent Embracing Diversity in Cross-Accent Emo
XEmoAccent Embracing Diversity in Cross-Accent Emo
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.202x.0xxx000
ABSTRACT Speech is a powerful means of expressing thoughts, emotions, and perspectives. However,
accurately determining the emotions conveyed through speech remains a challenging task. Existing manual
methods for analyzing speech to recognize emotions are prone to errors, limiting our understanding and
response to individuals’ emotional states. To address diverse accents, an automated system capable of real-
time emotion prediction from human speech is needed. This paper introduces a speech emotion recognition
(SER) system that leverages supervised learning techniques to tackle cross-accent diversity. Distinctively, the
system extracts a comprehensive set of nine speech features—Zero Crossing Rate, Mel Spectrum, Pitch, Root
Mean Square values, Mel Frequency Cepstral Coefficients, chroma-stft, and three spectral features (Centroid,
Contrast, and Roll-off) for refined speech signal processing and recognition. Seven machine learning models
are employed, encompassing Random Forest, Logistic Regression, Decision Tree, Support Vector Machines,
Gaussian Naive Bayes, K-Nearest Neighbors, ensemble learning, and four individual, hybrid deep learning
models including Long short-term memory (LSTM) and 1-Dimensional Convolutional Neural Network
(1D-CNN) with stratified cross-validation. Audio samples from diverse English regions are combined to
train the models. The performance evaluation results of conventional machine learning and deep learning
models indicate that the Random Forest-based feature selection model achieves the highest accuracy of up to
76% among the conventional machine learning models. Simultaneously, the 1D-CNN model with stratified
cross-validation reaches up to 99% accuracy. The proposed framework enhances the cross-accent emotion
recognition accuracy up to 86.3%, 89.87%, 90.27%, and 84.96% by margins of 14.71%, 10.15%, 9.6%, and
16.52% respectively.
INDEX TERMS Machine learning, Deep learning, Speech Emotion Recognition (SER), Random Forest
(RF), Logistic Regression (LR), Decision Tree (DT), Support Vector Machines (SVM), K-Nearest Neighbors
(KNN), 1-Dimensional Convolutional Neural Network (1D-CNN)
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379
Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379
Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning
language biases and enhancing model generalization ir- leads to an accuracy improvement of 10% for EMO-DB, 26%
respective of cultural inclination. for SAVEE, and 21% for RAVDESS datasets. They suggest
• An enhanced approach for cross-accent speech recog- comparatively examining SER methods grounded in deep
nition is introduced. It involves extracting up to nine learning strategies using different datasets.
speech features and eight emotion features from the
cross-accent speech emotion dataset.
• A deep learning framework is presented to learn the tem-
poral and spectral dependencies enabling discernment of y
deep voice patterns. Thus, it enhances the performance
of the cross-accent speech recognition framework.
The subsequent sections of the proposed work are structured
in the following manner. Section 2 summarizes the existing
Input Layer Convolution Max-Pooling Dense/FC Output
literature on SER, covering feature extraction and classi- Layer Layer Layers Layer
fication methods. Section 3 discusses the background for
datasets, models, and features. The fourth section discusses FIGURE 2: Architecture of CNN
the proposed methodology, encompassing an in-depth feature
extraction, model classification, evaluation, and discussion. Similarly, Huang et al. investigate SER using the fractional
The experimental outcomes and subsequent analysis are pre- Fourier transform (FrFT) [27]. They extract the MFCC fea-
sented in Section 5, while Section 6 serves as the conclusion ture using the RAVDESS dataset with the Fractional Fourier
of this paper. Transform (FrFT). The best possible setting for the ’p’ pa-
rameter in the FrFT algorithm, determined by the ambiguity
II. RELATED WORKS function and the MFCC, is for each speech signal frame. They
Several research studies have been conducted on extracting can detect all eight emotional states by deploying the LSTM
emotions from speech data. The SER system comprises two network, and the results are enhanced by up to 79.86% com-
fundamental procedures in conventional machine learning pared to the ordinal Fourier Transform (FT) method. Since
methodologies: traditional feature extraction and multi-class they did not obtain satisfying findings for neutral, happy, and
emotion classification. This section examines the current calm emotions, they recommended improving their results in
methodologies that utilize different datasets for real-time SER the future.
and the feature extraction techniques documented in scholarly Automatic SER is highlighted in the study [28], which
literature. This section discusses several extant models related highlights the deployment of parallel-based network train-
to SER in real-time. ing on the RAVDESS dataset. The researchers benchmarked
Alluhaidan et al. [26] aim to enhance SER using hy- various architectures, including standalone CNN designs
brid features in conjunction with Convolutional Neural Net- (Vgg-16, Resnet-50), attention-based networks (LSTM +
works (CNNs). CNNs are advanced deep-learning models Attention, Transformer), and hybrid architectures (Time-
that excel in voice interpretation. They use automatic and distributed CNN + Transformer). Their proposed parallel
adaptive learning to extract spatial hierarchies of charac- networks, namely CNN+Transformer and CNN + Bi-LSTM-
teristics from raw input data. The convolutional layer is a Attention modules, aim to encapsulate spatial and temporal
key component of CNNs, applying filters to scan the in- features. By converting raw audio into Mel-Spectrograms
put data and performing convolution, which helps identify and implementing data augmentation techniques, the parallel
patterns like edges and corners. The basic structure of the CNN-Transformer achieved an accuracy of 89.33%, while
CNN model is shown in Figure 2, where the max pooling the CNN+ Bi-LSTM-Attention network reached 85.67%.
layer in CNN down-samples the input, reducing complexity The Bi-LSTM-Attention module is a fusion of Bi-directional
and overfitting by selecting the highest value in each sliding Long Short-Term Memory (Bi-LSTM) Network and Atten-
window and Dense layer, placed after convolutional and pool- tion processes.
ing layers, performing classification tasks on the extracted The Bi-LSTM is a variant of recurrent neural networks that
features. To classify seven emotional states, the authors de- effectively incorporates information from both past and future
ploy three popular datasets, namely Emotion Database (Emo- contexts by processing data in both directions. The Attention
DB), Surrey Audio-Visual Expressed Emotion (SAVEE), and mechanism empowers the model to choose to concentrate on
Ryerson Audio-Visual Database of Emotional Speech and particular aspects of the input, assigning varying degrees of
Song (RAVDESS). They improve performance by extract- significance to these parts during the process of generating
ing MFCCs and Mel-Frequency Cepstral Coefficients time- the output. That’s why the combined Bi-LSTM Attention
domain features (MFCCT) through feature extraction in their module reveals the capability to effectively handle data in
proposed study. Among the machine learning models tested, both forward and backward directions while also possessing
the 1D-CNN model outperformed others, achieving an accu- the ability to concentrate on particular portions of the input
racy of 96.6% for Emotion Database (EMO-DB), 92.6% for selectively. The block diagram of Bi-LSTM with the attention
SAVEE, and 91.4% for RAVDESS. Their proposed strategy module is given in Figure 3. Overall results highlight the
VOLUME 11, 2023 3
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379
Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning
usefulness of parallel architectures for extracting emotional demonstrates a commendable F1-score of 83% and is crucial
aspects from speech data, and they aid in creating effective in annotating a significant proportion of the data. Follow-
and efficient SER models. For the future direction, they pro- ing this, the automatic annotation approach based on neural
pose more complex data augmentation techniques using deep networks substantially increases the annotated dataset, im-
learning methods to enhance the prediction results. proving emotion categorization accuracy by up to 5.9%. The
Aayushi et al. present facial expressions and speech sig- meticulous methodology employed in this study establishes
nals to implement emotion recognition systems for medical a solid basis for understanding the complexities of emotional
applications [29]. They operate the Gabor filter with SVM dynamics in the context of a pandemic but also carries signifi-
and CNN model architecture to retrieve the features from cant implications for enhancing the accuracy and flexibility of
images and extract MFCC speech features from the speech emotion detection systems in accurately capturing emotional
signals. They used RAVDESS, Toronto Emotional Speech situations that occur in real-world scenarios.
Set (TESS), Crowd-sourced Emotional Multimodal Actors In [33], various classifiers apply to a merged dataset and to
(CREMA-D), and SAVEE datasets for speech signals. Si- RAVDESS and TESS datasets that are analyzed individually.
multaneously, they select Japanese Female Facial Expression An analysis of comparative performance shows that Gradient
(JAFFE), the Kaggle face expression recognition dataset, and Boosting performs better than other classifiers when applied
Emotions in Context (EMOTIC) for image emotions. They to the combined dataset, achieving up to 84.96% efficiency.
achieve encouraging prediction results and expect to work Additionally, when compared to other classifiers, the MLP
for enhanced sentiment classification versions encompassing classifier achieves superior results across all three datasets.
audio and images using multimodal emotion identification. Shifted Linear Discriminant Analysis (S-LDA) is proposed
in paper [34] to derive dynamic attributes from static low-
level variables like MFCC and Pitch. These adjusted features
Forget
Gate go into a 1D-CNN to extract high-level features for auto-
Memory matic event recognition (AER). Three databases, SAVEE,
Input Attention Based -
Cell
Module eNTERFACE, and Berlin, assess the suggested methods’ per-
Input Gate formance in a classification test. Their results demonstrate
that the highest levels of accuracy for Automatic Emotion
Recognition (AER) are obtained for eNTERFACE at 96.41%,
Next
Output Gate Bi - LSTM the Berlin at 99.59%, and the SAVEE at 99.57%.
Layer
According to the findings presented by [35] acknowledge
FIGURE 3: Bi-LSTM with Attention Module that the former investigators do not accomplish capturing
a voice signal’s global, long-term context since they only
Mohanty et al. implement a Deep Convolutional Neu- recover local hidden facets. Due to limited dataset availability
ral Network (D-CNN) for identifying emotions [30]. This and inadequate feature portrayals, they exhibit poor recogni-
method accurately categorizes seven different human senti- tion performance. They propose ensemble approaches, com-
ments by utilizing spoken language’s spectral and prosodic bining the predictive performance of three model architec-
characteristics. To examine the efficacy of their proposed tures, 1D-CNN, LSTM, and Gated Recurrent Unit (GRU),
method, the authors use a variety of datasets, such as with a Fully Connected (FC) layer at the end to address con-
RAVDESS, SAVEE, TESS, and CREMA-D. The findings temporary problems. For each audio file in the TESS, EMO-
demonstrate outstanding accuracy rates of 96.54%, 92.38%, DB, RAVDESS, SAVEE, and CREMA-D datasets, they ex-
99.42%, and 87.90% for each dataset. Real-time testing is tract both local and global features, including MFCC, Log-
performed on combined datasets, and the results showed an mel spectrum, ZCR, chromogram, and RMS value. The en-
accuracy rate of 90.27%. The findings provide prospects semble method is used to achieve exceptional weight average
for improving user experiences in interactive systems and accuracy results of 99.46% for TESS, 95.42% for EMO-DB,
contribute to advancing the field of emotion recognition. 95.62% for RAVDESS, 93.22% for SAVEE, and 90.47% for
The study [31] applies the 1D-CNN classifier model to the CREMA-D respectively.
RAVDESS and TESS datasets and reaches an accuracy of The problem of distinguishing between positive and nega-
90.48% and 95.79%, respectively. They introduce a model tive emotions in audio recordings is tackled using deep learn-
with enhanced emotion categorization outcomes that could be ing methods. This research [36] uses five publicly available
implemented in smart home assistants to identify a person’s emotion speech datasets to build a 1D-CNN model. The ex-
emotions. perimental outcomes prove the model’s efficacy in identifying
Alenezi et al. [32] exploit a hybrid approach to monitor positive and negative emotional speech data. Dealing with
people’s emotions from Arabic tweets involving lexicon- ambiguous or noisy emotion labels in the datasets is just
based rule-based data labelling and Long Short-Term Mem- one feature extraction and classification challenge that needs
ory (LSTM) neural network techniques to comprehensively to be cleared. The study demonstrates that the classification
examine a dataset of 5.5 million Arabic tweets collected accuracy for voice emotion recognition increases when many
between January and August 2020. The rule-based technique models and datasets are used. These results show the potential
4 VOLUME 11, 2023
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379
Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning
of applying deep learning methods to this study area and help in the range of 0 to 1 [48]. For this purpose, the logistic func-
advance the field. tion (or sigmoid function) is used to model the association
Sharifa et al. [37] examine the impact of cultural accep- between the input attributes and the likelihood of belonging
tance on eliciting and recognizing emotions within an Arab to a specific class in the Logistic Regression model. The
culture. The researchers employed standardized English and sigmoid function converts the input to a number between 0
introductory Arabic stimuli to elicit six universally recog- and 1 to calculate the likelihood that an instance belongs to
nized emotions and assess the physiological and behavioural the positive class. To extend it to multiple classes, SoftMax
reactions of a sample consisting of 29 people. Notably, the activation is used to adapt Logistic Regression for multiclass
clip’s origin and the language used do not directly impact SER.
the elicitation of emotions. However, cultural acceptance is
contingent upon religious and cultural ideals. The findings 2) Decision Trees (DT)
indicate that the multiclass Support Vector Machine (SVM) A supervised learning model used for classification and re-
classifier consistently displays higher accuracy in recognizing gression learning analysis. In the case of classification, Gini
emotions in Arabic clips average of 60% compared to English index and Entropy are used to measure the impurity and dis-
clips average of 52%. This suggests that using culturally rel- order in the dataset, respectively. Mean squared error metric
evant clips boosts the emotional response in the classification is mostly used for the decision tree regression problem. This
process. model can effectively identify nonlinear relationships in data
Studies on SER systems between 2000 and 2017 are ana- due to its interpretable nature. Decision transparency is de-
lyzed by [38] in 2018 from three points of view: the database, pendent on the Gini Index, Entropy and Squared error and can
feature extraction, and classifiers. However, only classical deal with numerical and categorical features [49]. A decision
machine learning methods receive an evaluation as a poten- tree can help in emotion classification where contributing
tial classification tool, and the scientists express that neural features can be extracted by executing conditional rules.
networks (NN) and deep learning approaches await investiga-
tion. The paper contains a substantial section on databases and 3) RF
feature extraction. One year later, author Khalil et al. studied The RF algorithm is an ensemble learning technique integrat-
discrete methods in SER while using deep learning [39]. ing numerous decision trees to generate predictions [50]. The
The paper discussed several different methods of deep learn- training of each decision tree involves a stochastic selection
ing, such as auto-encoders, deep neural networks (DNNs), of a subset of the available data, followed by a computation
CNNs, and recurrent neural networks (RNNs), are discussed, of the final prediction using a voting or averaging procedure.
along with their benefits and drawbacks. Nevertheless, the The RF algorithm is extensively engaged in multi-class SER
research does not focus on the readily available strategies to because it manages high-dimensional data, captures complex
overcome limitations. Similarly, the subsequent year, 2021, feature connections, and reduces over-fitting. The utilization
examines deep learning algorithms for SER with accessible of RF involves the combination of multiple decision trees and
datasets, followed by traditional machine learning techniques serves to augment the model’s robustness and capacity for
for SER. The authors provide a multi-faceted analysis of the generalization.
differences and similarities between various practical neural
network methods for voice emotion identification [40]. 4) SVM
SVM is a robust supervised learning method regarding high-
III. MACHINE LEARNING MODELS AND FEATURES dimensional feature spaces and complex decision limits. The
COMPREHENSION method improves separation and generalization by increasing
Machine learning models depend on the features comprehen- the margin between classes [51]. SVMs are well suited for
sion within datasets for their effective utilization. Features are multiclass classification because of their resistance to noise
used by the model to make informed decisions. Therefore, the and outliers.
selection of the relevant features is crucial for the machine
learning models to learn the required patterns in the data. 5) Gaussian Naive Bayes (GNB)
A well curated dataset is vital, considering the diversity of Bayes’ theorem [52] forms the basis of the GNB classifier,
features, for effective and generalized model performance. which assumes that attributes are conditionally independent
The limitation of the data reflecting the features due to the given the class. It uses few computations and performs well
scarcity of data can be enhanced by data augmentation tech- on tiny datasets. Since it implies feature independence, it
niques, which artificially expands the size of dataset by data can be used for either statistical or text-based approaches to
transformations. SER. This works well when the distribution of class follows
a Gaussian distribution.
A. MACHINE LEARNING MODELS
1) Logistic Regression (LR) 6) KNN
It is a supervised machine learning algorithm used for binary This method uses only their immediate proximity to one
classification aiming to predict the likelihood of an instance another to find correlations between data points [53]. In the
VOLUME 11, 2023 5
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379
Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning
context of multiclass speech emotion classification, k-NN input sequence to learn representations and extract features
emerges when comparable emotions cluster comes closer, from the data. The convolutional process is carried out in a
allowing it to capture the local structure of the data effectively. 1D-CNN by sliding a filter over the sequence of inputs and
It works well when the limits of the possible choices are fuzzy calculating the dot product between the kernel and the local
or nonlinear. By polling nearby nodes, it’s able to perform region of the input. This is known as the sliding window
multiclass classification. method. Applying this technique at each place along the
sequence allows one to acquire the output feature map.
7) Ensemble Method (Voting classifier)
Due to their capacity to mix many base models and increase B. SPEECH FEATURES OVERVIEW
overall classification performance, ensemble approaches like 1) MFCC
the voting classifier are commonly utilized in multiclass
The Mel-Frequency Cepstral Coefficients MFCC is a com-
speech emotion identification. The Voting Classifier takes the
monly used set of spectral features in emotional Speech
predictions from several models and uses the majority vote to
Recognition (SR). These coefficients comprise a collection
determine the final classification. This reduces the impact of
of values that convey relevant information regarding the con-
inaccuracies and biases in the individual models, resulting in
figuration of the speech signal’s spectrum. MFCC is applied
more reliable classifications. Better predictions can be made
in emotion classification, speaker identification, and SR sys-
with the help of many models working together to minimize
tems. It can be represented as
bias and variation [54].
f
8) LSTM Mel(f ) = 2595 × log10 (1 + ), (1)
700
The LSTM model analyzes sequential data, such as sen-
timental analysis, machine translation, and speech-emotion where f is the frequency in Hz, and Mel(f) is the frequency
recognition. LSTM networks possess a distinctive structure recognized by the human ear in the Mels scale [45].
comprising memory cells and gating mechanisms, allowing
them to selectively retain and discard information across 2) ZCR
extended sequences. This characteristic makes them highly The ZCR measures the transition frequency between positive
suitable for tasks that require capturing temporal dependen- and negative changes within an audio signal or vice versa.
cies and modeling sequential patterns. Within the domain It provides insights into the temporal characteristics and the
of SER, the LSTM model accepts acoustic features in the presence of high-frequency components in speech signal.
form of sequences, such as MFCCs or spectrograms, and ZCR is applied in speech activity recognition and music
subsequently subjects them to a series of LSTM layers for information retrieval. It creates a 1D array of retrieved data.
processing. The LSTM layer comprises memory cells and The mathematical representation for ZCR is given below.
input, output, and forget gates. The gates regulate the infor-
T
mation flow and determine the selection of information to 1X
be retained, discarded, or transmitted. Utilizing a SoftMax ZCR = |s(t) − s(t − 1)|, (2)
T t=1
activation function enables the LSTM model to effectively
manage multiclass classification tasks by allocating probabil- where T is the total number of signal frames, s(t) is the signal
ities to individual emotion classes [55]. value at time t, and s(t-1) is the signal value at time t-1, the
previous time step.
9) Bi-LSTM
It is an advanced version of the LSTM model. Bi-LSTM 3) CHROMA-STFT
differs from traditional LSTM models in that it simultane-
The chroma (STFT (t, f )) feature represents the distribution of
ously considers past and future data. It does this by using
pitch classes in the audio signal. It captures the tonic infor-
two LSTM layers. One of the LSTM layers goes through
mation and helps identify the musical and emotional aspects
the input pattern from beginning to end, and the other goes
of the speech signal. Chroma is applied in music information
through it backwards. By merging data collected from both
retrieval tasks such as chord recognition, melody extraction,
sides, Bi-LSTM gets a complete picture of the sequence of
and music genre classification. The mathematical expression
inputs. This makes it better to capture dependencies. This
for chroma-stft is
two-way method makes it easier for the model to make precise
forecasts and leads to better results in several sequential data X
STFT (t, f ) = [x(n) · w(n − t) · e−j2πfn ], (3)
applications [28].
n
10) 1D-CNN Network where x(n) signal value at time n, w(n - t) windowing function
It is a variation of CNN developed expressly to manage one value at a time (n-t), and exponential function create complex
dimensional sequential data, such as text or time series. It sine signals at frequency f.
applies convolutional operations along the time axis to the
6 VOLUME 11, 2023
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379
Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379
Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning
fields by giving access to a comprehensive collection of speech signal. This strategy helps in reducing overfitting and
speech samples comprising of various emotions [41]. boosts the model’s capacity to handle noisy and practical
speech input. It also helps reduce the time it takes to train
2) CREMA-D the model.
The audio-visual Crema-D English language dataset consist-
ing of 7442 audio and video recordings that composed the 2) Pitch Alteration
CREMA-D dataset feature of male and female performers Pitch Alterations simulate various sounds and speech styles
reciting 12 scripted sentences with various facial expressions by introducing changes in the pitch contour. It captures var-
and vocal inflections. It is comprised of a varied collection ious vocal expressions and increases the model’s capacity to
of six distinct sentiments. Because it provides comprehensive identify emotions across a broader range of pitch ranges.
audio and video data collection, frustration, disgust, fear,
happiness, neutrality, and sadness are helpful for emotion 3) Time Stretching
identification and multi-modal analysis [42].
Time Stretching methodology alters the temporal dimension
of a speech signal, mimicking fluctuations in speech velocity
3) SAVEE
and length. Incorporating time-related trends and dependen-
This dataset includes 480 audio files of female and male cies enhances the model’s ability to effectively capture and
performers with various emotional states, such as neutral, process speech data across different speaking rates, thereby
pleased, sad, angry, afraid, disgusted, and surprised [43]. increasing its robustness.
It is a helpful resource for studying emotion identification,
particularly for examining emotions communicated through
Cross- Accent Emotions Signal Processing Framework
Machine Learning Framework
male voices. Ensembling based Feature Selection
HAPPY(j=1) ZCR 1. RF
(e=1,2,3,4) (0 - 1)
4) RAVDESS 2. LR
3. DT
SAD(j=2) CHROMA-STFT
It is also an English language dataset with 1440 audio and (2-12) 4. SVC
(e=1,2,3,4) 5. KNN
video recordings of performers pronouncing scripted dia- MFCC 6. GNB
NEUTRAL(j=3) (13-33)
logue and singing songs with various facial expressions and 7. Voting
(e=1,2,3,4)
TESS(e=1)
vocal nuances. There are twenty-four actors (12 men and SURPRISE(j=4) Spect Centroid
SAVEE(e=2) (34)
12 women), encompassing a range of emotions, including (e=3,4) Speech Emotion
CREMA-D(e=3) Spectral Features Classification
peaceful, joyful, sad, furious, afraid, surprised, and disgusted. DISGUST(j=5) Contrast Extracted
RAVDESS is a data collection application in emotion recog- RAVDESS(e=4) (e=1,2,3,4) (35-41) (172)
nition and offers a large variety of emotional speech and song CALM(j=6) Spect roll-off 1. LSTM
(e=4) (42) 2. LSTM + 1D-
samples [44]. CNN
RMS 3. BILSTM+1D-
FEAR(j=7) (43)
(e=1,2,3,4)
CNN
D. DATA AUGMENTATION TECHNIQUES MEL-SPEC 4. 1DCNN-
(44-171) Stratified
ANGRY(j=8)
It is used to strengthen and generalize speech datasets. (e=1,2,3,4)
Cross
Pitch Validation
Data augmentation solves imbalanced class problems and (172)
Deep Learning Framework
increases model prediction accuracy, precision, and recall.
FIGURE 6: Proposed Cross-Accent Emotion Recognition
1) Noise Injection Framework
The addition of artificial noise to the dataset enlarges the data
for learning features and would not drown out the natural
IV. PROPOSED ACCENT EMOTION RECOGNITION
FRAMEWORK
The proposed framework targets four different stages for
cross-accent emotion recognition as depicted in Figure 4.
First, it aims to develop an extensive database by merging
diverse English datasets like TESS, Crema-D, SAVEE, and
RAVDESS. Then, it extracts necessary speech characteristics
such as MFCC, ZCR, RMS and others using the Python-
based Librosa library from extensive databases. The third
stage employ conventional machine learning and deep learn-
ing models, with hyperparameters tuning, to suggest the most
appropriate ones for multi-class emotion classification. The
Final fourth stage focus on the evaluation of the machine
FIGURE 5: Count of Total Emotions learning using possible evaluation metrics like accuracy, re-
8 VOLUME 11, 2023
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379
Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379
Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning
Dropout
Spectral Contrast 7
Dropout
LSTM
BN
Spectral roll-off 1
LSTM
BN
RMS 1
Dense/FC
Mel-Spectrogram 128
Dropout
LSTM
Pitch 1
BN
Total 172
C. MODEL TRAINING
The newly created dataset with dimensions (36486, 172) is
split into segments to train seven conventional machine learn-
BN = Batch Normalization
ing models and four deep learning models. For machine learn-
ing models, 95% of the data is allocated for training, while
the remaining 5% is reserved for testing. This significant
training data allocation aims to optimize model performance
by enabling the capture of intricate patterns while mitigating FIGURE 8: Visual Representation of LSTM Network with
overfitting risks. Similarly, for deep learning models, 80% of Optimizing Layers
the data is assigned for training the models, and the remaining
20% is reserved for testing the trained models, as shown in In the deep learning phase, an initial approach uses a simple
Figure 6. LSTM model as illustrated in Figure 8. This model includes
three LSTM layers with decreasing units, three batch normal-
1) Conventional Machine Learning Models
ization layers, and a dropout rate of 30%. L2 regularization
Seven machine learning models, LR, DT, RF, SVM, KNN,
with a coefficient of 0.001 is employed for each LSTM layer
GNB, and ensemble, are employed for multiclass speech
to enhance the model generalization and reduce overfitting.
emotion classification. These models are trained using default
parameters, and performance is evaluated based on predic- TABLE 2: Proposed Layer Type LSTM Architecture with
tion accuracy, detailed in Table 5 for case I. After this, the Hyper-parameters
GNB, DT, and RF models undergo fine-tuning procedures
Layer Type Architecture with Hyper-parameters
by changing the hyperparameters, including max-depth, n- Layer 1 LSTM, Neurons = 128, Input Shape = (171, 1), Regular-
estimators, min-samples-split, and min-samples-leaf to op- izer L2 = 0.001
timize predictive results. In cases where fine-tuning does Layer 2, 3 BatchNormalization, Dropout = 0.3
Layer 4 LSTM, Neurons = 64, Regularizer L2 = 0.001
not yield improved results, an RF-based feature selection Layer 5, 6 BatchNormalization, Dropout = 0.3
technique is activated. This technique uses Gini impurity or Layer 7 LSTM, Neurons = 32, Regularizer L2 = 0.001
information gain to assign values between 0 and 1 to each Layer 8, 9 BatchNormalization, Dropout = 0.4
feature. This approach identifies features with strong correla- Layer 10 Fully Connected Layer, Neurons = 8, Activation Func-
tion = Softmax
tions to the dependent variable for further analysis. The RF Layer 11 Optimizer = Adam, Loss = Categorical Cross-Entropy
feature selection technique results are evaluated at various Layer 12 Batch Size = 16, Epochs = 200
threshold levels, including 0.004, 0.005, 0.007, and 0.008, to
select the most pertinent features from the initial 172. Table Also, optimization is performed using the Adam optimizer,
5, cases (II, III, IV, and V) presents the testing accuracy while categorical-cross entropy operates as the loss function,
results of all machine learning models at these threshold with accuracy as the chosen evaluation metric. The more
levels. The features selected from the final threshold level of focused architecture of the LSTM model with layers and
0.008, yielding the highest prediction accuracy, contribute to hyperparameters is given in Table 2. To address the issues
the classification report in Table 6 and the confusion matrix with the LSTM model performance, two hybrid models, 1D-
in Figure 10. CNN + LSTM and 1D-CNN + Bi-LSTLM, are employed in
almost identical architectures given in Table 3. These hybrid
2) Deep Learning Models architectures mitigate the rising validation loss and stationary
The proposed deep learning framework mainly employs four validation accuracy faced by LSTM. The hybrid models are
main models including LSTM, LSTM + 1D-CNN, Bi-LSTM designed by combining three layers of 1D-CNN, with filter
10 VOLUME 11, 2023
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379
Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning
sizes ranging from 32 to 128, using the same kernel size of model is trained to assess their impact on validation accuracy
3 and ReLU activation function. Additionally, two LSTM and loss. In case five, employing the same 1D-CNN model
layers are incorporated into the model with decreasing LSTM with the KBest Feature selection technique is used by leverag-
units of 64 to 32 and three dense layers with decreasing ing the Analysis of Variance (ANOVA) F-value. This method
units of 128 to 8. Based on categorical cross-entropy loss, discerns the correlation between input features and the target
the model performance is tested, and early stopping is imple- variable in classification tasks, highlighting the most relevant
mented. While the integration of models produced promising features for classification. This systematic approach elimi-
results compared to the LSTM, it is essential to note that nates less significant features, ultimately identifying the top
further optimization is required beyond the capabilities of this 40 and 20 features significantly influencing classification
architecture. accuracy. The results for each model are detailed in Table 7.
TABLE 3: Proposed Layer Type (1D-CNN + Bi-LSTM) Ar- In the proposed framework two models, 1D-CNN simple
chitecture with Hyper-parameters and 1D-CNN with stratified cross-validation are employed
with identical architecture where the simple model operates
Layer Type Architecture with Hyper-parameters
Layer 1 Conv1D, Filters = 64, Kernel_size = 3, Activation =
without cross-validation, while the other incorporates vali-
ReLU, Input_Shape = (171, 1) dation as shown in Table 4. These architectures encompass
Layer 2, 4, 6 MaxPooling1D (pool_size = 2) multiple layers, including 9- Conv1D (1D convolutional) lay-
Layer 3 Conv1D, Filters = 64, Kernel_size = 3, Activa- ers, 5-maxPooling1D layers, and 2-Dense (fully connected)
tion_Function = ReLU
Layer 5 Conv1D, Filters = 128, Kernel_size = 3, Activa- layers with categorical cross entropy as a loss function. The
tion_Function = ReLU Conv1D layers are designed to extract relevant features from
Layer 7 Bi-LSTM, Neurons = 64 the input data, while the maxPooling1D layers down-sample
Layer 8, 10 Dropout = 0.5
Layer 9 Bi-LSTM, Neurons = 32 the feature maps to capture pertinent features. Given the
Layer 11 Fully Connected Layer 1, Neurons = 128, Activa- multiclass nature of the classification task, the categorical
tion_Function = ReLU cross entropy loss function is activated. The callback function
Layer 12, 14 Dropout = 0.5
Layer 13 Fully Connected Layer 2, Neurons = 64, Activa-
is implemented to adjust the learning rate dynamically during
tion_Function = ReLU training and to halt the training process in the absence of
Layer 15 Fully Connected Layer 3, Neurons = 8, Activa- observed improvements in validation loss. This feature saves
tion_Function = Softmax computational resources and time, ensuring efficient model
Layer 16 Optimizer = Adam, Loss = Categorical_cross_entropy
Layer 17 Batch_size = 32, Epochs = 50 training.
Finally, the stratified K-Fold cross-validation technique is
Then, 162 features are randomly selected. The 1D-CNN employed, which partitions the data into training, validation,
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379
Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning
and test sets. This technique enhances the model’s robustness V. PERFORMANCE EVALUATION AND DISCUSSION
and inference capabilities. The model’s efficacy on the test Several machine learning and deep learning models for voice
set is assessed through a classification report in Table 8 and emotion recognition are developed and evaluated, utilizing
a confusion matrix in Figure 11. The finalized architecture of various speech datasets. Table 5 presents the comparative
the proposed deep learning framework is illustrated in Figure prediction accuracy outcomes of the machine learning models
9. Overall, hyperparameter tuning methods for each deep for SER. In the scenario where all 172 features are selected
learning model are explored. For instance, there are varying (Case I), the models display diverse levels of accuracy, with
activation functions, loss functions, and learning rates, as well nearly all models showing under-fitting results with less than
as modifying the number of layers and neurons. While these 35% training accuracy. These results suggest that models tend
adjustments are evaluated, the potential remains for further to overfit by using all nine features, including temporal and
enhancement in the model’s performance for cross-accent spectral, making testing accuracy uncertain. However, when
emotion recognition tasks. employing the feature selection technique via the RF model to
capture only crucial features denoted by α, get values ranging
between 0 to 0.1. Moreover, employing the RF-derived fea-
D. APPLICATIONS OF PROPOSED ACCENT EMOTION ture selection method at varying threshold levels 0 ≤ α ≤ 0.1
RECOGNITION FRAMEWORK (Cases II to V) significantly boosts testing accuracy. At a
significance level of α > 0.008 (Case V), the RF model
The potential applications of this framework are in online
achieves a peak accuracy of 0.76, outperforming all other
education [62], family household robot assistant [63], noise
models. Similarly, the Decision Tree algorithm substantially
detection for human robots [64], stress identification for air
increases, reaching an accuracy level of 0.61.
traffic controllers [65], financial distress management [66],
The effectiveness of the Ensemble Voting technique in ag-
and remote health care [67]. For instance, during the online
gregating the predictions of multiple models is demonstrated
education system, the proposed system can be used to en-
by its achievement of an accuracy of 0.64 in Case II. So,
hance the teaching quality due to cross-culture background
the findings highlight the significance of feature selection in
of students and teachers [62]. An intelligent family household
enhancing the efficacy of machine learning algorithms to rec-
robot can employ the proposed SER framework to accurately
ognize emotions in speech. This investigation identifies the
estimate the sentiments of user in a global perspective which
RF model as the most precise model. After assessing multiple
enables friendly interaction of robots with human beings [63].
models, the RF algorithm emerges as the most suitable choice
Human listeners often identify the noise talk, but machines
for implementing the emotion classification task, achieving a
cannot perform this task without using specific filters. There-
commendable accuracy rate of 76% on testing data. That is
fore, this framework can be used to detect the noise signal,
why the final RF model is used to find the classification report
extract sentiment, and assist in taking the required action
and classification on the testing dataset given in Table 6.
[64]. Similarly, Air traffic controllers interact with pilots and
people worldwide, which leads to diverse accent emotion As proposed, the RF model shows favourable performance
variations. The proposed cross-accent SER can enhance their across diverse emotion categories, as evidenced by the clas-
inter-communications [65]. Furthermore, cross-accent SER sification report. From Table 6, the precision values of the
can enhance the estimates of emotions in a diverse financial model range from 0.72 to 0.91, indicating the accurate clas-
distress, social interaction, and healthcare management [66], sification of instances for each emotion class. Meanwhile,
[67]. recall values between 0.65 and 0.93 indicate that the RF
model captures a significant proportion of instances for most
emotions. The F1 scores, which vary between 0.71 and 0.89,
12 VOLUME 11, 2023
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379
Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning
TABLE 5: Comparative Analysis of Different Machine Learning Models in Terms of Prediction Accuracy
CaseI CaseII CaseIII CaseIV CaseV
Models
α>0 α > 0.004 α > 0.005 α > 0.007 α > 0.008
Linear Regression 0.24 0.50 0.50 0.49 0.34
Decision Tree 0.31 0.57 0.56 0.50 0.61
Random Forest 0.31 0.74 0.72 0.61 0.76
Support Vector Classifier 0.23 0.53 0.51 0.51 0.37
K-Nearest Neighbor 0.25 0.56 0.55 0.54 0.43
Gaussian Naïve Bayes 0.24 0.23 0.23 0.26 0.22
Ensemble ... 0.64 0.62 ... 0.60
demonstrate a good balance between precision and recall, emphasizes the significance of a comprehensive feature set
successfully capturing every emotional category’s unique in accurately capturing and recognizing emotional cues in
characteristics. The RF model demonstrates consistent per- speech data. However, a slight decline in accuracy is observed
formance across various emotions, even when facing varying when using the KBest method to reduce the feature space.
degrees of support. This highlights the model’s constancy in The delicate balance between dimensionality reduction and
the face of imbalanced datasets. The model generalizes by re- retaining crucial features is highlighted, emphasizing the need
vealing high precision, recall, and F1 scores for emotions such for careful consideration when selecting a feature reduction
as anger, surprise, and neutral. In contrast, other emotions ex- technique to avoid sacrificing essential information. The 1D-
hibit slightly lower but still admirable performance. Despite CNN Simple model and the 1D-CNN with Stratified Cross-
its imperfections, the outcomes show significant achievement Validation utilize all 172 features. The 1D-CNN Simple
in precisely categorizing human emotions. The RF model’s model achieves a validation loss of 3.112 and an accuracy
effectiveness, generalization capacity, and consistency make of 0.6411, prioritizing simplicity in its architecture but at the
it a practical choice for recognizing emotions, especially in expense of accuracy compared to previous models.
imbalanced data situations. However, it requires additional On the other hand, the 1D-CNN model with Stratified
scrutiny and assessment of unobserved or heterogeneous Cross-Validation, trained for 50 iterations, improves perfor-
datasets. The comparative analysis of different deep learn- mance significantly with a validation loss of 0.032 and an
accuracy of 0.99. This technique effectively addresses class
TABLE 6: Classification Report of Proposed RF Model
imbalance and yields noteworthy improvements in loss and
Class Precision Recall F1-Score Support accuracy. Classification results for the 1D-CNN framework
angry 0.77 0.86 0.81 281
calm 0.72 0.93 0.81 28
disgust 0.73 0.69 0.71 281 TABLE 7: Proposed Deep Learning Models Analysis
Fear 0.83 0.65 0.73 281
Happy 0.79 0.68 0.73 295 Models Epochs Loss Accuracy
Neutral 0.73 0.82 0.77 253 LSTM 20 1.947 0.157
1D-CNN + LSTM 20 1.070 0.575
Sad 0.72 0.82 0.77 311
1D-CNN + Bi-LSTM 20 1.048 0.6203
Surprise 0.91 0.86 0.89 95
1D-CNN-(162) 50 0.842 0.678
1D-CNN-KBest-(40) 40 1.069 0.585
ing models SER, as depicted in Table 7, provides valuable 1D-CNN-KBest-(20) 30 0.236 0.5129
insights. Challenges arise with specific models throughout the 1D-CNN-(172) 25 3.112 0.6411
experimentation process, leading to exploring alternative ap- 1D-CNN-Cross Validation (Proposed) 50 0.032 0.99
proaches to improve validation loss and accuracy. Initially, the
LSTM model, a popular choice for sequence modelling tasks using the stratified cross-validation method are displayed in
given in Figure 8, is employed. However, despite its potential, Table 8. The outcomes show the model’s ability to categorize
the LSTM model exhibits limited effectiveness in capturing emotions into several groups correctly. The precision values
speech data’s complex patterns and structures because of its indicate the model’s accuracy in identifying occurrences of
limited validation accuracy of 0.1568, which falls short of each emotion class, which range from 0.99 to 1.00. Consis-
expectations. To address LSTM limitations, a combination of tently high recall values suggest the model captures a sizable
1D-CNN and LSTM is explored, exploiting the strengths of subset of the true positive examples for each class. Excellent
both architectures. This fusion approach improves accuracy performance is also indicated across all emotion classes by
by 0.5748 by effectively capturing local and temporal de- the F1-score values, which combine precision and recall.
pendencies in speech data. The 1D-CNN + Bi-LSTM model These findings confirm the comparison analysis, suggesting
further builds upon this success by leveraging bidirectional that the 1D-CNN architecture with stratified cross-validation
LSTM layers, achieving an even higher accuracy of 0.6203. is viable for speech emotion identification tasks due to its
Notably, the importance of feature selection is observed high validation accuracy and low loss. Apart from conducting
throughout the experiments. Later, in the case of further a comparative analysis between deep learning models, the
improving results, the 1D-CNN model, using 162 chosen fea- utilization of a confusion matrix on hold out datasets given in
tures, demonstrates the highest accuracy, 0.6785. This result table 6 and 8 provides additional insights into the efficacy of
VOLUME 11, 2023 13
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379
Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379
Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning
based solutions are proposed [57-59]. In the solution, the in terms of accuracy and comprehensiveness. The proposed
accuracy is enhanced up to 91.31%, leaving a significant gap research illustrates transformative advancement in real-time
for improvement. emotion recognition compared to benchmark studies.
In summary, deep learning concepts can process large
Furthermore, Shah et al. [17] and Aayushi et al. [29]
amounts of data quantitatively and qualitatively to improve
integrate multiple datasets, thereby achieving accuracy up
the accuracy of spoken language sentiment analysis. Careful
to 86.3% and 89.87%. However, the system is limited to
implementation of exclusive selection and hierarchical cross-
only a single speech feature named MFCC. Mohanty et al.
validation allows this model to outperform traditional meth-
[30] integrates four datasets and employs D-CNN model to
ods in terms of positive sentiment accuracy.
enhance the accuracy up to 90.27%, thereby underscoring the
efficacy of the proposed model selection and optimization
techniques. Nasim et al., integrates two datasets and employ VI. CONCLUSION
Gradient boosting which enhances the classification accuracy This research develops a system that accurately identifies
up to 84.96% [33]. To address this challenge, we propose a and analyzes speech-based emotions across various cultural
multi-cultural cross accent emotion recognitions system by and linguistic backgrounds using speech datasets encompass-
considering a range of multiple features speech recognition ing a broad spectrum of accents, ensuring a comprehensive
systems by employing conventional machine learning and approach to cross-accent emotion recognition. Recognizing
deep learning-based framework. The proposed scheme shows emotions in conversations from speakers of different lan-
14.71%, 10.15%, 9.6%, and 16.52% improvements as com- guages presents unique challenges. The recommended ap-
pared to the conventional schemes [17, 29, 30, 33]. Over- proach involves collecting well-annotated speech datasets,
all, comparison with previous studies of SER and detailed extracting a wide range of speech features, applying data
analysis demonstrate the efficiency of the proposed approach augmentation techniques, and employing advanced machine
VOLUME 11, 2023 15
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379
Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning
learning and deep learning classifiers. The results underscore [6] S. Madanian, D. Parry, O. Adeleye, C. Poellabauer, F. Mirza, S. Mathew,
the importance of feature selection in boosting the perfor- and S. Schneider, "Automatic Speech Emotion Recognition Using Machine
Learning: Digital Transformation of Mental Health," 2022.
mance of machine learning algorithms in speech emotion [7] A.G. Harvey, E. Watkins, and W. Mansell, "Cognitive behavioural pro-
recognition. Due to their distinct acoustic characteristics, cesses across psychological disorders: A transdiagnostic approach to re-
some accents pose challenges in emotion detection, high- search and treatment," Oxford University Press, USA, 2004.
[8] A. Grünerbl, A. Muaremi, V. Osmani, G. Bahle, S. Oehler, G. Tröster,
lighting the complexity of cross-accent analysis. Likewise, O. Mayora, C. Haring, and P. Lukowicz, "Smartphone-based recognition
the system’s performance varies when analyzing different of states and state changes in bipolar disorder patients," IEEE Journal of
accents, showcasing its strengths, and pinpointing areas for Biomedical and Health Informatics, vol. 19, no. 1, pp. 140–148, 2014.
IEEE.
improvement in a cross-accent setting. The RF model stands [9] M. Bojanić, V. Delić, and A. Karpov, "Call redistribution for a call center
out as the most accurate, achieving a notable 76% accuracy based on speech emotion recognition," Applied Sciences, vol. 10, no. 13,
on test data. This model consistently performs across various p. 4653, 2020. MDPI.
[10] X. Li and R. Lin, "Speech emotion recognition for power customer ser-
emotions, illustrating its effectiveness, adaptability, and re- vice," in 2021 7th International Conference on Computer and Communi-
liability, especially with imbalanced datasets. Deep learning cations (ICCC), pp. 514–518, 2021. IEEE.
models, like the combination of 1D-CNN and LSTM and the [11] D. Tanko, S. Dogan, F.B. Demir, M. Baygin, S.E. Sahin, and T. Tuncer,
"Shoelace pattern-based speech emotion recognition of the lecturers in
1D-CNN + Bi-LSTM, tap into the strengths of both structures distance education: ShoePat23," Applied Acoustics, vol. 190, p. 108637,
to attain higher accuracy rates. The 1D-CNN model, when 2022. Elsevier.
paired with Stratified Cross-Validation, addresses class im- [12] T. Zhang, M. Hasegawa-Johnson, and S.E. Levinson, "Children’s emotion
recognition in an intelligent tutoring scenario," in Proc. Eighth European
balances, significantly improving loss and accuracy metrics. Conf. Speech Comm. and Technology (INTERSPEECH), 2004.
This model achieves a remarkable validation loss of 0.032 and [13] R. AlSufayan and D. A. El-Dakhs, "Achievement Emotions in
an accuracy of 99%. This notice misclassification patterns, Paper-Based Exams vs. Computer-Based Exams: The Case of a
Private Saudi University," International Journal of Online Pedagogy
emphasizing the need for further research and addressing and Course Design (IJOPCD), vol. 13, no. 1, pp. 1-21, 2023.
potential biases from imbalanced emotion class datasets. This http://doi.org/10.4018/IJOPCD.322084
system has promising applications in human-computer inter- [14] P. Vasuki and C. Aravindan, "Hierarchical classifier design for speech
emotion recognition in the mixed-cultural environment," Journal of Exper-
action, mental health care, virtual assistants, and E-Learning. imental & Theoretical Artificial Intelligence, vol. 33, no. 3, pp. 451–466,
In the future, the recommendation is to examine diverse 2021. Taylor & Francis.
datasets to validate the model’s efficacy in varied contexts. By [15] A. Wierzbicka, "Emotions across languages and cultures: Diversity and
universals," Cambridge University Press, 1999.
incorporating more varied and balanced emotion datasets and [16] Z. Li, L. He, J. Li, L. Wang, and W.-Q. Zhang, "Towards Discriminative
ongoing research in real-time emotion classification, this sys- Representations and Unbiased Predictions: Class-Specific Angular Soft-
tem’s future applications could achieve exceptional accuracy max for Speech Emotion Recognition," in INTERSPEECH, pp. 1696–
1700, 2019.
in predicting the emotions of English speakers, regardless of [17] N. Shah, K. Sood, and J. Arora, "Speech emotion recognition for psy-
their accent or dialect. Moreover, extracting unique features chotherapy: an analysis of traditional machine learning and deep learning
such as fundamental frequency, Linear Predictive Control techniques," in 2023 IEEE 13th Annual Computing and Communication
Workshop and Conference (CCWC), pp. 0718–0723, 2023. IEEE.
(LPC), and Tonal Features can further enhance these aspects. [18] L.-M. Zhang, Y. Li, Y.-T. Zhang, G.W. Ng, Y.-B. Leau, and H. Yan,
This research is a foundation for future studies, setting a "A Deep Learning Method Using Gender-Specific Features for Emotion
precedent for further advancements in the SER field. Recognition," Sensors, vol. 23, no. 3, p. 1355, 2023. MDPI.
[19] W. Alsabhan, "Human–Computer Interaction with a Real-Time Speech
Emotion Recognition with Ensembling Techniques 1D Convolution Neural
ACKNOWLEDGMENT Network and Attention," Sensors, vol. 23, no. 3, p. 1386, 2023. MDPI.
The authors would like to thank Prince Sultan University for [20] A. Muraleedharan and M. Garcia-Constantino, "Domestic Violence De-
tection Using Smart Microphones," in International Conference on Ubiq-
paying the Article Processing Charges (APC) of this publica- uitous Computing and Ambient Intelligence, pp. 357–368, 2022. Springer.
tion. They would also like to thank Prince Sultan University [21] K. Jain, A. Chaturvedi, J. Dua, and R.K. Bhukya, "Investigation Using
for their support. MLP-SVM-PCA Classifiers on Speech Emotion Recognition," in 2022
IEEE 9th Uttar Pradesh Section International Conference on Electrical,
Electronics and Computer Engineering (UPCON), pp. 1–6, 2022. IEEE.
REFERENCES [22] A. Agrima, A. Barakat, I. Mounir, A. Farchi, L. ElMazouzi, and B. Mounir,
[1] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. "Speech Emotion Recognition Using Energies in six bands and Multilayer
Fellenz, and J.G. Taylor, "Emotion recognition in human-computer inter- Perceptron on RAVDESS Dataset," in 2022 5th International Conference
action," IEEE Signal Processing Magazine, vol. 18, no. 1, pp. 32–80, 2001. on Advanced Communication Technologies and Networking (CommNet),
IEEE. pp. 1–5, 2022. IEEE.
[2] R.W. Picard, E. Vyzas, and J. Healey, "Toward machine emotional intel- [23] S. Kakuba, A. Poulose, and D.S. Han, "Attention-based multi-learning
ligence: analysis of affective physiological state," IEEE Transactions on approach for speech emotion recognition with dilated convolution," IEEE
Pattern Analysis and Machine Intelligence, vol. 23, no. 10, pp. 1175–1191, Access, vol. 10, pp. 122302–122313, 2022. IEEE.
Oct. 2001. doi: 10.1109/34.954607. [24] A. Ochi and X. Kang, "Learning a Parallel Network for Emotion Recogni-
[3] M.D. Pell and S.A. Kotz, "On the time course of vocal emotion recogni- tion Based on Small Training Data," in 2022 8th International Conference
tion," PLoS One, vol. 6, no. 11, p. e27256, Nov. 2011. Public Library of on Systems and Informatics (ICSAI), pp. 1–5, 2022. IEEE.
Science San Francisco, USA. [25] R.R. Paul, S.K. Paul, and M.E. Hamid, "A 2D Convolution Neural Network
[4] M. El Ayadi, M.S. Kamel, and F. Karray, "Survey on speech emotion recog- Based Method for Human Emotion Classification from Speech Signal,"
nition: Features, classification schemes, and databases," Pattern Recogni- in 2022 25th International Conference on Computer and Information
tion, vol. 44, no. 3, pp. 572–587, 2011. Elsevier. Technology (ICCIT), pp. 72–77, 2022. IEEE.
[5] H. Yan, M.H. Ang, and A.N. Poo, "A survey on perception methods for [26] A.S. Alluhaidan, O. Saidani, R. Jahangir, M.A. Nauman, and O.S. Neffati,
human–robot interaction in social robots," International Journal of Social ‘‘Speech Emotion Recognition through Hybrid Features and Convolutional
Robotics, vol. 6, pp. 85–119, 2014. Springer. Neural Network," Applied Sciences, vol. 13, no. 8, p. 4750, 2023. MDPI.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379
Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning
[27] L. Huang and X. Shen, "Research on Speech Emotion Recognition Based [49] J.R. Quinlan, "Induction of decision trees," Machine learning, vol. 1, pp.
on the Fractional Fourier Transform," Electronics, vol. 11, no. 20, p. 3393, 81–106, 1986. Springer.
2022. MDPI. [50] L. Breiman, "Random forests," Machine learning, vol. 45, pp. 5–32, 2001.
[28] J.L. Bautista, Y.K. Lee, and H.S. Shin, "Speech emotion recognition based Springer.
on parallel CNN-attention networks with multi-fold data augmentation," [51] C. Cortes and V. Vapnik, "Support-vector networks," Machine learning,
Electronics, vol. 11, no. 23, p. 3935, 2022. MDPI. vol. 20, pp. 273–297, 1995. Springer.
[29] A. Chaudhari, C. Bhatt, T.T. Nguyen, N. Patel, K. Chavda, and K. Sarda, [52] I. Rish, "An empirical study of the naive Bayes classifier," in IJCAI 2001
"Emotion Recognition System via Facial Expressions and Speech Using workshop on empirical methods in artificial intelligence, vol. 3, no. 22, pp.
Machine Learning and Deep Learning Techniques," SN Computer Science, 41–46, 2001.
vol. 4, no. 4, p. 363, 2023. Springer. [53] T. Cover and P. Hart, "Nearest neighbor pattern classification," IEEE
[30] A. Mohanty, R.C. Cherukuri, and A.R. Prusty, "Improvement of Speech transactions on information theory, vol. 13, no. 1, pp. 21–27, 1967. IEEE.
Emotion Recognition by Deep Convolutional Neural Network and Speech [54] T.G. Dietterich, "Ensemble methods in machine learning," in International
Features," in Congress on Intelligent Systems, pp. 117–129, 2022. Springer. workshop on multiple classifier systems, pp. 1–15, 2000. Springer.
[31] R. Chatterjee, S. Mazumdar, R.S. Sherratt, R. Halder, T. Maitra, and D. [55] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural
Giri, "Real-time speech emotion analysis for smart home assistants," IEEE computation, vol. 9, no. 8, pp. 1735–1780, 1997. MIT press.
Transactions on Consumer Electronics, vol. 67, no. 1, pp. 68–76, 2021. [56] Y. Xu, H. Su, G. Ma, and X. Liu, "A novel dual-modal emotion recognition
IEEE. algorithm with fusing hybrid features of audio signal and speech context,"
[32] A. Al-Laith and M. Alenezi, "Monitoring People’s Emotions and Symp- Complex & Intelligent Systems, vol. 9, no. 1, pp. 951–963, 2023. Springer.
toms from Arabic Tweets during the COVID-19 Pandemic," Informa- [57] S. Li, P. Song, and W. Zheng, "Multi-Source Discriminant Subspace
tion, vol. 12, no. 2, article number 86, 2021. ISSN: 2078-2489. DOI: Alignment for Cross-Domain Speech Emotion Recognition," IEEE/ACM
10.3390/info12020086. Transactions on Audio, Speech, and Language Processing, vol. 31, pp.
[33] A.S. Nasim, R.H. Chowdory, A. Dey, and A. Das, "Recognizing Speech 2448–2460, 2023. doi:10.1109/TASLP.2023.3288415.
Emotion Based on Acoustic Features Using Machine Learning," in 2021 [58] Z. Kexin and L. Yunxiang, "Speech Emotion Recognition Based on Trans-
International Conference on Advanced Computer Science and Information fer Emotion-Discriminative Features Subspace Learning," IEEE Access,
Systems (ICACSIS), pp. 1–7, 2021. IEEE. vol. 11, pp. 56336–56343, 2023. doi:10.1109/ACCESS.2023.3282982.
[34] P. Tiwari and A.D. Darji, "A novel S-LDA features for automatic emotion [59] S. Latif, R. Rana, S. Khalifa, R. Jurdak, and B. Schuller, "Self Supervised
recognition from speech using 1-D CNN," International Journal of Mathe- Adversarial Domain Adaptation for Cross-Corpus and Cross-Language
matical, Engineering and Management Sciences, vol. 7, no. 1, p. 49, 2022. Speech Emotion Recognition," IEEE Transactions on Affective Computing,
International Journal of Mathematical, Engineering and Management Sci- vol. 14, no. 3, pp. 1912–1926, 2023. doi:10.1109/TAFFC.2022.3167013.
ences. [60] K. L. Ong, C. P. Lee, H. S. Lim, K. M. Lim, and A. Alqahtani, "Mel-
[35] M.R. Ahmed, S. Islam, A.K.M. Islam, and S. Shatabda, "An ensemble MViTv2: Enhanced Speech Emotion Recognition With Mel Spectrogram
1D-CNN-LSTM-GRU model with data augmentation for speech emotion and Improved Multiscale Vision Transformers," IEEE Access, vol. 11, pp.
recognition," Expert Systems with Applications, vol. 218, p. 119633, 2023. 108571–108579, 2023. doi:10.1109/ACCESS.2023.3321122.
Elsevier. [61] L.-M. Zhang, G. W. Ng, Y.-B. Leau, and H. Yan, "A Parallel-Model Speech
[36] Y.-C. Kao, C.-T. Li, T.-C. Tai, and J.-C. Wang, "Emotional speech anal- Emotion Recognition Network Based on Feature Clustering," IEEE Access,
ysis based on convolutional neural networks," in 2021 9th International vol. 11, pp. 71224–71234, 2023. doi:10.1109/ACCESS.2023.3294274.
Conference on Orange Technology (ICOT), pp. 1–4, 2021. IEEE. [62] J. Liu, X. Wu, and X. Wu, "Prototype of educational affective arousal
[37] S. Alghowinem, R. Goecke, M. Wagner, and A. Alwabil, "Evaluating and evaluation system based on facial and speech emotion recognition," Inter-
Validating Emotion Elicitation Using English and Arabic Movie Clips national Journal of Information and Education Technology, vol. 9, no. 9,
on a Saudi Sample," Sensors, vol. 19, no. 10, p. 2218, May 2019, doi: pp. 645–651, 2019.
10.3390/s19102218. [63] X. Huahu, G. Jue, and Y. Jian, "Application of Speech Emotion Recogni-
[38] M. Swain, A. Routray, and P. Kabisatpathy, "Databases, features and tion in Intelligent Household Robot," in 2010 International Conference on
classifiers for speech emotion recognition: a review," International Journal Artificial Intelligence and Computational Intelligence, vol. 1, pp. 537–541,
of Speech Technology, vol. 21, pp. 93–120, 2018. Springer. 2010. doi:10.1109/AICI.2010.118.
[39] R.A. Khalil, E. Jones, M.I. Babar, T. Jan, M.H. Zafar, and T. Alhussain, [64] S. Hamsa, I. Shahin, Y. Iraqi, and N. Werghi, "Emotion Recognition
"Speech emotion recognition using deep learning techniques: A review," From Speech Using Wavelet Packet Transform Cochlear Filter Bank and
IEEE Access, vol. 7, pp. 117327–117345, 2019. IEEE. Random Forest Classifier," IEEE Access, vol. 8, pp. 96994–97006, 2020.
[40] B.J. Abbaschian, D. Sierra-Sosa, and A. Elmaghraby, "Deep learning tech- doi:10.1109/ACCESS.2020.2991811.
niques for speech emotion recognition, from databases to models," Sensors, [65] Md. Zia Uddin and Erik G. Nilsson, "Emotion recognition using speech and
vol. 21, no. 4, p. 1249, 2021. MDPI. neural structured learning to facilitate edge intelligence," Engineering Ap-
[41] Toronto Emotional Speech Set (TESS), https://www.kaggle.com/datasets/ plications of Artificial Intelligence, vol. 94, 2020, article number 103775,
ejlok1/toronto-emotional-speech-set-tess. ISSN 0952-1976, https://doi.org/10.1016/j.engappai.2020.103775.
[42] Cremad: Crowd-sourced Emotional Multimodal Actors Dataset, https:// [66] P. Hajek and M. Munk, "Speech emotion recognition and text sen-
www.kaggle.com/datasets/ejlok1/cremad, Jun 2018. timent analysis for financial distress prediction," Neural Computing
[43] Surrey Audiovisual Expressed Emotion (SAVEE) | Kaggle, https://www. and Applications, vol. 35, no. 29, pp. 21463–21477, Mar. 2023, doi:
kaggle.com/datasets/ejlok1/surrey-audiovisual-expressed-emotion-savee. 10.1007/s00521-023-08470-8. [Online]. Available: http://dx.doi.org/10.
[44] RAVDESS Emotional speech audio, https://www.kaggle.com/datasets/ 1007/s00521-023-08470-8.
uwrfkaggler/ravdess-emotional-speech-audio. [67] H.-C. Li, T. Pan, M.-H. Lee, and H.-W. Chiu, "Make Patient Con-
[45] P. Sandhya, V. Spoorthy, S.G. Koolagudi, and N.V. Sobhana, "Spectral sultation Warmer: A Clinical Application for Speech Emotion Recog-
features for emotional speaker recognition," in 2020 Third International nition," Applied Sciences, vol. 11, no. 11, p. 4782, May 2021,
Conference on Advances in Electronics, Computers and Communications doi: 10.3390/app11114782. [Online]. Available: http://dx.doi.org/10.3390/
(ICAECC), pp. 1–6, 2020. IEEE. app11114782.
[46] P. Burk, L. Polansky, D. Repetto, M. Roberts, and D. Rockmore, "Mu- [68] L. Mentch and S. Zhou, "Randomization as regularization: A degrees of
sic and computers: a theoretical and historical approach," Preface to the freedom explanation for random forest success," The Journal of Machine
Archival Version (Spring, 2011), 2011. Learning Research, vol. 21, no. 1, pp. 6918–6953, 2020, publisher: JML-
[47] L. Malmqvist, "RapiCSF-A fast test of spectral contrast," 2013. RORG.
[48] T. Hastie, R. Tibshirani, and J.H. Friedman, "The elements of statistical
learning: data mining, inference, and prediction," vol. 2, 2009. Springer.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379
Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning
RAHEEL AHMAD He obtained a Bachelor of ARSHAD IQBAL (Member, IEEE) received the
Science B.S. degree in Electrical Engineering from B.S. degree in electrical and computer engineering
the University of Sargodha, Pakistan, in 2017. Af- from COMSATS (CIIT), Abbottabad, Pakistan, in
ter completing his bachelor’s degree, he began a 2013, and the M.S. and Ph.D. degrees in electrical
pursuit to enhance his knowledge and skills in and computer engineering from Sungkyunkwan
Artificial Intelligence. The individual is presently University, Suwon, South Korea, in 2020. Since
engaged in the pursuit of a Master of Science 2021, he has been an Assistant Professor with the
(M.S.) degree in the field of Artificial Intelligence Sino-Pak Center for Artificial Intelligence (SP-
at the esteemed Pak-Austria Fachhochschule Insti- CAI), Pak-Austria Fachhochschule: Institute of
tute of Applied Sciences and Technology, located Applied Sciences and Technology (PAF-IAST),
in Haripur, Khyber Pakhtunkhwa (KPK), Pakistan. In addition to his aca- Haripur, Pakistan. His research interests include medium access control,
demic endeavors, He has actively participated in implementing Artificial resource allocation, the Internet of Things, applied artificial intelligence,
Intelligence and Machine Learning. He fulfills the role of an independent WLAN, sensors networks, energy harvesting networks, backscatter commu-
Machine Learning tutor, sharing knowledge and expertise with individuals nication networks, power saving, distributed communication networks, and
who aspire to become students or professionals in the field. The individual’s next generation communication networks. He was a recipient of the fully-
devotion to education is demonstrated by developing artificial intelligence funded ICT research and development scholarship for undergraduate by the
(AI) solutions specifically designed for academic and industrial contexts. Ministry of Information Technology (IT), Pakistan. He was also a recipient
This exemplifies their dedication to closing the divide between theoretical of HEC scholarship under Human Resource Development (HRD) initiative
knowledge and practical implementation. M.S. leading to Ph.D. program of faculty development for UESTPS, Phase-1
Batch-IV.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3376379
Ahmad et al.: XEmoAccent: Embracing Diversity in Cross-Accent Emotion Recognition using Deep Learning
NAVEED AHMAD received a B.S. degree in YASIR JAVED is a highly qualified data scien-
computer science from the University of Peshawar, tist, senior programmer / developer with over 18
Pakistan, in 2007, and a Ph.D. degree in Computer years’ experience in research, security program-
Science from the University of Surrey, U.K., in ming, software development, project management,
2013. He is currently working as an Associate and analytics. As part of his research, he has in-
Professor with the College of Computer and Infor- terests in data analytics, forensics, smart cities,
mation Sciences, Prince Sultan University, Riyadh, network security, and education sustainability, in-
Saudi Arabia. His research interests include se- structional development, learning and education
curity and privacy in emerging networks, such as sustainability, robotics, unmanned aerial vehicles,
VANETs, DTN, Internet of Things (IoT), Machine vehicular platoons, secure software development,
Learning, and Big Data. signal processing, IoT analytics, intelligent applications, and predictive com-
puting inspired by artificial intelligence. With an outstanding PhD student
award from UNIMAS, Sarawak, Dr. Yasir holds a PhD degree. In addition,
he was awarded a rector’s medal in his MS degree as well as a Distinguished
Teaching Award from the President. Listed in Top researcher award at PSU
in recognition of his research contributions, he has published over 100 peer-
reviewed articles in top-tier journals, conference proceedings, and book
chapters.. Additionally, he serves as a reviewer for several journals. With
regards to his professional experience, he has undertaken a variety of national
and international research funding projects and has also served as an analyst
programmer at the Prince Megren Data Center, the Center of Excellence, and
the Research and Initiative Center at Prince Sultan University. He serves as
Chair of the ACM Professional Chapter in KSA and is an active member of
the RIOTU group at Prince Sultan University.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4