Draft 6

1.
ABSTRACT
In the field of audio analysis, the challenges of comprehending human emotions and discerning speakers are of
paramount importance due to their practical implications. This research introduces an innovative multimodal approach
designed to tackle both tasks, with a primary focus on practical applications such as speaker authentication and accurate
speech emotion recognition for customer service.
To address the critical issue of speaker identification, we utilize a subset of the LibriSpeech test-clean dataset and
employ Mel-frequency cepstral coefficients (MFCCs) as acoustic features. Our model adopts a three-layer Long Short-Term
Memory (LSTM) architecture, fine-tuned with triplet loss. Through extensive experimentation, we demonstrate the
effectiveness of deep learning techniques in achieving exceptional accuracy in speaker recognition, particularly in handling
diverse acoustic conditions. Remarkably, our model attains an impressive Equal Error Rate (EER) of 6.89% in speaker
recognition, highlighting its robustness and high accuracy, especially for practical applications like speaker authentication.
Simultaneously, we delve into the complex realm of emotional analysis with a focus on practical customer service
applications. Utilizing the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset, we
construct a classification model based on Convolutional Neural Networks (CNNs). This model excels in classifying eight
distinct emotions (neutral, relaxed, happy, unhappy, angry, fearful, disgust, surprise), encompassing Ekman's foundational
emotional categories while introducing neutral and relaxed states. Our evaluation metrics, including the F1 score,
demonstrate a high F1 score of 0.85.
This research underscores the potential of our multimodal approach that effectively combines speaker identification and
emotion analysis for real-world applications. The deep neural networks utilized in this study have significant implications
in various domains, including voice-based user authentication, sentiment analysis, and notably, improving customer service
through precise and nuanced speech emotion recognition. By concurrently addressing these intricate tasks, we lay the
groundwork for more comprehensive and context-aware audio analysis, poised to benefit applications in speaker
authentication and customer service alike.
2. INTRODUCTION
In the ever-evolving domain of audio analysis, researchers and innovators are increasingly drawn to two core challenges:
understanding human emotions and distinguishing speakers from audio data. This research endeavor centers on addressing
both of these critical tasks, with a particular focus on their relevance in two specific domains: speaker authentication and
speech emotion recognition customized for customer service.
Speaker identification systems can be generally categorized into two groups based on the nature of the speech content:
text-dependent and text-independent systems [1]. In text-dependent systems, individuals are required to utter a specific
phrase during both the training and testing phases, whereas text-independent systems [2] have the ability to recognize
speakers from any spoken phrase, regardless of its content. Text-independent speaker recognition poses a more captivating
challenge, emphasizing the capability to distinguish individuals without regard to the specific words they speak. This
distinction underscores the intricate challenges and opportunities inherent in speaker authentication, reflecting the
complexities of real-world authentication requirements.
In the realm of speaker authentication, our research leverages advanced deep learning techniques on acoustic features
extracted from the LibriSpeech test-clean dataset [3]. The core of our feature extraction process relies on Mel-Frequency
Cepstral Coefficients (MFCCs) [4], which excel in capturing the distinct spectral attributes of speech. Our model, supported
by a three-layer LSTM architecture [5], undergoes fine-tuning with triplet loss [6]. This approach equips our model not only
to excel in differentiating between different speakers but also to enhance its robustness against diverse acoustic conditions.
Through meticulous experimentation, we demonstrate the effectiveness of deep learning techniques, achieving impressive
accuracy in speaker recognition, particularly in addressing challenges posed by acoustic variations. A significant
achievement is our model's remarkable Equal Error Rate (EER) of 6.89% in speaker recognition, confirming its resilience
and precision, especially in practical applications like speaker authentication.
Simultaneously, in the context of customer service, we explore the intricate field of speech emotion recognition [7].
Utilizing Convolutional Neural Networks (CNNs) [8] on the RAVDESS dataset [9], our classification model extends its
capabilities to recognize eight distinct Ekman's foundational emotional states [10]: neutral, relaxed, happy, unhappy, angry,
fearful, disgust, and surprise. This expansion of the emotional spectrum builds upon established emotional categories,
introducing nuances such as neutral and relaxed states. Our evaluation metrics, including the weighted average F1 score,
attest to the model's effectiveness, achieving a score of 0.85 on the test dataset.
This research paper exemplifies the fusion of cutting-edge technology and practical utility. By simultaneously
addressing the complexities of speaker authentication and customer service-oriented speech emotion recognition, we aim
to establish the groundwork for a more comprehensive and context-aware approach to audio analysis. Our work holds
promise in the fields of security and healthcare, opening doors to innovative applications that not only comprehend human
emotions and identify speakers but also enrich our lives through practical solutions.
3. RELATED WORK
Recent advancements in audio analysis have led to significant progress in recognizing speakers by their voices and
understanding emotions from audio signals. These research efforts encompass a range of innovative approaches in both text-
independent speaker recognition and speech emotion recognition. In the context of speaker recognition, methods have been
developed that address the limitations of traditional text-dependent systems, emphasizing the advantages of text-
independent approaches with techniques like MFCC+LSTM [1] and 3D Convolutional Neural Networks (3DCNN)
combined with LSTM networks [11]. Another approach introduces the combination of Linear Predictive Coding (LPC) and
Log-Mel spectrum to get the acoustic features then it is fed to LSTM architecture [12].
Our approach, utilizing Mel-frequency cepstral coefficients (MFCCs) and a three-layer of LSTM architecture fine-tuned
with triplet loss, achieves a notable EER in speaker recognition. Our model's simplicity and robustness make it suitable for
practical applications such as speaker authentication.
On the other front, in speech emotion recognition, various models have emerged, including those leveraging parallel
Convolutional Neural Networks (CNNs) and Transformer encoders [13], as well as architectures capable of recognizing
emotions from raw audio data without the need for visual representations, incorporating a diverse set of audio features
including mel-frequency cepstral coefficients (MFCCs), chromagram, mel-scale spectrogram, Tonnetz representation, and
spectral contrast features, as inputs to a one-dimensional Convolutional Neural Network (CNN) [14].
Our model, designed with simplicity and effectiveness in mind, takes raw audio data as input and employs a
Convolutional Neural Network (CNN) architecture that captures intricate temporal and spatial features critical for emotion
recognition. With two Conv1D layers, ReLU activation, dropout layers (0.2), and a Softmax output, our model achieves
superior accuracy while maintaining model simplicity.
4. METHODOLOGY:
In this section, we present the methodology employed in our multimodal approach, addressing two critical tasks:
Speaker Recognition and Speech Emotion Recognition. For Speaker Recognition, we leverage the LibriSpeech test-clean
dataset and employ MFCC as acoustic features. Our model consists of a three-layer LSTM architecture, fine-tuned with
triplet loss. In parallel, for Speech Emotion Recognition, we utilize the RAVDESS dataset, employing raw audio data as
input. Our emotion recognition model is based on a Convolutional Neural Network (CNN) architecture. The following
subsections provide a detailed breakdown of our approach for each task, from data preprocessing to model architecture and
evaluation.
4.1 Feature Extraction:
To prepare the sound waves for input into the neural network, it was necessary to process and transform them into a set
of distinct features. In our research, we focused on extracting Mel Frequency Cepstral Coefficients (MFCCs), including
their corresponding differentials (Delta) and accelerations (Delta-Delta). These MFCCs are known for encapsulating crucial
speech signal characteristics, particularly those relevant to phonetic information. Below, we provide a concise outline of the
steps involved in the extraction of Mel Frequency Cepstral Coefficients (MFCCs) from a speech signal.
a) Preemphasis: The initial step involves applying pre-emphasis to the audio signal, achieved through a first-order
high-pass filter. This filter is typically implemented using a straightforward difference equation:
y[n] = x[n] - α⋅x[n−1]
In this equation:
y[n] is the output signal after pre-emphasis.
x[n] is the input audio signal.
α is the pre-emphasis coefficient, which typically has a value between 0.9 and 1.0.
The primary purpose of this equation is to enhance the amplitudes of higher-frequency components in the signal. It
accomplishes this by subtracting a scaled version of the previous sample (x[n−1]) from the current sample (x[n]),
giving more emphasis to high-frequency changes between consecutive samples.
b) Computing Mel-Filterbanks: Subsequently, the preemphasized audio signal is divided into short overlapping
frames, typically ranging from 20 to 40 milliseconds. These frames serve to capture the signal's spectral
characteristics over time. For each frame, a Fast Fourier Transform (FFT) [15] is computed, facilitating the
conversion of the signal from the time domain to the frequency domain. The magnitude of FFT coefficients is then
determined, yielding the magnitude spectrum that represents energy distribution across different frequency
components.
The Mel Filterbanks, illustrated in Fig.1, comprise a set of triangular-shaped filters. These Mel filterbanks,
designed to emulate the frequency sensitivity of the human auditory system, are applied to the spectrum, effectively
measuring energy within specific frequency bands.
Fig. 1 Mel Filterbanks
Finally, a logarithm is applied to the energies of each filterbank. This step is crucial as it approximates the
logarithmic response of the human auditory system, which aligns with perceptually relevant audio processing. The
outcome of this process is referred to as the Mel-frequency cepstral coefficients (MFCCs).
c) Discrete Cosine Transform (DCT): To obtain the MFCC coefficients, a Discrete Cosine Transform (DCT) [16] is
applied to the logarithmic filterbank energies. Once the initial 13 MFCC coefficients are obtained, delta and double-
delta coefficients are calculated. Delta coefficients reflect the rate of change of the MFCCs over time, while double-
delta coefficients represent the acceleration of the MFCCs. The final MFCC feature vector encompasses the 13
original MFCC coefficients, 13 delta coefficients, and 13 double delta coefficients. These features are amalgamated
into a feature vector with a size of 39, serving as the input data for the neural network.
4.2 Speaker Recognition Model:
Our speaker recognition model relies on the LSTM architecture. In speaker recognition, where we often work with
sequential data such as audio signals and phonetic features extracted from speech, LSTMs prove invaluable. These
networks are meticulously designed to model sequences effectively, capturing the extended dependencies inherent in
the data. This is particularly vital due to the temporal nature of speech signals, as LSTMs excel at retaining information
over longer intervals and adeptly grasping the contextual intricacies of sequential data [17]. We refined the model by
utilizing triplet loss, a technique that incentivizes the model to reduce the distance between embeddings originating
from the same speaker while simultaneously maximizing the separation between embeddings associated with distinct
speakers. Below is a brief overview of LSTM architecture.
LSTM is a type of recurrent neural network (RNN) that is specifically designed to learn long-term dependencies in
sequential data. LSTMs are more powerful than traditional RNNs because they can avoid the vanishing gradient
problem, which can make it difficult to train RNNs on long sequences.
The architecture of an LSTM cell consists of four main components:
Input gate: Responsible for regulating the extent to which the current input contributes to the cell state.
Forget gate: Manages the extent to which the previous cell state is disregarded or erased.
Cell state: Functions as the primary memory unit within the LSTM cell, preserving the long-term information that the
LSTM has acquired.
Output gate: Dictates the extent to which the cell state is directed to the subsequent layer of the network.
Each of these gates is implemented as a sigmoid neural network layer. The sigmoid function outputs values between 0
and 1, which can be used to control how much information is allowed to pass through the gate.
Fig.2 shows the architecture of an LSTM cell:
Fig. 2 LSTM cell architecture
The operation of the LSTM cell commences with the processing of the current input, a task entrusted to the input
gate. This gate plays a pivotal role in ascertaining the extent to which the current input contributes to the cell state.
Subsequently, the forget gate comes into play, serving as the arbiter of how much of the previous cell state should be
discarded or "forgotten." The cell state undergoes an update process, wherein it assimilates fresh information from the
input gate while concurrently purging outdated information as determined by the forget gate [18]. Lastly, the output
gate takes charge, determining the portion of the cell state that is forwarded to the subsequent layer within the network.
The following equations [5] describe the architecture of an LSTM cell:
Input gate: it = σ (Wi [xt, ht-1] + bi) (1)

Forget gate: ft = σ (Wf [xt, ht-1] + bf) (2)
Output gate: ot = σ (Wo [xt, ht-1] + bo) (3)
Candidate cell state: c̃t = tanh (Wc [xt, ht-1] + bc) (4)
Cell state: ct = ft * ct-1 + it * ct̃ (5)
Output: ht = ot * tanh (ct) (6)
it = represents input gate.

ft = represents forget gate.
ot = represents output gate.
σ = represents sigmoid function.
wx = weight for the respective gate(x) neurons.
ht-1 = output of the previous lstm block (at timestamp t - 1).
xt = input at current timestamp.
bx = biases for the respective gates(x).
ct = cell state (memory) at timestamp(t).
c̃t = represents candidate for cell state at timestamp(t).
In addition to these components, LSTM cells maintain a hidden state, which acts as a vector encapsulating insights
into the information gleaned about the sequence up to the current time step. The propagation of this hidden state to
subsequent time steps empowers LSTMs to discern and model long-term dependencies within sequences.
Furthermore, it's noteworthy that LSTMs can be stacked together to construct deeper networks. This stacking
enables the output of one LSTM cell to serve as the input to the next, allowing the network to unravel more complex
and intricate long-term dependencies within the data.
In our specific research context as shown in Fig. 3, we employ the power of LSTM cells in the domain of speaker
recognition. The initial step involves the extraction of Mel-frequency cepstral coefficients (MFCCs) from audio signals,
capturing essential spectral and temporal features. These MFCC features are then fed into a 3-layer LSTM network,
chosen for its prowess in processing sequential data. The adoption of three layers in our model has led to significantly
faster convergence during training. Moreover, we've observed that speaker recognition is performing exceptionally well
with this configuration. This highlights the effectiveness of these layers in capturing the essential features for accurate
recognition.
Crucially, during the training phase, we employ the triplet loss function. This loss function serves the purpose of
minimizing intra-speaker variability (the distance between anchor and positive examples) while concurrently
maximizing inter-speaker variability (the distance between anchor and negative examples). This strategic approach
ensures that our LSTM-based model excels in distinguishing between different speakers, making it a powerful and
effective solution for speaker recognition tasks.
Fig. 3 Proposed Model for Speaker Recognition.

4.2.1 Speaker Authentication:
Speaker authentication plays a crucial role in ensuring the security and accuracy of various
applications. Speaker recognition, a subset of speaker authentication, allows us to verify and identify
individuals based on their unique vocal characteristics.
For our speaker authentication task as displayed in Fig. 4, we employ a straightforward yet
effective approach that involves comparing two audio files, one from the test sample and one from the
enrolled speaker. This methodology is designed to verify whether the person represented by the test sample
matches the enrolled speaker's identity.
The process begins with the enrollment phase, where a user's unique vocal characteristics are
initially captured. During this phase, the enrolled speaker provides a sample of their voice, which is then
processed using the MFCC feature extraction and LSTM architecture. This enrollment audio file serves
as the reference point for the individual's vocal identity within our system. The LSTM network transforms
the features extracted from this audio into a d-vector [19], which is then securely stored for future
verification.
In the authentication phase, a new audio sample is provided for testing. This test audio file
represents the vocal input from an individual seeking authentication. We extract MFCC features from this
test sample and pass them through the same LSTM network architecture to generate a corresponding d-
vector. The key objective at this stage is to determine if the d-vector extracted from the test audio file is
sufficiently similar to the d-vector of the enrolled speaker.
To make this determination, we utilize the cosine similarity metric [20]. This involves calculating
the cosine similarity between the d-vector extracted from the test sample and the d-vector representing the
enrolled speaker. The resulting numerical value quantifies the similarity between the two voices. If this
value surpasses our predefined similarity threshold, typically set at 0.8 within our system, we confidently
authenticate the speaker as the enrolled individual. The choice of a threshold at 0.8 strikes a balance
between ensuring a high level of accuracy in speaker authentication while mitigating the risk of false
positives. Employing higher thresholds could establish stricter authentication criteria but might lead to the
rejection of legitimate speakers with minor vocal variations. Conversely, lower thresholds might be more
permissive but potentially heighten the risk of unauthorized access. Thus, a threshold of 0.8 represents a
well-considered compromise, optimizing both security and usability.
The results of the tests (1 to 3) are shown below in Fig. 4, Fig. 5 and Fig. 6:
Fig. 4 Test: 1
Fig. 5 Test: 2
Fig. 6 Test: 3
4.3 Speech Emotion Recognition:
In our research framework, we employ a Convolutional Neural Network (CNN) as shown in Fig. 7 to tackle the
task of emotion classification based on audio features. Our baseline model consists of one-dimensional convolutional
layers, integrated with crucial components such as dropout layers, batch normalization, and activation functions. The
input layer of our CNN is designed to accept 40 × 1 arrays, corresponding to the audio feature representations extracted
from sound files. Following this, the network initiates with an initial convolutional layer featuring 64 filters, each with
a kernel size of 5 and 'same' padding. This layer employs the Rectified Linear Unit (ReLU) activation function and
includes dropout with a rate of 0.2 to mitigate overfitting. A subsequent convolutional layer follows, comprising 128
filters and mirroring the configurations of the preceding layer. It also employs ReLU activation and dropout at the same
rate, contributing to the model's resilience.
Fig. 7 Proposed Model for Speech Emotion Recognition
Upon the convolutional layers, a flattening layer transforms the output into a one-dimensional tensor for further
processing. Subsequently, a fully connected layer adapts its size according to the number of distinct emotion classes, serving
as the output layer. This layer incorporates a softmax activation function to compute class probabilities. In terms of model
training, we configure it with categorical crossentropy loss, the Adam optimizer, and accuracy as the evaluation metric.
During training, we utilize a batch size of 16 and train for 50 epochs while validating the model's performance on a separate
dataset. Our evaluation metrics encompass confusion matrices and classification reports, providing a comprehensive
assessment of the model's capability to accurately predict emotions from audio data.
4.3.1 Speech Emotion Recognition (SER) for Customer Service:

In the domain of customer service, the level of customer engagement is pivotal for the effective utilization
of Speech Emotion Recognition (SER) systems. These SER models are adept at discerning and analyzing emotional
expressions conveyed through speech, making them invaluable tools for gauging customer satisfaction and
emotional well-being during remote interactions. Our SER system excels at detecting a wide spectrum of emotional
states as shown in Fig. 8, including contentment, happiness, frustration, dissatisfaction, anger, surprise, or
confusion, thus providing a comprehensive framework for understanding and addressing customer emotions in
virtual service settings.
Fig, 8 Emotional States Detected by SER
In essence, customer engagement within SER-supported customer service translates to early detection of customer
concerns and a more holistic approach to service delivery. It fosters continuous emotional monitoring, improved customer
interaction, and customer-centric service, all of which resonate with the overarching objective of customer service—to
provide accessible and comprehensive assistance that caters to both practical and emotional customer needs. In today's era
of remote service provision, customer engagement within SER models plays a pivotal role in ensuring a compassionate,
responsive, and effective approach to addressing emotional aspects of customer interactions.
4.4 Multimodal Architecture:
Our research introduces a powerful multimodal architecture as shown in Fig. 9 that combines Speaker Recognition
and Speech Emotion Recognition, designed to enhance audio analysis by simultaneously identifying speakers and
recognizing emotional cues. This fusion is of particular significance in applications like customer service, where it
strengthens security and accuracy, and in speaker recognition for customer support, where it enables personalized
emotional understanding during remote interactions
Fig. 9 Proposed multimodal approach

In this architecture, alongside speaker recognition for customer service, we have incorporated speech emotion
recognition into the testing audio. This dual-purpose approach not only verifies the speaker's identity but also offers valuable
insights into their emotional state, contributing to a more comprehensive understanding of customer needs.
5. RESULTS & DISCUSSIONS:
In this section, we present and discuss the outcomes of our experiments in the fields of speaker recognition and
speech emotion recognition (SER). Our analysis begins with a detailed examination of the speaker recognition results, where
we evaluate the performance of our proposed LSTM-based model against alternative architectures. Following this, we delve
into the results of our speech emotion recognition experiments using a CNN-based model. Both sections shed light on the
effectiveness of our models and their potential applications.
5.1 Speaker Recognition Results:
In this section, we delve deeper into the results of our speaker recognition experiments using the proposed LSTM-based
model. The accuracy of speaker recognition systems is of paramount importance in applications like voice-based user
authentication and security. Table 1 displays the comparison of our proposed architecture and other architecture for speaker
recognition.
Table 1: Comparison of Speaker Recognition Performance

Architecture EER in %
Proposed Model (LSTM + MFCC) 6.89
LSTM + LPC 9.14
LSTM + Log-Mel 7.89
The speaker recognition results presented in Table 1 highlight the performance of different architectures in terms of
Equal Error Rate (EER). The proposed model, which combines LSTM with MFCC features, achieves the lowest EER of
6.89%. This indicates its effectiveness in accurately identifying speakers, making it a promising choice for voice-based user
authentication and security applications.
Comparatively, the LSTM model combined with LPC features yields an EER of 9.14%, while the LSTM model
with Log-Mel features results in an EER of 7.89%. These findings suggest that the choice of feature representation
significantly impacts the performance of the speaker recognition system. The superior performance of the proposed model
reinforces the importance of using MFCC features in conjunction with LSTM for speaker recognition tasks.
These results underscore the potential of our proposed LSTM-based model, emphasizing its practical applicability
in real-world scenarios where accurate speaker recognition is essential for enhancing security and user authentication
systems.
5.2 Speech Emotion Recognition Results:

Our research also encompasses the vital field of Speech Emotion Recognition (SER). Detecting emotions expressed
through speech has significant applications in customer service, sentiment analysis, and human-computer interaction.
Below, we present the outcomes of our emotion classification experiments utilizing the CNN-based model.
The confusion matrix shown in Table 2 provides a visual representation of the model's classification results across eight
emotion classes: Angry, Happy, Neutral, Unhappy, Relaxed, Fearful, Disgusted, and Surprised. The performance and
accuracy of the model is displayed in Table 3.
Table 2: Confusion Matrix
Angry [[173 8 1 7 1 0 1 1]
Happy [1 102 7 4 2 2 4 1]
Neutral [0 18 212 3 11 7 7 6]
Unhappy [3 7 9 227 5 14 2 8]
Relaxed [1 1 9 5 216 10 4 6]
Fearful [3 4 7 17 3 200 1 6]
Disgusted [0 5 1 4 7 7 171 2]
Surprised [0 0 7 1 3 4 10 165]]
Angry Happy Neutral Unhappy Relaxed Fearful Disgusted Surprised
Table 3: Performance of the model on the test set for each individual class.
Emotion precision recall F1-score support
Angry 0.96 0.9 0.93 192
Happy 0.7 0.83 0.76 123
Neutral 0.84 0.8 0.82 264
Unhappy 0.85 0.83 0.84 275
Relaxed 0.87 0.86 0.86 252
Fearful 0.82 0.83 0.82 241
Disgusted 0.85 0.87 0.86 197
Surprised 0.85 0.87 0.86 190
Accuracy 0.85 1734
Overall avg 0.84 0.85 0.84 1734
Weighted avg 0.85 0.85 0.85 1734
The weighted average F1-score for the entire dataset was 0.85, indicating strong overall performance. The model
achieved an accuracy rate of 85% across all emotion categories.
The outcomes from our Speech Emotion Recognition (SER) model emphasize its proficiency in detecting a broad
spectrum of emotional states in speech. Particularly, the model demonstrates exceptional performance in identifying
emotions like "Angry," "Relaxed," and "Disgusted," with impressive F1-scores. This capability holds significant relevance
in customer service, where timely recognition of emotional distress plays a pivotal role in delivering suitable assistance to
clients.
The robustness of the model's performance across multiple emotion categories, as indicated by the weighted average
F1-score of 0.85, demonstrates its versatility and suitability for various applications. In real-world scenarios, this level
accuracy ensures that the emotional nuances conveyed through speech can be accurately and comprehensively analyzed.
Figure 10 depicts the gradual decline in the model's loss function across 50 training epochs, indicating a steady
enhancement in its ability to minimize prediction errors. Simultaneously, Figure 11 exhibits a corresponding rise in
accuracy, highlighting the model's improved proficiency in making accurate predictions as it fine-tunes its parameters.
These visual representations underscore the model's ongoing learning process and its applicability in real-world use cases,
including speaker recognition and emotion analysis in the context of customer service.
Fig 10. Model loss trajectory over the span of 50 epochs.
Fig 11. Accuracy trajectory over the span of 50 epochs.

In Fig. 12 we present the terminal output showcasing the outcomes of our multimodal approach. As depicted, the
system has successfully authenticated the speaker, indicating the robustness and precision of our speaker recognition model.
Simultaneously, the system has detected the speaker's emotional state as 'Relaxed,' highlighting the capability of our speech
emotion recognition component. This dual achievement exemplifies the potential of our research in enhancing practical
applications, particularly in speaker authentication scenarios. The successful authentication ensures the security of
interactions, while the emotion detection provides valuable insights into the user's emotional state, contributing to context-
aware audio analysis.
Fig. 12 Authentication and Emotion Detection Results.
6. CONCLUSIONS:
In this research, we have presented a comprehensive multimodal approach that addresses two crucial tasks in the
field of audio analysis: speaker recognition and speech emotion recognition. Our primary focus has been on practical
applications such as speaker authentication and emotion analysis for customer service.
For speaker recognition, we employed advanced deep learning techniques on acoustic features extracted from the
LibriSpeech dataset. Our model, based on a three-layer LSTM architecture fine-tuned with triplet loss, achieved remarkable
accuracy with an Equal Error Rate (EER) of 6.89%. This highlights its robustness and precision, making it well-suited for
practical applications like voice-based user authentication and security.
Simultaneously, we explored speech emotion recognition using Convolutional Neural Networks (CNNs) on the
RAVDESS dataset. Our model successfully recognized eight distinct emotional states, achieving an F1-score of 0.85. This
capability holds significant implications for applications in telemedicine, sentiment analysis, and customer service, where
understanding and responding to emotional cues are crucial.
Our multimodal approach, which combines speaker recognition and emotion analysis, lays the foundation for more
comprehensive and context-aware audio analysis. This approach has the potential to enhance security, improve customer
service interactions, and provide valuable insights into user emotions.
In future work, we aim to extend our multimodal approach to real-time applications, support multiple languages,
and address privacy and ethical considerations. We will explore continuous learning techniques to adapt our models over
time, apply them to human-robot interaction scenarios, investigate cross-modal fusion methods, and benchmark them
against state-of-the-art approaches. These efforts will further enhance the practical applications of speaker recognition and
emotion analysis, ensuring their relevance and effectiveness across various domains.
REFERENCES:
[1] Samia Abd El-Moneim, M. A. Nassar, Moawad I. Dessouky, Nabil A. Ismail, Adel S. El-Fishawy and Fathi E. Abd
El-Samie, “Text-independent speaker recognition using LSTM-RNN and speech enhancement,” Multimed Tools
Appl 79, 24013–24028, 2020.
[2] Kinnunen, Tomi and Li, Haizhou, “An Overview of Text-Independent Speaker Recognition: from Features to
Supervectors.” Speech Communication, 12-40, 2010.
[3] V. Panayotov, G. Chen, D. Povey and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio
books,” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5206-5210,
2015.
[4] Durairaj, Prabakaran and Sriuppili, S, “Speech Processing: MFCC Based Feature Extraction Techniques- An
Investigation,” Journal of Physics: Conference Series, 2021.
[5] Christian Bakke Vennerød, Adrian Kjærran and Erling Stray Bugge, “Long Short-term Memory RNN,” arXiv, 2021.
[6] Emmanuel Maqueda, Javier Alvarez-Jimenez, Carlos Mena and Ivan Meza, “Triplet loss-based embeddings for
forensic speaker identification in Spanish,” Springer Science and Business Media {LLC}, Volume 35, 18177-18186,
2021.
[7] Zhen-Tao Liu, Qiao Xie, Min Wu, Wei-Hua Cao, Ying Mei and Jun-Wei Mao, “Speech emotion recognition based
on an improved brain emotion learning model,” Neurocomputing, Volume 309, 145-156, 2018.
[8] M. G. de Pinto, M. Polignano, P. Lops and G. Semeraro, “Emotions Understanding Model from Spoken Language
using Deep Neural Networks and Mel-Frequency Cepstral Coefficients,” 2020 IEEE Conference on Evolving and
Adaptive Intelligent Systems (EAIS), 2020.
[9] Livingstone SR and Russo FA. "The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS):
A dynamic, multimodal set of facial and vocal expressions in North American English," PLoS One, 2018.
[10] EKMAN, P. “Basic emotions, Handbook of cognition and emotion,” 45-60, 1999.
[11] Hu, Z.F. & Si, X.T. & Luo, Y. & Tang, S.S. and Jian, F. “Speaker recognition based on 3dcnn-lstm,” Engineering
Letters, 29, 463-470, 2021.
[12] Q. Xu, M. Wang, C. Xu and L. Xu, “Speaker Recognition Based on Long Short-Term Memory Networks,” 2020
IEEE 5th International Conference on Signal and Image Processing (ICSIP), 318-322, 2020.
[13] Ullah, Rizwan & Asif, Engr. Dr. Muhammad & Ali Shah, Wahab & Anjam, Fakhar & Ullah, Ibrar & Khurshaid,
Tahir & Wuttisittikulkij, L. & Shah, Shashi & Ali, Syedmansoor & Alibakhshikenari, Mohammad, “Speech Emotion
Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer,” Sensors, 23, 2023.
[14] Dias Issa, M. Fatih Demirci and Adnan Yazici, “Speech emotion recognition with deep convolutional neural
networks,” Biomedical Signal Processing and Control, Volume 59, 2020.
[15] P. Manikandan, K. Shrimathi, M. Kiruthika and A. Mubeena, "Speech Recognition using Fast Fourier Transform
Algorithm," INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH & TECHNOLOGY (IJERT),
Volume 10, 2022.
[16] Amit Meghanani and A G Ramakrishnan, “Pitch-synchronous DCT features: A pilot study on speaker
identification,” arXiv, 2018.
[17] J. Oruh, S. Viriri and A. Adegun, “Long Short-Term Memory Recurrent Neural Network for Automatic Speech
Recognition,” in IEEE Access, Volume 10, 30069-30079, 2022.
[18] Zazo R, Language identification in short utterances using long short-term memory (LSTM) recurrentneural
networks, PLoS One, 2016.
[19] Jung, Jeeweon & Heo, Heesoo & Yang, Ilho & Yoon, Sunghyun & Shim, Hye-Jin & Yu, Hajin, “D-vector based
speaker verification system using Raw Waveform CNN,” 2017 International Seminar on Artificial Intelligence,
Networking and Information Technology (ANIT 2017), 2018.
[20] Kuruvachan K. George, C. Santhosh Kumar, Sunil Sivadas, K.I. Ramachandran and Ashish Panda, “Analysis of
cosine distance features for speaker verification,” Pattern Recognition Letters, Volume 112, 285-289, 2018.

Draft 6

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Draft 6

Uploaded by

Copyright:

Available Formats

1.

4.1 Feature Extraction:

y[n] = x[n] - α⋅x[n−1]

Fig. 1 Mel Filterbanks

The architecture of an LSTM cell consists of four main components:

Fig.2 shows the architecture of an LSTM cell:

Fig. 2 LSTM cell architecture

The following equations [5] describe the architecture of an LSTM cell:

Input gate: it = σ (Wi [xt, ht-1] + bi) (1)

it = represents input gate.

Fig. 3 Proposed Model for Speaker Recognition.

4.3.1 Speech Emotion Recognition (SER) for Customer Service:

4.4 Multimodal Architecture:

Fig. 9 Proposed multimodal approach

5. RESULTS & DISCUSSIONS:

5.1 Speaker Recognition Results:

Table 1: Comparison of Speaker Recognition Performance

5.2 Speech Emotion Recognition Results:

Fig 11. Accuracy trajectory over the span of 50 epochs.

Fig. 12 Authentication and Emotion Detection Results.

You might also like