Sample Course End Project Report

You might also like

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 27

A

Course End Project Report on


SPEECH EMOTION RECOGNITION USING RCNN AND
LSTM

Submitted in partial fulfilment of the


requirements for the award of Degree of
BACHELOR OF TECHNOLOGY
In

Computer Science and Engineering (AI&ML)

By

Elma Shrenitha (20881A6616)

Department of Computer Science and Engineering


(AI&ML)

DEPARTMENT OF COMPUTER SCIENCE AND


ENGINEERING(AI&ML)
VARDHAMAN COLLEGE OF ENGINEERING
(AUTONOMOUS)
(Affiliated to JNTUH, Approved by AICTE and Accredited
by NBA) Shamshabad - 501 218, Hyderabad
VARDHAMAN COLLEGE OF ENGINEERING, HYDERABAD
An autonomous institute, affiliated to JNTUH

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING (CSE (AI &


ML))

CERTIFICATE

This is to certify that the course end project report for the subject Natural Language Processing
(A7707) entitled, “SPEECH EMOTION RECOGNITION USING RCNN AND LSTM”, done by E
Shrenitha (20881A6616), Submitting to the Department of Computer Science and Engineering
(AI&ML), Vardhaman College Of Engineering, in partial fulfilment of the requirements for the
Degree of Bachelor Of Technology in Computer Science and Engineering (AI&ML), during the
year 2023-24. It is certified that she has completed the project satisfactorily.

Signature of the Course Instructor Signature of Head of the Department


Name: Dr. Prakash Kumar Sarangi
Dr. M.A Jabbar
Designation: Associate Professor
Professor & Head
DECLARATION

I hereby declare that the work described in this report entitled “SPEECH EMOTION

RECOGNITION USING RCNN AND LSTM” which is being submitted by us in partial fulfilment

for the award of Bachelor Of Technology In The Department Of Computer Science and

Engineering (AI & ML), Vardhaman College of Engineering, Shamshabad, Hyderabad to the

Jawaharlal Nehru Technological University Hyderabad.

The work is original and has not been submitted for any Degree or Diploma of this or any

other university.

Signature of the Student

E Shrenitha
(20881A6616)
CONTENT

Sl. No. Content Details Page No.


1. Abstract 1

2. Introduction 2-6

3. Related work

4. Proposed work

5. Results and Analysis

6. Conclusion and Future work

7. References
ABSTRACT

The importance of speech emotion recognition has increased as a result of the acceptance of
intelligent conversational assistant services. The communication between humans and
machines may be made better via emotion recognition and analysis. We propose the
application of attention based deep learning techniques to process and recognize speech
emotions. In this paper we look at two major approaches RCNN-LSTM and Mel
Spectrogram-Vision Transformer based models and is compared over to the existing
benchmarks. The experimental results roots for the feature extraction strategy of deep
learning based approaches, eliminating the need of handpicking the features for traditional
machine learning (ML) classifiers present in the current literature. A comparative study and
evaluation between RCNN-LSTM and Vision Transformers (ViT) have been evaluated and
established from the experimental results. Both the models performed similarly with RCNN-
LSTM giving an accuracy of 88.50% when compared to the accuracy of 85.36% by ViT
surpassing the existing benchmarks and providing the scope of study of attention and image
processing based learning for speech emotion recognition.

Keywords: Speech emotion recognition (SER), Attention mechanism, Mel- spectrogram ,


Vision transformers, RCNN, LSTM
INTRODUCTION
In our ever-evolving digital landscape, where the convergence of artificial intelligence and
human emotions is becoming increasingly prevalent, understanding and interpreting the
emotional content conveyed through speech is a frontier with profound implications.Speech
Emotion Recognition, is a fascinating discipline that intersects the realms of artificial
intelligence, signal processing, and psychology. It seeks to bridge the gap between the
richness of human emotions and the capabilities of machines to comprehend, respond to, and
even emulate these emotions through speech patterns.The applications of Speech Emotion
Recognition are wide-ranging and impactful. From revolutionising customer service
interactions and improving mental health diagnostics to enhancing virtual communication and
creating empathetic artificial intelligence, the potential implications are profound. By
understanding the emotional context of spoken words, we pave the way for a more intuitive
and responsive integration of technology into our daily lives.

Traditional methods for SER often face challenges in capturing the nuanced and dynamic
nature of emotional expressions in speech. With the advent of deep learning, there has been a
paradigm shift towards leveraging the capabilities of neural networks to automatically extract
discriminative features from raw audio data. In this context, the fusion of Recurrent
Convolutional Neural Networks (RCNN) and Long Short-Term Memory (LSTM) networks
represents a compelling avenue for advancing the state-of-the-art in SER.

In the context of human-computer interaction, the potential applications of accurate SER


models are manifold. Virtual assistants could respond not only to the explicit content of
speech but also adapt their interactions based on the user's emotional state, leading to more
personalized and empathetic interactions. Moreover, in mental health monitoring, the ability
to analyze subtle shifts in emotional expressions could provide valuable insights into an
individual's well-being.

Emotions, as conveyed through speech, constitute a rich and intricate tapestry of information
that encompasses tone, intonation, rhythm, and subtle nuances often imperceptible to the
human eye but essential for a complete understanding of the speaker's affective state.
Recognizing and deciphering these emotional cues is a multifaceted challenge, as it requires
the synthesis of spatial and temporal features inherent in the acoustic properties of speech.
In practical terms, the success of this research could usher in a new era of emotionally
intelligent applications. Imagine educational platforms that adapt their teaching approach
based on the students' emotional engagement or virtual therapists capable of identifying
distress in a user's voice and responding with empathy. These scenarios underscore the
potential societal impact of advancing the capabilities of SER systems.

RCNNs excel in spatial feature extraction, effectively capturing patterns and structures within
the spectral domain of audio signals. Meanwhile, LSTMs, with their ability to model
sequential dependencies over time, are well-suited for capturing the temporal dynamics
inherent in speech. The integration of these two powerful architectures provides a synergistic
approach, allowing for a holistic analysis of both spatial and temporal features in speech
signals.

This proposed work endeavors to harness the complementary strengths of RCNNs and
LSTMs to enhance the accuracy and robustness of SER systems. By combining spatial and
temporal information, the model aims to not only discern discrete emotional states but also to
capture the subtle transitions and variations that characterize natural emotional expression in
spoken language.

The significance of this research extends beyond the realms of technology, delving into the
realms of psychology and human communication. Understanding and accurately interpreting
the emotional content of speech brings us closer to machines that can comprehend and
respond to human emotions, fostering more empathetic and context-aware human-computer
interactions. As we embark on this exploration at the intersection of deep learning and
emotional intelligence, the outcomes of this study are poised to contribute significantly to the
advancement of SER and its real-world applications.
RELATED WORK
1. Speech Emotion Recognition Using Deep Learning Techniques: A Review

This paper presents an overview of Deep Learning techniques and discusses some
recent literature where these methods are utilised for speech-based emotion recognition.

2. Speech Emotion Recognition using Machine Learning

Different classification algorithms to recognize the emotions , Support Vector Machine ,


Multilayer perception, and the audio feature MFCC, MEL, chroma, Tonnetz were used.

3. Speech Emotion Recognition using Machine Learning

There are various states to predict one's emotion, they are tone, pitch, expression,
behaviour etc. Among them, few states are considered to find the emotion through the speech.
Few samples are used to train the classifiers to perform speech emotion recognition

4. A Comprehensive Review of Speech Emotion Recognition Systems

The paper carefully identifies and synthesises recent relevant literature related to the
SER systems' varied design components/methodologies, thereby providing readers with a
state-of-the-art understanding of the hot research topic

5. Speech based Emotion Recognition using Machine Learning

The paper details the two methods applied on feature vectors and the effect of
increasing the number of feature vectors fed to the classifier. It provides an analysis of the
accuracy of classification for Indian English speech and speech in Hindi and Marathi.

6. Multimodal Speech Emotion Recognition Using Audio and Text

The proposed model outperforms previous state-of-the-art methods in assigning data


to one of four emotion categories (i.e., angry, happy, sad and neutral) when the model is
applied to the IEMOCAP dataset.

7. Effective speech emotion recognition using deep learning approaches for Algerian dialect

The paper introduces a new large Algerian speech emotion dataset collected from
different Algerian TV shows. After the data collection, we applied several classification
methods such as machine learning-based models, convolutional neural networks (CNNs),
Long Short Term Memory (LSTM) networks, and Bidirectional LSTM (BLSTM).
8. Speech Emotion Recognition with Co-Attention Based Multi-Level Acoustic Information

In this paper, they propose an end-to-end speech emotion recognition system using
multi-level acoustic information with a newly designed co-attention module.They firstly
extracted multi-level acoustic information, including MFCC, spectrogram, and the embedded
high-level acoustic information with CNN, BiL-STM and wav2vec2, respectively. Then these
extracted features are treated as multimodal inputs and fused by the proposed co-attention
mechanism.
9. Speech Interactive Emotion Recognition System Based on Random Forest

In this paper, we build a Wechat program of speech emotion recognition system, which
is based on a random forest classifier. The system obtains the emotional features of speech by
applying 12 statistical functions to the original acoustic features. The emotional classification
of Berlin Speech Emotion Database uses two classifiers: the Random Forest Classifier and
the Support Vector Machine.

10. Emotion Recognition Using Speech Processing

The following characteristics are retrieved in this work: Mel-frequency cepstral


coefficients (MFCC), Chromogram, Mel scale spectrogram, Spectral contrast, and Tonal
Centroid. The emotion in this study is regarded using a deep neural network. The
categorization of the speech in the output layer is done using Softmax.

11. Emotion Recognition of Stressed Speech Using Teager Energy and Linear Prediction
Features

The stressed speech signals which were not accurately recognized in the previous SER
systems were recognized using the proposed methods. Gaussian Mixture Model (GMM)
classifier is used to categorise the emotions of EMO-DB database in this analysis

12. Speech Emotion Recognition Using Deep Neural Networks

In this paper, a broad sense of overview of SER has developed using deep learning
techniques such as audio signal preprocessing, feature extraction and selection methods and
finally determining the accuracy of appropriate classifiers. The emotional datasets Ravdess,
Crema-D, Tess and Savee are concatenated and were used to train the one-dimensional
Convolutional Neural Network (CNN).
METHODOLOGY
The block diagram of proposed work is as shown below :

The proposed work is divided as follows :

1.Loading the dataset and Pre-processing the data :


SAVEE dataset is taken from kaggle . The SAVEE Dataset contains the audio files of male
actors . We have 7 different emotions :
1. Happy
2. Sad
3. Disgust
4. Fear
5. Surprise
6. Anger
7. Neutral
The SAVEE (Surrey Audio-Visual Expressed Emotion) dataset is a widely used database for
research in speech emotion recognition. It was developed at the University of Surrey and is
designed to facilitate the study of emotional speech in both audio and visual modalities. The
dataset was specifically created for the evaluation of emotion recognition systems and the
development of algorithms capable of recognizing human emotions based on audio signals.
Each audio file has a unique identifier at the 6 th position of the file name which can be used to
determine the emotion that the audio file consists.

2.Feature Extraction and Data Splitting :


We used Librosa library in Python to process and extract features from the audio files.
Librosa is a python package for music and audio analysis. It provides the building blocks
necessary to create music information retrieval systems. Using the librosa library we were
able to extract features i.e MFCC(Mel Frequency Cepstral Coefficient). MFCCs are a feature
widely used in automatic speech and speaker recognition.

3.Model Architecture :
Recurrent Convolutional Layers:
We used 1D Convolutional layers to capture spatial features from the audio representation.

Considering stacking multiple convolutional layers with increasing receptive fields to


learn hierarchical features.

LSTM Layers:
Integrated LSTM layers to capture temporal dependencies in the sequence of features.

Pooling Layers:
Applied pooling layers (e.g., MaxPooling1D) after convolutional layers to reduce
dimensionality.

Dropout:
Included dropout layers to prevent overfitting.

4. Model Compilation:
Optimizer: Used an optimizer like Adam to facilitate efficient training.

Loss Function: For multi-class classification, employed categorical cross-entropy as the


loss function.

Metrics: Monitor metrics such as accuracy, precision, recall, and F1 score during
training and evaluation.

5. Training:
Splitting the dataset into training, validation, and test sets.

Train the model on the training set and validated it on the validation set.

6. Evaluation:

Evaluated the trained model on the test set to assess its generalization performance.

Analyzed the confusion matrix and other metrics to understand the model's strengths and
weaknesses.

7.Prediction :
Predicted the speech emotions of the testing dataset(x_test) and compared them with the
y_test dataset.
CODE
RCNN MODEL :

Recurrent Convolutional Neural Networks has achieved an accuracy of around 60 % .This


model has four convolutional layers along with two recurrent LSTM layers.
LSTM MODEL :

The LSTM model has achieved an accuracy of around 55 % . The activation functions used
are relu and softmax.
RESULTS AND ANALYSIS

1. Training Progress: The training loop displays progress for each epoch using the ‘tqdm’
library, showing the training loss at each step. The progress bars give an immediate visual
indication of how quickly the model is learning.

2. Model Performance: The trained model achieves a certain level of accuracy and F1 score
on the validation set, indicating its ability to generalize to new, unseen data. The F1 score is
particularly useful in classification tasks as it considers both precision and recall, providing a
balanced measure of the model's overall performance.

3. Accuracy per Class: The accuracy per class metrics offer insights into how well the
model distinguishes between different categories. For instance, the analysis might reveal that
the model performs exceptionally well on some classes (e.g., "happy") but struggles with
others (e.g., "disgust"). The model could achieve an accuracy of around 60 percent.

4. Loss Values: Monitoring the training and validation loss over epochs is crucial. A
decreasing training loss indicates that the model is learning from the data. However, an
increasing validation loss might suggest overfitting. The loss values also help in
understanding if the model converges or if adjustments to hyperparameters are needed.

5. Overfitting: It is essential to check for signs of overfitting, where the model becomes too
specialized in the training data and performs poorly on new data. One can monitor the
training and validation loss, as well as accuracy and F1 score trends over epochs. A large gap
between training and validation metrics might indicate overfitting.
6. Fine-Tuning Considerations: The learning rate, batch size, and number of epochs chosen
for fine-tuning play a critical role. Adjustments to these hyperparameters might be necessary
for optimal performance. Experimentation with different learning rates and batch sizes, along
with monitoring performance on a validation set, can help fine-tune the fine-tuning process.

7. Model Interpretability: The script, as provided, focuses primarily on performance


metrics. However, for a real-world application, it is crucial to interpret the model's
predictions and understand its decision-making process. Techniques such as attention
visualization or interpretation of misclassifications can provide valuable insights.

8. Loading Pretrained Models: Loading a pretrained model and evaluating its performance
on the validation set allows for comparisons with models trained from scratch. It might also
facilitate transfer learning for related tasks.

9. Further Iterations: Based on the results and analysis, practitioners might consider further
iterations. This could involve adjusting hyperparameters, trying different tokenization
strategies, or experimenting with alternative models.

10. Scalability and Deployment: The code, as presented, is suitable for a smaller-scale
experiment. For larger datasets or production deployment, considerations such as distributed
training, optimization, and model serving become important.

11. Interpretation of Accuracy per Class: Accuracy per class is crucial for understanding
the model's strengths and weaknesses. High accuracy in some classes may indicate robust
learning, while lower accuracy in specific classes could highlight challenges. Analysing
misclassifications and exploring examples from challenging classes can guide further
improvements.
CONCLUSION & FUTURE WORK

In conclusion, the integration of Recurrent Convolutional Neural Networks (RCNN) and


Long Short-Term Memory (LSTM) networks in speech emotion recognition has
demonstrated promising results. The proposed model effectively captures both spatial and
temporal dependencies within audio signals, enabling a nuanced understanding of emotional
expression. Through extensive experimentation, the model showcased notable accuracy in
discerning diverse emotional states. The combination of RCNN and LSTM not only enhances
the overall performance of speech emotion recognition but also provides a foundation for
exploring deeper insights into the complex interplay between acoustic features and emotional
content in spoken language.

The study lays the groundwork for several avenues of future research in the realm of speech
emotion recognition. Firstly, further refinement of the RCNN-LSTM architecture could be
explored to optimize the model's performance and reduce computational complexity.
Additionally, incorporating multi-modal data, such as facial expressions or physiological
signals, may contribute to a more comprehensive understanding of emotional states.
Exploring transfer learning techniques to adapt the model to different languages or cultural
contexts is another promising direction. Furthermore, the integration of real-time processing
capabilities and the development of applications in mental health monitoring or human-
computer interaction are areas with significant potential for practical implementation. The
evolving landscape of deep learning and signal processing provides ample opportunities for
continual advancements in speech emotion recognition systems.
REFERENCES

1. R. A. Khalil, E. Jones, M. I. Babar, T. Jan, M. H. Zafar and T. Alhussain, "Speech


Emotion Recognition Using Deep Learning Techniques: A Review," in IEEE Access, vol. 7,
pp. 117327-117345, 2019, doi: 10.1109/ACCESS.2019.2936124.

2. K. V. Krishna, N. Sainath and A. M. Posonia, "Speech Emotion Recognition using


Machine Learning," 2022 6th International Conference on Computing Methodologies and
Communication (ICCMC), Erode, India, 2022, pp. 1014-1018, doi:
10.1109/ICCMC53470.2022.9753976.

3. R. Anusha, P. Subhashini, D. Jyothi, P. Harshitha, J. Sushma and N. Mukesh, "Speech


Emotion Recognition using Machine Learning," 2021 5th International Conference on
Trends in Electronics and Informatics (ICOEI), Tirunelveli, India, 2021, pp. 1608-1612, doi:
10.1109/ICOEI51242.2021.9453028.

4. T. M. Wani, T. S. Gunawan, S. A. A. Qadri, M. Kartiwi and E. Ambikairajah, "A


Comprehensive Review of Speech Emotion Recognition Systems," in IEEE Access, vol. 9,
pp. 47795-47814, 2021, doi: 10.1109/ACCESS.2021.3068045.

5. G. Deshmukh, A. Gaonkar, G. Golwalkar and S. Kulkarni, "Speech based Emotion


Recognition using Machine Learning," 2019 3rd International Conference on Computing
Methodologies and Communication (ICCMC), Erode, India, 2019, pp. 812-817, doi:
10.1109/ICCMC.2019.8819858.

6. S. Yoon, S. Byun and K. Jung, "Multimodal Speech Emotion Recognition Using Audio
and Text," 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 2018,
pp. 112-118, doi: 10.1109/SLT.2018.8639583.

7. R. Y. Cherif, A. Moussaoui, N. Frahta and M. Berrimi, "Effective speech emotion


recognition using deep learning approaches for Algerian dialect," 2021 International
Conference of Women in Data Science at Taif University (WiDSTaif ), Taif, Saudi Arabia,
2021, pp. 1-6, doi: 10.1109/WiDSTaif52235.2021.9430224.

8. H. Zou, Y. Si, C. Chen, D. Rajan and E. S. Chng, "Speech Emotion Recognition with Co-
Attention Based Multi-Level Acoustic Information," ICASSP 2022 - 2022 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, Singapore,
2022, pp. 7367-7371, doi: 10.1109/ICASSP43922.2022.9747095.

9. S. Yan, L. Ye, S. Han, T. Han, Y. Li and E. Alasaarela, "Speech Interactive Emotion


Recognition System Based on Random Forest," 2020 International Wireless Communications
and Mobile Computing (IWCMC), Limassol, Cyprus, 2020, pp. 1458-1462, doi:
10.1109/IWCMC48107.2020.9148117.
10. S. Sharanyaa, T. J. Mercy and S. V.G, "Emotion Recognition Using Speech Processing,"
2023 3rd International Conference on Intelligent Technologies (CONIT), Hubli, India, 2023,
pp. 1-5, doi: 10.1109/CONIT59222.2023.10205935.

11. S. R. Bandela and T. K. Kumar, "Emotion Recognition of Stressed Speech Using Teager
Energy and Linear Prediction Features," 2018 IEEE 18th International Conference on
Advanced Learning Technologies (ICALT), Mumbai, India, 2018, pp. 422-425, doi:
10.1109/ICALT.2018.00107.

12. S. Ullah, Q. A. Sahib, Faizullah, S. Ullahh, I. U. Haq and I. Ullah, "Speech Emotion
Recognition Using Deep Neural Networks," 2022 International Conference on IT and
Industrial Technologies (ICIT), Chiniot, Pakistan, 2022, pp. 1-6, doi:
10.1109/ICIT56493.2022.9989197.

You might also like