JSS Campus, Dr. Vishnuvardhan Road, Bangalore - 560060

JSS ACADEMY OF TECHNICAL EDUCATION
JSS Campus, Dr. Vishnuvardhan Road, Bangalore – 560060
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
TECHNICALSEMINAR
2021-2022
“SPEECH EMOTION RECOGNITION USING DEEP

LEARNING”
Presented By
Yashu S K
1JS19EC164
Seminar Incharge Seminar Incharge

NAME NAME
DESIGNATION DESIGNATION
DEPT. OF E&C DEPT. OF E&C
JSSATE-BANGALORE-60 JSSATE-BANGALORE-60

Outline
• Introduction
• Literature survey
• Explanation of the chosen topic( Concept,
Principle of working , Technology
Considerations,.. With diagrams)
• Advantages and Limitations
• Applications
• References
Introduction
• Human machine interaction is widely used now a days in many applications. One
of the medium of interaction is speech. The main challenges in human machine
interaction is detection of emotion from speech.
• Emotion can play an important role in decision making. Emotion can be detected
from different physiological signal .If emotion can be recognized properly from
speech then a system can act accordingly.
• Identification of emotion can be done by extracting the features or different
characteristics from the speech and training needed for a large number of speech
database to make the system accurate.
Literature Survey
• In [1] A review of neural networks and their applications in Speech
Emotion Recognition and a review of MLP (Multilayer Perceptron)
Classifier and its usage in Speech Emotion Recognition.A comparison of
various approaches and techniques used in Speech Emotion Recognition
and their performance.
• In [2] A detailed review of Convolutional Neural Networks (CNNs) and
their applications in Voice Emotion Recognition.An overview of Decision
Tree algorithms and their usage in Voice Emotion Recognition.
• In [3] The authors then provide a detailed review of various deep learning-
based SER techniques, including convolutional neural networks (CNNs),
recurrent neural networks (RNNs), and hybrid models and provide a
comparison of different feature extraction methods such as Mel-frequency
cepstral coefficients (MFCCs), log Mel spectrograms, and deep neural
networks (DNNs).
• In [4] it provide a detailed review of various SER techniques, including traditional
machine learning techniques such as support vector machines (SVMs), k-nearest
neighbors (k-NN), and Gaussian mixture models (GMMs), as well as deep learning
techniques such as convolutional neural networks (CNNs). The paper also covers the
datasets used for SER, including the Berlin Emotional Speech Database (Emo-DB), the
Emotional Prosody Speech and Transcripts (EPST) corpus, and the Ryerson Audio-Visual
Database of Emotional Speech and Song (RAVDESS).
• In [5] it is based on a bidirectional long short-term memory (BLSTM) network with an

attention mechanism that focuses on a local segment of the input sequence.The proposed
model is based on two benchmark datasets, the Emo-DB and the Toronto Emotional
Speech Set (TESS).
• In [6] it use transfer learning to adapt the pre-trained DNN on a large-scale audio dataset
to the SER task. The authors fine-tune the DNN on a smaller labeled SER dataset and use
data augmentation techniques to improve the model's robustness to noise and variability .
This proposed model is based on two benchmark datasets, the Emo-DB and the Berlin
Database of Emotional Speech (Emo-DB).
• In [7] The proposed model is based on a CNN architecture consisting of multiple

convolutional and pooling layers, followed by a fully connected layer. The authors use
log-mel filter banks to extract features from the speech signals and train the CNN model
to classify emotions into six categories: happy, sad, angry, fearful, surprised, and neutral.
System Block Diagram
Methodology
1.Preprocessing
The removal of unwanted noise signal from the speech.
• Silent removal
• Background Noise
Removal
• Windowing
• Normalization
• 2.Feature Extraction
Extract the feature from audio file
 Pitch
 Loudness
 Rhythm
DATASET
• Ryerson Audio-Visual Database of Emotional Speech and

Song (RAVDESS) dataset.
• RAVDESS dataset has recordings of 24 actors, 12 male
actors and 12 female actors, the actors are numbered
from 01 to 24 in North American accent.
• All emotional expressions are uttered at two levels of
intensity: normal and strong, except for the 'neutral'
emotion, it is produced only in normal intensity. Thus,
the portion of the RAVDESS, that we use contains 60
trials for each of the 24 actors, thus making it 1440 files
in total.
Training process workflow
Testing process workflow
3.Classification
Match the features with corresponding emotions
Multilayer Perceptron
Multi-Layer Perceptron Classifier
A multilayer perceptron (MLP) is a class of feedforwardartificial neural

network (ANN).
MLP consists of at least three layers of nodes-input layer,hidden layer and
output layer.
MLPs are suitable for classification prediction problems where inputs are
assigned a class or label.
Classification Matrix
Advantages
• Improved Accuracy
• Reduced Need for Feature Engineering
• Robustness to Variations in Speech
• Real-time Processing
• Scalability
Limitations
• Speaker variability
• Channel variability
• Data Availability
• Lack of Diversity in Datasets
• Processing Time
Applications
• customer Service Chatbots
• Human-Computer Interaction
• Sentiment Analysis
• Mental Health Diagnosis
• Entertainment Industry
• Voice-Based Personal Assistants
References
• Jerry Joy, Aparna Kannan, Shreya Ram, S. Rama Speech Emotion Recognition
using Neural Network and MLP Classifier, IJESC, April 2020.
• Navya Damodar, Vani H Y, Anusuya MA. Voice Emotion Recognition using CNN
and Decision Tree. International Journal of Innovative Technology and Exploring
Engineering(UITEE), October 2019.
• Y. Fan, et al., "Speech Emotion Recognition Based on Deep Learning: A Review,"

IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp.
2060-2083,2020.
• P. Pandey, et al., "A Survey on Speech Emotion Recognition: Techniques, Datasets,

and Applications," Journal of Ambient Intelligence and Humanized Computing,
vol. 11, pp. 4073-4093, 2020.
• Z. Zhang, et al., "Speech Emotion Recognition using Recurrent Neural
Networks with Local Attention," Proceedings of the 2018 IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 5239-
5243, 2018.
• H. Lee and D. Yoon, "Speech Emotion Recognition using Deep Neural

Network with Transfer Learning," Proceedings of the 2018 IEEE International
Conference on Big Data and Smart Computing (Big Comp), pp. 167-170,
2018.
• S. Kwon, et al., "Speech Emotion Recognition with Convolutional Neural

Networks," Proceedings of the 2017 IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP), pp. 2227-2231, 2017.
Thank You

JSS Campus, Dr. Vishnuvardhan Road, Bangalore - 560060

Uploaded by

Copyright:

Available Formats

You might also like

JSS Campus, Dr. Vishnuvardhan Road, Bangalore - 560060

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

JSS Campus, Dr. Vishnuvardhan Road, Bangalore - 560060

Uploaded by

Copyright:

Available Formats

JSS ACADEMY OF TECHNICAL EDUCATION

JSS Campus, Dr. Vishnuvardhan Road, Bangalore – 560060

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

“SPEECH EMOTION RECOGNITION USING DEEP

Seminar Incharge Seminar Incharge

• In [5] it is based on a bidirectional long short-term memory (BLSTM) network with an

• In [7] The proposed model is based on a CNN architecture consisting of multiple

The removal of unwanted noise signal from the speech.

Extract the feature from audio file

• Ryerson Audio-Visual Database of Emotional Speech and

A multilayer perceptron (MLP) is a class of feedforwardartificial neural

• Y. Fan, et al., "Speech Emotion Recognition Based on Deep Learning: A Review,"

• P. Pandey, et al., "A Survey on Speech Emotion Recognition: Techniques, Datasets,

• H. Lee and D. Yoon, "Speech Emotion Recognition using Deep Neural

• S. Kwon, et al., "Speech Emotion Recognition with Convolutional Neural

You might also like