Emotion Recognition From Speech: Abstract. Emotions Play An Extremely Vital Role in Human Lives and Human

Emotion Recognition from speech
Dinesh kumarasamy1, S.Avinash2, S.Balaji3, R.S.Deepak Raj4 and J.Jenita Hermina5

1,2,3,4
Department of CSE, Kongu Engineering College, Perundurai, Erode.
5
Department of CSE, Er. Perumal Manimekalai College of Engineering, Hosur, India.
Abstract. Emotions play an extremely vital role in human lives and Human
Emotions Recognition plays an important role in interpersonal communication.
The most common human emotions are sadness, happiness, anger, disgust, fear
and surprise which are reflected through speech, facial expressions and body
gestures. Among these, Speech is most commonly used to recognize the human
emotions. Therefore, Researchers introduced several techniques for predicting
the emotions from speech. Even though researchers introduced different
techniques to predict the emotions, the existing techniques achieved only less
accuracy for predicting the human emotions. So, the main objective of the
proposed work is to increase the accuracy of prediction. Hence, this work
introduces Speech Emotion Recognition (SER) technique for predicting the
various kinds of emotion like angry, calm, fear, happy, and sad. The SER
technique recognizes the emotion in multi levels. In the first level, the speech
audio is pre-processed to remove noises. After pre-processing, the features of
the speech audio is extracted from the speech data-set in the second level. In the
third level, the features are selected using Auto Encoder (AE). After feature
selection, Support Vector Machines (SVM) is used to identify the gender and
Deep Convolutional Neural Network (DCNN) is used to classify the emotions.
The proposed work is experimented in Colab and the results from the SVM and
DCNN outperforms the existing algorithms by accurately predicting the
different emotions.
Keywords: Emotions, Machine learning, Deep learning, Speech recognition.
1 Introduction
Emotions are the feelings of an individual which are used to describe an individual's
situation, based on their feelings. The most common human emotions are happiness,
sadness, anger, fear, disgust and surprise [1]. Humans express their emotions in
various modes like facial expressions, body gestures, speech and so on [2]. Among
these, speech is the most commonly used to recognize the emotions. Thus, the
prediction of human emotions through speech is mostly preferable for determining the
present state of humans. Recognition of human emotions plays an important role in
healthcare,interpersonal communication and marketing [3]. For interpersonal
communication, emotion plays a vital role in the evolution of recent technologies like
google speech recognition, Cortana, Alexa and so on [4].
2
Therefore, Researchers introduced several techniques like image processing, Multi-

hop attention mechanism, Recurrent Neural Network (RNN) and subtraction pre-
processing to predict the emotion [5]. The existing methods show less accuracy in
predicting the emotions from speech. In order to improve accuracy, some of the
researchers addressed machine learning techniques for predicting the emotions [6].
Machine learning is a subset of Artificial Intelligence. It is the ability of the machines
to learn from the past experience and improve based on the learned experience.
Machine Learning consists of various algorithms to predict and classify speech
emotion. Even though researchers introduced various techniques for predicting the
emotions, the existing techniques sometimes have not recognized the emotions
correctly. Therefore, this work introduces a new technique called Speech Emotion
Recognition (SER) technique.
SER technique has four levels for predicting the emotion correctly. The four levels
are preprocessing, feature extraction, feature selection and classification. The
preprocessing level removes the noises of speech signals. Later, Feature extraction
extracts 39 coefficients from the audio signal. After extracting the features, the feature
selection chooses the required feature through Auto encoder[7]. After selecting the
features, SVM classifier is used to identify the gender [8]. DCNN is used to classify
the emotions [9]. The following section describes the literature review. Section 3
explains about the motivation of the proposed work. Section 4 gives a detailed
explanation about the working principle of proposed work. Section 5 provides the
experimental setup and result analysis and section 6 provides a conclusion and future
work.
2 Literature Review
Due to the evolution of new technologies, human emotions also take a major role in
interpersonal communication. Researchers introduced different algorithms in different
technologies to accurately predict the emotions. The approaches used for recognizing
the emotions are broadly classified into three types like statistical methods,
knowledge-based techniques and hybrid approaches. Among these, this work chooses
statistical methods to analyze and to predict the emotions using the statistical data-set.
Researchers introduced several approaches with different technologies like deep
learning, image processing, machine learning, Natural Language Processing (NLP)
and so on for detecting emotions. In order to predict the statistical data-set, most of
the researchers preferred machine learning over the other methods.
Initially, Sudarsana Reddy Kadiri introduced a new work called “Excitation
Features of Speech for Emotion Recognition Using Neutral Speech as Reference” to
identify emotions from speech. The work is carried out by capturing the deviations
from neutral state in 2-D feature space. The work used two different types of data-set
for Berlin and Telugu languages to recognize the emotions. The recognition rate is
76.1% and the work recognizes only the 2 second short speech audio. The work
cannot recognize the emotions correctly with the short audio segments [10].
3
In order to increase the recognition rate, Pengcheng Wei proposed a new system
called “A novel speech emotion recognition algorithm based on wavelet kernel sparse
classifier in stacked deep auto-encoder model”. To improve the speech emotion
recognition, the system uses an improved stacked kernel sparse deep model
algorithm,which is based on auto-encoder, denoising auto-encoder, and sparse auto-
encoder. The recognition rate of the system is 80.95%. Although the work gives a
good recognition rate, the model is very complex to use many encoders [11].
In order to make the model simple and to gain good accuracy, Linhui Sun proposed
a new system called “Decision tree SVM model with Fisher feature selection for
speech emotion recognition”. This work introduces Fisher feature selection to extract
the features which are used to train each SVM in the decision tree. The system used
two different types of data-set for Chinese and Berlin languages. The work shows the
recognition rate of 83.75% for chinese data-set and 87.55% for the Berlin data-set.
Although the work gives a good recognition rate, the work does not correctly classify
the emotion between fear and sadness [12].
Similarly, Xingfeng Li proposed a new work called “Improving multilingual
speech emotion recognition by combining acoustic features in a three-layer model”.
The work initially finds acoustic features from speech data-set. After predicting the
acoustic features, the work normalizes the features using speaker normalization
method and selects some of the features using Fisher Discriminant Ratio (FDR). With
the help of selected features, the different emotion dimensions like arousal, pleasure
and power are identified using training the logistic model trees. But, the work
achieves different accuracy for different languages In order to improve the accuracy,
Xingfeng Li introduces a technique called segment reptitation with data
augmentation. This technique yields the high accuracy of 98.16% after data
augmentation. Even though the method gives high accuracy the method fails to
classify the emotion based on gender[13].
Due to evolution of recent technologies, authors used Deep Learning techniques to
recognize emotion from speech. Bagas Adi Prayitno proposed a new methodology
called “ Segment Repetition Based on High Amplitude to Enhance a Speech Emotion
Recognition” to recognize emotion using deep learning techniques. The work uses the
Berlin Emotional Speech Database (Berlin EMO-DB) and Long Short Term Memory
(LSTM) with 3 core layers and 2 recurrent layers for classification of different
emotions. The method does not classify the emotions correctly and gives less
accuracy of 66.18% [14].
To identify the emotion based on the gender, Ftoon Abu Shaqra proposed a new
model called “Recognizing Emotion from Speech Based on Age and Gender Using
Hierarchical Models”. The proposed model uses hierarchical classification models to
find the necessity of identifying gender and age in the process of emotion recognition.
The model uses Toronto Emotional Speech Set (TESS) for emotion recognition.
Although the work classifies the emotions based on the gender, the system obtains
very less accuracy of 74%[15].
To effectively differentiate emotions from the given speech data-set, Mohit Shah
proposed a new method called “Articulation constrained learning with application to
speech emotion recognition”. A discriminative learning method is used to effectively
4
recognize emotions by separating the features based on vowel arousal. The model
distinguishes happiness from other emotions more accurately in the ElectroMagnetic
Articulography (EMA) database and Interactive Emotional Dyadic Motion Capture
(IEMOCAP) database. But the method fails to recognize the emotions from the long
audio speech [16].
To recognize the emotion from the long audio speech, Jian-Hua Tao introduced a
new method *called “Semi-supervised Ladder Networks for Speech Emotion
Recognition” *to recognize emotions through a semi supervised ladder network. This
method has two encoders for reducing the noise and to clean the input signal. The
method trains the Interactive Emotional Dyadic Motion Capture (IEMOCAP)
database which has 12 hours audio duration. The results show that Ladder network
obtains 83.3% which outperforms the existing network like Denoising Autoencoders
(DAE) and Variational Autoencoders (VAE). Moreover, the method has a higher
classification rate for angry and happy emotions[17].
To effectively reduce the confusion between emotions and to improve the speech
emotion recognition rate Linhui Sun proposed a new method *called “Speech
Emotion Recognition Based on DNN-Decision Tree SVM Model”*. The input signal
is preprocessed by pre-emphasis method and the MFCC feature is extracted by
framing. The extracted features are trained in decision tree SVM and DNN to classify
the emotions. The results obtained from the method shows that the average emotion
recognition rate is 75.83%[18].
From the state of the art, the emotion recognition still has some limitations for
accurately predicting the emotions. This limitation motivates to introduce a new
approach. The following section describes the motivation of the proposed work.
3 Motivation of the proposed work
Emotions play an important role in human mental life. But, People are finding it
difficult to understand other people’s emotions. Moreover due to the evolution of
technology, Emotions also have an impact on interacting with the system to play
songs and so on. Therefore, Detecting emotions will be addressed in this work which
also helps to improve the interaction between the machines and humans. Several
scholars were researching this topic for many years and introduced different
techniques to classify various emotions such as happy, sad, anger, neutral. But, the
existing techniques still have some limitations such as lower identification rate,
support for limited language, ability to predict less number of emotions. In addition to
that, the existing systems have complex data extraction processes and are difficult to
identify the different age of people. The above difficulties in the existing work
motivates to introduce a new approach for predicting human emotion. Therefore, this
work proposes a new technique called Speech Emotion Recognition (SER) technique.
The following section briefly describes the proposed system.
5
4 Proposed System
This work proposed a new technique called Speech Emotion Recognition (SER) to
recognize the emotions from given speech input. The proposed technique has various
stages to process the input speech and to predict the emotion. The human speech is
given as input to the system.For This purpose RAVDESS dataset is used [19].The
entire SER system is implemented through Google Colab [20].
Fig. 1. Block Diagram for Speech Emotion recognition
Fig.1 represents the overall view of the emotion recognition . The SER technique has
different phases namely preprocessing, feature extractions, feature selection and
classification of emotion. Initially,the audio samples are preprocessed using
Convolutional Neural Network (CNN) to eliminate the noise. After preprocessing,
the various features are extracted with the aid of different methods like Mel-frequency
Cepstral Coefficients (MFCC), Harmonics to Noise Ratio (HNR),Zero Crossing Rate
(ZCR), Teager Energy Operator (TEO) and so on. After extracting the features, Auto-
Encoder (AE) chooses the selected features to reduce the input vector dimension.
After feature selection, Support Vector Machine (SVM) classifies the gender from the
input audio and Deep Convolutional Neural Network (DCNN) approach is used to
predict the different emotions.
4.1 Pre-processing
Data pre-processing is a technique used mainly to convert raw speech data into useful
data-set. The given data-set is not in the standard form because the audio signals are
obtained from different sources. Hence, the data-set is not suitable for analysing the
data. Therefore, the audio signal needs to be preprocessed for standardizing the audio
signal. In addition to that, the pre-processing technique is used for eliminating the
noise in the audio signal. Therefore, the audio signals are preprocessed using voice
activity detection method to eliminate the noise from distinguishing the voiced
speech, unvoiced speech and silence. In addition to that, the proposed work
normalizes the audio signal to eliminate the noise. Further, the preprocessed audio
signal will occupy more space and increase the time complexity in the large neural
network. Hence, the proposed work uses dimensionality reduction technique to reduce
the size of the audio signal and to minimize the time complexity. Finally, the
preprocessed audio signal is given as input to the Feature Extraction phase.
6
4.2 Feature extraction
After pre-processing, the audio signals are handed over to the feature extraction
phase. The feature extraction phase extracts the different features from the audio
signal by using various methods such as MFCC, ZCR, HNR. Among the methods,
MFCC methods have 39 coefficients such as 12 MFCC features, 12 Delta MFCC
features, 12 Delta Delta MFCC features, 1 (log) frame energy, 1 Delta (log) frame
energy and 1 Delta Delta (log) frame energy. The MFCC algorithm has different steps
to extract the features such as framing & windowing, FFT, mel filter bank and
frequency wraping and DCT. Figure 2 represents the different stages of feature
extraction.
Fig. 2. Stages of Feature Extraction

Fig.2 represents the different stages of feature extraction.
Framing: The preprocessed audio signal consists of N samples and the adjacent
frames are separated by using delta coefficients. Among the N samples, each sample
is fragmented into small blocks of 20-30 as frames. The number of frames varies
based on the time duration of the signal. Moreover, the short time spectral analysis is
done to examine necessary short periods of time and ensure the time frame is fixed
with the given set or not.
Windowing: The small duration of the audio signal has a set of frames such as
multiplicative of ‘5’ in order to maintain the continuity of the signal. Generally, the
7
spectral distortion method is used to reduce the voice sample to zero at both the
beginning and end of the frames.
Fast Fourier Transform (FFT): It is a method to convert time domain into
frequency domain. Therefore, this work chooses FFT to convert the magnitude
frequency response of each frame such as spectrum or periodogram.
Mel Filter bank and frequency wrapping: SER system multiply magnitude
frequency response and a set of 20 triangular band pass filters to get a smooth
magnitude spectrum. It also reduces the size of emotional features involved.
Discrete Cosine Transforms: SER system applies DCT in a smooth magnitude ok
spectrum to obtain Mel-scale cepstral coefficients.
After extracting the features, the extracted features are passed to the next phase
called feature selection.
4.3 Feature selection
Feature selection is a mechanism to choose selected features which is sufficient to

predict the output because the irrelevant features of the signal will decrease the
accuracy and increase the processing time. Therefore, this work introduces the feature
dimension reduction method which helps in reducing feature dimensions without
removing the relevant information. In this phase, Auto Encoder (AE) is used to
extract features. The input and output of the AE have the equal number of
dimensions. But, the hidden layer has less dimensions and it compresses the
information from the entry layer and thereby its size is reduced for the original entry.
The newly obtained data has been taken as training data to train the SVM model to
predict test samples. ΔMFCC, ΔΔMFCC, ZCR, HCR, TEO are selected by AE.
4.4 Emotion Recognition
SVM is used for classification. SVM are applied in many fields and have proved their
effectiveness by identifying the different classes of data presented in the data-set.
SVM technique is used for classification and regression analysis. It tries to classify
the data by finding suitable hyperplanes to classify the data by the highest margin.
The best way is to separate the training data present in the feature space by a kernel
function K. The commonly used kernel functions are linear, polynomial and Radial
Basis Function (RBF). Therefore, it is important to use a classification method that
consists of good kernel functions and modifying the parameters to get maximum
identification rate. Hence, it is used to classify gender from the audio.
The preprocessed audio is given as input for CNN architecture. The input audio is
typically given in a two-dimensional array of neurons which denotes a set of features
of the audio. CNN architecture has different layers namely convolution layer,
activation layer and max pooling layer to do the classification of emotion. Among
these, the Convolution layer identifies the salient regions at intervals to depict the
feature map sequence. In CNN, non-linear activation layer function is used to pass the
current neuron and its output to the conventional layer output. Therefore, the SER
uses the Rectified Linear Unit (ReLU) and passes the output to the Max Pooling layer
8
which selects the maximum value in Dense layers. It helps to keep the variable length
inputs to a fixed sized feature array. The audio signal is a 3D signal in which has 3
axes namely time, amplitude and frequency.
The audio file is converted into a 1D array which is of time series x. Sampling rate
value and MFCC function are stored in a list to zip the list. Speech represented in the
form of an image with 3 layers. In CNN, 1st and 2nd derivatives of speech image
with time and frequency.
4.5 Prediction
After extracting the features of the audio, the input audio is passed through the
prediction phase in which the input audio is compiled, trained and predicted. There
are different types of loss function namely regression loss function, binary
classification function and Multi-class classification function. Among these, a
categorical classification loss function called categorical cross-entropy is used to
reduce the error in prediction. Categorical cross-entropy is chosen because it is based
on the probabilities of yes or no decisions. In prediction, the audio is converted to a
data frame and displays it in structured form. Further it compares with the loaded
models and predicts the emotions. In prediction, the model learns to map input
characteristics to an output characteristic. Output characteristic is a label, such as
angry, calm, fearful, sad, neutral and happy. The evaluation metrics are required to
quantify the model performance. There are many classification metrics like
classification accuracy, logarithmic loss, Area Under Curve (AUV) and F-measure.
Among these, classification accuracy is used to find the performance of the model.
Classification accuracy is the number of correct predictions to all predictions made.
5 Result and Discussion
5.1 Dataset Description

Therefore, the proposed work chooses a Ryerson Audio-Visual Database of
Emotional Speech and Song (RAVDESS) data-set which has different input speech.
It consists of 24 actors’ audios. It consists of 12 Actors & 12 Actresses recorded
speech. The speech audio consists of 1440 audio samples. In all the sets, 60% of the
audio are taken for training, 20% of the audio are taken for validation and 20% of the
audio are taken for testing. The proposed work is deployed in the Google
Colaboratory (Colab) platform to exactly recognize various kinds of emotions from
speech. Therefore, the RAVDESS data-set is uploaded in the Colab. The proposed
work uses python programming to build and deploy the SER technique for predicting
the emotions. Librosa is used to analyze the audio signals and convert the samplating
at the rate of 22 kHz in CNN. In Colab, CNN can predict, analyze the speech data,
CNN can learn from speeches and identify words or utterances.The training data-set is
used to train the model while the validation set is used to fine tune the model. To
9
improve the model’s performance, different numbers of epochs like 100, 200 and 700
are used. An epoch is one complete passing through the training data.
5.2 Accuracy
Accuracy is treated as a one of the most important parameters for analyzing the
proposed work. The accuracy is the ratio between the sum of true positive and true
negative to the total number of samples.
Table 1. General Confusion Matrix
Predicted as Predicted as false

true
Actually True True Positive False Negative

The confusion matrix is used
to indicate the performance
of the proposed model based
on true values identified.
Actually False False Positive True Negative
Table 2. Confusion
matrix for the model
Predicted as Predicted as false

true
Actually True 112 27

From Table 2 we can
understand that the correct
output is 231 and incorrect
output is 57. Actually False 30 119 The accuracy
is calculated by using the
Equation (1).
Accuracy = TP+TN /(TP+ TN+ FP+ FN )
(1)
Accuracy = (112+119)/ (112+27+30+119) = 0.8012
10
Accuracy of the model is 80.12%
5.3 Result Analysis
5.3.1 Dataset Wise Accuracy
Fig. 3. Proposed dataset wise accuracy result

Fig. 3 shows the data-set wise accuracy for each emotion present in the input
audio. The SER model has classified different kinds of emotions. The actual emotion
and predicted emotion are compared. With the help of the audio parameters, our
model is capable of being achieved a detection rate of 80.125% using CNN.
Fig. 4. Comparison of existing and proposed method
Fig. 4 shows the comparative wise chart between existing and proposed model.
From the graph, it shows that, SER model has a higher prediction rate for different
emotions, when compared to other existing models.
11
5.3.2 Epoch Wise Accuracy
Fig. 5. Epoch Wise Accuracy Result
Fig.5 shows the epoch wise accuracy result of the data-set. An epoch is one
complete passing through the training data. The accuracy increases by increasing the
number of epochs. Usually, the accuracy continues to improve because the model
searches to find the best fit for the training data which tends to over fit. Epoch value
should be chosen in such a way that the model should not over fit.
5.3.3 Performance Metrics
Fig. 5. Performance metrics.
Fig. 6 shows the performance metrics of the SVM over the Naive Bayes algorithm
and Decision tree algorithm. Performance metrics parameters such as accuracy,
12
precision, recall, f-measure are higher for SVM over decision tree and Naive Bayes
algorithm.
6 Conclusion and Future work
In this work, a model for detection of emotion from speech is proposed. The
important part in emotion recognition is pre-processing, feature extraction, feature
selection and classification. The Motive of the proposed work is to increase the
accuracy of various emotions. In order to obtain the objectives, this work focuses on
machine learning algorithms like SVM and DCNN. In this work, the system provides
an accuracy of about 80.125% for various emotions. The proposed work can be
extended in future to identify various kinds of emotions such as shame, shyness, rage,
surprise and frustration.
References
1. Yao, Qingmei. "Multi-sensory emotion recognition with speech and facial expression."
(2014).
2. Suja, P., Shikha Tripathi, and J. Deepthy. "Emotion recognition from facial expressions
using frequency domain techniques." Advances in signal processing and intelligent
recognition systems, pp.299-310. Springer, Cham (2014).
3. Lugović, Sergej, Ivan Dunđer, and Marko Horvat. "Techniques and applications of
emotion recognition in speech." In 2016 39th international convention on information and
communication technology, electronics and microelectronics (mipro), pp. 1278-1283.
IEEE (2016).
4. V. Këpuska and G. Bohouta, "Next-generation of virtual personal assistants (Microsoft
Cortana, Apple Siri, Amazon Alexa and Google Home)," 2018 IEEE 8th Annual
Computing and Communication Workshop and Conference (CCWC), pp. 99-103, Las
Vegas, NV, USA (2018).
5. Anagnostopoulos, Christos-Nikolaos, Theodoros Iliou, and Ioannis Giannoukos. "Features
and classifiers for emotion recognition from speech: a survey from 2000 to
2011." Artificial Intelligence Review 43, no. 2, pp. 155-177 (2015).
6. Deshmukh, Girija, Apurva Gaonkar, Gauri Golwalkar, and Sukanya Kulkarni. "Speech
based Emotion Recognition using Machine Learning." In 2019 3rd International
Conference on Computing Methodologies and Communication (ICCMC), pp. 812-817.
IEEE (2019).
7. Zhao, Jianfeng, Xia Mao, and Lijiang Chen. "Speech emotion recognition using deep 1D
& 2D CNN LSTM networks." Biomedical Signal Processing and Control 47, 312-323
(2019).
8. Bisio, Igor, et al. "Gender-driven emotion recognition through speech signals for ambient
intelligence applications." IEEE transactions on Emerging topics in computing 1.2, 244-
257 (2013).
9. Santamaria-Granados, Luz, Mario Munoz-Organero, Gustavo Ramirez-Gonzalez, Enas
Abdulhay, and N. J. I. A. Arunkumar. "Using deep convolutional neural network for
emotion detection on a physiological signals dataset (AMIGOS)." IEEE Access 7, pp. 57-
67 (2018).
13
10. Kadiri, Sudarsana Reddy, et al. "Excitation Features of Speech for Emotion Recognition
Using Neutral Speech as Reference." Circuits, Systems, and Signal Processing 39.9, 4459-
4481 (2020).
11. Wei, Pengcheng, and Yu Zhao. "A novel speech emotion recognition algorithm based on
wavelet kernel sparse classifier in stacked deep auto-encoder model." Personal and
Ubiquitous Computing 23.3, 521-529 (2019).
12. Sun, Linhui, Sheng Fu, and Fu Wang. "Decision tree SVM model with Fisher feature
selection for speech emotion recognition." EURASIP Journal on Audio, Speech, and
Music Processing 2019.1, 1-14 (2019).
13. Li, Xingfeng, and Masato Akagi. "Improving multilingual speech emotion recognition by
combining acoustic features in a three-layer model." Speech Communication 110 , 1-12
(2019).
14. Prayitno, Bagas Adi, and Suyanto Suyanto. "Segment Repetition Based on High
Amplitude to Enhance a Speech Emotion Recognition." Procedia Computer Science 157,
420-426 (2019).
15. Shaqra, Ftoon Abu, Rehab Duwairi, and Mahmoud Al-Ayyoub. "Recognizing emotion
from speech based on age and gender using hierarchical models." Procedia Computer
Science 151, 37-44 (2019).
16. Shah, Mohit, et al. "Articulation constrained learning with application to speech emotion
recognition." EURASIP journal on audio, speech, and music processing 2019.1, 1-17
(2019).
17. Tao, Jian-Hua, et al. "Semi-supervised ladder networks for speech emotion recognition."
International Journal of Automation and Computing 16.4, 437-448 (2019).
18. Sun, Linhui, et al. "Speech emotion recognition based on DNN-decision tree SVM model."
Speech Communication 115, 29-37 (2019).
19. Alshamsi, Humaid, et al. "Automated Speech Emotion Recognition on Smart
Phones." 2018 9th IEEE Annual Ubiquitous Computing, Electronics & Mobile
Communication Conference (UEMCON), pp. 44-50 , IEEE (2018).
20. Gunawan, Teddy Surya, et al. "Development of video-based emotion
recognition using deep learning with Google Colab." TELKOMNIKA 18.5,
2463-2471 (2020).

Emotion Recognition From Speech: Abstract. Emotions Play An Extremely Vital Role in Human Lives and Human

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Emotion Recognition From Speech: Abstract. Emotions Play An Extremely Vital Role in Human Lives and Human

Uploaded by

Copyright:

Available Formats

Emotion Recognition from speech

Dinesh kumarasamy1, S.Avinash2, S.Balaji3, R.S.Deepak Raj4 and J.Jenita Hermina5

Keywords: Emotions, Machine learning, Deep learning, Speech recognition.

Therefore, Researchers introduced several techniques like image processing, Multi-

3 Motivation of the proposed work

Fig. 1. Block Diagram for Speech Emotion recognition

4.2 Feature extraction

Fig. 2. Stages of Feature Extraction

4.3 Feature selection

Feature selection is a mechanism to choose selected features which is sufficient to

4.4 Emotion Recognition

5 Result and Discussion

5.1 Dataset Description

Table 1. General Confusion Matrix

Predicted as Predicted as false

Actually True True Positive False Negative

Predicted as Predicted as false

Actually True 112 27

Accuracy of the model is 80.12%

5.3 Result Analysis

5.3.1 Dataset Wise Accuracy

Fig. 3. Proposed dataset wise accuracy result

Fig. 4. Comparison of existing and proposed method

5.3.2 Epoch Wise Accuracy

Fig. 5. Epoch Wise Accuracy Result

5.3.3 Performance Metrics

Fig. 5. Performance metrics.

6 Conclusion and Future work

You might also like