Research Paper

Human Emotion Detection with Speech Recognition
Using Mel-frequency Cepstral Coefficient and CNN

Dr Aruna Akshat Kumar Tiwari
Anika Pandey
Assistant Professor, Department of Computer Science and
Computer Science and
Computing Technologies,SRM Engineering, SRM Institute of Engineering, SRM Institute of
Institute of Science and Technology, Science and Technology, Science and Technology
Chennai, Tamil Nadu Chennai, Tamil Nadu Chennai, Tamil Nadu
arunas@srmist.edu.in at7989@srmist.edu.in at7989@srmist.edu.in
used is Mel-frequency Cepstral Coefficient. This is because

Abstract— Speech is one of the most common and effective
this method is considered the most appropriate for human
modes of communication, as one not only gets the message but frequency modeling. Therefore the authors use this method for
also intent, motive and emotions of the individual feature extraction. The dialect that will be used in this research is
communicating. It’s not only the facial expression but also the the dialect of Makassar people as voice data. The author chose
signals generated by one’s speech that helps us in recognizing this dialect because of its unique characteristics and different
the emotion of the individual. The main hurdle that emotion dialects from other regions. This rough-specking dialect marks
recognition systems face is the fact that people have varied the identity of a strong and mighty person.
emotional response, when exposed to the same stimuli. We
have used dataset from --. It
is also observed that using cnn along with mfcc improves the II. Literature survey
accuracy of the model.
Keywords—Speech Emotion Recognition, Recursive a. Machine learning-based speech emotion recognition

Neural Network, Long short term memory, K Neural system[2] this paper provides a block diagram of speech
Network emotion recognition system. It also proposes the
methodology of breaking the speech signal into
I. INTRODUCTION
segments of small time intervals, which makes it easier
Deep learning behaves like our nervous system. When it is to extract features.
when presented with a problem to solve it processes the data it
already has and builds a solution upon it. The data already b. Speech emotion recognition using Deep Learning[3] this
present to the system is called as training data. In other words project uses deep learning models to detect emotions
deep learning models computer to process information and and attempt to classify them according to audio signals.
compute results like the human brain but with much more IEMOCAP dataset is used for detection of emotions. It
computational power. uses SER to gauge the emotional state of the driver and
Since the rapid advancement in the field of AI and specially comes handy to prevent any accidents.
deep learning, it is being widely used in various fields nowadays
like medical treatment [1], detecting frauds, and detecting c. Emotion detection in dialog systems: applications
emotions etc. Speech emotion recognition is the process of strategies and challenges [4] , this paper has two
categorizing and classifying emotions. It has varied field of conclusions. First is to label audio signals as angry and
applications and can be used to upscale and optimize the already the second one is to gauge how satisfied a customer. It
existing business operations, it also has applications that help in uses 21 hours of recorded data where users are
resolving social issues like suicide prevention and much more. reporting phone connectivity issues, the actors then
choose to rate their experience based on a scale of 1-5.
Multiple studies uses Support Vector Machine (SVM) for
Finally the models labels the data as “garbage”,
emotion detection but they were found that they were not as
optimal as other deep learning algorithms [2]. We have used “angry”, “not angry”, “don’t know”.
Convolutional Neural Network(CNN) in our model which yields d. Emotion detection: a technology overview [5] in this
in better result than compared to SVM. For prediction we are per- paper we explore various factors that stimulate
training the data and constructing a training set and testing set emotions and emotion recognition with voice is delt in
and testing set, as our prediction obtains an optimum node such that
here. It describes three technologies vokaturi which is
the predicted node provides the satisfactory output. We can look
run on a computer, beyond verbal which is a cloud
at CNN as a model that feeds ahead artificial network in which
joining sequence among its nodes is motivated by presenting an based service and is used in the project to send api
animal visual-cortex, its dimension is bounded by the depth of the request and get results it needs at least 13 seconds of
structure, also the size of the filter and zero-padding. Inside the voice audio.
confines of this inquiry, we will discuss a number of concepts
e. Speech emotion recognition using feqture and word
connected to intelligent ML modelling techniques as well as its
embedding [6] this paper describes categorical speech
bearings on emotion recognition system.
recognition. It tells us that we can combine text and
voice functions to improve the accuracy of the model.
Based on existing emotion recognition research, feature
extraction is used for speech recognition both in the
implementation of emotion recognition, feature extraction
f. A review on emotion detection and classification
using speech [7] this paper demonstrates all of the It is essential to verify that the data is dispersed
techniques we can use for feature extraction which uniformly and also that the database's order has no impact on
when coupled with certain algorithms can yield in the learning of our algorithm in the training. Putting all of the
improved efficiency. information retrieved form data we have together and then
randomly assigning it. This helps to ensure that the material is
fairly dispersed, and that the learning process is not impacted
g. Physiological signals based human emotion by the ordering of the information. It's possible that we'll have
recognition: This paper lays emphasis on the need for to reorganize the gathered dataset and make adjustments to the
systems that are user independent so that the results each row and column or the number of rows and columns in
found are more accurate [8] order to organize and clear up the data and get rid of extraneous
information like redundant values, data type conversions,
incompleteness like missing data, and so on. Data or the
h. Facial emotion recognition using convolutional neural database we are going to use further should be visualized so
network (FERC)[9]: it elaborates the difficulty of that the structure of the data can be understood, as well as the
detecting emotions via machines and suggest a system link between the many classes and variables that are there.
which is named FERC that uses data sets from CMU,
NIST and Cohn-kanade expression. Creating two different sets namely training and testing
once from the cleansed data by separating it into two sets. The
training dataset is the data from which your model is taught
III. METHOD new information. After the training sample, we will utilize the
testing set to validate whether or not our model is accurate. In
order to make sense of the data's structure and the
interconnections between its many variables and categories by
The process of designing a system for machine seeing it graphically. Divide the cleansed data in halves to
learning is iterative in nature. In most cases, the procedure is create a training set and a test set. This is the data from which
divided into four primary stages: setting up the project, your model is taught. After our model has been trained, we
building the data working pipeline, and delivery. The results utilize a testing set to see how well it really performs.
from the later phase are feed back into the earlier one. We
have used cnn for out model and its architecture diagram of Selecting Models Once execution is done then we
the system is given below.- follow to execute a machine learning algorithm on gathered or
collected data, the output that we obtain is determined by a
machine learning model that we have created. It is essential to
go with a pattern that is applicable to the work that has to be
done right now. Over the course of many years, researchers
and engineers in numerous fields have produced a wide variety
of models that are well suited for certain tasks such as voice
recognition, picture identification, prediction, and many
Fig 1 Process of emotion detection others. In addition to this, we need to determine if your model
works best with numerical or categorical data so that we can
Training and testing model: we fetch a training data to choose the appropriate method. A machine learning model
the system consisting of experimental label and weight decides the algorithm's output on acquired data. Choose an
training is also provided for the network. An audio is taken appropriate model for the job. To complete our model, we
as an input which is then normalized in order to train the need to specify whether it will be used for numerical
CNN, the purpose being that the impact of presentation information or categorical one.
sequence of the examples don’t affect the training
performance. The outcome we get from the process is that In simple words, After training our model, we must test
we it acquires the best result with this learning data. It it. Test the model's performance with new data. Unseen data is
fetches the system with energy along with pitch. The our early testing set. If testing is done on the same data as
network weights trained gives the determined emotion. The training, we won't receive an accurate measure since the model
output is represented in a numerical value each is already familiar with it and discovers the same patterns. This
corresponding to the expressedemotions. increases
precision.
For collecting data We have searched multiple sites to
find the data set we will require in order to train our data. We When we have finished developing and analyzing our
have referred already existing data set which was open to model, we should check to see if there is any manner in which its
perform research related to the field. We have looked precision may be increased. Adjusting the values of the model's
through a wide variety of websites to get the dataset that will features allows you to achieve this goal. The level of accuracy
be necessary for the training of our data. We turned to an will be at its highest when it is set to a certain value
already existing data collection that was available so that we for our parameter. Finding these settings is referred to as
could carry out study associated with the topic. We have not "parameter tweaking." When we've constructed and verified
selected a data of a particular water body since it may take our model, enhance its accuracy. Tuning hyperparameters does
months to collect thousands of data from a single site. this. Programmers determine set of parameters. Our
Additionally, the lack of data on government websites has parameter's value determines the greatest accuracy. This is
led us to make use of web-based fake data in order to train called parametertweaking.
our machine. For collecting data We have searched multiple
sites to find the data set we will require in order to train our Import all of the necessary libraries that are needed in
data . We have referred already existing data set which was order to either proceed to train our model which is being
open toperform research related to the field. We have looked selected by us or have look upon database. The dataset
through a wide variety of websites to get the dataset that will provided or gathered should then be loaded by using the read
be necessary for the training of our data. We turned to an csv() method that is available in Pandas, and the first five rows
already existing data collection that was available so that we of the data set should be shown. Then perform Exploratory
could carry out study associated with the topic. Data Analysis. After that, we should carry out an exploratory
data analysis. When using EDA, the first thing you need to
do is check the shape of the data set. The next step is to
determine whether or not there are any valuesthat are null. If
you look at the image below, you'll notice that the
arguments like pitch, tone and voice all have values that are
null. The next step is to double check the information
contained in the data set.
Now provide a description of the dataset, noting the

lowest value, the highest value, the mean value, the count,
the standard deviation, etc.
The last step is to deal with the values that are

missing. We were able to deal with incomplete data by using
the mean value in our locators rather than the null data. This
was accomplished by assigning a mean value to every one
of the features. This suggests that we have reached the value Fig 2 Methodology diagram
of the mean. After that, do a second check to see if any null
values have remained after the first one.
Dataset is collected from Kaggle – Ravdess dataset. It has 1440
Let's have a peek at the numerical value counts of our files and 60 actors and 1440 trials. The ravedess consists of 24
goal characteristic, which is Potability, shall we? After that, professional voices and lexically matched sentences in the
we used seaborn’ s count plot function to create a visual north American accent which are happy sad and angry and
representation of the portability of the probability. labeled as garbage. Every expression is generatedin2 levels of
emotional intensity (light, bold), with a neutral expression.
We use a distplot function, in order to check to see Every file out of 1440 files has an unique filename. The
whether the pH value follow the regular dispersion to filename holds a 7-part numerical identifier (e.g., 03-02-05-01-
determine whether or not it is normally distributed. So, we 02-02-11.wav). They constitute the evoking features.
can now have a look for the normal distribution of data
being dispersed. MFCC for feature extraction:
Using a boxplot function, you can now identify the It is one of the methods used for feature extraction. . The
outlier. Thus, it should come as no surprise that the Solid human auditory system is assumed to process speech signals
feature contains outliers; however, we as a researchers are non linearly and measured on a mel-frequency scale. In speech
not able to eliminate such anomalies int he form of outliners recognition, Mel-frequency cepstrum represents the short-run
from the Solid characteristic since this would result in an spectral strength of the greeting frame using the linear
erroneous depiction of the data. Therefore, you won't have to transform cosine of the log freedom spectrum on a non-linear
worry about the safety of the water you consume. It implies mel-frequency scale [8].
that there we have an impurity in it that makes the water less
pure, and this allows us to determine water drinkability. Safe water is a human right and crucially essential since it
There is a possibility that the water's high solid content will impacts our health. Various techniques and evaluation
render it dangerous to drink. The data set must now be criteria have been devised to monitor the safety and
prepared, so go to work. Separate the data into the potability of water from subsurface, surface, and inshore
characteristics that are independent and those that are sources. For future humans and other life forms. Thinking
dependent. All of these characteristics are independent with about and preserving water is our duty. Machine Learning
the exception of potability, which is a dependent has helped us investigate and find a solution.
characteristics. And used the function train test split, which
outputs four different data sets, divide the data set into the Modules of CNN
training portion and the testing portion. The emotion that are
detected here are happy, angry, sad and garbage.
In our CNN module we have four important layers;
Algorithm used;
1. Convolution layer: Identifies salient regions at
1. Sample audio is given as input
intervals, length utterances that are variable and
2. We then plot waveform from the input depicts the feature map sequence.
3. Using librosa we extract the mfcc (mel frequency 2. Activation layer: it is a non linear activation layer
cepstral coefficient)
function that is used as customary to the convolution
4. Data then is further segregated into training and layer outputs, we have used corrected linear unit
testing set.
(ReLU) for this.
5. Then we predict the emotions using the model
3. Max pooling layer: this layer enables options with the
maximum value to the dense layers. It helps to keep
the variable length inputs to a fixed sized feature
array.
4. Dense
Future work should incorporate machine learning algorithms and

required sensors to integrate IoT technologies with an online
monitoring system. To increase the effectiveness of the
classification process, other techniques and algorithms can be
suggested, such as the deep learning approach. Another Happy: A positive emotion that is associated with high tone
suggestion is to employ additional machine learning but has a lower tone when compared to angry emotion.
techniques, such as ensemble learning, SVM, and K- NN.
Additional investigation can be done on management,
practices, and the innate qualities of water bodies that affect
water quality.
Fig 6 happy’s spectum
In simple words, after training our model, we must

test it. Test the model's performance with new data. Unseen IV. results and discussions
data is our early testing set. If testing is done on the same
data as training, we won't receive an accurate measure since
the model is already familiar with it and discovers the same Training and testing model: we fetch a training data to the
patterns. This increases precision. system consisting of experimental label and weight training is
also provided for the network. An audio is taken as an input
The emotions are categorized as happy, sad, angry which is then normalized in order to train the CNN, the purpose
and garbage. Additional investigation can be done on being that the impact of presentation sequence of the examples
management, practices, and the innate qualities of water don’t affect the training performance. The outcome we get from
bodies that affect water quality. The last step is to deal with the process is that we it acquires the best result with this learning
the values that are missing. We were able to deal with data. It fetches the system with energy along with pitch. The
incomplete data by using the mean value in our locators network weights trained gives the determined emotion. The
rather than the null data. results of our investigation demonstrated that the suggested
models are capable of appropriately classifying the emotions
It may be used into an emotion detection system in the based on voice signals. Depending on the findings of the
near future. For the forecast to be accurate, this technique esteemed dedicated research and rigorous experiments, it was
may be utilized to gather the required data from sensors. determined that the CNN algorithm had the most successful
Emotion detection prediction data collection is the next stage outcome in terms of prediction when compared to the other
in the process. We'll implement this strategy by constantly algorithms. Within the scope of this investigation, we will talk
tweaking the model. Therefore, emotions should be about several ideas associated with machine learning techniques
monitored in places where people are supposed to be calm. and their implications for the detection of emotions. For the
It might be implemented as a web-based resource for purpose of prediction, various machine learning techniques,
emotion detection. This approach collects sensor data for including SVM, CNN were being used. The information that was
prediction. Next, acquire the stream data needed to estimate used consists of 10 distinct characteristics, as well as some
emotions. This strategy involves dynamic model updating. reliability accuracy metrics were employed in order to analyse
the methods that were utilised by the expert computational
The last step is to deal with the values that are missing. We models and their performance. According to the results of our
were able to deal with incomplete data by using the mean research, the suggested models are able to appropriately
value in our locators rather than the null data. This was categorise emotions. Based on the findings of the investigation, it
accomplished by assigning a mean value to every one of the was determined that the cited algorithms had the most successful
features. It is one of the methods used for feature extraction. outcome in terms of emotion predicting when compared to the
. The human auditory system is assumed to process speech other methods.
signals non linearly and measured on a mel-frequency scale.
Emotions we identified:
Anger: anger is a emotion which comes under the negative

spectrum it is associated with high tone therefore it has
more oscillations
Fig 3 angry’s spectrum
Neutral: It lies between positive and negative emotions,

generally it has less oscillations and falls under the category
of unbroken low spectrum.
Fig 7 result with cnn
Fig 4 Neutral’s spectrum
Sad: It falls under the spectrum of negative emotion. It is

observed that it has a low tone and a disconnected spectrum.
Fig 5 Sad’s spectrum

on Electrical, Control and Instrumentation Engineering
(ICECIE). IEEE, 2020.
[12] H. Z. Muhammad, M. Nasrun, C. Setianingsih, and M. A.
Murti, "Speech Recognition for English to Indonesian
Translator Using Hidden Markov Model," in International
Conference on Signalsand Systems, 2018, pp. 255–260.
[13] Z. Nurthohari, M. A. Murti, and C. Setianingsih, "Wood
Quality Classification Based on Texture and Fiber Pattern
Recognition using HOG Feature and SVM Classifier."
2019 IEEE International Conference on Internet of Things
and Intelligence System (IoTaIS). IEEE, 2019.
[14] N. A. Arifin, B. Irawan, and C. Setianingsih, "Traffic sign
recognition application using speeded-up robust features
(SURF) and support vector machine (SVM) based on
Fig 8 result with svm android." 2017 IEEE Asia Pacific Conference on Wireless
and Mobile (APWiMob). IEEE, 2017.
[15] D. F. Azid, B. Irawan, and C. Setianingsih, "Translation
REFERENCES Russian cyrillic to latin alphabet using SVM (support
vector machine)." 2017 IEEE Asia Pacific Conference on
Wireless and Mobile (APWiMob). IEEE, 2017.
[1] L. Yi-Lin, and G. Wei, "Speech emotion recognition [16] R. N. Chory, M. Nasrun, and C. Setianingsih, "Sentiment
based on HMM and SVM." 2005 international analysis on user satisfaction level of mobile data services
conference on machine learning and cybernetics. Vol. 8. using Support Vector Machine (SVM) algorithm." 2018
IEEE, 2005. IEEE International Conference on Internet of Things and
[2] C. Yashpalsing, M. L. Dhore, and P. Yesaware, "Speech Intelligence System (IOTAIS). IEEE, 2018.
emotion recognition using support vector machine." [17] R. Ratnasari, B. Irawan, and C. Setianingsih, "Traffic sign
International Journal of Computer Applications 1.20 recognition application using scale invariant feature
(2010): 6-9. transform method and support vector machine based on
[3] N. S. Nehe, and R. S. Holambe, "Power spectrum android." 2017 IEEE Asia Pacific Conference on Wireless
difference teager energy features for speech recognition and Mobile (APWiMob). IEEE, 2017.
in noisy environment." 2008 IEEE Region 10 and the [18] R. Mardhotillah, B. Dirgantoro, and C. Setianingsih.
Third international Conference on Industrial and "Speaker Recognition for Digital Forensic Audio Analysis
Information Systems. IEEE, 2008. using Support Vector Machine." 2020 3rd International
[4] S. K. Kopparapu, and M. Laxminarayana, "Choice of Seminar on Research of Information Technology and
Mel filter bank in computing MFCC of a resampled Intelligent Systems (ISRITI). IEEE, 2020.
speech." 10th International Conference on Information
Science, Signal Processing and their Applications
(ISSPA 2010). IEEE, 2010.
[5] Jiang, Jianbo, et al., "Comparison of adaptation methods
for GMM- SVM based speech emotion recognition."
2012 IEEE spoken language technology workshop (SLT).
IEEE, 2012.
[6] S. S. Kumar, and T. RangaBabu, "Emotion and gender
recognition of speech signals using SVM." International
Journal of Engineering Science and Innovative
Technology 4.3 (2015): 128-137.
[7] B. Christopher J.C., 1998. A Tutorial on Support Vector
Machine for Pattern Recognition. International Journal
of Data Mining and Knowledge Discovery Vol. 2.
[8] D. B. Manurung, B. Dirgantoro, and C. Setianingsih,
"Speaker Recognition For Digital Forensic Audio
Analysis Using Learning Vector Quantization Method,"
in 2018 IEEE International Conference on Internet of
Things and Intelligence System (IOTAIS), 2018, pp. 221–
226.
[9] F. Alifani, T. W. Purboyo, and C. Setianingsih,
"Implementation of Voice Recognition in Disaster Victim
Detection Using Hidden Markov Model (HMM)
Method." 2019 International Seminar on Intelligent
Technology and Its Applications (ISITIA). IEEE, 2019.
[10] A. S. Haq, M. Nasrun, C. Setianingsih, and M. A. Murti,
"Speech Recognition Implementation Using MFCC and
DTW Algorithm for Home Automation." Proceeding of
the Electrical Engineering Computer Science and
Informatics 7.2 (2020): 78-85.
[11] R. A. Malik, C. Setianingsih, and M. Nasrun, "Speaker
Recognition for Device Controlling using MFCC and
GMM Algorithm." 2020 2nd International Conference

Research Paper

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Research Paper

Uploaded by

Copyright:

Available Formats

Human Emotion Detection with Speech Recognition

Using Mel-frequency Cepstral Coefficient and CNN

used is Mel-frequency Cepstral Coefficient. This is because

Keywords—Speech Emotion Recognition, Recursive a. Machine learning-based speech emotion recognition

Now provide a description of the dataset, noting the

The last step is to deal with the values that are

Future work should incorporate machine learning algorithms and

In simple words, after training our model, we must

Anger: anger is a emotion which comes under the negative

Fig 3 angry’s spectrum

Neutral: It lies between positive and negative emotions,

Fig 7 result with cnn

Fig 4 Neutral’s spectrum

Sad: It falls under the spectrum of negative emotion. It is

Fig 5 Sad’s spectrum

You might also like