SER-Paper-Bolo 27.11.19 40

Speech Emotion Recognition from Audio Files
Using Feedforward Neural Network

Khorshed Alam Nishargo Nigar
Computer Science & Engineering Co-Founder Heidy Erler
United International University Bolo Co-Founder
Dhaka, Bangladesh Hamburg, Germany Bolo
mohdkhurshed120@gmail.com somcnish@gmail.com Hamburg, Germany
heidy.erler@gmail.com
Anonnya Banerjee
Information & Communication
Systems
Hamburg University of Technology
Hamburg, Germany
anonnyabanerjee95@ymail.com
Abstract— Speech Emotion Recognition (SER) is an act of assistant editor who oversaw the assessment of this
recognizing human emotion and affective states from speech. submission and gave final approval for publication one of the
This is because voice often reflects underlying emotion through fundamental methods is the approach. In the literature of
tone and pitch. Emotion recognition is a component of speech speech emotion recognition (SER), it utilizes a range of
recognition which is gaining more popularity and need for it is emotions, including rage, boredom, contempt, surprise, fear,
increasing exceptionally. Our paper aims to use the pleasure, happiness, neutrality, and melancholy [10][11]. A
Feedforward Neural Network to recognize emotions from three-dimensional continuous space with properties like
unseen data (i.e., audio files) and label them according to the arousal, valence, and potency is another significant model that
range of different emotions using appropriate variables (such as
is used.
modality, emotion, intensity, repetition etc.) found in the data.
Keywords—machine learning, deep learning, speech emotion

recognition II. LITERATURE REVIEW
I. INTRODUCTION Speech is a form of communication that carries the data
about the speech-maker, the message, the emotion and more.
People are turning their attention away from the physical Depending on how it is said, spoken text can be interpreted in
world towards the spiritual society as it is getting more a number of different ways. For instance, in English, the word
materialistic. Human-machine interaction systems have been "FINE" can be used to convey agreement, disappointment, a
developed to identify and treat people's emotions. The existing statement or disinterest. Therefore, the semantics of a spoken
human-machine interaction systems frequently facilitate word cannot therefore be understood solely by
human-robot interaction in a line-of-sight (LOS) propagation comprehending the text.
environment, although most human-to-human and human-to-
machine communications are non-LOS (NLOS). Identifying The feature extraction and feature classification phases
emotions from speech is now a crucial part of human- make up most of the speech emotion recognition (SER)
computer interaction (HCI) [10]. techniques [1]. Researchers have developed a number of
features in the field of speech processing, including source-
The speaker's emotional state is produced by a based excitation features, vocal traction factors, and so on.
combination of internal physiological changes occurring Utilizing both linear and nonlinear classifiers, the second
throughout the utterance of a phrase (or simply a single word) phase entails feature classification. Bayesian Networks are
and variations in voice tones. Even when listening to one among the most used linear classifiers for recognizing
another, it can be challenging for humans to accurately emotions [2][3].
identify their own deepest feelings. When the speaker must
repress emotions, some parts of internal sensation are buried Recent years have seen increased interest in the field of
and are not audible in speech. Therefore, computer-based machine learning research area known as "deep learning."
systems are limited to what can be seen from the input of These techniques used for SER have a few benefits over
speech samples [13]. As a result of the lengthy dispute over conventional methods, including the ability to detect complex
the definition of "emotion" and the appropriate emotional structure and features without the necessity of manual tuning
classes, classifying emotional speech samples is a difficult and feature extraction, and the ability to handle unlabeled data
task. To avoid that "fruitless discussion," Batliner et al. [14] [4]. Numerous deep learning methods have been created for
favor the idea of emotion-related states. SER [5][6]. However, there are promising potential and a
hospitable environment for future study not only in SER but
As one might expect, because identifying a person's in many other disciplines [7]. Neural networks' layer-wise
emotional state may be a difficult process even for humans, it architecture allows them to adaptively learn characteristics
is significantly more difficult for automated systems, from raw data that is provided in a hierarchical manner [8].
necessitating the need for efficient emotion identification.
Recently, Deep Learning approaches have been put out as an III. SPEECH EMOTION RECOGNITION FRAMEWORK
alternative to conventional SER techniques [12]. We have used Feedforward Neural Network to develop
A discrete emotional model is one of the many categories our model that has been proven very effective in image
used to classify various feelings. Senthil Kumar served as the classification and shown promise for audio. The feedforward
neural network is the first and simplest type of artificial neural contains the angry emotion where x axis plots the time and y
network. In this network, the information moves in only axis plots the frequency of the audio.
forward direction—starting from the input nodes, through the
hidden nodes (if any) and to the output nodes. We followed an
approach from a research paper [9], however the accuracy of
this research work was not satisfactory due to limited training
data and characteristics. Hence, we modified the architecture
and improved the framework.
A. Dataset Details
In this research, we have used two datasets called
RAVDESS and TESS Datasets. RAVDAAS also known as
Figure 3: Spectogram for audio containing the fear emotion
Ryerson Audio-Visual Database of Emotional Speech and
Song (RAVDESS) contains 1440 audio files with 24
professional actors (12 male and 12 female). The speakers are Figure 3 presents the spectrogram for audio that contains
native North American. The following target classes of the fear emotion where x axis plots the time and y axis plots
RAVDESS dataset are anger, calm, fearful, happy, surprise, the frequency of the audio.
sad, disgust. Figure 1 demonstrates the example dataset that
we considered.
Figure 4: Spectogram for audio containing the happy emotion
Figure 4 illustrates the spectrogram for audio that contains

the happy emotion where x axis plots the time and y axis plots
the frequency of the audio.
Figure 1: Our dataset example C. Model Information
The layered feedforward neural network we designed for
The second dataset is Toronto emotional speech set our model ensures 93% accuracy on test data. For mitigating
(TESS), where it contains a set of 200 target words spoken by the overfitting problem, our model consists of numerous
two professional female speakers of age 26 and 64. The hidden layers comprising Dense layers, Activation layers,
following target classes of TESS dataset are anger, disgust, flatten layers, and lastly dropout layers. The suggested
fear, happiness, pleasant surprise, sadness, and neutral. solution diagram is shown in figure 5. The 3 subparts of the
B. Data Preprocessing block diagram signify how we created the model after the
collection and preprocessing of data and finally after feature
We have developed a data frame with the associated
extraction, we divided the training and test data to achieve the
emotions and audio file paths for each dataset in order to get
the most out of them. To develop our database for speech output.
emotion recognition, we concatenated both datasets. The data D. Dense Layer
frame will be used for feature extraction. Then, we used noise By doing matrix vector multiplication, the dense layer,
injection, time shifting, and pitch. We have done some feature which is highly related to the layer above it, changes the
extraction for an appropriate format in order to make the data output's dimension. The last stages of the neural network use
understandable for our model. We employed Zero Crossing a layer known as a dense layer, also known as a fully
Rate, Chroma, MFCC, RMS value, and Mel Spectrogram for connected layer. This layer aids in altering the output's
these datasets. Figure 2 shows the spectrogram for audio that dimensionality from the layer before so that the model may
more easily establish the relationship between the values of
the data it is working with. In any neural network, a layer that
is densely connected to its preceding layer means that every
neuron in the layer is connected to every other neuron in the
layer above it.
In any neural network, a layer that is densely connected to
its preceding layer means that every neuron in the layer is
connected to every other neuron in the layer above it. In
artificial neural network networks, this layer is the one that is
Figure 2: Spectogram for audio containing the angry emotion
most frequently utilized. In a model, each neuron in the
Figure 5: Diagram of our proposed solution
Figure 6: Training & Testing Performance
Figure 7: Confusion Matrix of our model

preceding layer sends signals to the neurons in the dense layer, We also used sigmoid hidden layer activation function that
which multiply matrices and vectors. The row vector of the is also known as logistic function. The function outputs
output from the previous layers is equal to the column vector numbers between 0 and 1 and accepts any real value as input.
of the dense layer during matrix vector multiplication. The The output value will be nearer to 1.0 the larger the input
row vector must have an equal number of columns to the (more positive), and nearer to 0.0 the smaller the input (more
column vector to multiply matrices with vectors. negative). Equation (2) represents how the sigmoid activation
function is calculated:
The most popular algorithm for training feedforward neural
networks is backpropagation. In a neural network,
backpropagation typically calculates the gradient of the loss 1.0/ (1.0 + 𝑒 −𝑥 ) 
function relative to the network weights for a single input or
output. According to the preceding idea, the dense layer will where the natural logarithm's base, e, is a mathematical
produce an N-dimensional vector as its output. We can constant.
observe that the vectors' dimension is being reduced. As a Our custom feedforward-based deep learning model's
result, each neuron in a dense layer is employed to change the speech emotion recognition achieved a test accuracy of 93%
vectors' dimension. and training accuracy of 97.44%. In addition, the test loss
E. Activation Functions is 0.20 whereas the training loss is 0.081. Figure 6 shows the
training and testing loss.
An activation function in a neural network describes how
a node or nodes in a layer of the network translate the weighted and accuracy correspondingly. The test performance
sum of the input into an output. A "transfer function" is looks favorable.
another name for the activation function. It may be referred to Figure 7 shows the confusion matrix of our model where
as a "squashing function" if the output range of the activation x axis represents the predicted labels and y axis plots the actual
function is constrained. Numerous activation functions have labels. It is comprehensible that our model is performing well
nonlinear behavior, which is referred to as "nonlinearity" in as it has been successful in most cases.
network or layer design. Different activation functions may be
used in different regions of the model, and the choice of Table I illustrates the performance analysis of our model.
activation function has a significant impact on the neural The proportion of relevant examples among the retrieved
network's capacity and performance. instances is known as precision (also known as positive
predictive value), whereas the proportion of relevant instances
Although networks are built to utilize the same activation that were retrieved is known as recall (also known as
function for all nodes in a layer, technically the activation sensitivity). Equation (3) and (4) denote precision and recall.
function is applied before or after the internal processing of Thus, relevance serves as the foundation for both precision
each node in the network. A network may have three different and memory.
kinds of layers: output layers that produce predictions, hidden
layers that receive data from one layer and transfer it to 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑃) (3)
another, and input layers that take raw input directly from the
domain. Typically, the same activation function is used by all 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑁) (4)
hidden layers. The sort of prediction needed by the model will The harmonic mean of precision and recall is provided by
determine what activation function is used in the output layer, the f1-score. In comparison to all other classes, the scores
which is often different from the hidden layers. assigned to each class will indicate how accurately the
A given input value can be used to derive the first-order classifier classified the data points inside that class. Equation
derivative for activation functions that are typically (5) symbolizes f1-score.
differentiable. Given that neural networks are frequently 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 +𝑅𝑒𝑐𝑎𝑙𝑙 (5)
trained using the backpropagation of error algorithm, which
needs the derivative of the prediction error to update the The number of samples of the real response that belong to
model's weights, this is necessary. Although there are many that class is the support.
kinds of activation functions utilized in neural networks,
possibly only a few functions are really used for the hidden TABLE I. PERFORMANCE ANALYSIS OF O UR MODEL
and output layers in actual practice.
The activation function in neural networks is a function
that transforms the input values of the neurons. In essence, it
adds nonlinearity to neural network networks so that the PRECISION RECALL F1- SUPPORT
SCORE
networks can figure out how input and output values relate to
one another. Perhaps the most often utilized function for
hidden layers is the rectified linear activation function, or ANGRY 0.95 0.94 0.94 356
ReLU activation function.
Equation (1) shows how the ReLU function is determined: CALM 0.84 0.88 0.86 120
(max 0.0, x  DISGUST 0.92 0.95 0.93 366
Accordingly, a value of 0.0 is returned if the input value

(x) is negative; otherwise, the value is returned. FEAR 0.94 0.92 0.93 354
HAPPY 0.92 0.94 0.93 335
IV. CONCLUSION
NEUTRAL 0.95 0.97 0.96 294 Our research was primarily based on the identifiers that
were found in the datasets such as modality, vocal channel,
SAD 0.92 0.88 0.90 344
emotional intensity, statement, repetition, and actor. These
identifiers mostly resemble the stimulus characteristics. Each
expression is produced at two levels of emotional intensity
SURPRISE 0.95 0.93 0.94 375 (normal, strong), with an additional neutral expression. Our
goal was to achieve the rightly labeled emotions after
assessing the audio files. Although we succeeded in that with
a good result, there are scopes of improvement such as noise
filtering and analysis of the used words. From time to time, it
ACCURACY 0.93 2544 can be challenging to find out the correct labels using just the
audio. To reduce the confusion, we plan to apply more
MACRO 0.92 0.93 0.92 2544 attributes and techniques in future.
AVG
ACKNOWLEDGMENT
WEIGHTED 0.93 0.93 0.93 2544 We are thankful to our volunteers named Hareesh, Kinga,
AVG Sevim and Farhan who helped us to test our model.
REFERENCES
[1] S. G. Koolagudi and K. S. Rao, ‘‘Emotion recognition from speech: A

Table II indicates the predicted vs actual labels of our model. review,’’ Int. J. speech Technol., vol. 15, no. 2, pp. 99–117, 2012
Our model performs very well usually, however there is some [2] M. El Ayadi, M. S. Kamel, and F. Karray, ‘‘Survey on speech emotion
trouble differentiating between the emotions of sad and recognition: Features, classification schemes, and databases,’’ Pattern
neutral as their nature of frequency are similar. Recognit., vol. 44, no. 3, pp. 572–587, 2011
[3] A. D. Dileep and C. C. Sekhar, ‘‘GMM-based intermediate matching
TABLE II. PREDICTED VS ACTUAL LABELS OF O UR MODEL kernel for classification of varying length patterns of long duration
speech using support vector machines,’’ IEEE Trans. neural Netw.
Learn. Syst., vol. 25, no. 8, pp. 1421–1432, Aug. 2014.
[4] J. Schmidhuber, ‘‘Deep learning in neural networks: An overview,’’
PREDICTED LABELS ACTUAL LABELS Neural Netw., vol. 61, pp. 85–117, Jan. 2015
[5] L. Cen, W. Ser, and Z. L. Yu, ‘‘Speech emotion recognition using
canonical correlation analysis and probabilistic neural network,’’ in
0 DISGUST DISGUST Proc. 7th Int. Conf. Mach. Learn. Appl. (ICMLA), Dec. 2008, pp. 859–
862
[6] D. Yu, M. L. Seltzer, J. Li, J.-T. Huang, and F. Seide, ‘‘Feature learning
1 FEAR FEAR in deep neural networks—Studies on speech recognition tasks,’’ 2013,
arXiv:1301.3605. [Online]. Available: https://arxiv.org/abs/1301.3605
[7] A. Satt, S. Rozenberg, and R. Hoory, ‘‘Efficient emotion recognition
2 HAPPY HAPPY from speech using deep learning on spectrograms,’’ in Proc.
INTERSPEECH, 2017, pp. 1089–1093
[8] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, ‘‘A convolutional
3 ANGRY ANGRY neural network for modelling sentences,’’ 2014, arXiv:1404.2188.
[Online]. Available: https://arxiv.org/abs/1404.2188
[9] [9] X. Wu et al., "Speech Emotion Recognition Using Capsule
4 SURPRISE SURPRISE Networks," ICASSP 2019 - 2019 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6695-
6699, doi: 10.1109/ICASSP.2019.8683163.
5 SURPRISE SURPRISE [10] M. Chen, P. Zhou, and G. Fortino, `Èmotion communication system,''
IEEE Access, vol. 5, pp. 326_337, 2016.
[11] O. Kwon, K. Chan, J. Hao, T. Lee, `Èmotion recognition by speech
signal,'' in Proc. EUROSPEECH, Geneva, Switzerland, 2003, pp.
6 HAPPY HAPPY
125_128.
[12] R. W. Picard, `Àffective computing,'' Perceptual Comput.
… [13] Picard, R. W., Vyzas, E., and Healey, J., "Toward Machine Emotional
Intelligence: Analysis of Affective Physiological State ," IEEE
Transactions on Pattern Analysis and Machine Intelligence. 23(10),
1175-1191 (2001).
19 CALM CALM
[14] Batliner, A., et al., "Whodunnit – Searching for the most important
feature

SER-Paper-Bolo 27.11.19 40

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SER-Paper-Bolo 27.11.19 40

Uploaded by

Copyright:

Available Formats

Speech Emotion Recognition from Audio Files

Using Feedforward Neural Network

Keywords—machine learning, deep learning, speech emotion

Figure 4: Spectogram for audio containing the happy emotion

Figure 4 illustrates the spectrogram for audio that contains

Figure 6: Training & Testing Performance

Figure 7: Confusion Matrix of our model

(max 0.0, x  DISGUST 0.92 0.95 0.93 366

Accordingly, a value of 0.0 is returned if the input value

[1] S. G. Koolagudi and K. S. Rao, ‘‘Emotion recognition from speech: A

You might also like