Professional Documents
Culture Documents
EpochSER MTA
EpochSER MTA
Abstract Features related to the glottal closure instants (GCI) exhibit different
patterns for different emotions. In this work, our main objective was to explore the
effectiveness of these features in speech emotion recognition (SER). In this regard,
we had proposed two distance-based classifiers based on four features related to
GCI. This was the first phase of our work. In two later phases of this work, we
extended this idea to develop hierarchical two-stage SER systems in order to couple
the GCI features with other features to improve our SER systems. The first stage
in phase 2 was based on prosodic features and for phase 3, we had used power
spectral features in the first stage. The second stage in both the systems was
based on the GCI features. The best performance was obsevered for the phase
3 systems, which outperformed the phase 1 systems by as much as about 10%
for the IEMOCAP corpus and by about 20% for the EMO-DB corpus. It also
outperformed a related and recent work by Kadiri et al. (2020) by 9.6% for the
EMO-DB corpus.
Keywords Speech emotion recognition · epoch features · power spectral features ·
KL distance
1 Introduction
In the evolving field of human-computer interaction (HCI), there are a large num-
ber of modes by which humans communicate with computers. Modes like speech,
text, GUI-based interaction using mouse or touchscreen devices are the most com-
mon. Among these, speech is one of the most intuitive and natural modes of
communication. Apart from the message conveyed by speech, one of the great-
est aspects which make the message more meaningful is emotions. Therefore, for
more natural HCI through speech, machines also need to deal with paralinguis-
tic aspects of speech like emotions. Computers should be able to understand the
emotion in speech conveyed by a human, as well as generate a speech in response
that corresponds to both the message and the emotion identified. The first task
1. IIT Kharagpur, India
2. IIT Kharagpur, India
2 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2
requires machines to be able to recognize emotions from human speech. Our work
focuses on this aspect, i.e., identifying emotions automatically from human speech.
There are many previous works in this area spanning about two decades and this
is still a hot topic of research. Emotion recognition has applications in making
HCI better and more natural. In addition to that, emotion recognition has many
other applications. As an example, marketing strategies can be formulated based
on identifying emotions from audio recordings of customer feedback. Another ex-
ample is effective e-learning by adjusting teaching strategies based on the emotions
detected from voice recordings of student feedback [1].
Emotion recognition tasks, like other pattern recognition tasks related to mul-
timedia like text, audio, video and image processing, have gone through some
paradigm shifts in the past few decades. It started off with rule-based systems,
evolved into machine learning techniques and finally, today the state of the art
technique in almost all kinds of pattern recognition tasks in general, and SER, in
particular, is a special kind of machine learning (ML) technique known as deep
learning. The motivation behind the paradigm shift from general machine learning
techniques to deep learning comes from the fact that in many tasks, deep neural
networks have been found to classify better with raw data (or data with little
pre-processing) than with other ML techniques with carefully engineered hand-
crafted features. As an example, in the case of speech emotion recognition (SER),
deep learning can be directly applied to raw speech spectrograms. Whereas, in the
case of ordinary machine learning, there is an additional requirement of a feature
extraction module. This is why most of the SER techniques starting from around
2017 up to this day are based on deep learning on direct, raw speech spectro-
grams. In fact, within these 3-4 years of the deep learning boom, there have been
numerous works on deep learning-based SER. They have been found, in general,
to perform better than ordinary ML techniques. However, despite the advantages
of high performance and the ability of the deep networks to learn appropriate fea-
tures (hidden representations) by themselves, they suffer from some drawbacks.
The first problem with deep learning is that they require huge amounts of data
to perform well. However, huge amounts of relevant data are not always easy to
obtain. Secondly, training deep neural networks (DNN) is extremely expensive in
view of computational complexity. The most complex models may take a week
to train even with several GPUs. Thirdly, the DNN models are so complex that
determining the proper topology and hyperparameters is like black art with hardly
any theory or very complicated theory to guide us. On the other hand, the trial
and error approach towards these problems is also very time consuming because
each trial will take very long to run (a few days to even a week). These problems
are more intense in low resource scenarios where several expensive GPUs and huge
datasets are not available. These problems are alleviated to some extent by using
transfer learning in which pretrained networks used for one task is used for another
task by retraining only the last/last few layers. This reduces the computational
complexity to a large extent. However, if large datasets for the desired problem
are not available, the problem of overfitting is likely to occur. This is because pre-
trained networks are generally large networks with often millions of parameters
to be tuned. Despite this problem, we still plan to explore transfer learning tech-
niques for this problem in a future work. In fact, because of these problems, there
is still a need to explore ordinary ML techniques and new features to improve SER
Title Suppressed Due to Excessive Length 3
Among the works in speech emotion recognition (SER) using ordinary ML,
there have been many works using the source-filter model [2]. Using this model,
some source and/or vocal tract (VT) features are extracted. Using either or a com-
bination of these features, a standard pattern recognition algorithm like GMM/
SVM/ ANN/... or a combination of two of them are used to predict emotions.
There are many works on VT features. Compared to that, works on only exci-
tation source features or its combination with VT features are few. A few works
on emotion analysis had shown that some parameters derived from the glottal
closure instants (GCIs) had different trends for different emotions [3][4]. Works
exploiting these differences in the patterns to build speech emotion recognition
(SER) systems are very rare. Among those rare works on SER is one that had
used these parameters related to GCI (also known as epochs) as features [5]. Since
these features are extracted at a sub-segmental level (using frames as short as 5-10
ms), they are also referred to as sub-segmental excitation source features (another
name for the same is epoch features). In this work, three sub-segmental features
have been used: pitch, the energy of excitation (EoE) and the strength of exci-
tation (SoE) [6]. These features were fed to classifiers proposed by the authors,
based on the KL distance metric. Recently, the same authors have extended that
work by adding one more feature [7]– the ratio between the high frequency and
low-frequency spectral energy (β) [8]. Besides that, the statistics of the variations
of these features among four different emotions (anger, happiness, neutral and
sadness) were also shown. The performance of the classifiers on the four emotions
was comparable with other existing works. However, the classifiers performed re-
markably well when instead of distinct emotions, emotion groups were identified.
Happiness and anger was considered as one group and neutral and sadness as an-
other group. They used this idea to finally develop heirarchical classifiers in which
the emotion category was identified in the first stage and the actual emotion was
then subsequently identified in the second stage. The results triggred in us the idea
that we can examine some specific features with which we can attempt to discrim-
inate among groups of emotions from one another. If the recognition accuracy is
then found to be quite high, then it seems reasonable to use the same features or
a different set of features in the next step to distinguish among emotions within
each group.
IEMOCAP dataset [9], a semi-natural dataset in English and the German emo-
tional dataset also known as EMO-DB [10], an acted dataset. Following the trend
of most works related to SER, we have worked with four common emotions– anger,
happiness, neutral and sadness.
This work has been done in three phases. In the first phase, we had first applied
our proposed classifiers on the epoch features for the emotion identification task.
A reason for using epoch features was that we expected these features to classify
groups of emotions well. Then, for comparison, those same classifiers were applied
on state of the art MFCC features. We observed that though the epoch features
did not perform miserably, they did not seem to be good enough as features for
building decent quality SER systems. However, they may play a complementary
role to state of the art MFCC features by improvement in the performance com-
pared to using MFCC features only. Therefore, for purposes of comparison, we
have also attempted a combination of the epoch features and MFCC features.
It should also be noted that instead of attempting a hierarchical classification in
this phase, we decided to try all these classifiers directly on the four emotions to
get an idea of how to group the emotions for hierarchical classification, which is
later attempted in the next two phases of this work. Based on some observations
(specially regarding the sad emotion) in the first phase, in the second phase of
our work, we proposed a two-stage classifier, in which sad vs not-sad (rest) was
identified in the first stage using prosodic features, and if the speech was identified
as not sad, then, in the second stage, another classifier (based on phase 1 systems
trained only on the other three emotions) would classify the speech into one of the
other three categories. There were considerable improvements in the performance
in this phase compared to the first phase. However, the confusion between anger
and happiness was still significant. Therefore, in the third phase of our work, in
order to deal with this confusion between anger and happiness, we tried to look for
features that could potentially be useful. In that regard, we analyzed the power
spectrum of speech representing different emotions and derived features from the
power spectra. A set of 14 features were extracted. A two-stage classifier similar
to the one in the second phase was applied, and the classification rates were found
to be even better than those in the previous two phases. The confusion between
happiness and anger had also reduced in the process.
2 Related work
Before the advent of the deep learning era, different machine learning algorithms
were used to identify emotions in speech. These algorithms comprise two main
steps: feature extraction and pattern recognition. A good review of some previous
works on the features and the pattern recognition algorithms used for SER can
be found in [11]. Broadly three types of features have been used in the literature:
excitation source features, vocal tract features and prosodic features. Some works
combine features from these three categories before feeding those to a classifier.
Works on vocal tract features and prosodic features can be found in abundance.
Here, we mention a few important ones. An attempt was made to identify emotions
using LFPC (log frequency power coefficients), LPCC and MFCC as features [17]
using discrete HMMs as classifiers. Lee et al. used MFCC features with HMM-
based classifiers for classifying speech into four emotion categories, obtaining 65%
accuracy [18]. Another work used MFCC features derived from speech utterances
from the LDC emotional speech database, prepared by the authors, and EMO-DB
[19]. Koolagudi et al. used MFCC features derived from speech and fed them to a
GMM classifier to identify emotions [20].
Among the works on prosodic features, an early work by Petrushin et al. an-
alyzed the potential of an ensemble of neural networks employed on prosodic fea-
tures like pitch, the first and second formants, energy and the speaking rate,
obtaining an accuracy of 77% on two emotional states [21]. Prosodic features like
fundamental frequency (F0), energy, duration, the first and second formant fre-
quencies were used for detection of negative and non-negative emotions using spo-
6 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2
ken language data obtained from a call centre application [22]. Kao et al. extracted
pitch and power-based features from frame, syllable, and word levels for recogni-
tion of four emotions in Mandarin [23]. In [24], the authors derived 35-dimensional
prosodic feature vectors including pitch, energy, and duration from speech for clas-
sification into seven emotions using neural networks, from the EMO-DB corpus,
getting an accuracy of 51%. Koolagudi et al. used duration patterns, average pitch,
the standard deviation of pitch and average energy to classify speech into one of
eight emotions from the IITKGP-SESC corpus [25].
Apart from the above-mentioned three types of features, different feature com-
binations have also been used in the literature. Nakatsu et al. used a combination
of LPCCs and pitch related features for the identification of eight emotions using
neural networks [26]. Bozkurt et al. used prosodic, spectral and HMM-based fea-
tures for the classification of five emotions of the Interspeech 2009 challenge and
achieved recognition accuracy of 63% [27]. Spectral, prosody and lexical features
were derived from the semi-natural USC-IEMOCAP database [9] in [28], yielding
an accuracy of 65.7%.
With the advent of the deep learning era around the middle of this decade, a lot
of work has been done on identifying emotions using deep neural networks (DNNs)
from speech. Fayek et al. proposed an end-to-end DNN architecture to identify
emotions from speech using the eNTERFACE [29] and SAVEE [30] databases,
yielding accuracies of 60.53% and 59.7%, respectively, on the two datasets [31].
Zhao et al. used a recurrent convolutional neural network (RCNN) to categorize
speech into seven emotions using the IEMOCAP corpus [32]. This was the first
time a hybrid model was used for SER and it achieved an accuracy of 83.4%. RNNs
were used in [33] to identify the same emotions as in [32] using the IEMOCAP
corpus. Its weighted accuracy (WA) and unweighted accuracy (UA) outperformed
SVMs by 5.7% and 3.1% respectively. Tzirakis et al. developed an SER system
[34] on four emotions from the RECOLA and AVEC 2016 [35] datasets using an
end-to-end system using convolutional neural networks (CNN) and ResNet of 50
layers along with long short term memory (LSTM). An accuracy of 78.7% was
achieved in that work. In [36], a deep convolutional neural network (DCNN) was
used on EMO-DB and IEMOCAP. They merged deep 1D and 2D CNN for high-
level learning of features from speech spectrograms. Seven emotions were used for
the study and accuracy of 92.71% was obtained. Other variants of deep neural
networks have also been used for SER. Examples are adversarial autoencoders
[37], variational autoencoders [38], HMM-based hybrid DNNs [39], etc. The works
mentioned in this paragraph pertain to clean speech. It is clear from the accuracies
obtained how better deep learning is compared to traditional ML.
In this work, two publicly available datasets: the IEMOCAP dataset and the Ger-
man EMO-DB dataset, have been used. The IEMOCAP dataset is a multi-modal
corpus in English specifically designed for the analysis of emotions. It is a semi-
natural corpus that has been recorded from 10 actors in dyadic sessions in two
Title Suppressed Due to Excessive Length 7
modes: scripted and improvised. The utterances in the corpus have been labelled
according to both categorical and continuous model of emotions. It comprises nine
emotions, among which four emotions, namely, neutral, anger, happiness and sad-
ness are the most commonly used. The approximate duration of the entire corpus
is about 12 hours.
4 Proposed framework
Different features, classifiers and techniques have been used in different phases.
In this section, we give a detailed description of all these aspects. Since this work
is based on three phases, we dedicate a subsection to each. In those subsections,
the results of the experiments in each phase will also be presented. Finally, each
subsection will be concluded by our remarks on the results of each phase.
4.1 Phase-1
The following features have been extracted from the excitation source:
(i) Pitch: The reciprocal of the time interval between two consecutive glottal
closure instants (GCI). The positions of the positive zero crossings of the zero
frequency filtered (ZFF) [48] signal are a fairly good approximation of GCI.
(ii) Energy of excitation (EoE): The ratio of the root mean square energy
(RMSE) of the samples of the Hilbert envelope of the LP residual [49] to the
RMSE of the samples of the speech signal, over 2 ms around each GCI. This is
indicative of the vocal effort.
(iii) Strength of excitation (SoE): It is measured by the slope of the ZFF
signal at the positive zero crossings. Also known as epoch strength, it gives an
idea of the energy of the excitation signal at the epoch locations.
(iv) Epoch sharpness: The ratio of the standard deviation to the mean of the
samples of the Hilbert envelope of the LP residual around each GCI. This can
represent the loudness level of the excitation source signal.
It has already been discussed that our proposed emotion identification methods
are based on two types of distances: the KL distance and the Euclidean distance.
The effectiveness of using the KL distance for discriminating between emotions
was already demonstrated in [7]. However, it did not make similar explorations for
other distance metrics. This has triggered in us the idea that just like the KL dis-
tance, other distance metrics might also be useful for the same task. However, we
are not aware of any work that has investigated the effectiveness of other distance
metrics in this regard. Therefore, we decided to explore another distance metric,
the Euclidean distance, for our work. We have chosen the Euclidean distance as
it is the one of the simplest of all distance metrics and is also the most intuitive.
Hence, before using the Euclidean distance directly to develop SER systems, a
similar analysis is performed in this work to assess the effectiveness of using this
distance for this task. In this regard, random samples of 200 utterances from each
emotion have been considered for an analysis from the IEMOCAP corpus. For an
analysis of the excitation source features, we have taken the four sub-segmental
features just described above. For vocal tract features, the state of the art MFCC
features along with their delta coefficients have been used. Thirteen dimensional
MFCC features have been extracted for this purpose. After extracting both these
types of features (epoch features and MFCC features), we process both these fea-
ture sets (epoch features and MFCC features) separately as follows. First of all,
GMMs are built for each emotion with these features. We refer to these GMMs for
each emotion as a template of that emotion, parameterized by its mean vectors
and covariance matrices.
After extracting the two above-mentioned feature sets (epoch features and
MFCC features), in order to analyze how effective Euclidean distances are in dis-
tinguishing emotions, we measured d(i, j), the average distance of feature vectors
of the ith emotion from the template of the j th emotion, for both the feature sets.
To measure distances from a template, we use only the mean vectors of the GMMs
representing the template. d(i, j) is measured as follows. Let [v(1), v(2), ..., v(Ni )]
represent the feature vectors extracted from all utterances of the ith emotion,
where Ni is the total number of feature vectors extracted from the ith emotion.
Let [µj1 , µj2 , ..., µjM ] represent the mean vectors of the M -component GMM rep-
resenting the j th emotion. First of all, we measure d(vi (l), µj ), the average distance
Title Suppressed Due to Excessive Length 9
of the lth feature vector of the ith emotion from the mean vectors of the GMM of
the j th emotion. Here, the averaging is done over the M mean vectors. Therefore,
M
1 X
d(vi (l), µj ) = ||vi (l) − µjk ||2 l = 1, 2, ..., Ni (1)
M k=1
where ||v||2 , represents the norm of vector v. Now, after finding d(vi (l), µj ) for
l = 1, 2, ..., Ni , we find their average to finally calculate d(i, j), i.e.,
Ni Ni M
1 X 1 X 1 X
d(i, j) = d(vi (l), µj ) = ||vi (l) − µjk ||2
Ni l=1 Ni l=1 M k=1
Ni XM
1 X
= ||vi (l) − µjk ||2 (2)
M Ni l=1 k=1
In this way, we find d(i, j) for i, j = 1, 2, 3, 4. Let anger, happiness, neutral and
sadness represent the first, second, third and fourth emotions respectively. These
values of d(i, j) have been shown in tables 1 and 2.
It is clear from both the tables that d(i, j) has the minimum values when i = j.
In other words, the distance between feature vectors and a template is minimum
when both the feature vectors and the template represent the same emotion. This
shows that the Euclidean distances (on average) between utterances of an emotion
from the templates of the four emotions are the least when the template is of the
10 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2
same emotion. However, we also observe from the first two rows of both tables
that though d(i, j) is minimum for i = j for all cases, the values of d(i, j) are also
low when i = 1, j = 2 and i = 2, j = 1, compared to the other cases. This might
imply that if the Euclidean distance is used to distinguish between emotions, the
confusion bewteen anger and happiness might be more than confusions between
other pairs of emotions. Nevertheless, it is clear that the Euclidean distance may
still be used as a reliable metric for measuring the deviation of utterances from
the templates representing an emotion. Another important point worth mention-
ing here is that on observing that the distances for MFCC vectors are about twice
than those of the epoch features, we should not erroneously come to the conclusion
that epoch features might perform better than MFCC features. These distances
also depend on the dimensionality of the feature vectors. The more the dimension-
ality, the more the distance. Therefore, it is quite reasonable that the Euclidean
distances for the MFCC vectors will be greater than those for the epoch features as
MFCC features have a much higher dimensionality compared to the epoch features.
Now, we use the Euclidean distance and the KL distance to propose two emo-
tion recognition techniques, each based on one of these two distance metrics. A clas-
sification technique using KL distances was also proposed by Kadiri et al [7]. That
technique was based on templates. However, the technique they had used was based
on many templates, which can be used on small to medium-sized datasets like the
state of the art EMODB corpus [9] and the IIIT-H Telugu Emotion dataset [5],
prepared by the authors. However, for large datasets like the IEMOCAP dataset,
which are generally seminatural and mostly resemble real-life natural, emotional
speech, the technique used by Kadiri et al. may not scale well as the number of
stored templates will be very high and the template matching technique used there
will turn out to be more cumbersome. Therefore, instead of using their classifica-
tion technique, we used our proposed classification technique, which consists of as
many stored templates as the number of candidate emotions. Hence, the number
of stored templates in our technique is remarkably lower compared to the other
technique. Furthermore, our technique is also simple to understand. Therefore,
the proposed two techniques were used to develop different emotion recognition
systems each based on one of the following feature sets:
A total of six systems have been developed (2 distances × 3 feature sets). Below
is a detailed description of our proposed distance-based techniques.
Training phase
Test phase
Training phase
Test phase
We will also build classifiers based on the above two algorithms with only
MFCCs as features. It will output distance values for each emotion for a given
utterance. These distances will be combined with the distances from the excitation
features using the following equation:
Tables 3 and 4 show the recognition accuracies for the different systems we devel-
oped for the IEMOCAP and EMO-DB datasets respectively. From the table, we
observe that the performance is the best on combined features for both KL distance
and Euclidean distance for both the datasets. While it is expected that MFCCs
will outperform the four subsegmental features, the improvement in the combined
systems prove that the subsegmental features also have significant emotion-related
information embedded in them, which can be exploited along with some state of
the art features to develop a decent SER system. Also, the performance of the
systems based on subsegmental features, though not spectacular, show that they
contain emotion-related information. The improvement of the combined systems
compared to the MFCC-based systems for the techniques based on the Euclidean
distance-based and KL-distance were 0.6% and 4.7% respectively for the IEMO-
CAP dataset. Overall, KL distance based systems have been found to perform
slightly better than the systems based on Euclidean distance. Almost the same
trends are observed in the case of the EMO-DB corpus. For this corpus, the im-
provement of the combined systems compared to the MFCC-based systems were
5.9% for the Euclidean distance-based technique and 2% for the KL distance-
based technique. It can also be observed that the improvement in performance of
the combined systems compared to the MFCC-based systems was lower for the
Euclidean distance-based system than for the KL distance-based system for the
Title Suppressed Due to Excessive Length 13
Tables 5 and 6 show the confusion matrices for the two classification systems
using the combined features, on the IEMOCAP corpus. Though the identification
of anger, happiness and sadness are fairly satisfactory, the identification rates of
neutral emotion is quite low for both techniques. For the Euclidean distance based
classification, 28% of neutral utterances are getting confused with sadness, while
for the KL distance based algorithm, 29% of neutral utterances get confused with
sadness.
14 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2
Emotions Anger (%) Happy (%) Neutral(%) Sad (%) TPR (%) FNR (%)
Anger 58 15 10 17 58 42
Happy 16 51 12 21 51 49
Neutral 14 18 40 28 40 60
Sad 12 14 12 62 62 38
Emotions Anger (%) Happy (%) Neutral(%) Sad (%) TPR (%) FNR (%)
Anger 64 12 6 18 64 36
Happy 18 50 8 24 50 50
Neutral 17 16 38 29 38 62
Sad 11 11 7 71 71 29
Tables 7 and 8 show the confusion matrices for the Euclidean distance-based
classification system and the KL distance-based classification system respectively,
on the EMO-DB corpus. The confusion matrices for this dataset also show similar
trends as in the IEMOCAP corpus.
Emotions Anger (%) Happy (%) Neutral (%) Sad (%) TPR (%) FNR (%)
Anger 65 11 9 15 65 35
Happy 13 56 10 21 56 44
Neutral 11 9 55 25 55 45
Sad 8 7 11 74 74 26
Title Suppressed Due to Excessive Length 15
Emotions Anger (%) Happy (%) Neutral (%) Sad (%) TPR (%) FNR (%)
Anger 67 11 8 14 67 33
Happy 13 58 9 20 58 42
Neutral 10 10 56 24 56 44
Sad 8 7 10 75 75 25
It is also noticeable that in both the systems and both the datasets, among
the misclassifications of different emotions, many of them are getting misclassified
as sadness. This shows that sadness might have some peculiarity which is well
reflected in neither the subsegmental source features nor the vocal tract features.
Since this is common to both the proposed techniques and both the datasets,
we guess that neither the datasets nor the techniques are responsible for this
confusion. It turns out that probably the features used have played a major role
in this confusion. Therefore, we decided to perform an analysis to find out which
types of features may be more appropriate for this task. The details of these
analyses will soon follow in subsection 4.2. However, all these observations have
given rise to the following idea. Given any speech during the test case, we can
use an appropriate set of features to first determine whether the speech belongs
to the sad category. If it is, there will be no further processing of the speech and
it will be classified as sad. However, if that is not the case, then we will use our
previous classifiers (based on epoch features and MFCC features) to identify the
speech among the rest of the three emotions. Given this framework, we now move
on to the next phase of our work based on this plan.
4.2 Phase-2
In an attempt to find out an appropriate set of features to identify the sad emotion
well, we, first of all, decided to find out if any noticeable pattern can be discerned
from the waveforms of the speech itself. Therefore, we manually observed the time
domain waveforms of different speech segments from each emotion. After observ-
ing some waveforms, the first thing that we observed was the remarkable difference
between the intensity levels of sad speech versus other speech. We noticed that
in most cases, the intensity levels of sadness is significantly lower than those of
other emotions. Another observation, though less prominent than the one already
mentioned was that the variations (can be quantified by variance or range) in
intensities of sad speech were lower than that of other emotions, accross an utter-
ance. At the utterance level, there are significant variations in the intensities of
anger and happiness emotions. However, for sadness, the variations in intensity at
the utterance level also seemed to be quite low. Compared to anger and happiness,
the variations of neutral speech were also low, but not as low as variations in sad
speech. Furthermore, it is well known that pitch patterns also vary for different
emotions at the utterance level. Therefore, we also decided to study the patterns
16 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2
Fig. 1: Scatter plot of the three prosodic parameters extracted from four
emotions from the IEMOCAP dataset at the utterance level.
Fig. 2: Scatter plot of the three prosodic parameters extracted from four
emotions from the EMO-DB dataset at the utterance level.
Title Suppressed Due to Excessive Length 17
From Figure 1, all the observations we had made after manual analysis are
confirmed. The intensity values and the variations in intensity values at an utter-
ance level are quite low in sadness compared to the other emotions. However, the
variations in pitch for sadness compared to the other emotions is quite high. From
Figure 2, only the variation patterns in pitch at the utterance level matches with
that of the previous figure. However, the intensity values and the variations in
intensity levels at the utterance level are not significantly different from the other
emotions. This hints at the possibility that variations in pitch can potentially be
an important parameter in distinguishing sad emotion from other emotions, across
the datasets. Regarding the mismatch of the patterns of the other two parame-
ters between the IEMOCAP and EMO-DB datasets, we think that this is because
of the nature of differences between the two corpora. EMO-DB is a simulated
emotion database with 10 sentences enacted by actors in seven pre-determined
emotions/styles. In this situation, both the sentences and the desired emotion are
given to the actors, and they are constrained to articulate those sentences using
an expression that corresponds to the given emotion. As a result, the enacted
emotional speech that is recorded will not completely resemble full-blown emo-
tional speech that is found in real-life scenarios. Even if the actors are extremely
skilled at imitating any style of speech, the constraints imposed is bound to make
the enacted emotions somewhat artificial. On the other hand, for the IEMOCAP
dataset, the constraints are very few. Some of the utterances are impromptu, while
the scripted ones also consist of a wide variety of sentences. Furthermore, the ac-
tors were not constrained to enact any particular style of speech. The choice of
style in the dyadic conversations, whether impromptu or scripted, was totally at
the discretion of the actors. In other words, they could freely use any style of
expression they deemed appropriate for the conversation. This resulted in speech
that is far more near real-life speech with natural expressions. Therefore, we have
considered the observations of the IEMOCAP to be more reliable as far as analysis
of emotions is concerned.
The above analysis prompted us to pursue these three prosodic parameters for
recognition of emotions on both the datasets based on a two-stage framework.
First of all, we use the prosodic parameters to train a classifier that can identify
whether a given speech segment is sad speech or not. In the test case, given a
speech segement, the classifier predicts whether the given speech segment belongs
to the sad emotion. If that is the case, there is no further processing of the speech
and the assigned emotion for the speech is sadness. Otherwise (if it is detected as
non-sad), the speech segment passes through a second stage of processing in which
phase 1 systems are used to classify this non-sad speech into one of the other three
emotions– neutral, happiness and anger. In this stage, we have trained our phase
1 systems based on only the other three emotions (excluding sadness). The entire
classification scheme is shown in Figure 3.
18 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2
the other emotions, specially anger and happiness. Though neutral and sadness
share some common space, it is more so for the IEMOCAP corpus than for the
EMO-DB corpus. The confusion matrices of the prosody-based classifier on the
IEMOCAP and EMO-DB datasets are shown in Tables 9 and 10 respectively.
Table 11: Recognition accuracies for the two-stage systems for the four
emotions on the IEMOCAP dataset
The confusion matrices of the two systems have been shown in Tables 12 and 13.
The true positive rate of sadness of the modified system is 74%, which is 3% better
than the best of the previous two direct classifiers. Also, the misclassifications of
angry and happy speech as sad speech are remarkably less compared to phase
1 systems. However, the misclassification rates of neutral speech as sad speech
has not significantly changed. This might be explained by the scatter diagram in
Figure 1, in which we can observe that sad speech is very well-separable in the
three-dimensional space from anger and happiness. However, their separability
from neutral speech is not much as many instances of neutral and sad speech are
close to one another. However, in all, compared to phase 1 systems, the percentage
of the other emotions being misclassified as sadness has been drastically reduced,
except for the neutral emotion. Also, more instances of sad speech get classified
correctly. This has given rise to better overall recognition accuracies in the entire
modified systems.
Table 12: Confusion matrix for the two stage system (Euclidean based
technique in the second stage) on the IEMOCAP dataset
Emotions Anger (%) Happy (%) Neutral (%) Sad (%) TPR (%) FNR (%)
Anger 69 13 12 6 69 31
Happy 20 53 16 11 53 47
Neutral 16 15 42 28 42 58
Sad 8 8 10 74 74 26
Table 13: Confusion matrix for the two stage system (KL distance based
technique in the second stage) on the IEMOCAP dataset
Emotions Anger (%) Happy (%) Neutral (%) Sad (%) TPR (%) FNR (%)
Anger 73 13 9 5 73 27
Happy 20 57 16 7 57 43
Neutral 17 14 43 26 43 57
Sad 4 10 12 74 74 26
Title Suppressed Due to Excessive Length 21
Similarly, we use the same two-stage classifiers on the EMO-DB corpus. The
recognition accuracies of the systems are shown in Table 14. It is observed that
the improvements in these systems compared to the old systems is drastic. For
the Euclidean-distance based system in the second stage, we get an improvement
of 12.6%. For the other system, we get an improvement of 11.2%. Our proposed
systems slightly outperform the system in [7] using almost similar features, which
had an accuracy of 76%. Also, an important point worth reiterating here is that
the accuracy of the prosody-based systems in the first stage was more than 90%.
This might be the reason behind the drastic improvements in the performance
of the phase 2 systems compared to the phase 1 systems. Thus, this gives us a
motivation for using prosodic features in this way for conducting a study on more
emotions in a future work.
Table 14: Recognition accuracies for the two-stage systems for the four
emotions on the EMO-DB dataset
Table 15: Confusion matrix for the two stage system (Euclidean based
technique in the second stage) on the EMO-DB dataset
Emotions Anger (%) Happy (%) Neutral (%) Sad (%) TPR (%) FNR (%)
Anger 83 8 6 3 83 17
Happy 21 56 13 9 56 44
Neutral 6 4 78 12 78 22
Sad 3 5 7 85 85 15
Table 16: Confusion matrix for the two stage system (KL distance based
technique in the second stage) on the EMO-DB dataset
Emotions Anger (%) Happy (%) Neutral (%) Sad (%) TPR (%) FNR (%)
Anger 83 8 5 4 83 17
Happy 21 57 14 8 57 43
Neutral 5 5 79 11 79 21
Sad 3 5 7 85 85 15
22 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2
Tables 15 and 16 represent the confusion matrices for the two two-stage systems
based on the EMO-DB dataset. The reduction in the misclassification rates of
other emotions as sadness has greatly reduced. Also, the recognition accuracy of
sad speech has dramatically improved because of the prosodic processing in the
first stage. However, despite these improvements, the confusion between anger and
happiness is still high compared to the confusions between other pairs of emotions.
Therefore, we attempt to find out ways to handle this problem.
4.3 Phase-3
We had already remarked that though sad speech could be identified far better
using the systems in phase 2, which ultimately led to dramatic improvements in
the overall SER accuracy, specially in the EMO-DB corpus, the confusion between
anger and happiness was still high compared to the confusions between other pairs
of emotions. Therefore, we decided to explore other features which might possibly
alleviate this problem. Also in phase 2, which was based on a time-domain analysis
(prosodic features were derived from the time domain), we did not find much dif-
ferences between anger and happiness. We also contemplated the fact that in audio
processing tasks, specially for modification of audio signals, most of the manipula-
tions are done on the magnitude/power spectrum itself. As an example, modifying
parts of the power spectra in music often leads to different effects in the modified
audio. Enhancing the lowermost frequencies (also called bass) makes drum beats
in audio more prominent. If parts of a music segment are abnormally shrill with
occasional hissings at different regions, then those deficiencies are compensated by
reducing magnitudes of the high frequencies (also called treble). Enhancing the
mid-range frequencies can sometimes make dialogues clearer. In other words, ma-
nipulating the power spectrum has the potential to change the entire profile of an
audio and how it sounds to a listener. Even in speech, changing the shape of the
power spectra can sometimes result in creating an illusion that a different speaker
is speaking. Motivated by these facts, we hypothesized that different emotions
should also have different patterns in the general shape of their power spectra. To
verify this, we decided to manually observe the magnitude spectra for different
emotions from the two datasets for a preliminary analysis. We observed that some
differences are such that a mere inspection of the shape of the magnitude spectra
is enough to distinguish between emotions. Figure 4 shows sample power spectra
of the four emotions based on a particular utterance in the EMO-DB dataset.
Title Suppressed Due to Excessive Length 23
It can be seen from the figure (and other samples we had inspected) that the
power contained in angry and happy speech is quite high compared to those in the
other two emotions, except for the extremely low-frequency regions (0-500 Hz) and
the high-frequency regions (around 4-5 kHz onwards). We can, therefore, conclude
that the power spectrum may be effectively used to classify emotions at least ac-
cording to two emotion categories: type 1 representing anger and happiness and
type 2 representing neutral and sadness. This might be possible if we can char-
acterize the overall shape of the power spectra using features related to the power
spectrum. Therefore, we made a statistical analysis of the spectra of different emo-
tions based on different bands. We divided the entire spectrum of any speech signal
into different non-overlapping bands and find the mean (geometric mean) energies
of each band. We experimented with different bands (varying band lengths) and
plotted the mean energies of each band in a box plot to estimate the usefulness of
that band in distinguishing among the different emotions. Furthermore, we also
analyzed the low to high-frequency energy ratios and found even these ratios to
be useful. After experimenting with different bandwidths for these ratios, we came
up with two particularly useful features: (i) narrowband ratio (0-0.1 kHz to 7.9-8
kHz) and (ii) wideband ratio (0-4 kHz to 4-8 kHz). Also, after experimenting with
different bands, we came up with a set of bands that, we hoped, would be useful
for our task. A box plot of the energies of a sample band (1-1.5 kHz) has been
shown in Figure 5. From this plot, we observe that the median values of all the
emotions are quite far apart from one another. The overlap among the different
emotions is also not much. This might imply that this band might be useful to
distinguish all four emotions from one another. Box plots of the two low to high
energy ratios for different emotions have been shown in Figures 6 and 7.
24 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2
Fig. 6: Box Plot of low to high frequency energy ratio (wide band))
Fig. 7: Box Plot of low to high frequency energy ratio (narrow band))
Title Suppressed Due to Excessive Length 25
Our summarized observations for the bands found to be potentially useful for
the SER task has been shown in Table 17. We observed that the high frequencies
(greater than 5 kHz) were not much useful. The energies of all the emotions in
the high-frequency regions slowly fade out. However, bands lower than 5 kHz hold
significant distinctive information. As an example, the lowermost frequencies up
to 80 Hz are potentially important for distinguishing between type 1 and type 2
emotions outlined above. For the band, 80-250 Hz, sadness and neutral have been
found to have distinct patterns from the other two emotions. Even sadness and
neutral seem to be well-distinguishable from one another. The next band (250-600
Hz) might be useful in detecting only sad speech against other emotions (see table
17). Five separate bands starting at 1 kHz and ending at 3.5 kHz might be useful
to distinguish each emotion from one another. These five bands might be the most
useful among all the features we are planning to use. The next three sets of bands
(3.5-5 kHz) might help us distinguish anger from other emotions.
Similarly, Table 18 shows our summarized observations for the ratio between
the energies of low frequencies to high frequencies.
From the above analysis, we can surmise that using the above spectral param-
eters as features with a good classifier might result in a good SER system. From
Table 17, we obtain 12 features from the 12 bands. From Table 18, we obtain two
26 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2
features. Therefore, we will be using a total of 14 features for our SER task. We
now explain our classification scheme in the below.
From table 17, we observe that some bands are potentially very good at classifying
all the four emotions from one another (rows 5-9). The table also hints at the
possibililty that all the bands together can potentially distinguish between type 1
and type 2 emotions well. Therefore, we proposed a two-stage framework similar to
that in phase 2. It differs from the phase 2 systems by using the 14 features related
to the power spectrum, and type 2 includes both sadness and neutral emotions
instead of only sadness as in phase 2. The entire scheme is shown in Figure 8.
First of all, before implementing the classification system just explained above, we
decided to assess the effectiveness of these 14 features in identifying all four emo-
tions. Therefore, using the above-mentioned 14 features, we decided to develop a
system that classifies any given speech utterance into one of the four emotions. For
that purpose, we trained an SVM each on the four emotions for both the datasets.
We obtained an accuracies of 57.5% on the IEMOCAP corpus and 81% on the
EMO-DB corpus. The accuracy for the IEMOCAP dataset is comparable to those
of phase 1 systems. Compared to the Euclidean-distance based technique, it is
better by 3.1%, while the KL-distance based technique outperforms this system
Title Suppressed Due to Excessive Length 27
by only 0.6%. However, for the EMO-DB corpus, the improvements of this system
compared to phase 1 systems are dramatic. It outperforms the Euclidean-distance
based and KL-distance based systems by 16.9% and 15.2% respectively. The con-
fusion matrices for the systems developed on the two datasets have been shown
in Tables 19 and 20. It can be observed from the two tables that at a gross level,
though the confusions between type 1 and type 2 emotions are not much, intra-
type confusions are still high. This problem can potentially be addressed using the
two-stage scheme just explained above.
Emotions Anger (%) Happy (%) Neutral (%) Sad (%) TPR (%) FNR (%)
Anger 49 44 1 5 49 51
Happy 13 59 3 29 59 41
Neutral 3 8 62 27 62 38
Sad 1 8 31 60 60 40
Emotions Anger (%) Happy (%) Neutral (%) Sad (%) TPR (%) FNR (%)
Anger 91 9 0 0 91 9
Happy 44 50 6 0 50 50
Neutral 2 3 84 11 84 16
Sad 1 1 5 93 93 7
The corresponding confusion matrices have been shown in Tables 24 and 25.
It can be observed that while the first stage resulted in a dramatic reduction in
the confusion between type 1 and type 2 emotions, there is also a slight reduction
of intra-type confusions. This has increased overall accuracies for both systems.
Title Suppressed Due to Excessive Length 29
Emotions Anger (%) Happiness (%) Neutral (%) Sadness (%) TPR (%) FNR (%)
Anger 55 35 1 9 55 45
Happiness 14 59 1 26 59 41
Neutral 3 2 63 32 63 37
Sadness 1 9 29 61 61 39
Emotions Anger (%) Happiness (%) Neutral (%) Sadness (%) TPR (%) FNR (%)
Anger 63 34 1 2 63 37
Happiness 20 56 1 23 56 44
Neutral 6 1 71 22 71 29
Sadness 2 4 34 60 60 40
For the EMO-DB corpus, the accuracy of the two-stage system with the Eu-
clidean distance-based system in the second stage is 83.4%, which is better than
the one-stage system (based on power spectral features) by 2.4%. For the system
based on KL distance in the second stage, the accuracy is 85.6%, which is better
by 4.6% (Table 26).
The confusion matrices of the two systems have been shown in Tables 27 and
28. We get the same trends as we got for the IEMOCAP dataset.
30 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2
Emotions Anger (%) Happiness (%) Neutral (%) Sadness (%) TPR (%) FNR (%)
Anger 88 12 0 0 88 12
Happiness 29 71 0 0 71 29
Neutral 1 2 79 18 79 21
Sadness 1 1 4 94 94 6
Emotions Anger (%) Happiness (%) Neutral (%) Sadness (%) TPR (%) FNR (%)
Anger 97 3 0 0 97 3
Happiness 35 65 0 0 65 35
Neutral 3 2 84 11 84 16
Sadness 2 0 10 88 88 12
5 Final remarks
After a detailed exposition of the methods used and the results obtained from
all the phases, let us now summarize the results. We will present the overall per-
formance of the systems of all the phases. We assume the system based on only
MFCCs (in phase 1) as the baseline system and compare our improvised systems
in the different phases with this baseline system. We have two baseline systems for
each dataset: one based on the Euclidean distance-based technique and the other
based on the KL-distance based technique. For each dataset, we choose the system
giving the best performance as our baseline system.
The summarized performance is shown in Tables 29 and 30. It can be seen
from both the tables that when the power spectral features with SVMs are used
in the first stage and combined features with the KL distance-based technique are
used in the second stage, we get dramatic improvements. We get improvements as
high as 10.6% and 21.8% compared to the baseline classifiers for the IEMOCAP
and EMO-DB datasets respectively. Also our best system (last row in Table 30)
outperforms the recent work by Kadiri et al. [7] on the EMO-DB corpus by 9.6%.
Title Suppressed Due to Excessive Length 31
Phase First stage classifier Second stage classifier Accuracy (%) Improvement(%)
1 MFCC-based (Euclidean) NA 53.8 0 (baseline)
1 Combined features (Euclidean) NA 54.4 0.6
1 Combined features (KL) NA 58.1 4.3
2 Prosody (Sad vs others) Combined features (Euclidean) 59.6 5.8
2 Prosody (Sad vs others) Combined features (KL) 61.2 7.4
3 Power spectrum (all four emotions) NA 57.5 3.7
3 Power spectrum (type 1 vs type 2) Combined features (Euclidean) 60 6.2
3 Power spectrum (type 1 vs type 2) Combined features (KL) 64.4 10.6
Table 30: Performance comparison of the systems developed for the EMO-
DB dataset
Phase First stage classifier Second stage classifier Accuracy (%) Improvement (%)
1 MFCC (KL) NA 63.8 0 (baseline)
1 Combined features (Euclidean) NA 64.1 0.3
1 Combined features (KL) NA 65.8 2
2 Prosody (Sad vs others) Combined features (Euclidean) 76.7 12.9
2 Prosody (Sad vs others) Combined features (KL) 77 13.2
3 Power spectrum (all four emotions) NA 81 17.2
3 Power spectrum (type 1 vs type 2) Combined features (Euclidean) 83.4 19.6
3 Power spectrum (type 1 vs type 2) Combined features (KL) 85.6 21.8
on our phase 1 techniques trained on all the other emotions, excluding sadness.
The performance of these systems were quite better compared to the phase 1
systems. However, we observed that the confusion between happiness and anger
was high. Therefore, in phase 3, we attempted to develop another set of two-stage
SER systems in which the first stage was based on power spectral features that
identified whether a given speech utterance belong to type 1 (anger and happiness)
or type 2 (neutral and sadness) emotion. Then, in the second stage our phase 1
classifiers further classified the speech utterance into a distinct emotion using the
emotion type information obtained from phase 1. The performance of these phase
3 systems improved dramatically. It outperformed the best of our phase 1 systems
by as much as about 10% for the IEMOCAP corpus and by about 20% for the
EMO-DB corpus. It also outperformed a very recent and related work [7] by 9.6%
for the EMO-DB corpus. The usage of these power spectra seems to be quite
promising and we plan to explore these features further in future.
7 Data Availability
Two datasets have been used in this work. One of these is the German EMO-DB
corpus. It is freely available at http://emodb.bilderbar.info/index-1280.html.
The other one is the English IEMOCAP dataset, which is freely available at https:
//sail.usc.edu/iemocap/.
The authors declare that they have no known competing financial interests or
personal relationships that could have appeared to influence the work reported in
this paper.
References
1. A. Satt, S. Rozenberg, and R. Hoory, “Efficient emotion recognition from speech using
deep learning on spectrograms.” in Interspeech, pp. 1089–1093, 2017.
2. G. Fant, “The source filter concept in voice production,” STL-QPSR, vol. 1, no. 1981, pp.
21–37, 1981.
3. S. G. Koolagudi, R. Reddy, and K. S. Rao, “Emotion recognition from speech signal using
epoch parameters,” in 2010 international conference on signal processing and communi-
cations (SPCOM). IEEE, pp. 1–5, 2010.
4. P. Gangamohan, S. R. Kadiri, S. V. Gangashetty, and B. Yegnanarayana, “Excitation
source features for discrimination of anger and happy emotions,” in Fifteenth Annual
Conference of the International Speech Communication Association, 2014.
5. S. R. Kadiri, P. Gangamohan, S. V. Gangashetty, and B. Yegnanarayana, “Analysis of ex-
citation source features of speech for emotion recognition,” in Sixteenth Annual Conference
of the International Speech Communication Association, 2015.
6. A. Haque and K. S. Rao, “Modification of energy spectra, epoch parameters and prosody
for emotion conversion in speech,” International Journal of Speech Technology, vol. 20,
no. 1, pp. 15–25, 2017.
7. S. R. Kadiri, P. Gangamohan, S. V. Gangashetty, P. Alku, and B. Yegnanarayana, “Ex-
citation features of speech for emotion recognition using neutral speech as reference,”
Circuits, Systems, and Signal Processing, vol. 39, no. 9, pp. 4459–4481, 2020.
Title Suppressed Due to Excessive Length 33
30. P. Jackson and S. Haq, “Surrey audio-visual expressed emotion (savee) database,” Uni-
versity of Surrey: Guildford, UK, 2014.
31. H. M. Fayek, M. Lech, and L. Cavedon, “Towards real-time speech emotion recognition
using deep neural networks,” in 2015 9th international conference on signal processing
and communication systems (ICSPCS). IEEE, pp. 1–5, 2015.
32. Y. Zhao, X. Jin, and X. Hu, “Recurrent convolutional neural network for speech process-
ing,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, pp. 5300–5304, 2017.
33. S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech emotion recognition using
recurrent neural networks with local attention,” in 2017 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 2227–2231, 2017.
34. P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou, “End-to-end
multimodal emotion recognition using deep neural networks,” IEEE Journal of Selected
Topics in Signal Processing, vol. 11, no. 8, pp. 1301–1309, 2017.
35. M. Valstar, J. Gratch, B. Schuller, F. Ringeval, D. Lalanne, M. Torres Torres, S. Scherer,
G. Stratou, R. Cowie, and M. Pantic, “Avec 2016: Depression, mood, and emotion recog-
nition workshop and challenge,” in Proceedings of the 6th international workshop on au-
dio/visual emotion challenge, pp. 3–10, 2016.
36. J. Zhao, X. Mao, and L. Chen, “Learning deep features to recognise speech emotion using
merged deep cnn,” IET Signal Processing, vol. 12, no. 6, pp. 713–721, 2018.
37. S. Sahu, R. Gupta, G. Sivaraman, W. AbdAlmageed, and C. Espy-Wilson, “Adversarial
auto-encoders for speech based emotion recognition,” arXiv preprint arXiv:1806.02146,
2018.
38. S. E. Eskimez, Z. Duan, and W. Heinzelman, “Unsupervised learning approach to feature
analysis for automatic speech emotion recognition,” in 2018 IEEE International Con-
ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 5099–5103,
2018.
39. L. Li, Y. Zhao, D. Jiang, Y. Zhang, F. Wang, I. Gonzalez, E. Valentin, and H. Sahli,
“Hybrid deep neural network–hidden markov model (dnn-hmm) based speech emotion
recognition,” in 2013 Humaine association conference on affective computing and intelli-
gent interaction. IEEE, pp. 312–317, 2013.
40. B. Xia and C. Bao, “Speech enhancement with weighted denoising auto-encoder.” in In-
terspeech, pp. 3444–3448, 2013.
41. Z. Zhang, F. Ringeval, J. Han, J. Deng, E. Marchi, and B. Schuller, “Facing realism in
spontaneous emotion recognition from speech: Feature enhancement by autoencoder with
lstm neural networks,” in Proceedings INTERSPEECH 2016, 17th Annual Conference of
the International Speech Communication Association (ISCA), pp. 3593–3597, 2016.
42. N. Cummins, S. Amiriparian, G. Hagerer, A. Batliner, S. Steidl, and B. W. Schuller,
“An image-based deep spectrum feature representation for the recognition of emotional
speech,” in Proceedings of the 25th ACM international conference on Multimedia, pp.
478–484, 2017.
43. S. Steidl, Automatic classification of emotion related user states in spontaneous children’s
speech. Logos-Verlag, 2009.
44. F. Weninger, F. Eyben, and B. Schuller, “Single-channel speech separation with memory-
enhanced recurrent neural networks,” in 2014 IEEE international conference on acoustics,
speech and signal processing (ICASSP). IEEE, pp. 3709–3713, 2014.
45. F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers,
J. Epps, P. Laukka, S. S. Narayanan et al., “The geneva minimalistic acoustic parameter
set (gemaps) for voice research and affective computing,” IEEE transactions on affective
computing, vol. 7, no. 2, pp. 190–202, 2015.
46. F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent developments in opensmile,
the munich open-source multimedia feature extractor,” in Proceedings of the 21st ACM
international conference on Multimedia, pp. 835–838, 2013.
47. S. Amiriparian, M. Gerczuk, S. Ottl, N. Cummins, S. Pugachevskiy, and B. Schuller,
“Bag-of-deep-features: Noise-robust deep feature representations for audio analysis,” in
2018 International Joint Conference on Neural Networks (IJCNN). IEEE, pp. 1–7,
2018.
48. K. S. R. Murty and B. Yegnanarayana, “Epoch extraction from speech signals,” IEEE
Transactions on Audio, Speech, and Language Processing, vol. 16, no. 8, pp. 1602–1613,
2008.
Title Suppressed Due to Excessive Length 35