Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

Noname manuscript No.

(will be inserted by the editor)

Hierarchical Emotion Recognition from Speech Using


Source, Power Spectral and Prosodic Features

Arijul Haque1 [corresponding author]


K. Sreenivasa Rao2

Received: date / Accepted: date

Abstract Features related to the glottal closure instants (GCI) exhibit different
patterns for different emotions. In this work, our main objective was to explore the
effectiveness of these features in speech emotion recognition (SER). In this regard,
we had proposed two distance-based classifiers based on four features related to
GCI. This was the first phase of our work. In two later phases of this work, we
extended this idea to develop hierarchical two-stage SER systems in order to couple
the GCI features with other features to improve our SER systems. The first stage
in phase 2 was based on prosodic features and for phase 3, we had used power
spectral features in the first stage. The second stage in both the systems was
based on the GCI features. The best performance was obsevered for the phase
3 systems, which outperformed the phase 1 systems by as much as about 10%
for the IEMOCAP corpus and by about 20% for the EMO-DB corpus. It also
outperformed a related and recent work by Kadiri et al. (2020) by 9.6% for the
EMO-DB corpus.
Keywords Speech emotion recognition · epoch features · power spectral features ·
KL distance

1 Introduction

In the evolving field of human-computer interaction (HCI), there are a large num-
ber of modes by which humans communicate with computers. Modes like speech,
text, GUI-based interaction using mouse or touchscreen devices are the most com-
mon. Among these, speech is one of the most intuitive and natural modes of
communication. Apart from the message conveyed by speech, one of the great-
est aspects which make the message more meaningful is emotions. Therefore, for
more natural HCI through speech, machines also need to deal with paralinguis-
tic aspects of speech like emotions. Computers should be able to understand the
emotion in speech conveyed by a human, as well as generate a speech in response
that corresponds to both the message and the emotion identified. The first task
1. IIT Kharagpur, India
2. IIT Kharagpur, India
2 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2

requires machines to be able to recognize emotions from human speech. Our work
focuses on this aspect, i.e., identifying emotions automatically from human speech.
There are many previous works in this area spanning about two decades and this
is still a hot topic of research. Emotion recognition has applications in making
HCI better and more natural. In addition to that, emotion recognition has many
other applications. As an example, marketing strategies can be formulated based
on identifying emotions from audio recordings of customer feedback. Another ex-
ample is effective e-learning by adjusting teaching strategies based on the emotions
detected from voice recordings of student feedback [1].

Emotion recognition tasks, like other pattern recognition tasks related to mul-
timedia like text, audio, video and image processing, have gone through some
paradigm shifts in the past few decades. It started off with rule-based systems,
evolved into machine learning techniques and finally, today the state of the art
technique in almost all kinds of pattern recognition tasks in general, and SER, in
particular, is a special kind of machine learning (ML) technique known as deep
learning. The motivation behind the paradigm shift from general machine learning
techniques to deep learning comes from the fact that in many tasks, deep neural
networks have been found to classify better with raw data (or data with little
pre-processing) than with other ML techniques with carefully engineered hand-
crafted features. As an example, in the case of speech emotion recognition (SER),
deep learning can be directly applied to raw speech spectrograms. Whereas, in the
case of ordinary machine learning, there is an additional requirement of a feature
extraction module. This is why most of the SER techniques starting from around
2017 up to this day are based on deep learning on direct, raw speech spectro-
grams. In fact, within these 3-4 years of the deep learning boom, there have been
numerous works on deep learning-based SER. They have been found, in general,
to perform better than ordinary ML techniques. However, despite the advantages
of high performance and the ability of the deep networks to learn appropriate fea-
tures (hidden representations) by themselves, they suffer from some drawbacks.
The first problem with deep learning is that they require huge amounts of data
to perform well. However, huge amounts of relevant data are not always easy to
obtain. Secondly, training deep neural networks (DNN) is extremely expensive in
view of computational complexity. The most complex models may take a week
to train even with several GPUs. Thirdly, the DNN models are so complex that
determining the proper topology and hyperparameters is like black art with hardly
any theory or very complicated theory to guide us. On the other hand, the trial
and error approach towards these problems is also very time consuming because
each trial will take very long to run (a few days to even a week). These problems
are more intense in low resource scenarios where several expensive GPUs and huge
datasets are not available. These problems are alleviated to some extent by using
transfer learning in which pretrained networks used for one task is used for another
task by retraining only the last/last few layers. This reduces the computational
complexity to a large extent. However, if large datasets for the desired problem
are not available, the problem of overfitting is likely to occur. This is because pre-
trained networks are generally large networks with often millions of parameters
to be tuned. Despite this problem, we still plan to explore transfer learning tech-
niques for this problem in a future work. In fact, because of these problems, there
is still a need to explore ordinary ML techniques and new features to improve SER
Title Suppressed Due to Excessive Length 3

performance without imposing a serious demand on resources. It is necessary to


seek better ways of using ordinary ML, whether it is by designing new classifiers
or by using effective features. This work endeavours to make an exploration in this
regard.

Among the works in speech emotion recognition (SER) using ordinary ML,
there have been many works using the source-filter model [2]. Using this model,
some source and/or vocal tract (VT) features are extracted. Using either or a com-
bination of these features, a standard pattern recognition algorithm like GMM/
SVM/ ANN/... or a combination of two of them are used to predict emotions.
There are many works on VT features. Compared to that, works on only exci-
tation source features or its combination with VT features are few. A few works
on emotion analysis had shown that some parameters derived from the glottal
closure instants (GCIs) had different trends for different emotions [3][4]. Works
exploiting these differences in the patterns to build speech emotion recognition
(SER) systems are very rare. Among those rare works on SER is one that had
used these parameters related to GCI (also known as epochs) as features [5]. Since
these features are extracted at a sub-segmental level (using frames as short as 5-10
ms), they are also referred to as sub-segmental excitation source features (another
name for the same is epoch features). In this work, three sub-segmental features
have been used: pitch, the energy of excitation (EoE) and the strength of exci-
tation (SoE) [6]. These features were fed to classifiers proposed by the authors,
based on the KL distance metric. Recently, the same authors have extended that
work by adding one more feature [7]– the ratio between the high frequency and
low-frequency spectral energy (β) [8]. Besides that, the statistics of the variations
of these features among four different emotions (anger, happiness, neutral and
sadness) were also shown. The performance of the classifiers on the four emotions
was comparable with other existing works. However, the classifiers performed re-
markably well when instead of distinct emotions, emotion groups were identified.
Happiness and anger was considered as one group and neutral and sadness as an-
other group. They used this idea to finally develop heirarchical classifiers in which
the emotion category was identified in the first stage and the actual emotion was
then subsequently identified in the second stage. The results triggred in us the idea
that we can examine some specific features with which we can attempt to discrim-
inate among groups of emotions from one another. If the recognition accuracy is
then found to be quite high, then it seems reasonable to use the same features or
a different set of features in the next step to distinguish among emotions within
each group.

Motivated by these observations and thoughts based on the above work, we


felt a strong need to explore hierarchical classification of emotions further. Our
present work is one such attempt in exploring hierarchical frameworks for clas-
sifying emotions from speech, with new techniques and more comprehensiveness.
We propose two distance-based classifiers for identifying emotions– one based on
KL distance and the other based on Euclidean distance. Although the above two
works have also used the KL distance metric for emotion identification, we have
used the KL distance in a different and simpler way. The details will be explained
in a relevant section. Our proposed classifiers will also be explained in details in
that section. Two publicly available datasets have been used for this study, the
4 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2

IEMOCAP dataset [9], a semi-natural dataset in English and the German emo-
tional dataset also known as EMO-DB [10], an acted dataset. Following the trend
of most works related to SER, we have worked with four common emotions– anger,
happiness, neutral and sadness.

This work has been done in three phases. In the first phase, we had first applied
our proposed classifiers on the epoch features for the emotion identification task.
A reason for using epoch features was that we expected these features to classify
groups of emotions well. Then, for comparison, those same classifiers were applied
on state of the art MFCC features. We observed that though the epoch features
did not perform miserably, they did not seem to be good enough as features for
building decent quality SER systems. However, they may play a complementary
role to state of the art MFCC features by improvement in the performance com-
pared to using MFCC features only. Therefore, for purposes of comparison, we
have also attempted a combination of the epoch features and MFCC features.
It should also be noted that instead of attempting a hierarchical classification in
this phase, we decided to try all these classifiers directly on the four emotions to
get an idea of how to group the emotions for hierarchical classification, which is
later attempted in the next two phases of this work. Based on some observations
(specially regarding the sad emotion) in the first phase, in the second phase of
our work, we proposed a two-stage classifier, in which sad vs not-sad (rest) was
identified in the first stage using prosodic features, and if the speech was identified
as not sad, then, in the second stage, another classifier (based on phase 1 systems
trained only on the other three emotions) would classify the speech into one of the
other three categories. There were considerable improvements in the performance
in this phase compared to the first phase. However, the confusion between anger
and happiness was still significant. Therefore, in the third phase of our work, in
order to deal with this confusion between anger and happiness, we tried to look for
features that could potentially be useful. In that regard, we analyzed the power
spectrum of speech representing different emotions and derived features from the
power spectra. A set of 14 features were extracted. A two-stage classifier similar
to the one in the second phase was applied, and the classification rates were found
to be even better than those in the previous two phases. The confusion between
happiness and anger had also reduced in the process.

The paper is organized as follows. Section 2 explains some related work on


SER. Section 3 outlines the details of the datasets used in the work. Section 4
is divided into three subsections, each dedicated to a phase of the work. In each
subsection, the details of the features, classifiers and the techniques used have been
outlined. The details of the performance of the developed SER system have also
been provided with comments on the performance. Section 5 finally summarizes
the results from all the phases and provides some final comments. Lastly, section
6 wraps up the paper by summarizing the work and giving some conclusions.
Directions for future plan of work has also been discussed therein.
Title Suppressed Due to Excessive Length 5

2 Related work

Before the advent of the deep learning era, different machine learning algorithms
were used to identify emotions in speech. These algorithms comprise two main
steps: feature extraction and pattern recognition. A good review of some previous
works on the features and the pattern recognition algorithms used for SER can
be found in [11]. Broadly three types of features have been used in the literature:
excitation source features, vocal tract features and prosodic features. Some works
combine features from these three categories before feeding those to a classifier.

An analysis of the variations in the glottal waveform patterns for speech in


different emotions was performed in [12]. Parameters related to glottal closure in-
stants (GCI), also known as epochs, were used in [13] to identify emotions. The
epoch parameters were extracted from the LP residual and the zero frequency fil-
tered (ZFF) signal. Six emotions were used in this study. Using GMMs and SVMs
as classifiers, recognition rates were 61% for the GMM classifier and 58% for the
SVM classifier. An analysis of speech in different emotions was performed in [14]
with the following epoch parameters: pitch, epoch strength, epoch sharpness and
energy of excitation. These parameters were found to vary significantly across the
four emotions studied: neutral, anger, sadness and happiness. This work provided
evidence for the fact that sub-segmental features have information related to the
emotion embedded in them. Several other emotion studies like [15] and [16] have
attempted emotion recognition using different features and found that though
identifying emotions using those features yield satisfactory results, the confusion
between the emotions anger and happiness is high. In an attempt to address this
problem, [4] used three excitation source features, namely, pitch, epoch strength
and spectral band magnitude energy ratio (β) to analyze the confusion between
anger and happiness using the IIIT-H Telugu Emotion Database [14] and a German
emotional database known as EMO-DB. The confusion scores obtained indicated
that those features do well in reducing the confusion between those two emotions.
A work on SER using sub-segmental excitation source features on emotion recog-
nition by Kadiri et al. [5] was already discussed in the introduction.

Works on vocal tract features and prosodic features can be found in abundance.
Here, we mention a few important ones. An attempt was made to identify emotions
using LFPC (log frequency power coefficients), LPCC and MFCC as features [17]
using discrete HMMs as classifiers. Lee et al. used MFCC features with HMM-
based classifiers for classifying speech into four emotion categories, obtaining 65%
accuracy [18]. Another work used MFCC features derived from speech utterances
from the LDC emotional speech database, prepared by the authors, and EMO-DB
[19]. Koolagudi et al. used MFCC features derived from speech and fed them to a
GMM classifier to identify emotions [20].

Among the works on prosodic features, an early work by Petrushin et al. an-
alyzed the potential of an ensemble of neural networks employed on prosodic fea-
tures like pitch, the first and second formants, energy and the speaking rate,
obtaining an accuracy of 77% on two emotional states [21]. Prosodic features like
fundamental frequency (F0), energy, duration, the first and second formant fre-
quencies were used for detection of negative and non-negative emotions using spo-
6 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2

ken language data obtained from a call centre application [22]. Kao et al. extracted
pitch and power-based features from frame, syllable, and word levels for recogni-
tion of four emotions in Mandarin [23]. In [24], the authors derived 35-dimensional
prosodic feature vectors including pitch, energy, and duration from speech for clas-
sification into seven emotions using neural networks, from the EMO-DB corpus,
getting an accuracy of 51%. Koolagudi et al. used duration patterns, average pitch,
the standard deviation of pitch and average energy to classify speech into one of
eight emotions from the IITKGP-SESC corpus [25].

Apart from the above-mentioned three types of features, different feature com-
binations have also been used in the literature. Nakatsu et al. used a combination
of LPCCs and pitch related features for the identification of eight emotions using
neural networks [26]. Bozkurt et al. used prosodic, spectral and HMM-based fea-
tures for the classification of five emotions of the Interspeech 2009 challenge and
achieved recognition accuracy of 63% [27]. Spectral, prosody and lexical features
were derived from the semi-natural USC-IEMOCAP database [9] in [28], yielding
an accuracy of 65.7%.

With the advent of the deep learning era around the middle of this decade, a lot
of work has been done on identifying emotions using deep neural networks (DNNs)
from speech. Fayek et al. proposed an end-to-end DNN architecture to identify
emotions from speech using the eNTERFACE [29] and SAVEE [30] databases,
yielding accuracies of 60.53% and 59.7%, respectively, on the two datasets [31].
Zhao et al. used a recurrent convolutional neural network (RCNN) to categorize
speech into seven emotions using the IEMOCAP corpus [32]. This was the first
time a hybrid model was used for SER and it achieved an accuracy of 83.4%. RNNs
were used in [33] to identify the same emotions as in [32] using the IEMOCAP
corpus. Its weighted accuracy (WA) and unweighted accuracy (UA) outperformed
SVMs by 5.7% and 3.1% respectively. Tzirakis et al. developed an SER system
[34] on four emotions from the RECOLA and AVEC 2016 [35] datasets using an
end-to-end system using convolutional neural networks (CNN) and ResNet of 50
layers along with long short term memory (LSTM). An accuracy of 78.7% was
achieved in that work. In [36], a deep convolutional neural network (DCNN) was
used on EMO-DB and IEMOCAP. They merged deep 1D and 2D CNN for high-
level learning of features from speech spectrograms. Seven emotions were used for
the study and accuracy of 92.71% was obtained. Other variants of deep neural
networks have also been used for SER. Examples are adversarial autoencoders
[37], variational autoencoders [38], HMM-based hybrid DNNs [39], etc. The works
mentioned in this paragraph pertain to clean speech. It is clear from the accuracies
obtained how better deep learning is compared to traditional ML.

3 Speech datasets used

In this work, two publicly available datasets: the IEMOCAP dataset and the Ger-
man EMO-DB dataset, have been used. The IEMOCAP dataset is a multi-modal
corpus in English specifically designed for the analysis of emotions. It is a semi-
natural corpus that has been recorded from 10 actors in dyadic sessions in two
Title Suppressed Due to Excessive Length 7

modes: scripted and improvised. The utterances in the corpus have been labelled
according to both categorical and continuous model of emotions. It comprises nine
emotions, among which four emotions, namely, neutral, anger, happiness and sad-
ness are the most commonly used. The approximate duration of the entire corpus
is about 12 hours.

The Berlin EMO-DB dataset is a simulated emotion speech corpus, in which


10 German sentences were recorded by 10 actors in seven different styles/emotions
in an anechoic chamber. Initially, 800 utterances were recorded (7 emotions × 10
actors × sentences + some second versions). This was followed by a perceptual
test based on the naturalness and recognizability of emotions. Utterances that
did not meet the set standards based on speech naturalness and recognizability
of emotions were discarded. This resulted in a total of 535 utterances in the final
corpus. Recordings were collected at a sampling frequency of 48 kHz with a bit
depth of 16 bits/sec. These were later downsampled to 16 kHz.

4 Proposed framework

Different features, classifiers and techniques have been used in different phases.
In this section, we give a detailed description of all these aspects. Since this work
is based on three phases, we dedicate a subsection to each. In those subsections,
the results of the experiments in each phase will also be presented. Finally, each
subsection will be concluded by our remarks on the results of each phase.

4.1 Phase-1

The following features have been extracted from the excitation source:

(i) Pitch: The reciprocal of the time interval between two consecutive glottal
closure instants (GCI). The positions of the positive zero crossings of the zero
frequency filtered (ZFF) [48] signal are a fairly good approximation of GCI.
(ii) Energy of excitation (EoE): The ratio of the root mean square energy
(RMSE) of the samples of the Hilbert envelope of the LP residual [49] to the
RMSE of the samples of the speech signal, over 2 ms around each GCI. This is
indicative of the vocal effort.
(iii) Strength of excitation (SoE): It is measured by the slope of the ZFF
signal at the positive zero crossings. Also known as epoch strength, it gives an
idea of the energy of the excitation signal at the epoch locations.
(iv) Epoch sharpness: The ratio of the standard deviation to the mean of the
samples of the Hilbert envelope of the LP residual around each GCI. This can
represent the loudness level of the excitation source signal.

In addition to these four excitation source features, MFCC features, which


represent the vocal tract filter have also been used for comparison.
8 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2

4.1.1 An analysis of the effectiveness of the Euclidean distance in distinguishing


emotions

It has already been discussed that our proposed emotion identification methods
are based on two types of distances: the KL distance and the Euclidean distance.
The effectiveness of using the KL distance for discriminating between emotions
was already demonstrated in [7]. However, it did not make similar explorations for
other distance metrics. This has triggered in us the idea that just like the KL dis-
tance, other distance metrics might also be useful for the same task. However, we
are not aware of any work that has investigated the effectiveness of other distance
metrics in this regard. Therefore, we decided to explore another distance metric,
the Euclidean distance, for our work. We have chosen the Euclidean distance as
it is the one of the simplest of all distance metrics and is also the most intuitive.
Hence, before using the Euclidean distance directly to develop SER systems, a
similar analysis is performed in this work to assess the effectiveness of using this
distance for this task. In this regard, random samples of 200 utterances from each
emotion have been considered for an analysis from the IEMOCAP corpus. For an
analysis of the excitation source features, we have taken the four sub-segmental
features just described above. For vocal tract features, the state of the art MFCC
features along with their delta coefficients have been used. Thirteen dimensional
MFCC features have been extracted for this purpose. After extracting both these
types of features (epoch features and MFCC features), we process both these fea-
ture sets (epoch features and MFCC features) separately as follows. First of all,
GMMs are built for each emotion with these features. We refer to these GMMs for
each emotion as a template of that emotion, parameterized by its mean vectors
and covariance matrices.

We had decided to take M = 8 as a starting point to assess the effectiveness


of the Euclidean distance in distinguishing emotions. How this assessment is done
is explained in the next paragraph. If this effectiveness could not be established
for M = 8, then we decided to repeat building the GMMs with other numbers of
GMM components. If GMMs with different values of M could not establish the
effectiveness of the Euclidean distance in distinguishing emotions, then we could
safely conjecture that the Euclidean distance is not useful for this task. In that
case, we would have abandoned our plan of using the Euclidean distance to de-
velop SER systems. However, our first attempt with M = 8 proved to be sufficient
for our purpose as we will later see from Tables 1 and 2.

After extracting the two above-mentioned feature sets (epoch features and
MFCC features), in order to analyze how effective Euclidean distances are in dis-
tinguishing emotions, we measured d(i, j), the average distance of feature vectors
of the ith emotion from the template of the j th emotion, for both the feature sets.
To measure distances from a template, we use only the mean vectors of the GMMs
representing the template. d(i, j) is measured as follows. Let [v(1), v(2), ..., v(Ni )]
represent the feature vectors extracted from all utterances of the ith emotion,
where Ni is the total number of feature vectors extracted from the ith emotion.
Let [µj1 , µj2 , ..., µjM ] represent the mean vectors of the M -component GMM rep-
resenting the j th emotion. First of all, we measure d(vi (l), µj ), the average distance
Title Suppressed Due to Excessive Length 9

of the lth feature vector of the ith emotion from the mean vectors of the GMM of
the j th emotion. Here, the averaging is done over the M mean vectors. Therefore,

M
1 X
d(vi (l), µj ) = ||vi (l) − µjk ||2 l = 1, 2, ..., Ni (1)
M k=1

where ||v||2 , represents the norm of vector v. Now, after finding d(vi (l), µj ) for
l = 1, 2, ..., Ni , we find their average to finally calculate d(i, j), i.e.,

Ni Ni M
1 X 1 X 1 X
d(i, j) = d(vi (l), µj ) = ||vi (l) − µjk ||2
Ni l=1 Ni l=1 M k=1
Ni XM
1 X
= ||vi (l) − µjk ||2 (2)
M Ni l=1 k=1

In this way, we find d(i, j) for i, j = 1, 2, 3, 4. Let anger, happiness, neutral and
sadness represent the first, second, third and fourth emotions respectively. These
values of d(i, j) have been shown in tables 1 and 2.

Table 1: d(i, j) values for subsegmental features

Anger Happy Neutral Sad


Anger 0.402 0.571 0.582 0.559
Happy 0.507 0.408 0.561 0.538
Neutral 0.478 0.534 0.321 0.516
Sad 0.516 0.537 0.538 0.365

Table 2: d(i, j) values for MFCC features

Anger Happy Neutral Sad


Anger 0.969 1.054 1.076 1.065
Happy 1.058 0.955 1.068 1.059
Neutral 1.055 1.077 0.697 1.078
Sad 1.113 1.095 1.111 0.683

It is clear from both the tables that d(i, j) has the minimum values when i = j.
In other words, the distance between feature vectors and a template is minimum
when both the feature vectors and the template represent the same emotion. This
shows that the Euclidean distances (on average) between utterances of an emotion
from the templates of the four emotions are the least when the template is of the
10 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2

same emotion. However, we also observe from the first two rows of both tables
that though d(i, j) is minimum for i = j for all cases, the values of d(i, j) are also
low when i = 1, j = 2 and i = 2, j = 1, compared to the other cases. This might
imply that if the Euclidean distance is used to distinguish between emotions, the
confusion bewteen anger and happiness might be more than confusions between
other pairs of emotions. Nevertheless, it is clear that the Euclidean distance may
still be used as a reliable metric for measuring the deviation of utterances from
the templates representing an emotion. Another important point worth mention-
ing here is that on observing that the distances for MFCC vectors are about twice
than those of the epoch features, we should not erroneously come to the conclusion
that epoch features might perform better than MFCC features. These distances
also depend on the dimensionality of the feature vectors. The more the dimension-
ality, the more the distance. Therefore, it is quite reasonable that the Euclidean
distances for the MFCC vectors will be greater than those for the epoch features as
MFCC features have a much higher dimensionality compared to the epoch features.

Now, we use the Euclidean distance and the KL distance to propose two emo-
tion recognition techniques, each based on one of these two distance metrics. A clas-
sification technique using KL distances was also proposed by Kadiri et al [7]. That
technique was based on templates. However, the technique they had used was based
on many templates, which can be used on small to medium-sized datasets like the
state of the art EMODB corpus [9] and the IIIT-H Telugu Emotion dataset [5],
prepared by the authors. However, for large datasets like the IEMOCAP dataset,
which are generally seminatural and mostly resemble real-life natural, emotional
speech, the technique used by Kadiri et al. may not scale well as the number of
stored templates will be very high and the template matching technique used there
will turn out to be more cumbersome. Therefore, instead of using their classifica-
tion technique, we used our proposed classification technique, which consists of as
many stored templates as the number of candidate emotions. Hence, the number
of stored templates in our technique is remarkably lower compared to the other
technique. Furthermore, our technique is also simple to understand. Therefore,
the proposed two techniques were used to develop different emotion recognition
systems each based on one of the following feature sets:

(i) The four epoch parameters


(ii) MFCC features
(iii) Combining both 1 and 2 using a score combination technique.

A total of six systems have been developed (2 distances × 3 feature sets). Below
is a detailed description of our proposed distance-based techniques.

4.1.2 Our proposed classification techniques

Classification technique 1: using KL distance

Training phase

For each emotion E


For each training utterance U
Title Suppressed Due to Excessive Length 11

For each epoch cycle C


Extract F0 , SoE, EoE, η.
For each x in (F0 , SoE, EoE, η)
Collect x values from all epoch cycles from all utterances.
x−xmin
Normalize x. (xnormalized = xmax −xmin )
Combine all normalized F0 , SoE, EoE and η values to form a 4-D vector.
Fit all these vectors into a single 4-D normal distribution.
Treat this distribution as a template T (E) for emotion E parameterized
by its mean vector µ and covariance matrix Σ.

Test phase

For each test utterance U


For each epoch cycle C
Extract F0 , SoE, EoE η.
For each x in (F0 , SoE, EoE, η)
Collect x values from all epoch cycles in U .
Normalize x.
Combine normalized F0 , SoE, EoE and η values to form a 4-D vector.
Fit all these vectors into a 4-D normal distribution.
Treat this distribution as a test model M .
For each emotion e in emotion set E
Let d = KLDistance(M ,T (e))
Assign emotion = argminE d.

Classification technique 2: using Euclidean distance

Training phase

For each emotion E


For each training utterance U
For each epoch cycle C
Extract F0 , SoE, EoE, η.
For each x in (F0 , SoE, EoE η)
Collect x values from all epoch cycles from all utterances.
x−xmin
Normalize x. (xnormalized = xmax −xmin )
Combine all normalized F0 , SoE, EoE and η values to form a 4-D vector.
Fit all these vectors into an M-component GMM.
Treat this distribution as a template T (E) for emotion E parameterized
by M mean vectors and covariance matrices.

Test phase

For each test utterance U


For each epoch cycle C
Extract F0 , SoE, EoE η.
For each x in (F0 , SoE, EoE, η)
12 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2

Combine x values from all epoch cycles in U .


Normalize x.
For each emotion e in E
1 P
Let dx,e =meanDist(x,µi,e )= M i dist(x, µi,e ), i = 1, 2, ..., M
(Attempts to conceptualize the average deviation of the feature vector x from the
mean vectors of the GMM representing emotion e)
For each emotion e in E
1 P
Let de =mean(dx,e )= |x| x dx,e ∀x, where |x| represents the number
of feature vectors in utterance U . (For an utterance U , de is the mean value of dx,e
(explained above), averaged over all feature vectors in utterance U , for emotion e)
Assign emotion = argminE de .

We will also build classifiers based on the above two algorithms with only
MFCCs as features. It will output distance values for each emotion for a given
utterance. These distances will be combined with the distances from the excitation
features using the following equation:

di = kde,i + (1 − k)dv,i (3)


th
where di is the final distance obtained for the i emotion, k is a scalar, de,i is
the distance obtained using the excitation source features and dv,i is the distance
obtained using MFCC features. 0 < k < 1 will be chosen in a way that maximizes
the performance for the validation set.
The emotion with the minimum value of di will be chosen as the identified
emotion. This combined system will then be compared with the baseline system
to see how much improvement is made.

4.1.3 Results and discussions

Tables 3 and 4 show the recognition accuracies for the different systems we devel-
oped for the IEMOCAP and EMO-DB datasets respectively. From the table, we
observe that the performance is the best on combined features for both KL distance
and Euclidean distance for both the datasets. While it is expected that MFCCs
will outperform the four subsegmental features, the improvement in the combined
systems prove that the subsegmental features also have significant emotion-related
information embedded in them, which can be exploited along with some state of
the art features to develop a decent SER system. Also, the performance of the
systems based on subsegmental features, though not spectacular, show that they
contain emotion-related information. The improvement of the combined systems
compared to the MFCC-based systems for the techniques based on the Euclidean
distance-based and KL-distance were 0.6% and 4.7% respectively for the IEMO-
CAP dataset. Overall, KL distance based systems have been found to perform
slightly better than the systems based on Euclidean distance. Almost the same
trends are observed in the case of the EMO-DB corpus. For this corpus, the im-
provement of the combined systems compared to the MFCC-based systems were
5.9% for the Euclidean distance-based technique and 2% for the KL distance-
based technique. It can also be observed that the improvement in performance of
the combined systems compared to the MFCC-based systems was lower for the
Euclidean distance-based system than for the KL distance-based system for the
Title Suppressed Due to Excessive Length 13

IEMOCAP dataset. However, we can observe opposite trends in the improvement


figures for the EMO-DB dataset. This may be due to differences in the nature
of the two datasets. For empirical analysis, anomalies can often be observed and
it is sometimes difficult to exactly pinpoint the reason for that anomaly. This is
because there may be several factors (for example datasets, features, classifiers,
etc.) that may contribute to the anomaly. The best we can do in those cases is to
make an educated guess. However, from the results, we can sum up that combined
systems are the best SER choices and that though epoch features by themeselves
may not be very good at distinguishing emotions, they can play a complementary
role to some other state of the art features in identifying emotions better.

Table 3: Recognition accuracies for the different systems on the IEMO-


CAP dataset

Features+Classifier Recognition Accuracy (%)


Epoch features + Euclidean 44.0
MFCC + Euclidean 53.8
Combined + Euclidean 54.4
Epoch features + KL 45.4
MFCC + KL 53.4
Combined + KL 58.1

Table 4: Recognition accuracies for the different systems on the EMO-DB


dataset

Features+Classifier Recognition Accuracy (%)


Epoch features + Euclidean 45.8
MFCC + Euclidean 58.2
Combined + Euclidean 64.1
Epoch features + KL 48.7
MFCC + KL 63.8
Combined + KL 65.8

Tables 5 and 6 show the confusion matrices for the two classification systems
using the combined features, on the IEMOCAP corpus. Though the identification
of anger, happiness and sadness are fairly satisfactory, the identification rates of
neutral emotion is quite low for both techniques. For the Euclidean distance based
classification, 28% of neutral utterances are getting confused with sadness, while
for the KL distance based algorithm, 29% of neutral utterances get confused with
sadness.
14 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2

Table 5: Confusion matrix for Euclidean distance-based classification sys-


tem with combined features on the IEMOCAP dataset. TPR means true
positive rate and FNR means false negative rate

Emotions Anger (%) Happy (%) Neutral(%) Sad (%) TPR (%) FNR (%)
Anger 58 15 10 17 58 42
Happy 16 51 12 21 51 49
Neutral 14 18 40 28 40 60
Sad 12 14 12 62 62 38

Table 6: Confusion matrix for KL distance-based classification system


with combined features on the IEMOCAP dataset

Emotions Anger (%) Happy (%) Neutral(%) Sad (%) TPR (%) FNR (%)
Anger 64 12 6 18 64 36
Happy 18 50 8 24 50 50
Neutral 17 16 38 29 38 62
Sad 11 11 7 71 71 29

Tables 7 and 8 show the confusion matrices for the Euclidean distance-based
classification system and the KL distance-based classification system respectively,
on the EMO-DB corpus. The confusion matrices for this dataset also show similar
trends as in the IEMOCAP corpus.

Table 7: Confusion matrix for Euclidean distance-based classification sys-


tem with combined features on the EMO-DB dataset

Emotions Anger (%) Happy (%) Neutral (%) Sad (%) TPR (%) FNR (%)
Anger 65 11 9 15 65 35
Happy 13 56 10 21 56 44
Neutral 11 9 55 25 55 45
Sad 8 7 11 74 74 26
Title Suppressed Due to Excessive Length 15

Table 8: Confusion matrix for KL distance-based classification system


with combined features on the EMO-DB dataset

Emotions Anger (%) Happy (%) Neutral (%) Sad (%) TPR (%) FNR (%)
Anger 67 11 8 14 67 33
Happy 13 58 9 20 58 42
Neutral 10 10 56 24 56 44
Sad 8 7 10 75 75 25

It is also noticeable that in both the systems and both the datasets, among
the misclassifications of different emotions, many of them are getting misclassified
as sadness. This shows that sadness might have some peculiarity which is well
reflected in neither the subsegmental source features nor the vocal tract features.
Since this is common to both the proposed techniques and both the datasets,
we guess that neither the datasets nor the techniques are responsible for this
confusion. It turns out that probably the features used have played a major role
in this confusion. Therefore, we decided to perform an analysis to find out which
types of features may be more appropriate for this task. The details of these
analyses will soon follow in subsection 4.2. However, all these observations have
given rise to the following idea. Given any speech during the test case, we can
use an appropriate set of features to first determine whether the speech belongs
to the sad category. If it is, there will be no further processing of the speech and
it will be classified as sad. However, if that is not the case, then we will use our
previous classifiers (based on epoch features and MFCC features) to identify the
speech among the rest of the three emotions. Given this framework, we now move
on to the next phase of our work based on this plan.

4.2 Phase-2

In an attempt to find out an appropriate set of features to identify the sad emotion
well, we, first of all, decided to find out if any noticeable pattern can be discerned
from the waveforms of the speech itself. Therefore, we manually observed the time
domain waveforms of different speech segments from each emotion. After observ-
ing some waveforms, the first thing that we observed was the remarkable difference
between the intensity levels of sad speech versus other speech. We noticed that
in most cases, the intensity levels of sadness is significantly lower than those of
other emotions. Another observation, though less prominent than the one already
mentioned was that the variations (can be quantified by variance or range) in
intensities of sad speech were lower than that of other emotions, accross an utter-
ance. At the utterance level, there are significant variations in the intensities of
anger and happiness emotions. However, for sadness, the variations in intensity at
the utterance level also seemed to be quite low. Compared to anger and happiness,
the variations of neutral speech were also low, but not as low as variations in sad
speech. Furthermore, it is well known that pitch patterns also vary for different
emotions at the utterance level. Therefore, we also decided to study the patterns
16 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2

of variations in pitch for different emotions. Therefore, we decided to analyze three


prosodic parameters: global intensity, the standard deviation of the intensity and
the standard deviation of the pitch. These observations were quite apparent for
the IEMOCAP dataset. However, for the EMO-DB dataset, we did not notice any
significant patterns. Nevertheless, we decided to conduct a study of all these pa-
rameters at the utterance level on both datasets. We considered 50 utterances per
emotion from both datasets for our analysis. We extracted the three parameters
at the utterance level and derived the scatter plots of these parameters based
on the four emotions. Since the parameters were found at the utterance level, we
obtained one point per utterance in the three dimensional space. Therefore, there
are a total of 50 points in the scatter plot per emotion. Figures 1 and 2 show the
three dimensional scatter plots of these parameters on the IEMOCAP dataset and
the EMO-DB dataset, respectively.

Fig. 1: Scatter plot of the three prosodic parameters extracted from four
emotions from the IEMOCAP dataset at the utterance level.

Fig. 2: Scatter plot of the three prosodic parameters extracted from four
emotions from the EMO-DB dataset at the utterance level.
Title Suppressed Due to Excessive Length 17

From Figure 1, all the observations we had made after manual analysis are
confirmed. The intensity values and the variations in intensity values at an utter-
ance level are quite low in sadness compared to the other emotions. However, the
variations in pitch for sadness compared to the other emotions is quite high. From
Figure 2, only the variation patterns in pitch at the utterance level matches with
that of the previous figure. However, the intensity values and the variations in
intensity levels at the utterance level are not significantly different from the other
emotions. This hints at the possibility that variations in pitch can potentially be
an important parameter in distinguishing sad emotion from other emotions, across
the datasets. Regarding the mismatch of the patterns of the other two parame-
ters between the IEMOCAP and EMO-DB datasets, we think that this is because
of the nature of differences between the two corpora. EMO-DB is a simulated
emotion database with 10 sentences enacted by actors in seven pre-determined
emotions/styles. In this situation, both the sentences and the desired emotion are
given to the actors, and they are constrained to articulate those sentences using
an expression that corresponds to the given emotion. As a result, the enacted
emotional speech that is recorded will not completely resemble full-blown emo-
tional speech that is found in real-life scenarios. Even if the actors are extremely
skilled at imitating any style of speech, the constraints imposed is bound to make
the enacted emotions somewhat artificial. On the other hand, for the IEMOCAP
dataset, the constraints are very few. Some of the utterances are impromptu, while
the scripted ones also consist of a wide variety of sentences. Furthermore, the ac-
tors were not constrained to enact any particular style of speech. The choice of
style in the dyadic conversations, whether impromptu or scripted, was totally at
the discretion of the actors. In other words, they could freely use any style of
expression they deemed appropriate for the conversation. This resulted in speech
that is far more near real-life speech with natural expressions. Therefore, we have
considered the observations of the IEMOCAP to be more reliable as far as analysis
of emotions is concerned.

4.2.1 Proposed two-stage classification scheme based on prosodic features in the


first stage

The above analysis prompted us to pursue these three prosodic parameters for
recognition of emotions on both the datasets based on a two-stage framework.
First of all, we use the prosodic parameters to train a classifier that can identify
whether a given speech segment is sad speech or not. In the test case, given a
speech segement, the classifier predicts whether the given speech segment belongs
to the sad emotion. If that is the case, there is no further processing of the speech
and the assigned emotion for the speech is sadness. Otherwise (if it is detected as
non-sad), the speech segment passes through a second stage of processing in which
phase 1 systems are used to classify this non-sad speech into one of the other three
emotions– neutral, happiness and anger. In this stage, we have trained our phase
1 systems based on only the other three emotions (excluding sadness). The entire
classification scheme is shown in Figure 3.
18 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2

Fig. 3: Proposed two-stage classification scheme based on prosodic fea-


tures in the first stage.

4.2.2 Results and discussions

For training the prosody-based classifier, we had attempted different classifiers


like decision trees, logistic regression, support vector machines (SVM) and nearest
neighbour classifiers. Among them, SVMs performed the best. Therefore, we have
taken SVMs as our final classifier in the first stage. We had not explored these
classifiers in phase 1 because our first target was to explore the effectiveness of the
Euclidean distance in identifying emotions.A second reason is that our proposed
techniques based on the two distance metrics in phase 1 were suitable to develop
a combined system based on the epoch features and the MFCC features. While
epoch features are extracted at the subsegmental level, MFCCs are segmental fea-
tures. Therefore, as the frame sizes of these two feature sets are quite different, it
is not possible to concatenate these two feature sets to realize a combined system.
As a result, we had to resort to a score level combination technique, in which the
probability values obtained for each emotion were used as scores. Our proposed
classification techniques, which were based on these probability values, turned out
to be a perfect choice for realizing fusion of two systems to acheive a combined
system. On the other hand, since the first stage is based on only prosodic features,
there is no need of combining features. Therefore, we simply resorted to standard
and already available classification techniques.

After training, the recognition accuracies of the prosody-based classifier on


the two-class (sadness vs others) problem were 79.3% and 94.4% on the datasets
IEMOCAP and EMO-DB, respectively. We find that the binary classifier based
on the EMO-DB corpus performs far better than that of IEMOCAP. The scatter
plots probably gives us some clue as to why the EMO-DB performs better. For the
EMO-DB corpus, most of the points reprenting the sad emotion are quite far from
Title Suppressed Due to Excessive Length 19

the other emotions, specially anger and happiness. Though neutral and sadness
share some common space, it is more so for the IEMOCAP corpus than for the
EMO-DB corpus. The confusion matrices of the prosody-based classifier on the
IEMOCAP and EMO-DB datasets are shown in Tables 9 and 10 respectively.

Table 9: Confusion matrix for prosody-based classification system for sad-


ness vs other three emotions on the IEMOCAP dataset

Emotions Others (%) Sad (%) TPR (%) FNR (%)


Others 85 15 85 15
Sad 26 74 74 26

Table 10: Confusion matrix for prosody-based classification system for


sadness vs other three emotions on the EMO-DB dataset

Emotions Others (%) Sad (%) TPR (%) FNR (%)


Others 96 4 96 4
Sad 15 85 85 15

In phase 1, we had six systems in all (2 distances × 3 feature sets). Out


of those systems, the systems with combined features for both the classification
techniques performed the best. Therefore, for comparison with these systems, we
had considered only those two systems in the second stage of our new two-stage
classifier, namely the systems with combined features, one of which is based on
the KL distance-based method and the other is based on the Euclidean distance-
based method. Therefore, after passing the test speech through the prosody-based
SVM classifier in the first stage, it was determined whether the speech belongs to
the sadness category. If the answer is no, then the speech is passed through these
two systems mentioned above. Therefore, in all, we have two two-stage classifiers,
where the first stage is identical and the second stage is based on one of these
two systems. The recognition accuracies of these two two-stage systems on the
IEMOCAP corpus have been shown in Table 11. We observed that there is an
improvement of 5.2% (Euclidean based classifier in the second stage) and 3.1%
(KL based classifier in the second stage) in the two-stage systems compared to the
corresponding previous systems.
20 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2

Table 11: Recognition accuracies for the two-stage systems for the four
emotions on the IEMOCAP dataset

Second Stage Classifier Recognition Accuracy (%)


Euclidean 59.6
KL 61.2

The confusion matrices of the two systems have been shown in Tables 12 and 13.
The true positive rate of sadness of the modified system is 74%, which is 3% better
than the best of the previous two direct classifiers. Also, the misclassifications of
angry and happy speech as sad speech are remarkably less compared to phase
1 systems. However, the misclassification rates of neutral speech as sad speech
has not significantly changed. This might be explained by the scatter diagram in
Figure 1, in which we can observe that sad speech is very well-separable in the
three-dimensional space from anger and happiness. However, their separability
from neutral speech is not much as many instances of neutral and sad speech are
close to one another. However, in all, compared to phase 1 systems, the percentage
of the other emotions being misclassified as sadness has been drastically reduced,
except for the neutral emotion. Also, more instances of sad speech get classified
correctly. This has given rise to better overall recognition accuracies in the entire
modified systems.

Table 12: Confusion matrix for the two stage system (Euclidean based
technique in the second stage) on the IEMOCAP dataset

Emotions Anger (%) Happy (%) Neutral (%) Sad (%) TPR (%) FNR (%)
Anger 69 13 12 6 69 31
Happy 20 53 16 11 53 47
Neutral 16 15 42 28 42 58
Sad 8 8 10 74 74 26

Table 13: Confusion matrix for the two stage system (KL distance based
technique in the second stage) on the IEMOCAP dataset

Emotions Anger (%) Happy (%) Neutral (%) Sad (%) TPR (%) FNR (%)
Anger 73 13 9 5 73 27
Happy 20 57 16 7 57 43
Neutral 17 14 43 26 43 57
Sad 4 10 12 74 74 26
Title Suppressed Due to Excessive Length 21

Similarly, we use the same two-stage classifiers on the EMO-DB corpus. The
recognition accuracies of the systems are shown in Table 14. It is observed that
the improvements in these systems compared to the old systems is drastic. For
the Euclidean-distance based system in the second stage, we get an improvement
of 12.6%. For the other system, we get an improvement of 11.2%. Our proposed
systems slightly outperform the system in [7] using almost similar features, which
had an accuracy of 76%. Also, an important point worth reiterating here is that
the accuracy of the prosody-based systems in the first stage was more than 90%.
This might be the reason behind the drastic improvements in the performance
of the phase 2 systems compared to the phase 1 systems. Thus, this gives us a
motivation for using prosodic features in this way for conducting a study on more
emotions in a future work.

Table 14: Recognition accuracies for the two-stage systems for the four
emotions on the EMO-DB dataset

Second Stage Classifier Recognition Accuracy (%)


Euclidean 76.7
KL 77

Table 15: Confusion matrix for the two stage system (Euclidean based
technique in the second stage) on the EMO-DB dataset

Emotions Anger (%) Happy (%) Neutral (%) Sad (%) TPR (%) FNR (%)
Anger 83 8 6 3 83 17
Happy 21 56 13 9 56 44
Neutral 6 4 78 12 78 22
Sad 3 5 7 85 85 15

Table 16: Confusion matrix for the two stage system (KL distance based
technique in the second stage) on the EMO-DB dataset

Emotions Anger (%) Happy (%) Neutral (%) Sad (%) TPR (%) FNR (%)
Anger 83 8 5 4 83 17
Happy 21 57 14 8 57 43
Neutral 5 5 79 11 79 21
Sad 3 5 7 85 85 15
22 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2

Tables 15 and 16 represent the confusion matrices for the two two-stage systems
based on the EMO-DB dataset. The reduction in the misclassification rates of
other emotions as sadness has greatly reduced. Also, the recognition accuracy of
sad speech has dramatically improved because of the prosodic processing in the
first stage. However, despite these improvements, the confusion between anger and
happiness is still high compared to the confusions between other pairs of emotions.
Therefore, we attempt to find out ways to handle this problem.

4.3 Phase-3

We had already remarked that though sad speech could be identified far better
using the systems in phase 2, which ultimately led to dramatic improvements in
the overall SER accuracy, specially in the EMO-DB corpus, the confusion between
anger and happiness was still high compared to the confusions between other pairs
of emotions. Therefore, we decided to explore other features which might possibly
alleviate this problem. Also in phase 2, which was based on a time-domain analysis
(prosodic features were derived from the time domain), we did not find much dif-
ferences between anger and happiness. We also contemplated the fact that in audio
processing tasks, specially for modification of audio signals, most of the manipula-
tions are done on the magnitude/power spectrum itself. As an example, modifying
parts of the power spectra in music often leads to different effects in the modified
audio. Enhancing the lowermost frequencies (also called bass) makes drum beats
in audio more prominent. If parts of a music segment are abnormally shrill with
occasional hissings at different regions, then those deficiencies are compensated by
reducing magnitudes of the high frequencies (also called treble). Enhancing the
mid-range frequencies can sometimes make dialogues clearer. In other words, ma-
nipulating the power spectrum has the potential to change the entire profile of an
audio and how it sounds to a listener. Even in speech, changing the shape of the
power spectra can sometimes result in creating an illusion that a different speaker
is speaking. Motivated by these facts, we hypothesized that different emotions
should also have different patterns in the general shape of their power spectra. To
verify this, we decided to manually observe the magnitude spectra for different
emotions from the two datasets for a preliminary analysis. We observed that some
differences are such that a mere inspection of the shape of the magnitude spectra
is enough to distinguish between emotions. Figure 4 shows sample power spectra
of the four emotions based on a particular utterance in the EMO-DB dataset.
Title Suppressed Due to Excessive Length 23

Fig. 4: Sample power spectra of four emotions for a particular utterance


in the EMO-DB dataset.

It can be seen from the figure (and other samples we had inspected) that the
power contained in angry and happy speech is quite high compared to those in the
other two emotions, except for the extremely low-frequency regions (0-500 Hz) and
the high-frequency regions (around 4-5 kHz onwards). We can, therefore, conclude
that the power spectrum may be effectively used to classify emotions at least ac-
cording to two emotion categories: type 1 representing anger and happiness and
type 2 representing neutral and sadness. This might be possible if we can char-
acterize the overall shape of the power spectra using features related to the power
spectrum. Therefore, we made a statistical analysis of the spectra of different emo-
tions based on different bands. We divided the entire spectrum of any speech signal
into different non-overlapping bands and find the mean (geometric mean) energies
of each band. We experimented with different bands (varying band lengths) and
plotted the mean energies of each band in a box plot to estimate the usefulness of
that band in distinguishing among the different emotions. Furthermore, we also
analyzed the low to high-frequency energy ratios and found even these ratios to
be useful. After experimenting with different bandwidths for these ratios, we came
up with two particularly useful features: (i) narrowband ratio (0-0.1 kHz to 7.9-8
kHz) and (ii) wideband ratio (0-4 kHz to 4-8 kHz). Also, after experimenting with
different bands, we came up with a set of bands that, we hoped, would be useful
for our task. A box plot of the energies of a sample band (1-1.5 kHz) has been
shown in Figure 5. From this plot, we observe that the median values of all the
emotions are quite far apart from one another. The overlap among the different
emotions is also not much. This might imply that this band might be useful to
distinguish all four emotions from one another. Box plots of the two low to high
energy ratios for different emotions have been shown in Figures 6 and 7.
24 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2

Fig. 5: Box Plot of energies of a sample band (1-1.5 kHz)

Fig. 6: Box Plot of low to high frequency energy ratio (wide band))

Fig. 7: Box Plot of low to high frequency energy ratio (narrow band))
Title Suppressed Due to Excessive Length 25

Our summarized observations for the bands found to be potentially useful for
the SER task has been shown in Table 17. We observed that the high frequencies
(greater than 5 kHz) were not much useful. The energies of all the emotions in
the high-frequency regions slowly fade out. However, bands lower than 5 kHz hold
significant distinctive information. As an example, the lowermost frequencies up
to 80 Hz are potentially important for distinguishing between type 1 and type 2
emotions outlined above. For the band, 80-250 Hz, sadness and neutral have been
found to have distinct patterns from the other two emotions. Even sadness and
neutral seem to be well-distinguishable from one another. The next band (250-600
Hz) might be useful in detecting only sad speech against other emotions (see table
17). Five separate bands starting at 1 kHz and ending at 3.5 kHz might be useful
to distinguish each emotion from one another. These five bands might be the most
useful among all the features we are planning to use. The next three sets of bands
(3.5-5 kHz) might help us distinguish anger from other emotions.

Table 17: Observations on energies in different bands

Bands (kHz) Observations


Up to 0.08 Sadness and neutral are distinct from the other emotions
0.08-0.25 Sadness and neutral are distinct from the other emotions
0.25-0.6 Sadness is distinct from the other emotions
0.6-1 Sadness and neutral are distinct from the other emotions
1-1.5 All are distinct from one another
1.5-2 All are distinct from one another
2-2.5 All are distinct from one another
2.5-3 All are distinct from one another
3-3.5 All are distinct from one another
3.5-4 Anger is distinct from the other emotions
4-4.5 Anger is distinct from the other emotions
4.5-5 Anger is distinct from the other emotions

Similarly, Table 18 shows our summarized observations for the ratio between
the energies of low frequencies to high frequencies.

Table 18: Observations on low to high frequency energy ratios (narrow


band and wide band)

Pairs of bands (kHz) Observations


1-4 and 4-8 Sadness is distinct from the other emotions
0-0.1 and 7.9-8 Sadness and neutral are distinct from the other emotions

From the above analysis, we can surmise that using the above spectral param-
eters as features with a good classifier might result in a good SER system. From
Table 17, we obtain 12 features from the 12 bands. From Table 18, we obtain two
26 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2

features. Therefore, we will be using a total of 14 features for our SER task. We
now explain our classification scheme in the below.

4.3.1 Proposed two-stage classification system based on power spectral features in


the first stage

From table 17, we observe that some bands are potentially very good at classifying
all the four emotions from one another (rows 5-9). The table also hints at the
possibililty that all the bands together can potentially distinguish between type 1
and type 2 emotions well. Therefore, we proposed a two-stage framework similar to
that in phase 2. It differs from the phase 2 systems by using the 14 features related
to the power spectrum, and type 2 includes both sadness and neutral emotions
instead of only sadness as in phase 2. The entire scheme is shown in Figure 8.

Fig. 8: Proposed two-stage classification scheme based on power spectral


features in the first stage

4.3.2 Results and discussions

First of all, before implementing the classification system just explained above, we
decided to assess the effectiveness of these 14 features in identifying all four emo-
tions. Therefore, using the above-mentioned 14 features, we decided to develop a
system that classifies any given speech utterance into one of the four emotions. For
that purpose, we trained an SVM each on the four emotions for both the datasets.
We obtained an accuracies of 57.5% on the IEMOCAP corpus and 81% on the
EMO-DB corpus. The accuracy for the IEMOCAP dataset is comparable to those
of phase 1 systems. Compared to the Euclidean-distance based technique, it is
better by 3.1%, while the KL-distance based technique outperforms this system
Title Suppressed Due to Excessive Length 27

by only 0.6%. However, for the EMO-DB corpus, the improvements of this system
compared to phase 1 systems are dramatic. It outperforms the Euclidean-distance
based and KL-distance based systems by 16.9% and 15.2% respectively. The con-
fusion matrices for the systems developed on the two datasets have been shown
in Tables 19 and 20. It can be observed from the two tables that at a gross level,
though the confusions between type 1 and type 2 emotions are not much, intra-
type confusions are still high. This problem can potentially be addressed using the
two-stage scheme just explained above.

Table 19: Confusion matrix of the power spectrum-based classification


system on all four emotions of the IEMOCAP dataset

Emotions Anger (%) Happy (%) Neutral (%) Sad (%) TPR (%) FNR (%)
Anger 49 44 1 5 49 51
Happy 13 59 3 29 59 41
Neutral 3 8 62 27 62 38
Sad 1 8 31 60 60 40

Table 20: Confusion matrix of the power spectrum-based classification


system on all four emotions of the EMO-DB dataset

Emotions Anger (%) Happy (%) Neutral (%) Sad (%) TPR (%) FNR (%)
Anger 91 9 0 0 91 9
Happy 44 50 6 0 50 50
Neutral 2 3 84 11 84 16
Sad 1 1 5 93 93 7

However, before implementing this two-stage scheme, we decided to see how


well these features related to the power spectrum can identify the emotion types
(type 1/type 2) instead of the actual emotions. Therefore, we trained two binary
SVM classifiers, each based on one of the two datasets. These two classifiers yielded
accuracies of 78.3% on the IEMOCAP dataset and 97.6% on the EMO-DB dataset.
The confusion matrices of the two systems are shown in Tables 21 and 22. It is
noticeable that for type 1 emotions based on the EMO-DB dataset, the accuracy
is 100%. For the IEMOCAP dataset, though the identification accuracy of type 1
emotion is only 69%, that for the type 2 emotions is as high as 84%. Now that it
is clearer that these features can identify the two emotion types reasonably well,
we are now more confident of getting a satisfactory performance based on the
proposed two-stage systems.
28 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2

Table 21: Confusion matrix of the power spectrum-based classification


system on the two emotion types of the IEMOCAP dataset

Emotions Type 1 (%) Type 2 (%)


Type 1 69 31
Type 2 16 84

Table 22: Confusion matrix of the power spectrum-based classification


system on the two emotion types of the EMO-DB dataset

Emotions Type 1 (%) Type 2 (%)


Type 1 100 0
Type 2 6 94

Based on our proposed two-stage architecture, we had developed two systems.


In the first stage of those two systems, we had used the binary classifiers mentioned
above to classify whether a given test utterance belongs to type 1 or type 2 emo-
tion. The second stage is based on combined features (MFCC and epoch features),
in which one of the systems is based on the proposed Euclidean distance-based
technique and the other is based on the proposed KL distance-based technique.
The accuracies of the systems based on the IEMOCAP dataset are 60% for the
Euclidean distance-based technique in the second stage and 64.4% for the KL
distance-based technique in the second stage (Table 23). While the former outper-
forms the one-stage system (based on the power spectral features) by 2.5%, the
latter outperforms that system by 6.9%.

Table 23: Recognition accuracies of the two-stage systems (phase 3) for


the four emotions on the IEMOCAP dataset

Second Stage Classifier Recognition Accuracy (%)


Euclidean 60
KL 64.4

The corresponding confusion matrices have been shown in Tables 24 and 25.
It can be observed that while the first stage resulted in a dramatic reduction in
the confusion between type 1 and type 2 emotions, there is also a slight reduction
of intra-type confusions. This has increased overall accuracies for both systems.
Title Suppressed Due to Excessive Length 29

Table 24: Confusion matrix of the two-stage system (phase 3) on the


IEMOCAP dataset with the Euclidean distance-based technique in the
second stage

Emotions Anger (%) Happiness (%) Neutral (%) Sadness (%) TPR (%) FNR (%)
Anger 55 35 1 9 55 45
Happiness 14 59 1 26 59 41
Neutral 3 2 63 32 63 37
Sadness 1 9 29 61 61 39

Table 25: Confusion matrix of the two-stage system (phase 3) on the


IEMOCAP dataset with the KL distance-based technique in the second
stage

Emotions Anger (%) Happiness (%) Neutral (%) Sadness (%) TPR (%) FNR (%)
Anger 63 34 1 2 63 37
Happiness 20 56 1 23 56 44
Neutral 6 1 71 22 71 29
Sadness 2 4 34 60 60 40

For the EMO-DB corpus, the accuracy of the two-stage system with the Eu-
clidean distance-based system in the second stage is 83.4%, which is better than
the one-stage system (based on power spectral features) by 2.4%. For the system
based on KL distance in the second stage, the accuracy is 85.6%, which is better
by 4.6% (Table 26).

Table 26: Recognition accuracies of the two-stage systems (phase 3) for


the four emotions on the EMODB dataset

Second Stage Classifier Recognition Accuracy (%)


Euclidean 83.4
KL 85.6

The confusion matrices of the two systems have been shown in Tables 27 and
28. We get the same trends as we got for the IEMOCAP dataset.
30 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2

Table 27: Confusion matrix of the two-stage system (phase 3) based on


the EMO-DB dataset with the Euclidean distance-based technique in the
second stage

Emotions Anger (%) Happiness (%) Neutral (%) Sadness (%) TPR (%) FNR (%)
Anger 88 12 0 0 88 12
Happiness 29 71 0 0 71 29
Neutral 1 2 79 18 79 21
Sadness 1 1 4 94 94 6

Table 28: Confusion matrix of the two-stage system (phase 3) based on


the EMO-DB dataset with the KL distance-based technique in the second
stage

Emotions Anger (%) Happiness (%) Neutral (%) Sadness (%) TPR (%) FNR (%)
Anger 97 3 0 0 97 3
Happiness 35 65 0 0 65 35
Neutral 3 2 84 11 84 16
Sadness 2 0 10 88 88 12

5 Final remarks

After a detailed exposition of the methods used and the results obtained from
all the phases, let us now summarize the results. We will present the overall per-
formance of the systems of all the phases. We assume the system based on only
MFCCs (in phase 1) as the baseline system and compare our improvised systems
in the different phases with this baseline system. We have two baseline systems for
each dataset: one based on the Euclidean distance-based technique and the other
based on the KL-distance based technique. For each dataset, we choose the system
giving the best performance as our baseline system.
The summarized performance is shown in Tables 29 and 30. It can be seen
from both the tables that when the power spectral features with SVMs are used
in the first stage and combined features with the KL distance-based technique are
used in the second stage, we get dramatic improvements. We get improvements as
high as 10.6% and 21.8% compared to the baseline classifiers for the IEMOCAP
and EMO-DB datasets respectively. Also our best system (last row in Table 30)
outperforms the recent work by Kadiri et al. [7] on the EMO-DB corpus by 9.6%.
Title Suppressed Due to Excessive Length 31

Table 29: Performance comparison of the systems developed for the


IEMOCAP dataset

Phase First stage classifier Second stage classifier Accuracy (%) Improvement(%)
1 MFCC-based (Euclidean) NA 53.8 0 (baseline)
1 Combined features (Euclidean) NA 54.4 0.6
1 Combined features (KL) NA 58.1 4.3
2 Prosody (Sad vs others) Combined features (Euclidean) 59.6 5.8
2 Prosody (Sad vs others) Combined features (KL) 61.2 7.4
3 Power spectrum (all four emotions) NA 57.5 3.7
3 Power spectrum (type 1 vs type 2) Combined features (Euclidean) 60 6.2
3 Power spectrum (type 1 vs type 2) Combined features (KL) 64.4 10.6

Table 30: Performance comparison of the systems developed for the EMO-
DB dataset

Phase First stage classifier Second stage classifier Accuracy (%) Improvement (%)
1 MFCC (KL) NA 63.8 0 (baseline)
1 Combined features (Euclidean) NA 64.1 0.3
1 Combined features (KL) NA 65.8 2
2 Prosody (Sad vs others) Combined features (Euclidean) 76.7 12.9
2 Prosody (Sad vs others) Combined features (KL) 77 13.2
3 Power spectrum (all four emotions) NA 81 17.2
3 Power spectrum (type 1 vs type 2) Combined features (Euclidean) 83.4 19.6
3 Power spectrum (type 1 vs type 2) Combined features (KL) 85.6 21.8

Since the power spectrum features turn out to be promising, we intend to


explore these features further in a future work.

6 Summary and conclusions

We started this work by exploring the effectiveness of features related to GCI in


identifying emotions from speech. For this purpose, we had proposed two distance-
based classification techniques, which were applied on the IEMOCAP and EMO-
DB datasets. Four emotions were considered for this study. To improve our SER
systems, we coupled the epoch features with state of the art MFCC features.
This was phase 1 of our work. In phase 2, we developed heirarchical SER systems
with two stages. The first stage was based on prosody, which identified whether
a given speech segment belongs to the sad emotion. The second stage was based
32 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2

on our phase 1 techniques trained on all the other emotions, excluding sadness.
The performance of these systems were quite better compared to the phase 1
systems. However, we observed that the confusion between happiness and anger
was high. Therefore, in phase 3, we attempted to develop another set of two-stage
SER systems in which the first stage was based on power spectral features that
identified whether a given speech utterance belong to type 1 (anger and happiness)
or type 2 (neutral and sadness) emotion. Then, in the second stage our phase 1
classifiers further classified the speech utterance into a distinct emotion using the
emotion type information obtained from phase 1. The performance of these phase
3 systems improved dramatically. It outperformed the best of our phase 1 systems
by as much as about 10% for the IEMOCAP corpus and by about 20% for the
EMO-DB corpus. It also outperformed a very recent and related work [7] by 9.6%
for the EMO-DB corpus. The usage of these power spectra seems to be quite
promising and we plan to explore these features further in future.

7 Data Availability

Two datasets have been used in this work. One of these is the German EMO-DB
corpus. It is freely available at http://emodb.bilderbar.info/index-1280.html.
The other one is the English IEMOCAP dataset, which is freely available at https:
//sail.usc.edu/iemocap/.

8 Funding and/or Conflicts of interests/Competing interests

The authors declare that they have no known competing financial interests or
personal relationships that could have appeared to influence the work reported in
this paper.

References

1. A. Satt, S. Rozenberg, and R. Hoory, “Efficient emotion recognition from speech using
deep learning on spectrograms.” in Interspeech, pp. 1089–1093, 2017.
2. G. Fant, “The source filter concept in voice production,” STL-QPSR, vol. 1, no. 1981, pp.
21–37, 1981.
3. S. G. Koolagudi, R. Reddy, and K. S. Rao, “Emotion recognition from speech signal using
epoch parameters,” in 2010 international conference on signal processing and communi-
cations (SPCOM). IEEE, pp. 1–5, 2010.
4. P. Gangamohan, S. R. Kadiri, S. V. Gangashetty, and B. Yegnanarayana, “Excitation
source features for discrimination of anger and happy emotions,” in Fifteenth Annual
Conference of the International Speech Communication Association, 2014.
5. S. R. Kadiri, P. Gangamohan, S. V. Gangashetty, and B. Yegnanarayana, “Analysis of ex-
citation source features of speech for emotion recognition,” in Sixteenth Annual Conference
of the International Speech Communication Association, 2015.
6. A. Haque and K. S. Rao, “Modification of energy spectra, epoch parameters and prosody
for emotion conversion in speech,” International Journal of Speech Technology, vol. 20,
no. 1, pp. 15–25, 2017.
7. S. R. Kadiri, P. Gangamohan, S. V. Gangashetty, P. Alku, and B. Yegnanarayana, “Ex-
citation features of speech for emotion recognition using neutral speech as reference,”
Circuits, Systems, and Signal Processing, vol. 39, no. 9, pp. 4459–4481, 2020.
Title Suppressed Due to Excessive Length 33

8. V. K. Mittal and B. Yegnanarayana, “Effect of glottal dynamics in the production of


shouted speech,” The Journal of the Acoustical Society of America, vol. 133, no. 5, pp.
3050–3061, 2013.
9. C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee,
and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,”
Language resources and evaluation, vol. 42, no. 4, pp. 335–359, 2008.
10. F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, “A database of
german emotional speech,” in Ninth European Conference on Speech Communication and
Technology, 2005.
11. S. G. Koolagudi and K. S. Rao, “Emotion recognition from speech: a review,” International
journal of speech technology, vol. 15, no. 2, pp. 99–117, 2012.
12. K. E. Cummings and M. A. Clements, “Analysis of the glottal excitation of emotionally
styled and stressed speech,” The Journal of the Acoustical Society of America, vol. 98,
no. 1, pp. 88–98, 1995.
13. S. G. Koolagudi, R. Reddy, and K. S. Rao, “Emotion recognition from speech signal using
epoch parameters,” in 2010 international conference on signal processing and communi-
cations (SPCOM). IEEE, pp. 1–5, 2010.
14. P. Gangamohan, S. R. Kadiri, and B. Yegnanarayana, “Analysis of emotional speech at
subsegmental level.” in INTERSPEECH, 2013.
15. M. Lugger and B. Yang, “The relevance of voice quality features in speaker independent
emotion recognition,” in 2007 IEEE International Conference on Acoustics, Speech and
Signal Processing-ICASSP’07, vol. 4. IEEE, pp. IV–17, 2007.
16. J. H. Jeon, R. Xia, and Y. Liu, “Sentence level emotion recognition based on decisions
from subsentence segments,” in 2011 IEEE international conference on acoustics, speech
and signal processing (ICASSP). IEEE, pp. 4940–4943, 2011.
17. T. L. Nwe, F. S. Wei, and L. C. De Silva, “Speech based emotion classification,” in Proceed-
ings of IEEE Region 10 International Conference on Electrical and Electronic Technology.
TENCON 2001 (Cat. No. 01CH37239), vol. 1. IEEE, pp. 297–301, 2001.
18. C. M. Lee, S. Yildirim, M. Bulut, A. Kazemzadeh, C. Busso, Z. Deng, S. Lee, and
S. Narayanan, “Emotion recognition based on phoneme classes,” in Eighth International
Conference on Spoken Language Processing, 2004.
19. D. Bitouk, R. Verma, and A. Nenkova, “Class-level spectral features for emotion recogni-
tion,” Speech communication, vol. 52, no. 7-8, pp. 613–625, 2010.
20. S. G. Koolagudi, A. Barthwal, S. Devliyal, and K. S. Rao, “Real life emotion classification
using spectral features and gaussian mixture models,” Procedia engineering, vol. 38, pp.
3892–3899, 2012.
21. V. Petrushin, “Emotion in speech: Recognition and application to call centers,” in Pro-
ceedings of artificial neural networks in engineering, vol. 710. Citeseer, p. 22, 1999.
22. C. M. Lee, S. Narayanan, and R. Pieraccini, “Recognition of negative emotions from the
speech signal,” in IEEE Workshop on Automatic Speech Recognition and Understanding,
2001. ASRU’01. IEEE, pp. 240–243, 2001.
23. Y.-h. Kao and L.-s. Lee, “Feature analysis for emotion recognition from mandarin speech
considering the special characteristics of chinese language,” in Ninth International Con-
ference on Spoken Language Processing, 2006.
24. T. Iliou and C.-N. Anagnostopoulos, “Statistical evaluation of speech features for emotion
recognition,” in 2009 Fourth International Conference on Digital Telecommunications.
IEEE, pp. 121–126, 2009.
25. S. G. Koolagudi, S. Devliyal, B. Chawla, A. Barthwal, and K. S. Rao, “Recognition of
emotions from speech using excitation source features,” Procedia engineering, vol. 38, pp.
3409–3417, 2012.
26. J. Nicholson, K. Takahashi, and R. Nakatsu, “Emotion recognition in speech using neural
networks,” Neural computing & applications, vol. 9, no. 4, pp. 290–296, 2000.
27. E. Bozkurt, E. Erzin, Ç. E. Erdem, and A. T. Erdem, “Improving automatic emotion
recognition from speech signals,” in Tenth Annual Conference of the International Speech
Communication Association, 2009.
28. V. Rozgic, S. Ananthakrishnan, S. Saleem, R. Kumar, A. N. Vembu, and R. Prasad, “Emo-
tion recognition using acoustic and lexical features,” in Thirteenth Annual Conference of
the International Speech Communication Association, 2012.
29. O. Martin, I. Kotsia, B. Macq, and I. Pitas, “The enterface’05 audio-visual emo-
tion database,” in 22nd International Conference on Data Engineering Workshops
(ICDEW’06). IEEE, 2006.
34 Arijul Haque1 [corresponding author] K. Sreenivasa Rao2

30. P. Jackson and S. Haq, “Surrey audio-visual expressed emotion (savee) database,” Uni-
versity of Surrey: Guildford, UK, 2014.
31. H. M. Fayek, M. Lech, and L. Cavedon, “Towards real-time speech emotion recognition
using deep neural networks,” in 2015 9th international conference on signal processing
and communication systems (ICSPCS). IEEE, pp. 1–5, 2015.
32. Y. Zhao, X. Jin, and X. Hu, “Recurrent convolutional neural network for speech process-
ing,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, pp. 5300–5304, 2017.
33. S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech emotion recognition using
recurrent neural networks with local attention,” in 2017 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 2227–2231, 2017.
34. P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou, “End-to-end
multimodal emotion recognition using deep neural networks,” IEEE Journal of Selected
Topics in Signal Processing, vol. 11, no. 8, pp. 1301–1309, 2017.
35. M. Valstar, J. Gratch, B. Schuller, F. Ringeval, D. Lalanne, M. Torres Torres, S. Scherer,
G. Stratou, R. Cowie, and M. Pantic, “Avec 2016: Depression, mood, and emotion recog-
nition workshop and challenge,” in Proceedings of the 6th international workshop on au-
dio/visual emotion challenge, pp. 3–10, 2016.
36. J. Zhao, X. Mao, and L. Chen, “Learning deep features to recognise speech emotion using
merged deep cnn,” IET Signal Processing, vol. 12, no. 6, pp. 713–721, 2018.
37. S. Sahu, R. Gupta, G. Sivaraman, W. AbdAlmageed, and C. Espy-Wilson, “Adversarial
auto-encoders for speech based emotion recognition,” arXiv preprint arXiv:1806.02146,
2018.
38. S. E. Eskimez, Z. Duan, and W. Heinzelman, “Unsupervised learning approach to feature
analysis for automatic speech emotion recognition,” in 2018 IEEE International Con-
ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 5099–5103,
2018.
39. L. Li, Y. Zhao, D. Jiang, Y. Zhang, F. Wang, I. Gonzalez, E. Valentin, and H. Sahli,
“Hybrid deep neural network–hidden markov model (dnn-hmm) based speech emotion
recognition,” in 2013 Humaine association conference on affective computing and intelli-
gent interaction. IEEE, pp. 312–317, 2013.
40. B. Xia and C. Bao, “Speech enhancement with weighted denoising auto-encoder.” in In-
terspeech, pp. 3444–3448, 2013.
41. Z. Zhang, F. Ringeval, J. Han, J. Deng, E. Marchi, and B. Schuller, “Facing realism in
spontaneous emotion recognition from speech: Feature enhancement by autoencoder with
lstm neural networks,” in Proceedings INTERSPEECH 2016, 17th Annual Conference of
the International Speech Communication Association (ISCA), pp. 3593–3597, 2016.
42. N. Cummins, S. Amiriparian, G. Hagerer, A. Batliner, S. Steidl, and B. W. Schuller,
“An image-based deep spectrum feature representation for the recognition of emotional
speech,” in Proceedings of the 25th ACM international conference on Multimedia, pp.
478–484, 2017.
43. S. Steidl, Automatic classification of emotion related user states in spontaneous children’s
speech. Logos-Verlag, 2009.
44. F. Weninger, F. Eyben, and B. Schuller, “Single-channel speech separation with memory-
enhanced recurrent neural networks,” in 2014 IEEE international conference on acoustics,
speech and signal processing (ICASSP). IEEE, pp. 3709–3713, 2014.
45. F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers,
J. Epps, P. Laukka, S. S. Narayanan et al., “The geneva minimalistic acoustic parameter
set (gemaps) for voice research and affective computing,” IEEE transactions on affective
computing, vol. 7, no. 2, pp. 190–202, 2015.
46. F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent developments in opensmile,
the munich open-source multimedia feature extractor,” in Proceedings of the 21st ACM
international conference on Multimedia, pp. 835–838, 2013.
47. S. Amiriparian, M. Gerczuk, S. Ottl, N. Cummins, S. Pugachevskiy, and B. Schuller,
“Bag-of-deep-features: Noise-robust deep feature representations for audio analysis,” in
2018 International Joint Conference on Neural Networks (IJCNN). IEEE, pp. 1–7,
2018.
48. K. S. R. Murty and B. Yegnanarayana, “Epoch extraction from speech signals,” IEEE
Transactions on Audio, Speech, and Language Processing, vol. 16, no. 8, pp. 1602–1613,
2008.
Title Suppressed Due to Excessive Length 35

49. K. S. Rao, S. M. Prasanna, and B. Yegnanarayana, “Determination of instants of signifi-


cant excitation in speech using hilbert envelope and group delay function,” IEEE Signal
Processing Letters, vol. 14, no. 10, pp. 762–765, 2007.

You might also like