Expert Systems With Applications

Expert Systems With Applications 69 (2017) 149–158
Contents lists available at ScienceDirect
Expert Systems With Applications

journal homepage: www.elsevier.com/locate/eswa
A new hybrid PSO assisted biogeography-based optimization for

emotion and stress recognition from speech signal
Yogesh C.K. a,∗, M. Hariharan b, Ruzelita Ngadiran a, Abdul Hamid Adom b, Sazali Yaacob c,
Chawki Berkai b, Kemal Polat d
a
School of Computer and Communication Engineering, Universiti Malaysia Perlis (UniMAP), Campus Pauh Putra, 02600 Arau, Perlis, Malaysia
b
School of Mechatronic Engineering, Universiti Malaysia Perlis (UniMAP), Campus Pauh Putra, 02600 Arau, Perlis, Malaysia
c
Universiti Kuala Lumpur Malaysian Spanish Institute, Kulim Hi-Tech Park, 09000 Kulim, Kedah, Malaysia
d
Department of Electrical and Electronics Engineering, Faculty of Engineering and Architecture, Abant Izzet Baysal University, 14280 Bolu, Turkey
a r t i c l e i n f o a b s t r a c t
Article history: Speech signals and glottal signals convey speakers’ emotional state along with linguistic information. To
Received 11 May 2016 recognize speakers’ emotions and respond to it expressively is very much important for human-machine
Revised 25 August 2016
interaction. To develop a subject independent speech emotion/stress recognition system, by identifying
Accepted 16 October 2016
speaker’s emotion from their voices, features from OpenSmile toolbox, higher order spectral features and
Available online 17 October 2016
feature selection algorithm, is proposed in this work. Feature selection plays an important role in over-
Keywords: coming the challenge of dimensionality in several applications. This paper proposes a new particle swarm
Speech signals optimization assisted Biogeography-based algorithm for feature selection. The simulations were con-
Emotions ducted using Berlin Emotional Speech Database (BES), Surrey Audio-Visual Expressed Emotion Database
Feature extraction (SAVEE), Speech under Simulated and Actual Stress (SUSAS) and also validated using eight benchmark
Feature selection and emotion recognition datasets. These datasets are of different dimensions and classes. Totally eight different experiments were
conducted and obtained the recognition rates in range of 90.31%–99.47% (BES database), 62.50%–78.44%
(SAVEE database) and 85.83%–98.70% (SUSAS database). The obtained results convincingly prove the ef-
fectiveness of the proposed feature selection algorithm when compared to the previous works and other
metaheuristic algorithms (BBO and PSO).
© 2016 Elsevier Ltd. All rights reserved.
1. Introduction plications to analyse and detect all callers emotions in call cen-
tres (Neiberg & Elenius, 2008), treat the mental disorders in di-
Human beings use different forms of gestures, facial expres- agnosis fields (Kostoulas et al., 2012) and detect Parkinson and
sions, body languages and speeches for communication. These Alzheimer’s diseases (Lopez-de-Ipiña et al., 2013; Zhao, Rudzicz,
communications convey both messages and emotional states of Carvalho, Márquez-Chin, & Livingstone, 2014). Further, ERS is used
the speakers (Gangamohan, Kadiri, & Yegnanarayana, 2016). Hu- to aid in applications like lie detection system, learning environ-
man beings have natural ability to understand the speakers’ emo- ment, development of educational, entertainment, and games soft-
tion from their speech signals. Robust speech emotion/stress recog- ware (Petrushin, 20 0 0).
nition system (ERS) aims at automatically identifying the human ERS usually comprises three stages: pre-processing, feature ex-
emotional state from his / her voice. A speech signal carries the traction and classification/recognition of emotion from speech sig-
speaker’s linguistic information and also his/her gender, age, ori- nals. Stage two is more important, since the extracted features
gin and emotional states (Garvin & Ladefoged, 1963). This sys- should produce greater impact and ability to represent different
tem has made several potential impacts on human computer in- emotions present in speech signals (Cowie & Cornelius, 2003). To
terface (HCI) (Calvo & D’Mello, 2010; El Ayadi, Kamel, & Karray, develop an ERS, features from both speech and glottal waveforms
2011). An automatic ERS has been used in various real time ap- are extracted usually.(Eyben, Batliner, Schuller, Seppi, & Steidl,
2010; Iliev, Scordilis, Papa, & Falcão, 2010; Sun, Moore, & Tor-
res, 2009; Sundberg, Patel, Björkner, & Scherer, 2011). Researchers
∗
Corresponding author.
have extracted Prosody features (Statistical Measure of Fundamen-
E-mail addresses: yyogesh61@gmail.com (Y. C.K.), hari@unimap.edu.my (M.
Hariharan), ruzelita@unimap.edu.my (R. Ngadiran), dhamid@unimap.edu.my (A.H.
tal Frequency), Spectral features (Mel Frequency Cepstral Coeffi-
Adom), sazali.yaacob@unikl.edu.my (S. Yaacob), chawki.berkai@gmail.com (C. cients (MFCCs) and Linear Prediction Cepstral Coefficients (LPCCs))
Berkai), kpolat@ibu.edu.tr (K. Polat).
http://dx.doi.org/10.1016/j.eswa.2016.10.035
0957-4174/© 2016 Elsevier Ltd. All rights reserved.
150 Y. C.K. et al. / Expert Systems With Applications 69 (2017) 149–158
and Voice quality features (pitch, shitter, jammer, normalized am- dent (SI), Gender Dependent Male (GD-male), Gender Dependent
plitude quotient) (Busso, Lee, & Narayanan, 2009; Cairns & Hansen, Female (GD female), Text Independent Multi Style Speech Classifi-
1994; Gobl & Nı´, 2003; Tahon & Devillers, 2016; Teager, 1980; cation (TIDMSS) and Text Independent Pairwise Stress Classifica-
Vayrynen, Kortelainen, & Seppanen, 2013). tion (TIDPS) were conducted. ELM classifier is employed in this
Feature extraction process is carried out at three different lev- work to distinguish different emotions. The paper is organized as
els namely: frame, segment and utterance from speech signal follows: the previous related works are discussed in Section 2, the
(Devillers & Vidrascu, 2006; He, Lech, & Allen, 2010; Hübner, materials and methods used in the experiment are explained in
Vlasenko, Grosser, & Wendemuth, 2010). There are standard toolk- Section 3. Section 4 represents the results and the discussion of
its to extract speech signal feature like PRAAT, APARAT, OpenSMILE the proposed work and finally Section 5 carries the conclusion of
and OpenEAR (Boersma & van Heuven, 2001; Eyben, Wöllmer, the proposed work.
& Schuller, 2009, 2010). The extracted features are labelled with
discrete numbers for each emotion and different classifiers are 2. Previous work
employed to recognize the emotion. The classifiers like Hidden
Morkov Model (HMM), Gaussian Mixture Model (GMM), Support The previous works on emotion/stress recognition systems, fea-
Vector Machine (SVM), Artificial Neural Network (ANN) and Ex- tures extracted and learning algorithms used are given in Table 1.
treme Learning Machine (ELM) have mostly been used to recog- Since the proposed method has considered BES, SAVEE and SUSAS
nize different emotions/stresses from speech signal (Amir, Kerret, databases, the previous work is confined to only those works per-
& Karlinski, 2001; Hassan & Damper, 2010; Hübner et al., 2010). taining to this area of research.
Though the researchers have applied various features and clas- In (Alonso, Cabrera, Medina, & Travieso, 2015; Cao et al., 2015;
sifiers to recognize the emotion which distinguishes multi class Henríquez, Alonso, Ferrer, Travieso, & Orozco-Arroyave, 2014; Lu-
emotional or stressed states for subject invariant (independent), engo et al., 2010; Muthusamy, Polat, & Yaacob, 2015a; Sidorov,
yet the tasks are challenging. Its challenging because there is no Brester, Minker, & Semenkin, 2014; Stuhlsatz et al., 2011; Sun,
theoretical basis to relate the features which directly represent the Wen, & Wang, 2015; Wang et al., 2015) spectral, prosody, voice
characteristics of human voice (speech) with different emotions quality, Interspeech 2009 and 2010, wavelet packet energy and
(Luengo, Navas, & Hernáez, 2010). The performance of ERS depends non linear features were extracted from the speech signal of BES
on features extracted from speech signals which should be invari- database. SVM, ELM and Neural network (NN) were used as learn-
ant to the speaker and his/her language. ing algorithms. Both subject dependent and subject independent
In this work, Higher Order Spectral Analysis (HOSA) based non- experiments were conducted and the accuracies attained were
linear features were extracted from speech signals. HOS is the in the range 75.4%–98.98% (SD) and 78.3%–97.24% (SI) respec-
spectral representation of higher order statistics. HOS has several tively. In (Mao, Dong, Huang, & Zhan, 2014; Muthusamy et al.,
advantages like the ability to detect non-Gaussianity, non station- 2015a; Sidorov et al., 2014; Sun et al., 2015) Spectrogram, Spec-
ary and non linearity present in the speech signal (Chua, Chandran, tral, Prosody, Voice quality, Interspeech 2009 and 2010, Wavelet
Acharya, & Lim, 2010). HOSA has been used to distinguish normal packet energy and Non Linear features were extracted from the
and pathological voices (Lee, 2012; Wszołek & Kłaczyński, 2010). In speech signal of SAVEE database. The authors reported accura-
this method, the features from speech and its glottal wave forms cies in the range 48.4%–97.60% (SD) and 50%–77.92% (SI) by us-
are extracted separately by applying inverse filter and linear pre- ing different classifiers like SVM, ELM and NN. For speech emo-
dictive analysis (Naylor, Kounoudes, Gudnason, & Brookes, 2007; tion/stress recognition of SUSAS, the authors in Deb and Danda-
Veeneman & BeMent, 1985; Wong, Markel, & Gray, 1979). From pat (2015), Shahin and Ba-Hutair (2015), Shukla, Dandapat, and
speech and glottal waveforms, 28 (14 from speech waveform and Prasanna (2016) and Stuhlsatz et al. (2011) extracted various fea-
14 from glottal waveform) non linear third order spectral measures tures like acoustic, MFCC, breathiness and dimensional ones. HMM
called bi-spectral features (BSFs) and 22 (11 from speech wave- and NN were used to attain accuracies of 53.6% (SI) and 72.8%–
form and 11 from glottal waveform) non linear bi-coherence fea- 93.89% (SD) respectively.
tures (BCFs) were extracted and these 50 features (BSFs + BCFs) In previous works, it was reported that ERS exhibited a huge
were tested with three databases namely Berlin Emotional Speech confusion between angry and happiness emotions. Further, it is
Database (BES), Surrey Audio-Visual Expressed Emotion Database noted that the accuracies, attained from subject dependent exper-
(SAVEE) and Speech Under Simulated and Actual Stress (SUSAS). iment, are higher than the subject independent experiment. It’s
These proposed feature sets were combined with standard Inter- because of the speaker specific variations in features extracted as
speech 2010 feature sets (1582 features) to further improve the well as voice quality.
performance of the emotion recognition system. From the literature review, it is evident that the features are
Feature Selection (FS) refers to choosing a necessary feature usually extracted from speech signals. In this research, the fea-
subset that is sufficient enough to achieve the required target tures from glottal waveforms were also extracted and both speech
(Kira & Rendell, 1992). The main goal of the FS is to remove and glottal signals were combined together to improve the per-
both irrelevant and redundant features. The function of FS in ma- formance of ERS. Further, the dimensionality of the feature set is
chine learning is to reduce the feature size dimensionality and the higher than the previous works, which increases computational
cost of classification/learning algorithm. In this work, a new Parti- cost, leads to over-fitting of the objective and degrades the per-
cle Swarm Optimization assisted Biogeography based Optimization formance of ERS. To overcome the above problem, a new Parti-
(PSOBBO) is proposed for feature selection. This proposed PSOBBO cle Swarm Optimization assisted Biogeography Based Optimization
is used in present ERS model. Further, the proposed optimization (PSOBBO) is proposed for feature selection.
is tested with the standard benchmark databases for feature selec-
tion. 3. Materials and method
The developed ERS model contains three stages namely: fea-
ture extraction in Stage 1 and feature selection using PSOBBO tech- This section describes the database used and BSFs and BCFs
nique in Stage 2 and classification and recognition of different features extracted from speech and glottal waveforms. It also de-
emotions in Stage 3. The detailed description about the proposed scribes the proposed feature selection algorithm and the classifier
model is as follows. To evaluate the performance of the proposed used. The overall block diagram of the proposed work is repre-
model, experiments like Subject Dependent (SD), Subject Indepen- sented in Fig. 1.
Y. C.K. et al. / Expert Systems With Applications 69 (2017) 149–158 151
Table 1
Review on some of the speech emotion recognition systems developed using BES, SAVEE and SUSAS database.
Reference no. Database No. of emotions Features extracted Classifiers Recognition rate
(Alonso et al., 2015) BES Happy, Angry, Sad and Boredom Spectral, Prosody and Pitch SVM 94.9% (SD)
(Luengo et al., 2010) BES Anxiety, Disgust, Happiness, Boredom, Spectral, Prosodic and Voice SVM 78.3% (SI)
Neutral, Sadness and Anger quality
(Cao, Verma, & BES Anxiety, Disgust, Happiness, Boredom, Spectral and Prosody Ranking SVM 82.1% (SD)
Nenkova, 2015) Neutral, Sadness and Anger
(Stuhlsatz et al., 2011) BES Anxiety, Disgust, Happiness, Boredom, 6552 acoustic features Deep Neural 81.9% (SI)
Neutral, Sadness and Anger Network
(Sidorov et al., 2014) BES Anxiety, Disgust, Happiness, Boredom, INTERSPEECH 2009 Probabilistic Neural 71.46% (SD)
Neutral, Sadness and Anger Network
(Wang, An, Li, Zhang, & BES Happiness, Boredom, Neutral, Sad, Prosody and Zero crossing rate SVM 88.8% (SI)
Li, 2015) Angry and Anxiety
(Henríquez et al., 2014) BES Angry, Neutral and Fear Non linear Dynamic Features Neural Network 75.4% (SD)
(Sun et al., 2015) BES Anxiety, Disgust, Happiness, Boredom, INTERSPEECH 2010 and Hu SVM 89.32% (SD) 81.7%
Neutral, Sadness and Anger Moments (SI)
(Muthusamy et al., BES Anxiety, Disgust, Happiness, Boredom, Wavelet Packet Energy and ELM 98.98% (SD) 97.24%
2015a) Neutral, Sadness and Anger Entropy (SI)
(Mao et al., 2014) SAVEE Anxiety, Disgust, Fear, Neutral, Sadness, Spectrogram Neural Network 86.7% (SD) 73.6%
Surprise and Happiness (SI)
(Sidorov et al., 2014) SAVEE Anxiety, Disgust, Fear, Neutral, Sadness, INTERSPEECH 2009 Probabilistic neural 48.4% (SD)
Surprise and Happiness network
(Sun et al., 2015) SAVEE Anxiety, Disgust, Fear, Neutral, Sadness, INTERSPEECH 2010 and Hu SVM 75.60% (SD) 50%
Surprise and Happiness moments (SI)
(Muthusamy et al., SAVEE Anxiety, Disgust, Fear, Neutral, Sadness, Wavelet Packet energy and ELM 97.60% (SD) 77.92%
2015a) Surprise and Happiness entropy (SI)
(Stuhlsatz et al., 2011) SUSAS Actual Stress 6552 acoustic features Deep Neural 53.6% (SI)
Network
(Shahin & Ba-Hutair, SUSAS Angry, Lombard, Loud, Soft and Neutral MFCC HMM 91.5% (SD)
2015)
(Shukla et al., 2016) SUSAS Neutral, Angry, Clear, Cond50, Cond70, 13 Dimensional features HMM 93.89% (SD)
Fast, Loud, Question and Soft
(Deb & Dandapat, 2015) SUSAS Angry, Happy, Lombard, Loud, Soft Breathiness feature and MFCC HMM 72.8% (SD)
Table 2
Description of BES and SAVEE database.
Databases Emotions
Anger Disgust Fear Neutral Happiness Sadness Boredom Surprise
BES 127 45 70 70 71 62 81 NA
SAVEE 60 60 60 120 60 60 NA 60
NA – not applicable
Table 3
Description of simulated domain from SUSAS database.
Database Multi-style of speech
Angry Lombard Loud Neutral
SUSAS 630 (9 Speaker x 35 words x 2 utterance) 630 630 631
3.1. Speech emotional database 3.2. Feature extraction
In this work, three databases are used to test the robust- To identify different emotional states and multi-style of speech
ness of the proposed PSOBBO. BES database is collected from 10 from speech signals, their salient features need to be extracted. All
German native speakers and it contains seven emotions (Angry, speech signals were down sampled to 8 kHz since all the recorded
Sad, Anxiety, Disgust, Happiness, Boredom and Neutral) (Burkhardt, signals of the database have different sampling rates. The speech
Paeschke, Rolfes, Sendlmeier, & Weiss, 2005). SAVEE database is signals were segmented into non overlapping frames with 256
collected from four native English male speakers and it also con- samples (32 ms). Based on the energy present in the frame, the
tains seven emotional states (Happiness, Surprise, Disgust, Anxiety, unvoiced portions (the frame with less energy) had been removed
Fear, Neutral and Sadness) (Haq, Jackson, & Edge, 2008). SUSAS before the feature extraction process. The frame with lesser en-
database contains actual stress and simulated multi-style speak- ergy is removed by setting a threshold value. The threshold value
ing (Hansen, Bou-Ghazale, Sarikaya, & Pellom, 1997). In this work, is determined for each database separately. (Ozdas, Shiavi, Silver-
six words uttered in four emotional styles (Angry, Lombard, Loud man, Silverman, & Wilkes, 2004). The remaining voiced frames
and Neutral) of simulated domain are considered. The SD, SI, GD- were concatenated and glottal waveforms were extracted by apply-
Male and GD-Female experiments are conducted for BES. SD and ing inverse filtering and linear predictive analysis method. Then,
SI experiments are conducted for SAVEE. TIDPS and TIDMSS exper- the first order pre-emphasis filter was used to spectrally flatten
iments are conducted for SUSAS simulated multi-style speaking. the speech waveforms and glottal waveforms (Muthusamy, Polat,
The descriptions of BES, SAVEE and SUSAS (simulated) are tabu- & Yaacob, 2015b; Rabiner & Juang, 1993). The filtered signals were
lated in Tables 2 and 3. segmented into frames with an overlap of 50%. Later, each frame
Fig. 1. Block diagram of the proposed method.
was windowed by applying hamming window technique which re- The spectral representation of higher order cumulants of a ran-
duced the signal discontinuity and spectral distortion. Bispectral dom process is defined as Higher Order Spectra. The third order
and Bicoherence features for each frame were extracted and they cumulant spectra are called bispectrum or bispectral. The bispec-
were averaged for all frames. trum is the 2D – Fourier transform of the third order cumulant
function (Chua et al., 2010). The bispectrum is a function of two
frequencies unlike the power spectrum which is a function of only
3.2.1. Bispectral and bicoherence features one frequency variable. The normalized bispectrum is called bico-
Speech signals are non-Gaussian by nature, since they contain herence representation of the signal. The following Eqs. (1) and (2)
repetitive characters within a short interval of time. HOS have high represent the bispectrum and bicoherence:
signal to noise ratio and they can detect the deviation of a sig-
nal from Gaussian model (Naylor et al., 2007). The important fea-
B( f1 , f2 ) = E[X ( f1 )X ( f2 )X ∗ ( f1 + f2 )] (1)
tures of signals are preserved only when the phase is retained. The
HOS have the ability to preserve the phase of the signal during the
phase reconstruction process. Further, the noise added to the signal where B(f1 , f2 ) is the bispectrum of bi-frequency (f1 , f2 ),
is also removed during the phase reconstruction process. Therefore, X(f)denotes the Fourier transform of the given signal, ∗ repre-
it will be more advantageous to analyse the speech signal with sents the complex conjugate, E[.] denotes the expectation oper-
HOS compared to power spectra and correlations (Muthuswamy, ation. The phase at frequency f1 + f2 is generated either fully
Sherman, & Thakor, 1999). or partially due to the non-linearity present in the signals
Table 4
Description of the features extracted from non redundant region.
Equation number Feature name Equation of the features Description of the feature

iB(i, j )
(3) The Weighted Centre of Bispectrum f 1m =

B(i, j )
Where i, j, represent the frequency bin index present in the non

(WCOB) (Chua et al., 2010)
redundant region,f1m ,f2m denote WCOB
jB(i, j )
(4) f 2m =

B(i, j )

i|B(i, j )|
(5) f 3m =

|B(i, j )| f3m , f4m represent the absolute values of WCOB.

j|B(i, j )|
(6) f 4m =

|B(i, j )|

(7) Entropy describes the regularity or ent1 = − pi log pi where pi = |B|(Bf(1f, f,2f)|)|
1 2
n
irregularity present in the
Bio-signals (Chua et al., 2010)

where pi = |B|(Bf(1f, f,2f)|)|2
2
(8) ent2 = − pi log pi
1 2
n

where pi = |B|(Bf(1f, f,2f)|)|3
3
(9) ent3 = − pi log pi
1 2
n
(10) The sum of logarithmic amplitude H1 = log(|B( f 1 , f 2 )| )
of the bispectrum

(11) The sum of logarithmic amplitude H2 = log(|B( f d , f d )| )
present in the diagonal elements
among the bispectrum

N
(12) The first and the second order H3 = d log(|B( f d , f d )| )
d=1
spectral moments of the
amplitude of the diagonal
elements

N
(13) H4 = (d − H 3 ) log(|B( fd , fd )| )
2
2
d=1
(14) H5 = i + j 2 |B(i, j )| where i, j, represent the frequency bin index present in the non
redundant region.

(15) Phase entropy entPh = p(k ) log p(k ) where p(k ) = 1
n
1((B( f 1 , f 2 )) ∈ k ) where
k
k = {| − π + 2πN k ≤ < −π + 2π (Nk+1) } n = 0, 1,…, N-1, where
represents the phase angle of the bispectrum and 1(.) function
returns the value 1 if lies within the range of bin k .

(16) Mean magnitude of the bispectrum mAmp = 1
n
|B ( f 1 , f 2 )| where n denotes the number of points within that region.

(Muthuswamy et al., 1999). Table 5

The description of features extracted from OpenSmile toolbox.
|B( f1 f2 )|
bic ( f1 f2 ) = (2) Features INTERSPEECH 2010
P ( f1 )P ( f2 )P ( f1 + f2 ) PCM loudness 42
MFCC 630
where P(f1 ) and P(f2 ) denote the power spectrum of frequencies LOG MEL FREQ. BAND 0–7 336
f1 and f2 . Bicoherence quantifies the extent of phase coupling be- LINE SPECTRAL PAIRS FREQ. 0–7 336
tween the two frequency components (Muthuswamy et al., 1999). F0 and F0 ENVELOPE 82
VOICING PROBABILITY 42
The bispectrum and the bicoherence of a signal contain redun-
JITTER Local and CONSEC FRAME PAIRS 76
dant data. Hence BSFs and BCFs features were extracted from the SHIMMER LOCAL 38
non-redundant region () as described in (Acharya, Chua, Chua, Total 1582
Min, & Tamura, 2010). From the non-redundant region () ofB(f1 ,
f2 ), 14 BSFs each were extracted from speech waveform and glottal
waveform. Similarly from the non-redundant region () ofbic(f1 f2 ),
In this work, the 50 BSBCFs are combined together with 1582
11 BCFs were extracted from speech waveform and 11 BCFs from
OpenSmile features. Totally 1632 OpenSmile Features along with
glottal waveform. The details of the features extracted were tabu-
Bispectral and Bicoherence Features (OSBSBCFs) are extracted from
lated in Table 4. From the Table 4, the Eqs. (3)–(16) are 14 BSFs
each speech signal. Since the number of features is high, it will
derived from bispectrum and Eq. (3), (4), (7)–(14) and (16) are 11
be a curse on the dimensionality problem. Hence feature selection
BCFs derived from bicoherence. Totally 50 (28 BSFs + 22 BCFs) bis-
is carried out before classification. The OSBSBCFs are used in the
pectral and bicoherence features (BSBCFs) were extracted.
proposed PSOBBO for feature selection.
These features were derived from each frame. The number of
voiced portions varies for each speech signal, since the record-
3.3. Feature selection
ing duration of speech signal varies. The features from each frame
were extracted first and were averaged for overall features from all
The higher feature set dimension can cause over fitting of the
the frames.
machine learning algorithm which leads to performance degrada-
tion (Alelyani, Tang, & Liu, 2013). The main goal of the FS is to
3.2.2. OpenSmile toolbox remove irrelevant and redundant features. The motivation of FS
Inter-speech 2010 feature set was directly computed from the in machine learning is to reduce the feature size dimensionality
OpenSmile toolbox (Eyben, Wöllmer et al., 2010). The details of the and reduce the cost of classification/learning algorithm. FS algo-
features were tabulated in Table 5. Totally 1582 features were ex- rithms can be broadly divided into three categories namely: filter,
tracted from the toolbox. wrapper and embedded. Each algorithm has its own advantage and
disadvantage (Yazdani, Shanbehzadeh, & Aminian, 2013). Since the migration operation, but it is slow in exploring the global search
wrapper algorithm uses the actual target in learning algorithm, it space (Gong, Cai, & Ling, 2010). In order to improve its explor-
provides better recognition rates. In this work, Wrapper Based Fea- ing capacity, a modified PSO velocity and position update of the
ture Selection with ELM classifier is used. particles are incorporated and are applied for worst half of the
Generating a subset from the main feature set for evaluating population. Further, the proposed method helps in increasing the
the fitness is an important step for any feature selection algorithm. diversity among the population. Algorithm 1 shows the operation
Finding a suitable subset is an NP-hard problem (Kohavi & John, of PSOBBO. In Algorithm 1, P represents the population and Pnew
1997). Exploring the entire possible subset search in higher di- represents the habitat after migration operation and V denotes the
mension is a challenge. Generally to solve the above mentioned velocity of the particle. The constant value c1 = 0.5 and c2 = 2 are
problems, heuristic search is employed, but it converges to a local used as constant weight factors.
optimum. Therefore, the random feature selection algorithms like The migration operation (Step 10–17) is the basic BBO which
PSO, genetic algorithm (GA) and tabu search (TS) were employed helps in modifying the habitats within the population. The pro-
(Yazdani et al., 2013). In 2008, Simon developed BBO [60] a meta- posed method (Step 18–29) is carried out after basic migration op-
heuristic algorithm for global optimization. The BBO algorithm had eration. Equation (Step 20) contains the modified V equation. Here,
outperformed the random feature selection algorithm on several the first part denotes the current velocity. In the second part, since
problems (Simon, 2008). only a minor population in pBest will become qualified; a different
move-away procedure is adopted by retaining the same population
3.3.1. BBO without modifying its habitant. The third term is used to attract
Biogeography based optimization is an algorithm based on geo- the particle towards global best position.
graphical distribution of a group of biological organisms in its iso-
lated environment. Organism in BBO is called species and these 4. Classification
species can migrate from one island to other and this migration
is called habitat. Each habitat has a Habitat Suitability Index (HSI) In this work, ELM classifier is used to distinguish the emotions.
which is similar to the fitness in general optimization algorithms.
Suitability Index Variable (SIV) suggests the habitability of the
4.1. ELM
habitant. The habitat with good HSI will move to other island in
order to create a good population for the next generation. The emi-
Extreme Learning Machine(ELM) is proposed by Huang, Zhou,
gration (μi ) and the immigration (λi ) of the habitats are controlled
Ding, and Zhang (2012) and Huang, Zhu, and Siew (2006) for a
by the fitness.
Single Layered Feed Forward Network. ELM can be used for feature
λi = I.(1 − k(i )/n ) (17) mapping, regression and multi-class classification. ELM has lesser
computation complexity compared to Least Square Support Vector
Machine (LS-SVM). For optimization algorithm, the ELM has lesser
μi = E.k(i )/n (18)
constraint compared to LS-SVM. Hence in this work, the ELM clas-
Where k(i) denotes the rank of the population calculated based sifier is used for emotion recognition and feature selection.
on fitness and n denotes the total number of population. E and I
are the maximum population immigration and emigration rates. 5. Results and discussion
They are followed by migration operation mutation operations
which are performed to modify the SIV. BBO proved to be suc- To develop an efficient ERS from speech signals, OSBSBCFs is
cessful in optimizing many engineering problems namely: aircraft used and PSOBBO algorithm is proposed to reduce the feature size.
maintenance sensor selection, Yuga–Uda’s antenna design, param- This proposed method is tested with three different databases BES,
eter estimation of chaotic systems and in solving non-productive SAVEE and SUSAS multi-style speech.
time during hole-making process and in optimal operation of The subject independent recognition rate is used as a fitness
reservoir systems (Haddad, Hosseini-Moghari, & Loáiciga, 2015; Si- function for BES (SI), SAVEE (SI) and SUSAS (TIDMSS). The popu-
mon, 2008; Singh, Tayal, & Sachdeva, 2012; Tamjidy, Paslar, Ba- lation size is set to 50 and maximum iteration is set to 100. To-
harudin, Hong, & Ariffin, 2015; Wang & Xu, 2011). tally 20 independent runs are performed for each experiments. The
maximum recognition rate obtained from overall independent run
3.3.2. PSO is noted and its corresponding features are used for other depen-
Particle Swarm Optimization (PSO) is a population based dent experiments.
stochastic optimization technique proposed in [66]. This technique The maximum, minimum, mean and standard-deviation of the
emerged from the behaviour of non-guided animals in a group or maximum accuracies obtained from each independent runs are
swarm as in bird flocking and fish schooling. Each individual in a tabulated below. Further, the mean values of the number of se-
population or a solution in PSO has its own velocity and position. lected features obtained from their corresponding maximum accu-
In a n-dimensional search area, each particle flies to its best po- racies of each independent runs are also tabulated.
sition depending on 1) pBest, the best solution achieved so far; 2) Table 6 illustrates the recognition rates of BES database. It is ev-
gBest, global best value. The algorithm updates its velocity and po- ident from the table that PSOBBO has outperformed BBO and PSO
sition when the best solution is reached (Kaur & Kaur, 2015). PSO by attaining accuracies of 90.31%, 99.47%, 98.94% and 92.98% for SI,
has recently been applied for improving emotion recognition in SD, GD-Female, GD-Male with only 177 features.
speech and glottal signals (Muthusamy et al., 2015b). Further, the The maximum recognition rate of SI experiment is 90.31% and
PSO is employed to optimize communication networks, engines, its individual recognition rates of seven BES database emotions are
motors, entertainment and metallurgy applications (Poli, 2007). Anxiety - 84.29%, Disgust - 91.11%, Happiness – 83.10%, Boredom
– 91.36%, Neutral- 91.14%, Sadness – 93.55% and Anger - 97.64%.
3.3.3. PSOBBO It can be noticed that Anxiety and Happiness have high confusion
In order to develop an efficient ERS model, the PSOBBO is de- compared to other emotions. It’s because anxiety, happiness and
veloped. The proposed PSOBBO enhances the basic BBO. The basic anger emotions belong to high levels of arousal type (level of phys-
BBO is good at exploiting current population information with its ical response). Though anger has high arousal it also has high pitch
Algorithm 1 The PSOBBO framework.
1 Randomly initialize the population of P habitats

2 Calculate the fitness for each habitat
3 Sort the habitats in descending order based on the fitness
4 Update gBest
5 for m = 1 to Maximum_Iteration
6 for i = 1 to P
7 Update λi and μi
8 end
9 // Perform Migration operations
10 for p = 1 to P
11 for j = 1 to Number_Of_features
12 if rand () < λi
13 Select a habitat Pp with probability μi
14 Pnewp ← Pp
15 end if
16 end for
17 end for
18 for p = round (length (P/2)) to P
19 for j = 1 to Number_Of_features
20 V ( p, j ) = rand ∗ V ( p, j ) + c1 ∗ rand ∗ P ( p, j ) + c2 ∗ rand ∗ gBest ( j ) − P ( p, j )
21 S = abs((2/pi ) ∗ a tan(( pi/2 ) ∗ V ( p, j )))
22 if rand () < S
23 Pnew(p, j) = 1
24 else
25 Pnew(p, j) = 0
26 end if
27 end for
28 end for
29 P = Pnew
30 Calculate the fitness for each habitat
31 Sort the habitats in descending order based on the fitness
32 Update gBest
33 end for
Table 6
Recognition rates of SI, SD, GD-female, and GD-male experiments to recognize multi-class (7 emotions) of BES database.
Bold values are obtained from proposed method.
Experiments Method Maximum (%) Minimum (%) Mean (%) STD No. of selected features mean
SI BBO 77.14 74.91 76.13 0.55 1062.65

PSO 73.84 71.36 72.63 0.67 817.05
PSOBBO 90.31 86.79 88.36 0.86 177.8
SD BBO 99.38 95.05 96.96 1.79 1062.65
PSO 99.42 90.76 96.23 2.47 817.05
PSOBBO 99.47 94.99 97.54 1.49 177.8
GD female BBO 93.57 89.54 91.12 1.35 1062.65
PSO 92.81 85.99 90.05 2.26 817.05
PSOBBO 98.94 91.27 94.79 2.30 177.8
GD male BBO 88.56 63.23 77.00 6.90 1062.65
PSO 82.89 67.62 75.62 5.78 817.05
PSOBBO 92.98 74.18 84.14 5.43 177.8
Table 7
Recognition rates of SI and SD experiments to recognize multi-class (7 emotions) of SAVEE database.
SI BBO 58.81 54.64 56.57 0.011 1046.25

PSO 53.57 48.69 50.55 0.011 817.15
PSOBBO 62.50 56.55 59.63 0.015 336.75
SD BBO 62.47 47.31 54.63 4.688 1046.25
PSO 61.04 44.43 56.02 4.594 817.15
PSOBBO 78.44 61.70 69.75 4.989 336.75
value. It can be clearly noticed that the emotion recognition rates The emotions like Fear, Disgust and Sadness have higher confu-
of anger is quite higher compared to other emotions. sion compared to Neutral and Anxiety. Fear and Disgust have low
Table 7 illustrates the recognition rates of SAVEE database. It is valence (level of pleasantness), further Fear, Disgust and Sadness
understood that with 336 features, the PSOBBO has outperformed produce less heartbeat, more salivation and the speech signal con-
BBO and PSO. 62.5% and 78.44% accuracies are attained for SI and tains low pitch and frequency value which makes these emotions
SD experiments of SAVEE database. to get correlated and more confusion between them.
The maximum emotion recognition rates of SAVEE database SI SUSAS (TIDMSS) and pairwise dependent emotions among neu-
experiment is 62.5%, and the individual emotion recognition rates tral and other three emotions (Angry, Lombard and Loud) are re-
are Anxiety - 73.33%, Disgust – 53.33%, Fear – 33.33%, Neutral – ported in Table 8. From these Tables, it can been seen that among
85.83%, Sadness – 55%, Surprise – 71.67% and Happiness – 65%. these three methods, the PSOBBO has performed better with max-
Table 8
Recognition rates of TIDMSS and TIDPS experiments to recognize multi-class (4 speech style) of SUSAS Database. Bold values are
obtained from proposed method.
TIDMSS BBO 78.43 75.83 77.11 0.006 853.15

PSO 75.00 72.87 73.64 0.006 801.45
PSOBBO 85.83 82.59 84.12 0.010 258.75
TIDPS Angry Vs. Neutral BBO 96.11 96.11 96.11 0 853.15
PSO 95.74 95.74 95.74 0 801.45
PSOBBO 97.96 97.96 97.96 0 258.75
TIDPS Lombard Vs. Neutral BBO 89.63 89.63 89.63 0 853.15
PSO 87.59 87.59 87.59 0 801.45
PSOBBO 92.96 92.96 92.96 0 258.75
TIDPS Loud Vs. Neutral BBO 94.63 94.63 94.63 0 853.15
PSO 92.41 92.41 92.41 0 801.45
PSOBBO 98.70 98.70 98.70 0 258.75
Table 9
Summary of benchmark dataset.
Dataset Number of samples Number of classes Number of features
Glass 214 6 9
Dermatology 358 6 34
Heart 270 2 13
Parkinson 195 2 22
Sonar 208 2 60
9 tumours 174 9 5726
11 tumours 174 11 12,533
14 tumours 308 26 15,009
imum accuracies of 85.83%, 97.96%, 92.96% and 98.70% respectively A paired t-test is performed among the three methods. The p-
by using 258 features. value less than 0.05 is considered to be statistically significant.
The maximum recognition rates of TIDMSS experiment is From Table 11, it can clearly be seen that PSOBBO has produced
85.83% (Angry – 86.67%, Lombard – 71.11%, Loud – 84.44% and significantly higher results for most of the experiments compared
Neutral – 97.04%). Since the stress style of speech like Angry, Lom- to BBO and PSO. The proposed method has the following salient
bard and Loud share common characteristics such as higher pitch, features:
higher frequency and higher arousal, it can be noticed that neutral
1. PSOBBO feature selection algorithm was validated using three
style shows very less confusion compared to other style of speech.
emotional speech database BES, SAVEE, SUSAS and also eight
To develop a real time ERS, the performance of the model
benchmark databases.
should be evaluated based on the SI experiment, as the exper-
2. Application of these HOSA based bispectrum and bicoherence
iment is conducted independent of the subjects and their emo-
features to ERS is novel.
tions. In SD experiment, a part of each subject’s emotions is used
3. Subject independent experiment was conducted using utter-
in training phase. These two experiments are conducted to eval-
ances of BES, SAVEE and SUSAS database and obtained maxi-
uate the proposed ERS. From the results of SI, SD and TIDMSS of
mum recognition rates of 90.31%, 62.5% and 85.83% respectively.
three databases tabulated in Tables 6–8, it can be concluded that
The results are significantly better than the results reported in
the proposed features of BSBCFs and PSOBBO feature selection al-
the previous works (Table 1).
gorithms have attained higher recognition rate compared to BBO
4. The limitation of proposed method is that the recognition rate
and PSO.
may vary for different race and cross cultural utterances.
Further to prove the efficiency of the proposed feature selec-
tion algorithm, PSOBBO is tested with publically available standard 6. Conclusion
benchmark datasets namely: Glass, Dermatology, Heart, Parkinson,
Sonar (Frank & Asuncion, 2010), 9 tumours, 11 tumours and 14 tu- A subject Independent and multi-style speech and stress recog-
mours (Statnikov & Tsamardinos, 2005). Table 9 gives a short de- nition ERS is very useful tool in applications like HCI, health care
scription and a summary of these benchmark datasets and detailed system and investigating criminal problems. Further, it has shown
description about these dataset can be found in (Liew, Seera, Loo, some potential impact on lie detection system, learning environ-
& Lim, 2015; Shen et al., 2016; Yazdani et al., 2013). These datasets ment, development of educational, entertainment, and games soft-
are chosen based on categories like smaller samples with lesser ware. In this work, BSBCFs and Interspeech 2010 features are ex-
number of features, smaller samples with larger number of fea- tracted from speech and glottal signals. In this paper, PSOBBO
tures and datasets with two class and multi class problems. The metaheuristic algorithm is proposed based on BBO and PSO to
PSOBBO is compared with BBO and PSO in Table 10. solve the feature selection problem. The proposed feature selection
For the benchmark datasets, accuracy of 1 × 10 fold cross- algorithm designed to resolve the problem of reducing the feature
validation is used as fitness function. The population size is set to size and remove the irrelevant features. The PSOBBO method is
50 and maximum iteration is set to 100. Totally 10 independent tested with three speech emotion/stress databases and eight differ-
runs are conducted for each dataset. ent benchmark datasets. The chosen datasets have different num-
Table 10 shows the accuracies attained from benchmark ber of classes and dimensionalities. Further, PSOBBO is compared
datasets. Results confirm that the proposed PSOBBO has always at- with BBO and PSO. Overall, PSOBBO reduces the feature size by
tained higher classification accuracy with lesser number of feature 92%, 48.5% and 63.2% for BES, SAVEE and SUSAS databases respec-
sets. tively from original 1632 features. Almost by 0.05% - 13.17% (BES),
Table 10
Comparison of PSOBBO with BBO and PSO for benchmark datasets. Bold values are obtained from proposed method.
Dataset name Method Maximum (%) Minimum (%) Mean (%) STD No. of selected features mean
Glass BBO 74.77 72.90 73.79 0.559 6.8

PSO 74.30 73.83 74.11 0.241 6.4
PSOBBO 74.77 73.83 74.21 0.369 6.1
Dermatology BBO 99.16 98.60 98.88 0.186 18.7
PSO 99.16 98.88 99.13 0.088 20.2
PSOBBO 99.44 98.88 99.08 0.189 12.4
Heart BBO 87.41 85.56 86.26 0.591 6.4
PSO 87.78 86.30 87.04 0.579 7.3
PSOBBO 86.30 86.30 86.30 0.0 0 0 3.8
Parkinson BBO 97.95 97.44 97.74 0.265 13.7
PSO 98.46 97.95 98.10 0.248 13.2
PSOBBO 98.46 97.95 98.21 0.270 10.5
Sonar BBO 95.19 92.79 93.80 0.697 27.7
PSO 97.12 95.19 96.25 0.591 33
PSOBBO 97.60 95.67 96.44 0.724 15.5
9 tumours BBO 81.67 80.00 80.33 0.745 3897
PSO 78.33 75.00 77.00 1.394 2853.4
PSOBBO 95 88.33 90.00 2.887 480
11 tumours BBO 99.43 98.85 99.31 0.257 7248.2
PSO 99.43 98.85 99.08 0.315 6267.2
PSOBBO 99.43 99.43 99.43 0.0 0 0 2286
14 tumours BBO 81.17 79.87 80.52 0.513 9469
PSO 81.17 79.55 80.32 0.673 7461.8
PSOBBO 83.77 79.87 82.34 1.727 3281
Table 11 Alelyani, S., Tang, J., & Liu, H. (2013). Feature selection for clustering: A review. Data
Paired t- test for PSOBBO with BBO and PSO in terms of Clustering: Algorithms and Applications, 29, 110–121.
its maximum classification accuracy attained from differ- Alonso, J. B., Cabrera, J., Medina, M., & Travieso, C. M. (2015). New approach in quan-
ent independent runs. Bold values are obtained from pro- tification of emotional intensity from the speech signal: Emotional temperature.
posed method. Expert Systems with Applications, 42, 9554–9564.
Amir, N., Kerret, O., & Karlinski, D. (2001). Classifying emotions in speech: A com-
Dataset name PSOBBO vs. BBO PSOBBO vs. PSO parison of methods. In INTERSPEECH (pp. 127–130).
Boersma, P., & van Heuven, V. (2001). Speak and unSpeak with PRAAT. Glot Interna-
BES (SI) 48.749 (0.0 0 0) 70.262 (0.0 0 0)
tional, 5, 341–347.
SAVEE (SI) 6.722 (0.0 0 0) 22.876 (0.0 0 0) Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005).
SUSAS (SI) 24.411 (0.0 0 0) 43.504 (0.0 0 0) A database of German emotional speech. In INTERSPEECH: vol. 5 (pp. 1517–
Glass 1.784 (0.108) 0.688 (0.509) 1520).
Dermatology 2.090(0.066) −0.802 (0.443) Busso, C., Lee, S., & Narayanan, S. (2009). Analysis of emotionally salient aspects
Heart 0.198(0.847) −4.045 (0.003) of fundamental frequency for emotion detection. IEEE Transactions on Audio,
Parkinson 3.857 (0.004) 1 (0.343) Speech, and Language Processing, 17, 582–596.
Sonar 7.822(0.0 0 0) 0.688(0.509) Cairns, D. A., & Hansen, J. H. (1994). Nonlinear analysis and classification of speech
9 tumours 9.94(0.001) 13.377(0.0 0 0) under stressed conditions. The Journal of the Acoustical Society of America, 96,
11 tumours 1(0.374) 2.44 (0.040) 3392–3400.
14 tumours 2.108(0.103) 1.901(0.130) Calvo, R., & D’Mello, S. (2010). Affect detection: An interdisciplinary review of mod-
els, methods, and their applications. IEEE Transactions on Affective Computing, 1,
Metrics – t value (significance) 18–37.
Cao, H., Verma, R., & Nenkova, A. (2015). Speaker-sensitive emotion recognition via
ranking: Studies on acted and spontaneous speech. Computer Speech & Lan-
3.63% - 18.97% (SAVEE) and 1.85% - 7.4% (SUSAS) relative increase guage, 29, 186–202.
of recognition rates are achieved in comparison with next best per- Chua, K. C., Chandran, V., Acharya, U. R., & Lim, C. M. (2010). Application of higher
formed feature selection method. The results illustrate the effec- order statistics/spectra in biomedical signals—A review. Medical Engineering &
Physics, 32, 679–689.
tiveness of the proposed feature selection algorithm over BBO and Cowie, R., & Cornelius, R. R. (2003). Describing the emotional states that are ex-
PSO. In terms of expert and intelligent systems and its relative ap- pressed in speech. Speech Communication, 40, 5–32.
plications, the proposed ERS can extract features from both speech Deb, S., & Dandapat, S. (2015). A novel breathiness feature for analysis and classi-
fication of speech under stress. In Communications (NCC), 2015 twenty first na-
signal and glottal signals. The PSOBBO method would be of ben- tional conference on (pp. 1–5). IEEE.
efit, for selecting the most relevant features in emotion recogni- Devillers, L., & Vidrascu, L. (2006). Real-life emotions detection with lexical and par-
tion/classification tasks. Further, this proposed approach can be ex- alinguistic cues on human-human call center dialogs. INTERSPEECH.
El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recog-
tended for real time emotion recognition system to automatically nition: Features, classification schemes, and databases. Pattern Recognition, 44,
detect the different stress/emotion from natural speech. 572–587.
In future work, the proposed ERS model will be tested us- Eyben, F., Batliner, A., Schuller, B., Seppi, D., & Steidl, S. (2010). Cross-Corpus classi-
fication of realistic emotions–some pilot experiments. In Proc. 3rd international
ing various types of emotions and cross-cultural or cross-linguistic
workshop on EMOTION (satellite of LREC): Corpora for research on emotion and
database, naturalistic and larger corpora. Further, the proposed affect (pp. 77–82).
ERS system will be extended to identify multiple/overlapping emo- Eyben, F., Wöllmer, M., & Schuller, B. (2009). OpenEAR—introducing the Munich
tions present in the utterances. In ERS’s feature selection frame- open-source emotion and affect recognition toolkit. In Affective computing and
intelligent interaction and workshops, 2009. ACII 2009. 3rd international conference
work other cotemporary heuristic searches like Firefly, artificial bee on (pp. 1–6). IEEE.
colony algorithm (ABC) and Fruitfly will also be incorporated and Eyben, F., Wöllmer, M., & Schuller, B. (2010). Opensmile: The munich versatile and
compared. fast open-source audio feature extractor. In Proceedings of the international con-
ference on multimedia (pp. 1459–1462). ACM.
Frank, A., & Asuncion, A. (2010). UCI Machine Learning Repository. In. http://archive.
References ics.uci.edu/ml/. Retried on (12/12/2015).
Gangamohan, P., Kadiri, S. R., & Yegnanarayana, B. (2016). Analysis of emotional
Acharya, U. R., Chua, E. C.-P., Chua, K. C., Min, L. C., & Tamura, T. (2010). Analysis and speech—A review. In Toward robotic socially believable behaving systems-volume i
automatic identification of sleep stages using higher order spectra. International (pp. 205–238). Springer.
journal of neural systems, 20, 509–521.
Garvin, P. L., & Ladefoged, P. (1963). Speaker identification and message identifica- Ozdas, A., Shiavi, R. G., Silverman, S. E., Silverman, M. K., & Wilkes, D. M. (2004).
tion in speech recognition. Phonetica, 9, 193–199. Investigation of vocal jitter and glottal flow spectrum as possible cues for de-
Gobl, C., & Ní, A. (2003). The role of voice quality in communicating emotion, mood pression and near-term suicidal risk. IEEE Transactions on Biomedical Engineering,
and attitude. Speech Communication, 40, 189–212. 51, 1530–1540.
Gong, W., Cai, Z., & Ling, C. X. (2010). DE/BBO: A hybrid differential evolution with Petrushin, V. A. (20 0 0). Emotion recognition in speech signal: Experimental study,
biogeography-based optimization for global numerical optimization. Soft Com- development, and application. Studies, 3, 4.
puting, 15, 645–665. Poli, R. (2007). An analysis of publications on particle swarm optimization applications.
Haddad, O. B., Hosseini-Moghari, S.-M., & Loáiciga, H. A. (2015). Biogeography-based Essex, UK: Department of Computer Science, University of Essex.
optimization algorithm for optimal operation of reservoir systems. Journal of Rabiner, L., & Juang, B.-H. (1993). Fundamentals of speech recognition.
Water Resources Planning and Management, 142, 04015034. Shahin, I., & Ba-Hutair, M. N. (2015). Talking condition recognition in stressful and
Hansen, J. H., Bou-Ghazale, S. E., Sarikaya, R., & Pellom, B. (1997). Getting started emotional talking environments based on CSPHMM2s. International journal of
with SUSAS: A speech under simulated and actual stress database. Eurospeech, speech technology, 18, 77–90.
97, 1743–1746. Shen, L., Chen, H., Yu, Z., Kang, W., Zhang, B., Li, H., et al. (2016). Evolving sup-
Haq, S., Jackson, P. J., & Edge, J. (2008). Audio-visual feature selection and reduc- port vector machines using fruit fly optimization for medical data classification.
tion for emotion classification. In Proc. international conference on auditory-visual Knowledge-Based Systems.
speech processing (AVSP’08). Shukla, S., Dandapat, S., & Prasanna, S. M. (2016). A subspace projection approach
Hassan, A., & Damper, R. I. (2010). Multi-class and hierarchical SVMs for emotion for analysis of speech under stressed condition. Circuits, Systems, and Signal Pro-
recognition. cessing, 35, 4486–4500.
He, L., Lech, M., & Allen, N. (2010). On the importance of glottal flow spectral energy Sidorov, M., Brester, C., Minker, W., & Semenkin, E. (2014). Speech-based emotion
for the recognition of emotions in speech. In Interspeech 2010 (pp. 2346–2349). recognition: feature selection by self-adaptive multi-criteria genetic algorithm.
International Speech Communication Association. International conference on language resources and evaluation (LREC).
Henríquez, P., Alonso, J. B., Ferrer, M. A., Travieso, C. M., & Orozco-Arroy- Simon, D. (2008). Biogeography-based optimization. Evolutionary Computation, IEEE
ave, J. R. (2014). Nonlinear dynamics characterization of emotional speech. Neu- Transactions on, 12, 702–713.
rocomputing, 132, 126–135. Singh, S., Tayal, S., & Sachdeva, G. (2012). Evolutionary performance of BBO and PSO
Huang, G.-B., Zhou, H., Ding, X., & Zhang, R. (2012). Extreme learning machine for algorithms for Yagi-Uda antenna design optimization. In Information and com-
regression and multiclass classification. Systems, Man, and Cybernetics, Part B: munication technologies (WICT), 2012 world congress on (pp. 861–865). IEEE.
Cybernetics, IEEE Transactions on, 42, 513–529. Statnikov, A., & Tsamardinos, I. (2005). Gene Expression Model Selector. In. http:
Huang, G.-B., Zhu, Q.-Y., & Siew, C.-K. (2006). Extreme learning machine: Theory and //www.gems-system.org/. Retrived: 10/12/2015.
applications. Neurocomputing, 70, 489–501. Stuhlsatz, A., Meyer, C., Eyben, F., ZieIke, T., Meier, G., & Schuller, B. (2011). Deep
Hübner, D., Vlasenko, B., Grosser, T., & Wendemuth, A. (2010). Determining optimal neural networks for acoustic emotion recognition: Raising the benchmarks. In
features for emotion recognition from speech by applying an evolutionary algo- Acoustics, speech and signal processing (ICASSP), 2011 IEEE international conference
rithm. In Proceedings of Interspeech (pp. 2358–2361). on (pp. 5688–5691). IEEE.
Iliev, A. I., Scordilis, M. S., Papa, J. P., & Falcão, A. X. (2010). Spoken emotion recogni- Sun, R., Moore, E., & Torres, J. F. (2009). Investigating glottal parameters for dif-
tion through optimum-path forest classification using glottal features. Computer ferentiating emotional categories with similar prosodics. In Acoustics, speech
Speech & Language, 24, 445–460. and signal processing, 2009. ICASSP 2009. IEEE international conference on
Kaur, A., & Kaur, M. (2015). A review of parameters for improving the performance (pp. 4509–4512). IEEE.
of particle swarm optimization. International Journal of Hybrid Information Tech- Sun, Y., Wen, G., & Wang, J. (2015). Weighted spectral features based on local Hu
nology, 8. moments for speech emotion recognition. Biomedical Signal Processing and Con-
Kira, K., & Rendell, L. A. (1992). A practical approach to feature selection. In Proceed- trol, 18, 80–90.
ings of the ninth international workshop on Machine learning (pp. 249–256). Sundberg, J., Patel, S., Björkner, E., & Scherer, K. R. (2011). Interdependencies among
Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial In- voice source parameters in emotional speech. Affective Computing, IEEE Transac-
telligence, 97, 273–324. tions on, 2, 162–174.
Kostoulas, T., Mporas, I., Kocsis, O., Ganchev, T., Katsaounos, N., Santamaria, J. J., Tahon, M., & Devillers, L. (2016). Towards a small set of robust acoustic features
et al. (2012). Affective speech interface in serious games for supporting ther- for emotion recognition: Challenges. Audio, Speech, and Language Processing,
apy of mental disorders. Expert Systems with Applications, 39, 11072–11079. IEEE/ACM Transactions on, 24, 16–28.
Lee, J. Y. (2012). A two-stage approach using Gaussian mixture models and high- Tamjidy, M., Paslar, S., Baharudin, B. H. T., Hong, T. S., & Ariffin, M. (2015). Biogeog-
er-order statistics for a classification of normal and pathological voices. EURASIP raphy based optimization (BBO) algorithm to minimise non-productive time
Journal on Advances in Signal Processing, 2012, 1–8. during hole-making process. International Journal of Production Research, 53,
Liew, W. S., Seera, M., Loo, C. K., & Lim, E. (2015). Affect classification using genet- 1880–1894.
ic-optimized ensembles of fuzzy ARTMAPs. Applied Soft Computing, 27, 53–63. Teager, H. M. (1980). Some observations on oral air flow during phonation. IEEE
Lopez-de-Ipiña, K., Alonso, J. B., Solé-Casals, J., Barroso, N., Henriquez, P., Faun- Transactions on Acoustics, Speech and Signal Processing, 28, 599–601.
dez-Zanuy, M., et al. (2013). On automatic diagnosis of Alzheimer’s disease Vayrynen, E., Kortelainen, J., & Seppanen, T. (2013). Classifier-based learning of
based on spontaneous speech analysis and emotional temperature. Cognitive nonlinear feature manifold for visualization of emotional speech prosody. IEEE
Computation, 7, 44–55. Transactions on Affective Computing, 4, 47–56.
Luengo, I., Navas, E., & Hernáez, I. (2010). Feature analysis and evaluation for au- Veeneman, D. E., & BeMent, S. L. (1985). Automatic glottal inverse filtering from
tomatic emotion identification in speech. Multimedia, IEEE Transactions on, 12, speech and electroglottographic signals. IEEE Transactions on Acoustics, Speech
490–501. and Signal Processing, 33, 369–377.
Mao, Q., Dong, M., Huang, Z., & Zhan, Y. (2014). Learning Salient Features for Speech Wang, K., An, N., Li, B. N., Zhang, Y., & Li, L. (2015). Speech emotion recognition
Emotion Recognition Using Convolutional Neural Networks. IEEE Transactions on using fourier parameters. IEEE Transactions on Affective Computing, 6, 69–75.
Multimedia, 16, 2203–2213. Wang, L., & Xu, Y. (2011). An effective hybrid biogeography-based optimization al-
Muthusamy, H., Polat, K., & Yaacob, S. (2015a). Improved emotion recognition using gorithm for parameter estimation of chaotic systems. Expert Systems with Appli-
Gaussian mixture model and extreme learning machine in speech and glottal cations, 38, 15103–15109.
signals. Mathematical Problems in Engineering, 2015, 1–13. Wong, D. Y., Markel, J. D., & Gray, A. H., Jr (1979). Least squares glottal inverse fil-
Muthusamy, H., Polat, K., & Yaacob, S. (2015b). Particle swarm optimization based tering from the acoustic speech waveform. IEEE Transactions on Acoustics, Speech
feature enhancement and feature selection for improved emotion recognition in and Signal Processing, 27, 350–355.
speech and glottal signals. PloS One, 10, e0120344. Wszołek, W., & Kłaczyński, M. (2010). Analysis of polish pathological speech by
Muthuswamy, J., Sherman, D. L., & Thakor, N. V. (1999). Higher-order spectral anal- higher order spectrum. Acta Physica Polonica A, 118, 190–191.
ysis of burst patterns in EEG. Biomedical Engineering, IEEE Transactions on, 46, Yazdani, S., Shanbehzadeh, J., & Aminian, E. (2013). Feature subset selection using
92–99. constrained binary/integer biogeography-based optimization. ISA Transactions,
Naylor, P., Kounoudes, A., Gudnason, J., & Brookes, M. (2007). Estimation of glottal 52, 383–390.
closure instants in voiced speech using the DYPSA algorithm. IEEE Transactions Zhao, S., Rudzicz, F., Carvalho, L. G., Márquez-Chin, C., & Livingstone, S. (2014).
on Audio, Speech, and Language Processing, 15, 34–43. Automatic detection of expressed emotion in Parkinson’s disease. In Acous-
Neiberg, D., & Elenius, K. (2008). Automatic recognition of anger in spontaneous tics, speech and signal processing (ICASSP), 2014 IEEE international conference on
speech. In INTERSPEECH (pp. 2755–2758). (pp. 4813–4817). IEEE.

Expert Systems With Applications

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Expert Systems With Applications

Uploaded by

Copyright:

Available Formats

Expert Systems With Applications 69 (2017) 149–158

Contents lists available at ScienceDirect

Expert Systems With Applications

A new hybrid PSO assisted biogeography-based optimization for

Anger Disgust Fear Neutral Happiness Sadness Boredom Surprise

Database Multi-style of speech

Angry Lombard Loud Neutral

SUSAS 630 (9 Speaker x 35 words x 2 utterance) 630 630 631

3.1. Speech emotional database 3.2. Feature extraction

Fig. 1. Block diagram of the proposed method.

among the bispectrum

(Muthuswamy et al., 1999). Table 5

Algorithm 1 The PSOBBO framework.

1 Randomly initialize the population of P habitats

SI BBO 77.14 74.91 76.13 0.55 1062.65

SI BBO 58.81 54.64 56.57 0.011 1046.25

TIDMSS BBO 78.43 75.83 77.11 0.006 853.15

Dataset Number of samples Number of classes Number of features

Glass BBO 74.77 72.90 73.79 0.559 6.8

You might also like