Professional Documents
Culture Documents
Final Year Project Report Complete
Final Year Project Report Complete
Final Year Project Report Complete
Approved By-
Pramathesh Bhattacharyya
Director
Dibrugarh University Institute of Engineering and Technology
Dibrugarh University
Dibrugarh
Assam-786004
CERTIFICATE
Forwarded By:
Mr. Hemerjit Singh
Department in Charge
Department of Electronics and Communication Engineering
DUIET, Dibrugarh University
Date: ,2020
EXAMINER
……………………. …………………….
(Internal) (External
AKNOWLEDGEMENT
We also want to thank all my B.Tech classmates and juniors for their
valuable support during the whole project work.
Special thanks to our parents and family members for their support and
encouragement all throughout. Finally, we would like to thank the
Almighty for all we have been given.
INTRODUCTION
1
The expression of or the ability to express thoughts and feelings by articulate
sounds is known as speech. The fundamental purpose of speech is
communication, i.e., the transmission of messages. It is primary mode of
communication among human being and also the most natural and efficient
form of exchanging information among human in speech. This fact has
motivated researchers to think of speech as a fast and efficient method of
interaction between human and machine.
The emotion in speech may be considered as similar kind of stress on all sound
events across the speech. An emotional speech describes a particular prosody in
speech. The prosodic rules of a language evolve with the culture of a
community over ages. In addition, speakers also have their own speaker
dependent style, i.e. a characteristic articulation rate, intonation habit and
loudness characteristic. Hence, emotion expressed and inferred in a speech,
depends upon the speaker’s community culture and language, gender, age,
education, social status, health, physical engagements, etc.
When a speaker is in a ‘quiet room’ with no task obligations and without any
illness, the speech produced by him is ‘neutral’. When a speaker feels his
environment as different from ‘normal’, he perceives an emotional arousal in
him, and this in turn causes a change of his physiological parameters. The
emotional arousal sets the speaker in an emotional state. In such state a speaker
normally produces a kind of stressed speech, which is called emotional speech.
A particular degree of emotional arousal causes a particular amount of
activation level, valence or evaluation level, orientation, etc. Full-blown
emotion is generally short lived and intense, where emotional strength has
crossed a certain limit, e.g. archetypal emotions such as angry, disgust, fear,
happy, sad and surprise etc.
Speech data that are collected from real life situations are much more relevant
than that of acted ones. A famous example is the recordings of the radio news
broadcast of major events. Such recordings contain utterances with very natural
conveyed emotions. Unfortunately, there may be some legal and moral issues
that prohibit the use of them for research purposes. Alternatively, emotional
sentences can be elicited in sound laboratories as in the majority of the existing
2
databases. It has always been criticized that acted emotions are not the same as
real ones. Most the databases share the following emotions: anger, joy, sadness,
surprise, and neutral.
3
(1.2) BASICS OF SPEECH PROCESSING
In speech production, as well as in many human-engineered
electronic communication systems, the information to be transmitted
is encoded in the form of a continuously varying (analog) waveform
that can be transmitted, recorded, manipulated, and ultimately
decoded by a human listener. In the case of speech, the fundamental
analog form of the message is an acoustic waveform, which we call
the speech signal.
4
laterals), and the presence or absence of voicing, nasalization,
aspiration, or other phonation. For example, the sound for s is
described as a voiceless alveolar fricative; the sound for m is a voiced
bilabial nasal stop. Additional classificatory information may indicate
whether the airstream powering the production of the consonant is
from the lungs (the pulmonary airstream mechanism) or some other
airstream mechanism and whether the flow of air is ingressive or
egressive. The production of consonants may also involve secondary
articulations—that is, articulations additional to the place and manner
of articulation defining the primary stricture in the vocal tract.
(1.2.2) VOICING
The vocal folds may be held against each other at just the right
tension so that the air flowing past them from the lungs will cause
them to vibrate against each other. We call this process voicing.
Sounds which are made with vocal fold vibration are said to be
voiced. Sounds made without vocal fold vibration are said to be
voiceless.
(1.2.3) NASALS
(1.2.4) STOPS
A stop consonant completely cuts off the airflow through the mouth. In
the consonants [t], [d], and [n], the tongue tip touches the alveolar
6
ridge and cuts off the airflow at that point. In [t]and [d], this means
that there is no airflow at all for the duration of the stop. In [n], there is
no airflow through the mouth, but there is still airflow through the
nose. We distinguish between
• nasal stops, like [n], which involve airflow through the nose, and
• oral stops, like [t] and [d], which do not.
Nasal stops are often simply called nasals. Oral stops are often called
plosives. Oral stops can be either voiced or voiceless. Nasal stops are
almost always voiced. (It is physically possible to produce a voiceless
nasal stop, but English, like most languages, does not use such
sounds.)
(1.2.5) FRICATIVES
In the stop [t], the tongue tip touches the alveolar ridge and cuts off
the airflow. In [s], the tongue tip approaches the alveolar ridge but
doesn't quite touch it.
(1.2.6) AFFRICATES
An affricate is a single sound composed of a stop portion and a
fricative portion. In English [tʃ], the airflow is first interrupted by a
stop which is very similar to (though made a bit further back). But
instead of finishing the articulation quickly and moving directly into
the next sound, the tongue pulls away from the stop slowly, so that
there is a period of time immediately after the stop where the
constriction is narrow enough to cause a turbulent airstream. In [tʃ], the
period of turbulent airstream following the stop portion is the same as
the fricative [ʃ]. English [dʒ] is an affricate like [tʃ], but voiced.
7
Chapter 2
THEORETICAL
BACKGROUND
8
(2.1) SPEECH PROCESSING
The short time energy is the energy of short speech segment. Short
time energy is a simple and effective classifying parameter for voiced
and unvoiced segments. Energy is also used for detecting end points
of utterance. Speech is time varying in nature. The energy associated
with voiced speech is large when compared to unvoiced speech.
Silence speech will have least or negligible energy when compared to
unvoiced speech. Hence, Short Time Energy can be used for voiced,
unvoiced and silence classification of speech. For Short Time Energy
computation, speech is considered in terms of short analysis frames
whose size typically ranges from 10-30 msec.
9
where 𝐸𝑇 is the total energy and 𝑠(n) is the discrete time signal.
10
(2.2.3) AUTOCORRELATION
11
(2.2) EMOTION RECOGNITION
Emotion plays a significant role in daily interpersonal human
interactions. This is essential to our rational as well as intelligent
decisions. It helps us to match and understand the feelings of others by
conveying our feelings and giving feedback to others. Research has
revealed the powerful role that emotion plays in shaping human social
interaction. Emotional displays convey considerable information about
the mental state of an individual. This has opened up a new research
field called automatic emotion recognition, having basic goals to
understand and retrieve desired emotions. Several inherent advantages
make speech signals a good source for affective computing. For
example, compared to many other biological signals (e.g.,
electrocardiogram), speech signals usually can be acquired more
readily and economically. This is why the majority of researchers are
interested in speech emotion recognition (SER).
Figure. shows the basic flow for the emotion detection from input
speech. First noise and d.c components are removed in speech
normalization then the feature extraction and selection are carried
out. The most important part in further processing of input speech
signal to detect emotions is extraction and selection of features from
speech.
12
The speech features are usually derived from analysis of speech signal
in both time as well as frequency domain. Then the data base is
generated for training and testing of the extracted speech features from
input speech signal. In the last stage emotions are detected by the
classifiers. Various pattern recognition algorithms (HMM, GMM) are
used in classifier to detect the emotion.
13
(2.3) LITERATURE SURVEY OF RELATED
WORKS ON SPEECH EMOTION
RECOGNITION
• Moataz El Ayadi, Mohamed S. Kamel, Fakhri Karray proposed
many systems to identify the emotional content of a spoken
utterance in their paper “Survey On Speech Emotion
Recognition: Features, Classification schemes & databases”.
The paper is a survey of emotion classification addressing 3
aspects of the design of a speech emotion recognition system.
The first issue describes the suitable features for speech
representation. The second issue describes the design of an
appropriate classification scheme & third issue describes the
proper preparation of an emotional speech database for
evaluating system performance.
• Qingli Zhang, Ning An, Kunxia Wang, Fuji Ren and Lian Li in
their paper “Speech Emotion Recognition using combination
of features” describe how speech features number and
statistical values impact recognition accuracy of emotions
present in speech. With the help of GMM, two effective
15
features; MFCC and Autocorrelation function coefficient are
extracted. Their method achieves emotion recognition rate of
74.45%, significantly better than 59.00% achieved previously.
They also conduct experiments considering different set of
emotion: anger, boredom, fear, happy, neutral and sad to prove
the broad applicability of their method.
16
utterances in order to obtain best accuracies. The first method
known as traditional Mel-Frequency Cepstral Coefficients
(MFCC) and the second one is MFCC combined with Shifted-
Delta-Cepstra (MFCC-SDC). Experimentations are conducted
on IEMOCAP database using two multiclass SVM approaches:
One-Against-One (OAO) and One Against-All (OAA).
Obtained results show that MFCC-SDC features outperform the
conventional MFCC.
17
(2.4) CONTRIBUTION OF THE PROJECT
18
Chapter 3
DATABASE
COLLECTION
19
(3.1) ABOUT THE EMOTIONS
Emotional speech recognition aims at automatically identifying the emotional or
physical state of a human being from his or her voice. The emotional and
physical states of a speaker are known as emotional aspects of speech. Emotion
is often entwined with temperament, mood, personality, motivation, and
disposition. In psychology, emotion is frequently defined as a complex state of
feeling that results in physical and psychological changes. These changes
influence thought and behaviour. In 1884, American psychologist and
philosopher William James proposed a theory of emotion whose influence was
considerable. According to his thesis, the feeling of intense emotion
corresponds to the perception of specific bodily changes. This illustrates the
difficulty of agreeing on definition of this dynamic and complex phenomenon
that we call emotion.
Emotions are described in different classes. These classes are (a) Categorical
and (b) Dimensional.
(b) In Dimensional class, the basic emotions are again classified into 3
classes:
• Valence: Usually Happiness has positive valence and anger,
sadness has negative valence.
• Activation: Sadness has low activation energy whereas happiness,
anger has high activation energy.
• Dominance: Anger is dominant whereas fear is dominated.
In this project we are mainly focusing on the four basic emotions namely angry,
happy, neutral and sad.
With the ultraportable ZOOM H1n creators can record professional quality
audio everywhere they go. The H1 in sleek compact design and one touch
button controls make it easier than ever to record audio for music, film,
interviews and more.
20
Some features are:
• Brighter, crisper backlit LCD display: The new 1.25" monochromatic
display looks great in virtually all lighting conditions. The layout is clean
and organized, and the backlight pops in the dark or direct light.
21
(3.3) PRAAT
Praat is a tool designed for speech analysis. It was developed at the University
of Amsterdam by Paul Boersma and David Weenink. According to them, Praat
is a tool for doing phonetics by computers.
Praat is a free scientific software program for the analysis of speech in phonetics
with which phoneticians can analyze, synthesize, and manipulate speech.
22
(3.4) WAVESURFER
The MFCC is the evident example of a feature set that is extensively used in
speech recognition. As the frequency bands are positioned logarithmically in
MFCC it approximates the human system response more closely than any other
system. The main technique of evaluating the MFCC is based on the short-term
analysis, and hence from each frame a MFCC vector is evaluated. To extract the
coefficients the speech sample is taken as the input and hamming window is
applied to minimize the discontinuities of a signal. Then DFT is used to
generate the Mel filter bank. In case of Mel frequency warping, the width of the
triangular filters varies and so the log total energy in a critical band around the
center frequency is included and after the warping the numbers of coefficients
23
are computed. Finally, for the cepstral coefficient’s calculation the Inverse
Discrete Fourier Transformer is used.
The objective of cepstral analysis is to separate the speech into its source and
system components without any prior knowledge about source or system.
Application
The computation of the SDC feature is a relatively simple procedure. First, the
cepstral feature vector is computed.
24
The SDC features are specified by a set of 4 parameters, (N, d, P, k) where:
SDC features are widely used in language identification and speech recognition
fields. SDC feature vectors are an extension of delta-cepstral coefficients.
Figure below describe the extraction procedure of SDC feature vectors.
25
In the figure, ci are the MFCC coefficients and t is the coefficient index. The
parameter d presents the spread over which delta are computed. The gaps
between different delta computations is given by the parameter P. Parameter k
determines the number of blocks whose delta coefficients are concatenated to
obtain the final form of feature vector. For given time t, an intermediate
calculation is done to obtain these k coefficients:
where, K is the number of frames being stacked and P is the amount of frame
shift. The performance of the SDC can be further improved if it is appended to
the basic feature vector. The new feature vector is
26
Empirically, researchers have found that when N=7(including zeroth DCT
coefficient C0), ,P=3 and K=7 the SDC gives quite good performance.
where N (; ,) is the Gaussian density function, ѡi, µi and ∑I are the weight, mean
and covariance matrix of the ith Gaussian component, respectively. The
supervector of a GMM is formed by concatenating the mean of each Gaussian
component, and it takes the form as
For each emotional utterance, a GMM is trained with the extracted spectral
features, and the corresponding supervector is obtained.
27
mapping allows the production of features with a fixed dimension for all the
emotional utterances. Therefore, we can use the GMM supervectors as input for
SVM learning.
The most commonly used and most popular hybrid model is the Gaussian
Mixture Model (GMM). If there is a sufficient number of Gaussian distributions
adjusting its expectation covariance matrix, and the coefficients of the linear
combination, it can express any continuous distribution. Gaussian Mixture
Model is a commonly used statistical models used in speech signal processing.
We have collected 10 dialogues from the famous Assamese novel “অসীমত যাৰ
হেৰাল সীমা” (Oximot Jar Heral Xeema) and asked them to read the
dialogues in four different emotions namely angry, happy, neutral and sad. The
Assamese dialogues that we have collected are:
28
4. “আইলেলেৰাৰলগত তলয়া যা, হমাৰ বকিা েৰকাৰ েলল জয়ৰালমই বেি
পাবৰি।”
5. “তই হমাৰলগত আজজ িগললও ে'ি বতলক।”
6. “হিবি ৰাবত িকবৰবি, স াঁজ লগাৰ আগলত উভবতবি।”
7. “আশাকলৰা এইিাৰ েয়লতা লালেলালে সকললা বিবিবি।”
8. “হতাৰ হেবখলিা বিজৰ িুজিৰ ওপৰত অগাধ বিশ্বাস।”
9. “হেবখললা আধবললটাৰ েুলয়াটা বপঠিলয়ই হমাৰপৰা সমাি আোঁতৰত।”
10. “বলখাস্োিৰ িৰ অভাে, গবতলক এলকািাই, বিৰশূিয।”
We have used devices like zoom and mobile recorder for recording purpose.
Female speakers:
Male speakers:
A spectrogram is plotted for the same sentence but spoken with four
different emotions – angry, happy, neutral and sad. It represents three
dimensions of a signal, i.e., time, frequency and amplitude.
From the spectrogram we can observe that in angry and happy emotions,
there is more energy in higher frequency region whereas in sad and neutral
emotions there is less energy in higher frequency region.
30
Matlab code for spectrogram:
31
Result of the spectrogram:
32
CHAPTER 4
EXPERIMENTAL DETAILS
33
(4.1) FEATURE EXTRACTION AND SELECTION
Any emotion from the speaker’s speech is represented by the large number of
parameters which is contained in the speech and the changes in these parameters will
result in corresponding change in emotions. Therefore, an extraction of these speech
features which represents emotions is an important factor in speech emotion
recognition system. The speech features can be divided into two main categories that
is long term and short-term features. The short-term features are the short time period
characteristics like formants, pitch and energy. And long-term features are the
statistical approach to digitized speech signal. Some of the
frequently used long term features are mean and standard deviation. The larger the
feature used the more improved will be the classification process. The region of
analysis of the speech signal used for the feature extraction is an important issue
which is to be considering in the feature extraction. The speech signal is divided into
the small intervals which are referred as a frame. In this experiment short frames of the
speech signal are taken of about 20ms and frameshift of 20ms. During this period the
audio does not change much. Each frame contains same features. 13-Dimension
MFCC is used in our experiment. Mel Frequency Cepstrum Coefficient (MFCC) is
extensively used in speech recognition and speech emotion recognition systems and
the recognition rate of the MFCC is very good. In the low frequency region better
frequency resolution and robustness to noise could be achieved with the help of
MFCC rather than that for high frequency region. Mel Frequency Cepstrum
Coefficient is an illustration of short-term power spectrum of sound. Along with the
MFCC feature SDC feature is also used. SDC is obtained from MFCC. For long-term
speech signals, Shifted Delta Coefficients (SDC) features are more appropriate, since
they identify the dynamic behavior of the speaker along the prosodic features of
speech signal. It is also used to find the robustness of channel mismatch, manner of
speaking and session variability. The combination of both MFCC-SDC feature gives a
better method to recognize different human emotional states.
In the speech emotion recognition system after calculation of the features, the best
features are provided to the classifier. A classifier recognizes the emotion in the
speaker’s speech utterance. Various types of classifier have been proposed for the task
of speech emotion recognition. Gaussian Mixtures Model (GMM), K-nearest
neighbors (KNN), Hidden Markov Model (HMM) and Support Vector Machine
(SVM), Artificial Neural Network (ANN), etc. are the classifiers used in the speech
emotion recognition system. Each classifier has some advantages and limitations over
the others.
34
Only when the global features are extracted from the training utterances, Gaussian
Mixture Model is more suitable for speech emotion recognition. Because in case of
GMM training and testing requirements are less. We have used 256-D GMM in our
experiment.
Along with these standard data we have also collected data sets manually from 10
male and 10 female speakers; who were told to utter the 10 Assamese sentences in
four different emotions. So total 1600 data sets are collected manually, among which
1280 data is used for training purpose and 320 data is used for testing purpose. So as a
whole we have 3695 universal data.
35
Figure: Sample of Universal data folder
36
Figure: Sample for UBM code
The training and testing are done using our own database which was manually
collected. The main aim of the training phase is to compute the best parameters to
match the distribution of the feature vectors. Training is done using the parameters
obtained from the UBM.80% of the total Assamese data is taken as Training data and
the rest 20% is taken as Testing data.
37
During the cluster formation stage there are some mean value for all the clusters. when
training is done, the means of all the clusters are updated. The adaptation of the means
of UBM leads to GMM.
For four different emotions (Angry, Happy, neutral and Sad), four GMMs are created
and their corresponding means are recorded. Using the means of four GMMs, Testing
is done.
The testing phase comprises of the loglikelihood function which is used to derive the
maximum likelihood estimator of the parameter.
38
Figure: Sample of Training data Angry folder
39
Figure: Sample of Testing data folder
40
Figure: Sample for test code
41
(4.5) CONFUSION MATRIX AND ACCURACY
A confusion matrix is a summary of prediction results on a classification problem.
The number of correct and incorrect predictions are summarized with count values and
broken down by each class. It shows the ways in which our classification model is
confused when it makes predictions. It gives us insight no into the errors being made
by a classifier.
Accuracy represents the number of correctly classified data instances over the total
number of data instances.
42
CHAPTER 5
43
After all the experimental works done by using 13 Dimensional MFCC and SDC and
256 Dimensional GMM we have obtained the following confusion matrix. The
confusion matrix is gender-independent. In the confusion matrix shown in Table, the
columns show the emotions that the speakers tried to induce, and the rows are the
percentages of output recognized emotions.
Confusion matrix:
The accuracy that we have obtained is 51.25%. Correct classification rate of less than
60% was achieved in this experiment showing that while MFCC coefficients are
popular features in speech recognition, they are not suitable for emotion recognition in
speech. Further SDC feature gives more precise result in case of language
identification than that of emotion recognition.
44
CHAPTER 6
CONCLUSION
45
Research in the field of speech and machine leaning has expanded up to great extent.
But still a system with sheer accuracy is yet to be developed. We believe that this
contribution shows important results considering emotion recognition with Gaussian
Mixture model. This project is about analyzing speech for emotion states (Anger, Sad,
Neutral, Happy) using speech signals. Here Speech emotion recognition systems based
on the several classifiers is illustrated. The important issues in speech emotion
recognition system are the signal processing unit in which appropriate features are
extracted from available speech signal and another is a classifier which recognizes
emotions from the speech signal. The confusion tables clearly show that some
emotions are often confused with certain others. Furthermore, some emotions seem to
be recognized more easily. This may be due to the fact that all the test sentences were
acted emotions and test-persons have difficulties with feigning certain emotions. For
the frequently confused emotional states, other type of features, such as prosodic and
voice quality features are also responsible.
Automatic emotion recognitions from the human speech are increasing now a day
because it results in the better interactions between human and machine. To improve
the emotion recognition process, combinations of the given methods can be derived.
Also, by extracting more effective features of speech, accuracy of the speech emotion
recognition system can be enhanced.
46
CHAPTER 7
BIBLIOGRAPHY
47
[1] https://en.wikipedia.org/wiki/Voice_(phonetics)
[9] Yu Zhou, Yanqing Sun and Jianping Zhang & Yonghong Yan, “Speech
emotion recognition using both spectral and prosodic features”.
[10] Qingli Zhang, Ning An, Kunxia Wang, Fuji Ren and Lian Li, “Speech
Emotion Recognition using Combination of Features”.
[15] Hari Krishna Vydana, P. Phani Kumar, K. Sri Rama Krishna, Anil
48
Kumar Vuppala Improved Emotion Recognition using GMM-UBM”.
[17] Wei-Qiang Zhang, Member, IEEE & Liang He, Yan Deng, Jia Liu,
Member, IEEE, and Michael T. Johnson, Senior Member, IEEE, “Time–
Frequency Cepstral Features and Heteroscedastic Linear Discriminant
Analysis for Language Recognition”.
[18] Asma Mansour & Zied Lachiri, “SVM based Emotional Speaker
Recognition using MFCC-SDC Features”.
[21] https://www.ieee.org/
[22] http://www.emodb.bilderbar.info/download/
[23] https://www.kaggle.com/barelydedicated/savee-database
[24] https://www.kaggle.com/uwrfkaggler/ravdess-emotional-speech-audio
49