Final Year Project Report Complete

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

Feature Extraction of Human Speech In Assamese Language

Based On Emotional Analysis


Dissertation submitted in partial fulfilment of the requirement for the degree of
Bachelor of Technology
In
ELECTRONICS AND COMMUNICATION ENGINEERING
Under the Supervision of
Mrs. Parismita Gogoi
Assistant Professor
Department of Electronics and Communication Engineering
By
Ananya Goswami (ECE-03/16)
Debashree Sharma (ECE-11/16)
Rosy Bordoloi (ECE-39/16)
Snigdha Sarma (ECE-44/16)
To

DIBRUGARH UNIVERSITY INSTITUTE OF ENGINEERING AND TECHNOLOGY


Dibrugarh University
Dibrugarh
Assam-786004
2019-2020
DECLARATION

This is to certify that the Report entitled “A Framework for


EMOTION RECOGNITION OF HUMAN SPEECH IN ASSAMESE
LANGUAGE” which is submitted us in partial fulfilment of the
requirement for the award of degree B.Tech in Electronics and
Communication Engineering to D.U.I.E.T., Dibrugarh University,
Dibrugarh, Assam comprises only our original work and due
acknowledgement has been made in the text to all other materials
used.

Date: th June, 2020

Ananya Goswami (ECE-03/16)


Debashree Sharma (ECE-11/16)
Rosy Bordoloi (ECE-39/16)
Snigdha Sarma (ECE-44/16)

Approved By-

Pramathesh Bhattacharyya
Director
Dibrugarh University Institute of Engineering and Technology
Dibrugarh University
Dibrugarh
Assam-786004
CERTIFICATE

This is to certify that the report entitled “A Framework for EMOTION


ECOGNITION OF HUMAN SPEECH IN ASSAMESE LANGUAGE” which
is submitted by Ananya Goswami (ECE-03/16), Debashree Sharma (ECE-
11/16), Rosy Bordoloi (ECE-39/16), Snigdha Sarma (ECE-44/16) in partial
fulfilment of the requirement for the award of B.Tech in Electronics and
Communication Engineering to D.U.I.E.T., Dibrugarh University, Dibrugarh,
Assam is a record of the candidates own work carried out by them under my
supervision. The matter embodied in this report is original and has not been
submitted for the award of any degree.

Mrs. Parismita Gogoi


Assistant Professor
Department of Electronics and Communication Engineering
DUIET, Dibrugarh University
Date: 2020

Forwarded By:
Mr. Hemerjit Singh
Department in Charge
Department of Electronics and Communication Engineering
DUIET, Dibrugarh University
Date: ,2020

EXAMINER

……………………. …………………….
(Internal) (External
AKNOWLEDGEMENT

The satisfaction that accompanies the successful completion of any


task would be incomplete without the mention of the people who made
it possible and whose constant guidance and encouragement crown all
the efforts of a single person, rather it bears the imprint of a number of
people who directly or indirectly helped me in partial fulfilment of the
project.

First and foremost, we would like to gratefully acknowledge our


gratitude to Mrs. Parismita Gogoi, our project supervision and
Assistant Professor of Department of Electronics and Communication
Engineering, Dibrugarh University Institute of Engineering and
Technology (DUIET), Dibrugarh University, for her inspiring advice,
supervision and guidance. Her truly scientific intuition has made her a
constant oasis of ideas and passions in science, which exceptionally
inspired and enriches our growth as a student and researcher. As a
guide her enthusiastic attitude inspired us a lot.

We would like to express our sense of gratitude to Pramathesh


Bhattacharyya, DIRECTOR, and DUIET for his kind help and
constant encouragement. We would also like to express our sense of
gratitude to Mr. Hemerjit Singh, Head of the Department (HOD),
Dept. of ECE, DUIET for his kind help and continuous
encouragement.

We also want to thank all my B.Tech classmates and juniors for their
valuable support during the whole project work.

Special thanks to our parents and family members for their support and
encouragement all throughout. Finally, we would like to thank the
Almighty for all we have been given.

Ananya Goswami (ECE-03/16)


Debashree Sharma (ECE-11/16)
Rosy Bordoloi (ECE-39/16)
Snigdha Sarma (ECE-44/16)
ABSTRACT
In human machine interface application, emotion recognition from the speech
signal has been research topic since many years. To identify the emotions from
the speech signal, many systems have been developed. The main purpose of our
project is to develop a suitable method based on Gaussian Mixture Model
(GMM) as classifier ; Mel-frequency Cepstral Coefficients (MFCC) and
Shifted Delta Coefficients(SDC) as features for emotion recognition from
Assamese speeches. We conduct experiments considering different set of
emotions: Angry, Happy, Neutral and Sad. The database for the speech
emotion recognition system is the emotional speech samples collected manually
from 20 speakers and some standard samples available in the internet. From the
experiments it is confirmed that angry and happy emotions have high energy in
higher frequency region whereas neutral and sad emotions have low energy in
higher frequency region. Based on these classifications GMM training and
testing has been done.
Contents
1. Introduction…………………………………………………………..1-7
1.1 Application of emotion recognition………………………………….3
1.2 Basics of speech processing………………………………………..4-7
1.2.1 Phonetic representation of speech…………………………4-5
1.2.2 Voicing………………………………………………….....5-6
1.2.3 Nasals………………………………………………………..6
1.2.4 Stops……………………………………………………… 6-7
1.2.5 Fricatives…………………………………………………….7
1.2.6 Affricates…………………………………………………….7
2. Theoretical Backgrounds………………………………………………8-18
2.1 Speech Processing……………………………………………………9
2.2 Basic Speech Processing Methods…………………...……...........9-11
2.2.1 Short-Time Energy (STE)…………………………………...9
2.2.2 Zero Crossing Rate (ZCR)…………………………………10
2.2.3 Autocorrelation…………………………………………….11
2.2.4 Pitch Period or Frequency………………………………….11
2.2 Emotion Recognition…………………………………………….12-13
2.2.1 End-Begin Point Detection…………………………………13
2.3 Literature Survey of Related Works on Speech Emotion
Recognition……………………………………………………...14-17
2.4 Contribution of the Project………………………………………….18
3. Database Collection…………………………………………………...19-32
3.1 About the emotions………………………………………………….20
3.1.1 Categorization of emotions………………………………...20
3.2 About the Device: ZOOM H1n………………………………….20-22
3.3 Praat…………………………………………………………………22
3.4 Wavesurfer…………………………………………………………..23
3.5 Mel-Frequency Cepstral Coefficients(MFCC)…………………..23-24
3.6 Shifted Delta Cepstral(SDC)…………………………………….24-27
3.7 Gaussian Mixture Model(GMM)………………………………..27-28
3.8 Database Preparation…………………………………………….28-30
3.9 Data Collection…………………………………………………..30-32
3.9.1 Emotion variability using spectrogram…………………30-32
4. Experimental Details…………………………………………………...33-42
4.1 Feature extraction and selection…………………………………….34
4.2 Classifier selection………………………………………………34-35
4.3 Emotional Database…………………………………………………35
4.4 Experimental Process……………………………………………35-41
4.4.1 Universal Background Model(UBM)…………………..35-37
4.4.2 Training Phase and Testing Phase……………………...37-41
4.5 Confusion Matrix and Accuracy……………………………………42
5. Results and Analysis……………………………………………………43-44
6. Conclusion………………………………………………………………45-46
7. Bibliography…………………………………………………………….47-49
Chapter 1

INTRODUCTION

1
The expression of or the ability to express thoughts and feelings by articulate
sounds is known as speech. The fundamental purpose of speech is
communication, i.e., the transmission of messages. It is primary mode of
communication among human being and also the most natural and efficient
form of exchanging information among human in speech. This fact has
motivated researchers to think of speech as a fast and efficient method of
interaction between human and machine.
The emotion in speech may be considered as similar kind of stress on all sound
events across the speech. An emotional speech describes a particular prosody in
speech. The prosodic rules of a language evolve with the culture of a
community over ages. In addition, speakers also have their own speaker
dependent style, i.e. a characteristic articulation rate, intonation habit and
loudness characteristic. Hence, emotion expressed and inferred in a speech,
depends upon the speaker’s community culture and language, gender, age,
education, social status, health, physical engagements, etc.

When a speaker is in a ‘quiet room’ with no task obligations and without any
illness, the speech produced by him is ‘neutral’. When a speaker feels his
environment as different from ‘normal’, he perceives an emotional arousal in
him, and this in turn causes a change of his physiological parameters. The
emotional arousal sets the speaker in an emotional state. In such state a speaker
normally produces a kind of stressed speech, which is called emotional speech.
A particular degree of emotional arousal causes a particular amount of
activation level, valence or evaluation level, orientation, etc. Full-blown
emotion is generally short lived and intense, where emotional strength has
crossed a certain limit, e.g. archetypal emotions such as angry, disgust, fear,
happy, sad and surprise etc.

Speech emotion recognition is particularly useful for applications which require


natural man machine interaction such as web movies and computer tutorial
applications where the response of those systems to the user depends on the
detected emotion.

During the process of speech recognition most work has focused on


monolingual emotion classification, making an assumption there is no cultural
difference among speech datas.

Speech data that are collected from real life situations are much more relevant
than that of acted ones. A famous example is the recordings of the radio news
broadcast of major events. Such recordings contain utterances with very natural
conveyed emotions. Unfortunately, there may be some legal and moral issues
that prohibit the use of them for research purposes. Alternatively, emotional
sentences can be elicited in sound laboratories as in the majority of the existing

2
databases. It has always been criticized that acted emotions are not the same as
real ones. Most the databases share the following emotions: anger, joy, sadness,
surprise, and neutral.

(1.1) Application of emotion recognition:

• The word emotion is inherently uncertain and subjective. The


term emotion has been used with different contextual meanings
by different people. It is difficult to define emotion objectively,
as it is an individual mental state that arises spontaneously
rather than through conscious effort. Therefore, there is no
common objective definition and agreement on the term
emotion. This is the fundamental hurdle to proceed with
scientific approach toward research.

• There are no standard speech corpora for comparing


performance of research approaches used to recognize
emotions. Most emotional speech systems are developed
using full blown emotions, but real-life emotions are
pervasive and underlying in nature. Some databases are
recorded using experienced artists, whereas some other are
recorded using semi- experienced or inexperienced subjects.
The research on emotion recognition is limited to 5–6
emotions, as most databases do not contain wide variety of
emotions.

• Emotion recognition systems developed using various features


may be influenced by the speaker and language dependent
information. Ideally, speech emotion recognition systems
should be speaker and language independent.

• An important issue in the development of a speech emotion


recognition systems is identification of suitable features that
efficiently characterize different emotions. Along with
features, suitable models are to be identified to capture
emotion specific information from extracted speech features.

• Speech emotion recognition systems should be robust enough


to process real life and noisy speech to identify emotions.

3
(1.2) BASICS OF SPEECH PROCESSING
In speech production, as well as in many human-engineered
electronic communication systems, the information to be transmitted
is encoded in the form of a continuously varying (analog) waveform
that can be transmitted, recorded, manipulated, and ultimately
decoded by a human listener. In the case of speech, the fundamental
analog form of the message is an acoustic waveform, which we call
the speech signal.

(1.2.1) PHONETIC REPRESENTATION OF SPEECH


Vowel: A vowel is one of the two principal classes of speech
sound. Vowels vary in quality, in loudness and also in quantity
(length). They are usually voiced, and are closely involved in
prosodic variation such as tone, intonation and stress. Vowel sounds
are produced with an open vocal tract. The word vowel comes from
the Latin word vocalis, meaning "vocal" (i.e. relating to the voice).
In English, the word vowel is commonly used to refer both to vowel
sounds and to the written symbols that represent them.
There are two complementary definitions of vowel, one phonetic and
the other phonological.
• In the phonetic definition, a vowel is a sound, such as the
English "ah" /ɑː/ or "oh" /oʊ/, produced with an open vocal
tract; it is median (the air escapes along the middle of the
tongue), oral (at least some of the airflow must escape through
the mouth), frictionless and continuant. There is no significant
build-up of air pressure at any point above the glottis.
• In the phonological definition, a vowel is defined as syllabic,
the sound that forms the peak of a syllable.

Semi vowel: A phonetically equivalent but non-syllabic sound is a


semivowel.
Consonant: Consonant, any speech sound, such as that represented
by t, g, f, or z, that is characterized by an articulation with a closure or
narrowing of the vocal tract such that a complete or partial blockage of
the flow of air is produced. Consonants are usually classified
according to place of articulation (the location of the stricture made in
the vocal tract, such as dental, bilabial, or velar), the manner of
articulation (the way in which the obstruction of the airflow is
accomplished, as in stops, fricatives, approximants, trills, taps, and

4
laterals), and the presence or absence of voicing, nasalization,
aspiration, or other phonation. For example, the sound for s is
described as a voiceless alveolar fricative; the sound for m is a voiced
bilabial nasal stop. Additional classificatory information may indicate
whether the airstream powering the production of the consonant is
from the lungs (the pulmonary airstream mechanism) or some other
airstream mechanism and whether the flow of air is ingressive or
egressive. The production of consonants may also involve secondary
articulations—that is, articulations additional to the place and manner
of articulation defining the primary stricture in the vocal tract.

(1.2.2) VOICING

The vocal folds may be held against each other at just the right
tension so that the air flowing past them from the lungs will cause
them to vibrate against each other. We call this process voicing.
Sounds which are made with vocal fold vibration are said to be
voiced. Sounds made without vocal fold vibration are said to be
voiceless.

There are several pairs of sounds in English which differ only in


voicing -- that is, the two sounds have identical places and manners of
articulation, but one has vocal fold vibration and the other doesn't. The
[θ] of thigh and the [ð] of thy are one such pair. The others are: -

The other sounds of English do not come in voiced/voiceless pairs. [h]


is voiceless, and has no voiced counterpart. The other English
5
consonants are all voiced: [ɹ], [l], [w], [j], [m], [n], and [ŋ]. This does
not mean that it is physically impossible to say a sound that is exactly
like, for example, an [n] except without vocal fold vibration. It is
simply that English has chosen not to use such sounds in its set of
distinctive sounds.

(1.2.3) NASALS

Nasal, in phonetics, speech sound in which the airstream passes


through the nose as a result of the lowering of the soft palate (velum)
at the back of the mouth. In the case of nasal consonants, such as
English m, n, and ng (the final sound in “sing”), the mouth is
occluded at some point by the lips or tongue and the airstream is
expelled entirely through the nose. Sounds in which the airstream is
expelled partly through the nose and partly through the mouth are
classified as nasalized. Nasalized vowels are common in French (e.g.,
in vin “wine,” bon “good,” and enfant “child”), Portuguese, and a
number of other languages.

(1.2.4) STOPS
A stop consonant completely cuts off the airflow through the mouth. In
the consonants [t], [d], and [n], the tongue tip touches the alveolar

6
ridge and cuts off the airflow at that point. In [t]and [d], this means
that there is no airflow at all for the duration of the stop. In [n], there is
no airflow through the mouth, but there is still airflow through the
nose. We distinguish between
• nasal stops, like [n], which involve airflow through the nose, and
• oral stops, like [t] and [d], which do not.

Nasal stops are often simply called nasals. Oral stops are often called
plosives. Oral stops can be either voiced or voiceless. Nasal stops are
almost always voiced. (It is physically possible to produce a voiceless
nasal stop, but English, like most languages, does not use such
sounds.)

(1.2.5) FRICATIVES

In the stop [t], the tongue tip touches the alveolar ridge and cuts off
the airflow. In [s], the tongue tip approaches the alveolar ridge but
doesn't quite touch it.

There is still enough of an opening for airflow to continue, but the


opening is narrow enough that it causes the escaping air to become
turbulent (hence the hissing sound of the [s]). In a fricative
consonant, the articulators involved in the constriction approach get
close enough to each other to create a turbulent airstream. The
fricatives of English are [f], [v], [θ], [ð], [s], [z], [ʃ], and [ʒ].

(1.2.6) AFFRICATES
An affricate is a single sound composed of a stop portion and a
fricative portion. In English [tʃ], the airflow is first interrupted by a
stop which is very similar to (though made a bit further back). But
instead of finishing the articulation quickly and moving directly into
the next sound, the tongue pulls away from the stop slowly, so that
there is a period of time immediately after the stop where the
constriction is narrow enough to cause a turbulent airstream. In [tʃ], the
period of turbulent airstream following the stop portion is the same as
the fricative [ʃ]. English [dʒ] is an affricate like [tʃ], but voiced.

7
Chapter 2

THEORETICAL
BACKGROUND

8
(2.1) SPEECH PROCESSING

Speech processing is the study of speech signals and the processing


methods of signals. The signals are usually processed in a digital
representation, so speech processing can be regarded as a special case
of digital signal processing, applied to speech signals. Aspects of
speech processing includes the acquisition, manipulation, storage,
transfer and output of speech signals. The input is called speech
recognition and the output is called speech synthesis. Speech
processing technologies are used for digital speech coding, spoken
language dialog systems, text-to-speech synthesis, and automatic
speech recognition. Information (such as speaker, gender, or language
identification, or speech recognition) can also be extracted from
speech.

(2.2) BASIC SPEECH PROCESSING METHODS

• Short-Time Energy (STE)


• Zero Crossing Rate (ZCR)
• Autocorrelation
• Pitch period or frequency
• Mel-Frequency Cepstrum Coefficients (MFCC)

(2.2.1) SHORT-TIME ENERGY (STE)

The short time energy is the energy of short speech segment. Short
time energy is a simple and effective classifying parameter for voiced
and unvoiced segments. Energy is also used for detecting end points
of utterance. Speech is time varying in nature. The energy associated
with voiced speech is large when compared to unvoiced speech.
Silence speech will have least or negligible energy when compared to
unvoiced speech. Hence, Short Time Energy can be used for voiced,
unvoiced and silence classification of speech. For Short Time Energy
computation, speech is considered in terms of short analysis frames
whose size typically ranges from 10-30 msec.

Short Time Energy is derived from the following equation,



ET = ∑𝑚=−∞ 𝑠2(n)

9
where 𝐸𝑇 is the total energy and 𝑠(n) is the discrete time signal.

(2.2.2) ZERO CROSSING RATE (ZCR)

• The zero-crossing rate is the rate of sign-changes along a


signal, i.e., the rate at which the signal changes from positive to
zero to negative or from negative to zero to positive. This
feature has been used heavily in both speech recognition and
music information retrieval, being a key feature to classify
percussive sounds.
• In some cases, only the "positive-going" or "negative-
going" crossings are counted, rather than all the crossings -
since, logically, between a pair of adjacent positive zero-
crossings there must be one and only one negative zero-
crossing.
• For monophonic tonal signals, the zero-crossing rate can be
used as a primitive pitch detection algorithm.

• In the context of discrete-time signals, a zero crossing is said to


occur if successive samples have different algebraic signs. The
rate at which zero crossings occur is a simple measure of the
frequency content of a signal.

Application: Zero crossing rates are used for Voice activity


detection (VAD), i.e., finding whether human speech is present in an
audio segment or not.

10
(2.2.3) AUTOCORRELATION

Autocorrelation, also known as serial correlation, is the correlation of


a signal with a delayed copy of itself as a function of delay.
Informally, it is the similarity between observations as a function of
the time lag between them. The analysis of autocorrelation is a
mathematical tool for finding repeating patterns, such as the presence
of a periodic signal obscured by noise, or identifying the missing
fundamental frequency in a signal implied by its harmonic
frequencies. It is often used in signal processing for analyzing
functions or series of values, such as time domain signals. A
commonly used method to estimate pitch is based on detecting the
highest value of the autocorrelation function in the region of interest.
The autocorrelation function of a signal is basically a (noninvertible)
transformation of the signal that is useful for displaying structure in
the waveform.
Thus, for pitch detection, if we assume x(n) is exactly periodic with
period P,
i.e., x(n) = x(n + P)
for all n, then it is easily shown that:
Rx(m) = Rx(m + P)
i.e., the autocorrelation is also periodic with the same period.
Conversely, periodicity in the autocorrelation function indicates periodicity.

(2.2.4) PITCH PERIOD OR FREQUENCY

Pitch, in speech, the relative highness or lowness of a tone as


perceived by the ear, which depends on the number of vibrations per
second produced by the vocal cords. Pitch is the main acoustic
correlate of tone and intonation.

Pitch period: Time taken to complete one cycle of vibration of vocal


folds. Pitch period is also referred to as fundamental frequency or F0.
Measured as time difference between two major peaks in the voiced
speech signal. Pitch is observed only in voiced regions.

11
(2.2) EMOTION RECOGNITION
Emotion plays a significant role in daily interpersonal human
interactions. This is essential to our rational as well as intelligent
decisions. It helps us to match and understand the feelings of others by
conveying our feelings and giving feedback to others. Research has
revealed the powerful role that emotion plays in shaping human social
interaction. Emotional displays convey considerable information about
the mental state of an individual. This has opened up a new research
field called automatic emotion recognition, having basic goals to
understand and retrieve desired emotions. Several inherent advantages
make speech signals a good source for affective computing. For
example, compared to many other biological signals (e.g.,
electrocardiogram), speech signals usually can be acquired more
readily and economically. This is why the majority of researchers are
interested in speech emotion recognition (SER).

Three key issues need to be addressed for successful SER system,


namely,

(1) choice of a good emotional speech database,

(2) extracting effective features, and

(3) designing reliable classifiers using machine learning algorithms.

In fact, the emotional feature extraction is a main issue in the SER


system. Many researchers have proposed important speech features
which contain emotion information, such as energy, pitch, formant
frequency, Linear Prediction Cepstrum Coefficients (LPCC), Mel-
frequency Cepstrum Coefficients (MFCC). Thus, most researchers
prefer to use combining feature set that is composed of many kinds
of features containing more emotional information.

Figure. shows the basic flow for the emotion detection from input
speech. First noise and d.c components are removed in speech
normalization then the feature extraction and selection are carried
out. The most important part in further processing of input speech
signal to detect emotions is extraction and selection of features from
speech.

12
The speech features are usually derived from analysis of speech signal
in both time as well as frequency domain. Then the data base is
generated for training and testing of the extracted speech features from
input speech signal. In the last stage emotions are detected by the
classifiers. Various pattern recognition algorithms (HMM, GMM) are
used in classifier to detect the emotion.

(2.2.1) END-BEGIN PONIT DETECTION


The problem of locating the beginning and end of a speech utterance
in background of noise is of importance in many areas of speech
processing. In particular, in automatic recognition of isolated words, it
is essential to locate the regions of a speech signal that correspond to
each word. A scheme for locating the beginning and end of a speech
signal can be used to eliminate significant computation in non-real
time systems by making it possible to process only the parts of the
input that correspond to speech. For high signal-to-noise ratio
environments, the energy of the lowest level speech sounds (e.g., weak
fricatives) exceeds the background noise energy, and thus a simple
energy measurement suffices. However, such ideal recording
conditions are not practical for most applications. The wavelet
transform is one of the powerful transforms that are used in the signal
processing fields. The wavelet transform extracts the frequency
contents of the signal similar to the Fourier transform but it relates the
frequency domain with the time domain. This link between the time
and the frequency gives this transform powerful characteristic for the
determination of the boundaries of a frequency-band-defined signals
such as the speech signals.

13
(2.3) LITERATURE SURVEY OF RELATED
WORKS ON SPEECH EMOTION
RECOGNITION
• Moataz El Ayadi, Mohamed S. Kamel, Fakhri Karray proposed
many systems to identify the emotional content of a spoken
utterance in their paper “Survey On Speech Emotion
Recognition: Features, Classification schemes & databases”.
The paper is a survey of emotion classification addressing 3
aspects of the design of a speech emotion recognition system.
The first issue describes the suitable features for speech
representation. The second issue describes the design of an
appropriate classification scheme & third issue describes the
proper preparation of an emotional speech database for
evaluating system performance.

• Another approach to the emotion recognition from speech has


done by Shashidhar G. Koolagudi.K.Sreenivasa Rao in their
paper “Emotion recognition from speech : A review”. In this
paper, different types of speech features and model used for
recognition of emotion from speech.32 representative speech
databases are reviewed in this work from point of view of their
language, number of speakers, number of emotions & purpose
of collection. Literature on different features used in the task of
emotion recognition from speech is presented.

• One important part of reducing the complexity of system


computing is selection of correct parameters in combination
with the classifier. Pavol Partila, Miroslav Voznak and Jaromir
Tovarek in their paper “Pattern Recognition Methods and
Features Selection for Speech Emotion Recognition System”
have discussed about the classification method and features
selection for the speech emotion recognition accuracy.
Classification accuracy of artificial neural network, k-nearest
neighbors and GMM is measured considering the selection of
prosodic, spectral and voice quality features.

• Rode Snehal Sudhkar and Manjare Chandraprabha Anil, in their


paper “Analysis of speech Features for emotion detection: A
review” uses various modules for performing actions like speech
to text conversion, features extraction, feature selection and
classification of those features to identify the emotions.
14
According to them the features selected to be classified must be
salient to detect the emotion correctly. They should have to
covey the measurable level of emotional modulation.

• Using Gaussian Mixture Model (GMM) classifier and Mel-


frequency cepstral coefficients (MFCC) in their paper “Emotion
recognition from Assamese speeches using MFCC features
and GMM classifier” Aditya Bihar Kandali, Aurobinda
Routray, and Tapan Kumar Basu use a method as features for
emotion recognition from as features for emotion recognition
from Assamese speech. The experiments are performed for the
cases of (1) text-independent but speaker-dependent and (2)
text- independent and speaker-independent.

• Hao Hu, Ming-XingXu, and Wei Wu in their paper “GMM


Supervector based SVM with spectral features for speech
emotion recognition” have used the GMM supervector based
SVM with some spectral features. A GMM is trained for each
emotional utterance, and the corresponding GMM supervector is
used as the input features for SVM. Experimental results on an
emotional speech database demonstrate that the GMM
supervector based SVM outperforms standard GMM on speech
emotion recognition.

• Dimitrios Ververidis and Constantine Kotropoulos in their


paper “A Review of Emotional Speech Databases” have
reviewed thirty-two emotional speech databases. Such database
consists of a corpus of human speech pronounced under
different emotional conditions. In this paper they concluded that
automated emotion recognition cannot achieve a correct
classification that exceeds 50% for the four basic emotions.
Second, natural emotions cannot be easily classified as
simulated ones can be. Third, the most common emotions
searched for in decreasing frequency of appearance are anger,
sadness, happiness, fear, disgust, joy, surprise and boredom.

• Qingli Zhang, Ning An, Kunxia Wang, Fuji Ren and Lian Li in
their paper “Speech Emotion Recognition using combination
of features” describe how speech features number and
statistical values impact recognition accuracy of emotions
present in speech. With the help of GMM, two effective

15
features; MFCC and Autocorrelation function coefficient are
extracted. Their method achieves emotion recognition rate of
74.45%, significantly better than 59.00% achieved previously.
They also conduct experiments considering different set of
emotion: anger, boredom, fear, happy, neutral and sad to prove
the broad applicability of their method.

• The paper “Speech Emotion Recognition” written by Ashish


B. Ingale, D. S. Chaudhari is based on the previous technologies
which uses different classifiers for the emotion recognition is
received. These classifiers are used to differentiate emotions
such as anger, happiness, sadness, surprise, neutral state etc.
The database for the speech emotion recognition system is the
emotional speech samples are energy, pitch, LPCC, MFCC etc.
This paper is used to discuss about performance and limitation
of speech emotion recognition system.

• Mr. Jangam Shrinivas Suresh and Prof. S.A Throat in their


paper “Language Identification System using MFCC and SDC
Feature” work on the study and implementation of Language
identification system using GMM classifier. Here MFCC and
SDC are used as a classifier to increase the accuracy of
identifying a language.

• Josh R. Calvo, Rafael Farnandez, Gabrial Hernandez in their


paper “Application of shifted Delta Cepstral Features in
Speaker Verification” used SDC as a feature for speaker
verification and evaluates its robustness to channel mismatch,
manner of speaking and session variability. The result of the
experiment reflects superior or at least similar performance of
SDC regarding delta and delta-delta features in speaker
verification.

• Asma Mansour and Zied Lachiri in their paper “SVM based


Emotional Speaker Recognition using MFCC-SDC Features”
Introduced a methodology for speaker recognition under
different emotional states based on the multiclass Support
Vector Machine classifier. Two feature extraction methods
are compared which are used to represent emotional speech

16
utterances in order to obtain best accuracies. The first method
known as traditional Mel-Frequency Cepstral Coefficients
(MFCC) and the second one is MFCC combined with Shifted-
Delta-Cepstra (MFCC-SDC). Experimentations are conducted
on IEMOCAP database using two multiclass SVM approaches:
One-Against-One (OAO) and One Against-All (OAA).
Obtained results show that MFCC-SDC features outperform the
conventional MFCC.

17
(2.4) CONTRIBUTION OF THE PROJECT

• Emotion recognition in speech is one of the most versatile


fields for human interaction.

• The human instinct detects emotions by observing


psycho-visual appearances and voices.

• This helps advertisers and content creators to sell their


products more effectively.

• It utilizes artificial intelligence to predict "attitudes and actions


based on facial expressions".

• It gauges the emotions of autistic children.

18
Chapter 3

DATABASE
COLLECTION

19
(3.1) ABOUT THE EMOTIONS
Emotional speech recognition aims at automatically identifying the emotional or
physical state of a human being from his or her voice. The emotional and
physical states of a speaker are known as emotional aspects of speech. Emotion
is often entwined with temperament, mood, personality, motivation, and
disposition. In psychology, emotion is frequently defined as a complex state of
feeling that results in physical and psychological changes. These changes
influence thought and behaviour. In 1884, American psychologist and
philosopher William James proposed a theory of emotion whose influence was
considerable. According to his thesis, the feeling of intense emotion
corresponds to the perception of specific bodily changes. This illustrates the
difficulty of agreeing on definition of this dynamic and complex phenomenon
that we call emotion.

(3.1.1) CATEGORIZATION OF EMOTIONS

Emotions are described in different classes. These classes are (a) Categorical
and (b) Dimensional.

(a)In Categorical class, Ekman proposed a list of 7 basic emotions:


Anger, Disgust, Fear, Happiness, Sadness, Surprise and Neutral.

(b) In Dimensional class, the basic emotions are again classified into 3
classes:
• Valence: Usually Happiness has positive valence and anger,
sadness has negative valence.
• Activation: Sadness has low activation energy whereas happiness,
anger has high activation energy.
• Dominance: Anger is dominant whereas fear is dominated.

In this project we are mainly focusing on the four basic emotions namely angry,
happy, neutral and sad.

(3.2) ABOUT THE DEVICE: ZOOM H1n

With the ultraportable ZOOM H1n creators can record professional quality
audio everywhere they go. The H1 in sleek compact design and one touch
button controls make it easier than ever to record audio for music, film,
interviews and more.

20
Some features are:
• Brighter, crisper backlit LCD display: The new 1.25" monochromatic
display looks great in virtually all lighting conditions. The layout is clean
and organized, and the backlight pops in the dark or direct light.

• Distortion-free audio recording: Auto-Level and onboard limiting


provide up to 120 dB of distortion-free audio — more than enough
headroom to reproduce the most dynamic spoken and musical
performances.
• Unlimited overdubs — great for multitrack audio: Zoom even lets
overdub audio onto existing recordings. This makes the H1n an essential
tool for the touring artist or songwriter on the move.
• Tone generators and timer tools: The H1n is even easier to use than
previous iterations. Slate and test tone generators enable picture-perfect
audio and video sync, while auto-record and self-timer features make sure
you never miss the moment.
• Onboard stereo XY mics plus mic/line jack: A pair of built-in stereo
condenser microphones oriented in a fixed 90° XY pattern captures audio
in perfect phase-coherent stereo. But when the occasion demands, an 1/8"
mic/line input lets you use your studio and shotgun mics. Broadcast-
ready, BWF-compliant 24/96 WAV and space-saving MP3 audio formats
are available to suit your needs.
• 10 hours from 2 AAA batteries: The H1n promises up to 10 hours of
operation using just 2 AAA alkaline batteries. Rechargeable Li-ion and
NiMH AAA batteries may promise even better performance.
• USB connectivity: Connecting the H1n to device via USB, it can be
used as an audio interface, or to drag files to workstation. Whether
recording at home or on location, the Zoom H1n captures clean audio
without fuss.

21
(3.3) PRAAT
Praat is a tool designed for speech analysis. It was developed at the University
of Amsterdam by Paul Boersma and David Weenink. According to them, Praat
is a tool for doing phonetics by computers.

Praat is a free scientific software program for the analysis of speech in phonetics
with which phoneticians can analyze, synthesize, and manipulate speech.

PRAAT is a very flexible tool to do speech analysis. It offers a wide range of


standard and non-standard procedures, including spectrographic analysis and
formant analysis.

With Praat we can-

• Generate waveforms, wide and narrow band spectrogram, intensity


contour and pitch tracks.
• Make recordings, edit a recorded sound and extract individual sounds for
further analysis.
• Get information about pitch, intensity, formants, pulses and enhance
certain frequency regions, segments and label words, syllables or
individual phonemes.

22
(3.4) WAVESURFER

It is an audio editor widely used for studies of acoustic phonetics. It is a simple


but fairly powerful program for interactive display of sound pressure
waveforms, spectral sections, spectrograms, pitch tracks and transcriptions. It
can read and write a number of transcription file formats used in industrial
speech research including TIMIT.
Wavesurfer provides basic audio editing operations such as, excision, copying,
pasting, zero-crossing adjustment, and effects such as fading, normalization,
echo, inversion, reversal, replacement with silence etc.

(3.5) MEL-FREQUENCY CEPSTRUM COEFFICIENTS


(MFCC)

The MFCC is the evident example of a feature set that is extensively used in
speech recognition. As the frequency bands are positioned logarithmically in
MFCC it approximates the human system response more closely than any other
system. The main technique of evaluating the MFCC is based on the short-term
analysis, and hence from each frame a MFCC vector is evaluated. To extract the
coefficients the speech sample is taken as the input and hamming window is
applied to minimize the discontinuities of a signal. Then DFT is used to
generate the Mel filter bank. In case of Mel frequency warping, the width of the
triangular filters varies and so the log total energy in a critical band around the
center frequency is included and after the warping the numbers of coefficients
23
are computed. Finally, for the cepstral coefficient’s calculation the Inverse
Discrete Fourier Transformer is used.

The objective of cepstral analysis is to separate the speech into its source and
system components without any prior knowledge about source or system.

Application

▪ MFCCs are commonly used as features in speech recognition systems, such


as the systems which can automatically recognize numbers spoken into a
telephone.

▪ MFCCs are also increasingly finding uses in music information retrieval


applications such as genre classification, audio similarity measures, etc.

(3.6) SHIFTED DELTA CEPSTRAL (SDC)


First proposed by Bielefeld, the Shifted Delta Cepstral (SDC) are obtained by
linking the delta Cepstral computed across multiple frames of speech
information, spanning multiple frames, into the feature vector. The Shifted
Delta Cepstral Coefficients (SDCC) are features incorporating long range
dynamic characteristics in speech signals. The SDCC features for a particular
short-time frame consist of delta values between Mel Frequency
Cepstral Coefficients (MFCC) from multiple neighboring frames.

The computation of the SDC feature is a relatively simple procedure. First, the
cepstral feature vector is computed.

24
The SDC features are specified by a set of 4 parameters, (N, d, P, k) where:

• N: number of cepstral coefficients in each frame. So, each frame is


presented by a coefficient vector given as: c(t) = [c0c1...ci ...cN−1].
• D: time advance and delay for the delta computation
• P: time shift between consecutive blocks.
• K: number of blocks whose delta coefficients are concatenated to form
the final vector

SDC features are widely used in language identification and speech recognition
fields. SDC feature vectors are an extension of delta-cepstral coefficients.
Figure below describe the extraction procedure of SDC feature vectors.

25
In the figure, ci are the MFCC coefficients and t is the coefficient index. The
parameter d presents the spread over which delta are computed. The gaps
between different delta computations is given by the parameter P. Parameter k
determines the number of blocks whose delta coefficients are concatenated to
obtain the final form of feature vector. For given time t, an intermediate
calculation is done to obtain these k coefficients:

∆c(t, i) = c(t + i × P + d) − c(t + i × P − d).

Finally, the SDC coefficient vector of k dimension is obtained as:

SDC(t) = [∆c(t, 0)∆c(t, 1)...∆c(t, k − 1)]

Hence, SDC coefficients expressed in 4 are the stracked version of MFCC


coefficients given in 1, and k×N parameters are then used for each SDC feature
vector. The SDC coefficients are able to interpret signal and capture features
from the long duration speech samples or dynamically changing samples. Thus,
it solves limitations of the traditional short time derivation of the Cepstral
features. This technique is widely successful in language identification system
(LID) using GMM with high order (512-1024-2048) mixture models.

In the SDC, a simpler form of the delta Cepstrum is used. It is defined as –

The SDC is a stack of K -frames of this simple delta Cepstrum, expressed as

where, K is the number of frames being stacked and P is the amount of frame
shift. The performance of the SDC can be further improved if it is appended to
the basic feature vector. The new feature vector is

26
Empirically, researchers have found that when N=7(including zeroth DCT
coefficient C0), ,P=3 and K=7 the SDC gives quite good performance.

(3.7) GAUSSIAN MIXTURE MODEL (GMM)

Gaussian mixture model is a probabilistic model for density estimation using a


convex combination of multi-variate normal densities. GMMs are very efficient
in modeling multi-modal distributions and their training and testing
requirements are much less than the requirements of a general continuous
HMM. Therefore, GMMs are more appropriate for speech emotion recognition
when only global features are to be extracted from the training utterances.

The density function of a GMM is defined as

where N (; ,) is the Gaussian density function, ѡi, µi and ∑I are the weight, mean
and covariance matrix of the ith Gaussian component, respectively. The
supervector of a GMM is formed by concatenating the mean of each Gaussian
component, and it takes the form as

For each emotional utterance, a GMM is trained with the extracted spectral
features, and the corresponding supervector is obtained.

The GMM supervector can be considered as a mapping from the spectral


features of an emotional utterance to a high-dimensional feature vector. This

27
mapping allows the production of features with a fixed dimension for all the
emotional utterances. Therefore, we can use the GMM supervectors as input for
SVM learning.

Gaussian distribution has many important properties. If we use it to describe the


reality data, there are still many limitations. actual data. Such a model is called
hybrid model.

The most commonly used and most popular hybrid model is the Gaussian
Mixture Model (GMM). If there is a sufficient number of Gaussian distributions
adjusting its expectation covariance matrix, and the coefficients of the linear
combination, it can express any continuous distribution. Gaussian Mixture
Model is a commonly used statistical models used in speech signal processing.

(3.8) DATABASE PREPARATION


The field survey consists of recordings done in the university campus and also
outside the campus. A total of 20 persons are recorded till now among whom 10
persons are males and 10 persons are females. They have fallen under the age
group between 18-26 years. Their native locations are Tinsukia, Dibrugarh,
Sivasagar, Jorhat, Tezpur, Guwahati, and Nalbari. The mother tongue of all the
persons are Assamese.

We have collected 10 dialogues from the famous Assamese novel “অসীমত যাৰ
হেৰাল সীমা” (Oximot Jar Heral Xeema) and asked them to read the
dialogues in four different emotions namely angry, happy, neutral and sad. The
Assamese dialogues that we have collected are:

1. “মই ভাবিবিললাোঁ, তই েয়লতা আগলত গবলগগ।”


2. “ব'ল, হসৌ পুখুৰীপাৰৰ ঘাোঁেবিডৰাত িলোোঁগগ।”
3. “হমাৰ ক'িগল এলকািাই, কাবলকা-ফাবলকাৰ কথা মই ভিাও িাই।”

28
4. “আইলেলেৰাৰলগত তলয়া যা, হমাৰ বকিা েৰকাৰ েলল জয়ৰালমই বেি
পাবৰি।”
5. “তই হমাৰলগত আজজ িগললও ে'ি বতলক।”
6. “হিবি ৰাবত িকবৰবি, স াঁজ লগাৰ আগলত উভবতবি।”
7. “আশাকলৰা এইিাৰ েয়লতা লালেলালে সকললা বিবিবি।”
8. “হতাৰ হেবখলিা বিজৰ িুজিৰ ওপৰত অগাধ বিশ্বাস।”
9. “হেবখললা আধবললটাৰ েুলয়াটা বপঠিলয়ই হমাৰপৰা সমাি আোঁতৰত।”
10. “বলখাস্োিৰ িৰ অভাে, গবতলক এলকািাই, বিৰশূিয।”

Inside the university campus we have recorded in our Electronics and


communication Department as well as in Gyanmalini Studio. Other than that we
have also recorded in Guwahati, Jorhat, Dibrugarh and Tinsukia. One person
has sent us his recordings from Rochester.

We have used devices like zoom and mobile recorder for recording purpose.

List of names of speakers along with their age and dialects:

Female speakers:

NAME OF SPEAKER AGE DIALECTS OF SPEAKER


Ananya Goswami 23 Assamese
Debashree Sharma 21 Assamese
Khushboo Bordoloi 23 Assamese
Madhusmita Barman 23 Assamese
Nistha Dutta 20 Assamese
Pooja Devi 23 Assamese
Rimpi Borah 23 Assamese
Rosy Bordoloi 22 Assamese
Snigdha Sarma 23 Assamese
Topsira Rahman 23 Assamese

Male speakers:

NAME OF SPEAKER AGE DIALECTS OF SPEAKER


Aakash Pradyut Konwar 23 Assamese
Anshujit Sharma 26 Assamese
Bhargab Sarma 23 Assamese
Bhaskarjyoti Bora 23 Assamese
29
Hrishikesh Mohan 23 Assamese
Kaustav Choudhury 24 Assamese
Madhujya Pratim Bordoloi 23 Assamese
Roopam Bordoloi 19 Assamese
Uddipta Sharma 23 Assamese
Wasikur Rahman Khan 23 Assamese

(3.9) DATA COLLECTION


A ZOOM H1n device and mobile recorders, were used for single channel
recording of 4 emotionally biased utterances of different lengths in each
emotion from 10 male and 10 female speakers of Assamese language, in a
closed-room noise-free environment. For digitization 16000 Hz of sampling
frequency is used. Speech samples were collected for 3 architypal (full-blown)
emotions and also for neutral mood. Each speaker was asked to utter 2 times
fixed set of 10 short sentences, with four different emotions. This required
emotional acting by the speakers and the meaning of the sentences were
narrated to them to sufficiently arouse the same emotion in them. The set of
utterances were recorded in two different sessions. In both the sessions the
utterances corresponding to angry, happy, sad were recorded in the same order.
The neutral utterances were recorded at the beginning of any of the above
sessions.

(3.9.1) EMOTION VARIABILITY USING SPECTROGRAM

A spectrogram is plotted for the same sentence but spoken with four
different emotions – angry, happy, neutral and sad. It represents three
dimensions of a signal, i.e., time, frequency and amplitude.
From the spectrogram we can observe that in angry and happy emotions,
there is more energy in higher frequency region whereas in sad and neutral
emotions there is less energy in higher frequency region.

30
Matlab code for spectrogram:

31
Result of the spectrogram:

32
CHAPTER 4

EXPERIMENTAL DETAILS

33
(4.1) FEATURE EXTRACTION AND SELECTION
Any emotion from the speaker’s speech is represented by the large number of
parameters which is contained in the speech and the changes in these parameters will
result in corresponding change in emotions. Therefore, an extraction of these speech
features which represents emotions is an important factor in speech emotion
recognition system. The speech features can be divided into two main categories that
is long term and short-term features. The short-term features are the short time period
characteristics like formants, pitch and energy. And long-term features are the
statistical approach to digitized speech signal. Some of the
frequently used long term features are mean and standard deviation. The larger the
feature used the more improved will be the classification process. The region of
analysis of the speech signal used for the feature extraction is an important issue
which is to be considering in the feature extraction. The speech signal is divided into
the small intervals which are referred as a frame. In this experiment short frames of the
speech signal are taken of about 20ms and frameshift of 20ms. During this period the
audio does not change much. Each frame contains same features. 13-Dimension
MFCC is used in our experiment. Mel Frequency Cepstrum Coefficient (MFCC) is
extensively used in speech recognition and speech emotion recognition systems and
the recognition rate of the MFCC is very good. In the low frequency region better
frequency resolution and robustness to noise could be achieved with the help of
MFCC rather than that for high frequency region. Mel Frequency Cepstrum
Coefficient is an illustration of short-term power spectrum of sound. Along with the
MFCC feature SDC feature is also used. SDC is obtained from MFCC. For long-term
speech signals, Shifted Delta Coefficients (SDC) features are more appropriate, since
they identify the dynamic behavior of the speaker along the prosodic features of
speech signal. It is also used to find the robustness of channel mismatch, manner of
speaking and session variability. The combination of both MFCC-SDC feature gives a
better method to recognize different human emotional states.

(4.2) CLASSIFIER SELECTION

In the speech emotion recognition system after calculation of the features, the best
features are provided to the classifier. A classifier recognizes the emotion in the
speaker’s speech utterance. Various types of classifier have been proposed for the task
of speech emotion recognition. Gaussian Mixtures Model (GMM), K-nearest
neighbors (KNN), Hidden Markov Model (HMM) and Support Vector Machine
(SVM), Artificial Neural Network (ANN), etc. are the classifiers used in the speech
emotion recognition system. Each classifier has some advantages and limitations over
the others.

34
Only when the global features are extracted from the training utterances, Gaussian
Mixture Model is more suitable for speech emotion recognition. Because in case of
GMM training and testing requirements are less. We have used 256-D GMM in our
experiment.

(4.3) EMOTIONAL DATABASE


The quality of the database plays an important role in performance of emotional
speaker recognition. The emotional speech corpus selected for this are Emo-DB
(consists of 535 audios), RAVDESS (consists of 1440 audios) and SAVEE (consists
of 120 audios). It consists of audio, video and motion-capture recordings of dyadic
mixed gender pairs of actors. The audios include a general emotional theme. The
main goal is to have an expression that mostly resembles to natural emotion
expression. Then, these expressions have been divided into utterances which were
manually annotated in categorical labels: angry, happy, sad, neutral, frustrated,
excited, fearful, surprised, disgusted and in terms of three-dimensional axes: valence,
activation, and dominance.

Along with these standard data we have also collected data sets manually from 10
male and 10 female speakers; who were told to utter the 10 Assamese sentences in
four different emotions. So total 1600 data sets are collected manually, among which
1280 data is used for training purpose and 320 data is used for testing purpose. So as a
whole we have 3695 universal data.

(4.4) EXPERIMENTAL PROCESS

(4.4.1) UNIVERSAL BACKGROUND MODEL (UBM)

UBM is a model used to represent general, person-independent feature characteristics


to be compared against a model of person -specific feature characteristics. It is
commonly used with Gaussian mixture model (GMM) that is used as a discriminative
approach, where the authenticity of the sample is determined by running a comparison
between user dependent characteristics and a sample of all other users. UBM is helpful
in speaker independent models, and is trained to use big amount of data created from
diverse speakers. We have created UBM for some standard databases like EmoDB,
SAVEE and RAVDESS. UBM creates a cluster of all phonemes and their features are
extracted to get the corresponding mean (m), variance(v) and weight (w).

35
Figure: Sample of Universal data folder

36
Figure: Sample for UBM code

(4.4.2) TRAINING PHASE & TESTING PHASE

The training and testing are done using our own database which was manually
collected. The main aim of the training phase is to compute the best parameters to
match the distribution of the feature vectors. Training is done using the parameters
obtained from the UBM.80% of the total Assamese data is taken as Training data and
the rest 20% is taken as Testing data.

37
During the cluster formation stage there are some mean value for all the clusters. when
training is done, the means of all the clusters are updated. The adaptation of the means
of UBM leads to GMM.

For four different emotions (Angry, Happy, neutral and Sad), four GMMs are created
and their corresponding means are recorded. Using the means of four GMMs, Testing
is done.

The testing phase comprises of the loglikelihood function which is used to derive the
maximum likelihood estimator of the parameter.

Figure: Sample of Training data folder

38
Figure: Sample of Training data Angry folder

Figure: Sample of GMM training code for angry emotion

39
Figure: Sample of Testing data folder

Figure: Sample of Testing data Angry folder

40
Figure: Sample for test code

41
(4.5) CONFUSION MATRIX AND ACCURACY
A confusion matrix is a summary of prediction results on a classification problem.
The number of correct and incorrect predictions are summarized with count values and
broken down by each class. It shows the ways in which our classification model is
confused when it makes predictions. It gives us insight no into the errors being made
by a classifier.

Accuracy represents the number of correctly classified data instances over the total
number of data instances.

42
CHAPTER 5

RESULTS AND ANALYSIS

43
After all the experimental works done by using 13 Dimensional MFCC and SDC and
256 Dimensional GMM we have obtained the following confusion matrix. The
confusion matrix is gender-independent. In the confusion matrix shown in Table, the
columns show the emotions that the speakers tried to induce, and the rows are the
percentages of output recognized emotions.

Confusion matrix:

Let us consider Confusion matrix as C, then

Accuracy = (sum(diag(C))/sum(sum(C)) *100)

The accuracy that we have obtained is 51.25%. Correct classification rate of less than
60% was achieved in this experiment showing that while MFCC coefficients are
popular features in speech recognition, they are not suitable for emotion recognition in
speech. Further SDC feature gives more precise result in case of language
identification than that of emotion recognition.

44
CHAPTER 6

CONCLUSION

45
Research in the field of speech and machine leaning has expanded up to great extent.
But still a system with sheer accuracy is yet to be developed. We believe that this
contribution shows important results considering emotion recognition with Gaussian
Mixture model. This project is about analyzing speech for emotion states (Anger, Sad,
Neutral, Happy) using speech signals. Here Speech emotion recognition systems based
on the several classifiers is illustrated. The important issues in speech emotion
recognition system are the signal processing unit in which appropriate features are
extracted from available speech signal and another is a classifier which recognizes
emotions from the speech signal. The confusion tables clearly show that some
emotions are often confused with certain others. Furthermore, some emotions seem to
be recognized more easily. This may be due to the fact that all the test sentences were
acted emotions and test-persons have difficulties with feigning certain emotions. For
the frequently confused emotional states, other type of features, such as prosodic and
voice quality features are also responsible.

Automatic emotion recognitions from the human speech are increasing now a day
because it results in the better interactions between human and machine. To improve
the emotion recognition process, combinations of the given methods can be derived.
Also, by extracting more effective features of speech, accuracy of the speech emotion
recognition system can be enhanced.

46
CHAPTER 7

BIBLIOGRAPHY

47
[1] https://en.wikipedia.org/wiki/Voice_(phonetics)

[2] Lawrence J. Raphael, Gloria J. Borden, Katherine S. Harris, “Speech


Science Primer”.

[3] K. Sri Rama Murty and B. Yegnanarayana, Senior Member, IEEE,


“Epoch Extraction From Speech Signals”.

[4] Lawrence R. Rabiner and Ronald W. Schafer. “Introduction to Digital


Speech Processing”.

[5] Mary Joe Osberger and Nancy S. McGarr”, Speech Production


Characteristics of Hearing Impaired”.

[6] Rode Snehal Sudhkar and Manjare Chandraprabha Anil, “Analysis of


Features for Emotion Detection:A review”.

[7] Aditya Bihar Kandali, Aurobinda Routray, “Emotion recognition from


Assamese speeches using MFCC features and GMM classifier”.

[8] Hoo Hu,Mung-Xing Xu and Wei Wu,“GMM supervector based SVM


with spectral features for speech emotion recognition”.

[9] Yu Zhou, Yanqing Sun and Jianping Zhang & Yonghong Yan, “Speech
emotion recognition using both spectral and prosodic features”.

[10] Qingli Zhang, Ning An, Kunxia Wang, Fuji Ren and Lian Li, “Speech
Emotion Recognition using Combination of Features”.

[11] Moataz El Ayadi, Mohamed S. Kamel, Fakhri Karray, “Survey on


Speech Emotion Recognition: Features, classification schemes and
databases”.

[12] Shashidhar G. Koolagudi. k. Sreenivasa Rao, “Emotion recognition from


speech: a review”.

[13] Dimitrios Ververidis and Constant tine Kotropoulos, “A review of


emotional speech databases”.

[14] Ashish B. Ingale, D. S. Chaudhari, “Speech Emotion Recognition”.

[15] Hari Krishna Vydana, P. Phani Kumar, K. Sri Rama Krishna, Anil

48
Kumar Vuppala Improved Emotion Recognition using GMM-UBM”.

[16] Xueying Zhang,Ying Sun , Shufei Duan, “Progress in speech emotion


recognition ”.

[17] Wei-Qiang Zhang, Member, IEEE & Liang He, Yan Deng, Jia Liu,
Member, IEEE, and Michael T. Johnson, Senior Member, IEEE, “Time–
Frequency Cepstral Features and Heteroscedastic Linear Discriminant
Analysis for Language Recognition”.

[18] Asma Mansour & Zied Lachiri, “SVM based Emotional Speaker
Recognition using MFCC-SDC Features”.

[19] Saikat Basu, Jaybrata Chakraborty, Arnab Bagan d Md. Aftabuddin, “A


Review on Emotion Recognition using Speech”.

[20] Rode Snehal Sudhkar & Manjare Chandraprabha Anil, “Analysis of


Speech Features for Emotion Detection: A review”.

[21] https://www.ieee.org/

[22] http://www.emodb.bilderbar.info/download/

[23] https://www.kaggle.com/barelydedicated/savee-database

[24] https://www.kaggle.com/uwrfkaggler/ravdess-emotional-speech-audio

49

You might also like