Professional Documents
Culture Documents
Minor Project Final Report (Group 8)
Minor Project Final Report (Group 8)
of
BACHELOR OF TECHNOLOGY
in
by
Guided by
Dr. ANU SAINI
Assistant. Professor
DECEMBER, 2021
We hereby certify that the work which is being presented in the Minor
Project entitled “Speech Emotion Recognition” in partial fulfillment for the
award of the Degree of Bachelor of Technology in Computer Science and
Engineering / Information Technology affiliated to Guru Gobind Singh
Indraprastha University, New Delhi and submitted to the Department of
Computer Science and Engineering G. B. Pant Govt. Engineering College , is an
authentic record of our own work carried out during a period from August 2020
to Dec 2020. The matter represented in this report has not been submitted by me
for award of any other degree of this or any other institute/university.
This is to certify that the above statement made by the candidate is correct to the
best of our knowledge.
Signature of Supervisor
Date Dr. Anu Saini & Designation
1) Title Page 1
2) Candidate Declaration 2
3) List of Figures 5
4) List of Tables 6
5) List of Abbreviations 7
6) Acknowledgement 8
7) Abstract 9
Chapter 1. INTRODUCTION 10 - 17
1.1) Importance 10
1.2) Motivation 11
1.3) SER 11
1.4) Theories of Emotions 12
1.4.1) Evolutionary Theory
1.4.2) The James-Lange Theory
1.4.3) The Cannon-Bard Theory
1.4.4) Schachter and Singer’s Two-Factor Theory
1.4.5) Cognitive Appraisal
Chapter 7. REFERENCES 41 - 42
We would like to express our gratitude towards our mentor Dr. Anu Saini for exposing us to
this topic, providing us with research material and moreover for being the guiding light all this
while. Furthermore we would like to thank Dr Amit R. Khaparde for his valuable suggestions
and to our principal for presenting us with this golden opportunity of making this project.
Finally, we would like to thank our friends and family for being the constant support system
and encouraging force all this while.
We as human being often use speech as one the most natural way to express ourselves.
As it helps us to express ourselves effectively. In modern times, we so much rely on it
that we even regularly use emojis and audio messages while interacting via other forms
of communications like email and messages etc. We all are very well aware that
emotions play a essential role in communication, the detection and analysis of the same
can become a blessing in today’s digital world of remote communication. Moreover, it
can help immensely in the development of human–computer interaction (HCI)
applications.[1,2]
First we performed an experiment among the various models present under the different
approaches available for sentimental analysis, to determine which comes out to be the
most effective and accurate. Then we proceeded with the machine learning approach to
detect the sentiments in the speech, once that speech is converted into text as it gets
processed under the speech to text section.
INTRODUCTION
In today’s environment, we’re suffering from data overload, companies usually have mountains
of customer feedback accumulated with no major categorization. For mere humans, it’s
impossible to analyze it manually without any sort of error or bias.
Speech Emotion Recognition (SER) is the answer to this problem. Speech Emotion
Recognition will take human speech as an audio input, converts it into text, analyze the text
and check whether the sentiment of the input statement is positive, negative or neutral and then
will give a graph as an output depicting the analyzed emotion. It helps in improving decision
making in various businesses by providing significant amount of structured data rather than
plain intuition of human which isn’t always right. It analyzes whether the customer’s
feedback/statement is positive, negative or neutral.
Speech Emotion Recognition can be implemented using many different approaches like Deep
Learning approach, Lexicon Based approach or Hybrid approach. We will be using Machine
Learning approach to detect human emotions in their speech.
1.1) Importance
In today’s world, communication is the key to express oneself, as it’s the most common way
of interaction between humans. Humans use most parts of their body to communicate
effectively, in which voice plays a very major role. Body language, hand gestures, and tone are
all often used collectively and effectively to express one’s sentiments. As different languages
are practiced across different parts of the world which leads to a varying verbal form of
communication, the non-verbal part of communication is used to express one’s sentiments
which is most likely to be common among all. Therefore, any advanced technology
environment developed to understand human emotions can be of great help.
1.2) Motivation
In today’s environment, we’re suffering from data overload, companies usually have
mountains of customer feedback accumulated with no major categorization. For mere
humans, it’s impossible to analyze it manually without any sort of error or bias.
Speech Emotion Recognition (SER) is the answer to this problem. It improves decision
making in various businesses by providing significant amount of structured data rather than
plain intuition of human which isn’t always right and moreover can have traces of biasedness.
1.3) SER
Speech emotion mostly focuses on to identify in a spew of seconds about the emotional state
of a human being from their particular voice. It is based on the deep analysis of the general
mechanism of speech signal, taking out some details which contain emotional information
regarding the speaker’s voice, and also to draw out some appropriate pattern to recognise or
identify the emotional states. Such as the usual standard recognition systems, the speech
In psychology, emotion is often considered as an intricate state of feeling that fathom through
physical and psychological changes which influence the thoughts and behaviour. Now here we
will discuss some kind of theories that help us to comprehend in knowing when mostly people
experiences or show emotions. Emotions can be categories into three types, as shown in fig 2.
Charles Darwin in the 1870s put forward a theory that stated that, emotion evolved because
they possessed adaptive value. For example, fear grows and help people to take corresponding
actions in a way that increase their survival possibility. [5]
In the 1880s, two psychologists namely William James and Carl Lange all by one’s self gave
a theory which expresses that people experiences various emotions to discern their bodies’
physiological reactions to different external events that comes in contact with them. As per this
theory, people do not cry because they are unhappy instead of that they feel unhappy that’s
why in that response they start crying.[5]
This theory refutes the James-Lange theory and expresses that the occurrence of emotion ensue
at the same time when physiological arousal happens. Neither one give rise to the other. The
brain gets a signal that roots the experience of emotion at the same time when the autonomic
nervous system accepts a signal that give rise to awaken the physiological emotion.[5]
In the 1960s, Stanley Schachter and Jerome Singer heralded with a theory that says, every
person experience of emotion focuses on the two main elements: physiological awakening and
the cognitive explication of that awakening. When people look on to the physiological signals
of arousal then they also look for an environmental clarification of this awaken emotions[5]
The psychologist Richard Lazarus’s studies have revealed that every person get the experience
of emotions on the basis of their appraise or evaluation of the events around them.[5]
Description
Types
Sentiment
For example - "this alarm clock battery works slow", an aspect-based
Analysis[6,7]
analyst will grasp the sentence expression and noted as a negative
opinion about the future battery life.[7]
This is mainly about the activity. Its main motive is to decide what kind
of intent carried by the message.
Intent
Analysis[6] The intent analysis associated by the consumer – whether the buyer
likes to purchase an item or they are simply checking out or browsing
the items.
If the buyer decides to purchase an item, then you can easily captivate
them with advertisements or deals. If a buyer isn’t ready to buy an item,
RELATED WORK
In this chapter, we will represent the different material we have referred and what are they
about. Moreover we will describe how our approach is different from these.
In research paper [1], they have performed Speech Emotion Recognition Using Deep Learning
and Attention Mechanism which is based on deep neural network (DNN). This provides a
review of the recent development in SER and also examines the impact of varying attention
mechanisms on SER performance. In this article[2], they have comprehensively discussed
SER using machine learning approach. They have highlighted the three key features they have
used namely, MFCC (Mel Frequency Cepstral Coefficients), Mel Spectrogram and Chroma.
The Python implementation of Librosa package was used in their extraction. The research
paper [3], focuses on study of various speech emotion detection methods like Support Vector
Machine (SVM), Hidden Markov Model (HMM) etc. In this article [4], it has discussed what
emotions is and what are the various theories about it. This article highlights 6 different theories
from first one being given in 1870s to the ones given a few decades ago.
The article [5] also sheds light on the different theories of emotion and also states the 3 major
categorisations, i.e., physiological, neurological and cognitive theories. In the article [6], it
discusses the meaning of sentimental analysis and describes the different types of sentimental
analysis like aspect based, fine grained etc. This Article [7], comprehensively describes
sentimental analysis from its fundamentals to different approaches like machine learning, rule
based etc to its challenges and applications. It describes how it could be done using NLP. In
the research paper [8], they have conducted a in-depth study to numerous models like Naive
Bayes, SVM, N-gram sentimental analysis etc. They have performed intensive comparison
between different models and have stated the advantages and disadvantages of each approach.
In the reference[11], they used logistic regression as the text classifier. They used TF-IDF
weighting for feature weighting and instead of accuracy used MRR for evaluation takes the
rank of the first correct answer into consideration. The reference[12], discussed with example
the use of linear_svc (support vector classifier) for text classification and for accuracy used
confusion matrix function. The reference[13] discusses the use of fit and transform function
for model training. Dr. Robert Chun et al. [14] discussed k means clustering for grouping of
data and multiple algorithms for emotion recognition like logistic regression, naive bayes,
linear_svc, decision tree, random forest. In reference[15], Victor Odumuyiwa and Ukachi
Osisiogu discussed a systematic review of hidden markov model for sentiment analysis. Anna
Jurek et al.[16] discussed lexicon based model for sentiment analysis involves calculating the
sentiment from the semantic orientation of word or phrases that occur in a text.
In reference[17], author discussed and implemented speech to text conversion using google
API. It discusses synchronous audio file, asynchronous audio file and real time speech input.
Maghilnan and Rajesh Kumar M [18] used Sphinx4, Bing Speech API, Google Speech API for
speech recognition and performance metric used was WWR. For speaker recognition, we used
MFCC as feature and DTW with various distance computation methods such as Euclidean,
Correlation, Canberra for feature matching and recognition rate was used as the performance
metric. Shivangi Nagdewani, Ashika Jain[19] used HMM and neural network for speech to
text conversion. Nithya Roopa S. et al.[20] used IEMOCAP corpus Database for sentiment
analysis. Auther used Inception Net v3 Model to build an emotion recognition model. Inception
is evolved from GoogLeNet Architecture with some enhancements.
Our approach is different from the earlier used approaches in the following ways :-
• In our approach we are using both types of supervised learning, i.e., regression and
classification.
• We have provided the feature of 2-way input option, i.e., via speech and via text.
• We have conducted the experiment of comparing the 3 different model trained over 2
distinct datasets.
In this chapter we will discuss the problem statement we have chosen and the appropriate
solution we have come up with. We have briefed about the dataset used and methodology we
have opted for.
Speech Emotion Recognition will take human speech as an audio input, converts it into text,
analyze the text and check whether the sentiment of the input statement is positive, negative
or neutral and then will give a graph as an output depicting the analyzed emotion.
3.2) Dataset
We have used 2 twitter tweets based dataset which contains positive, negative and neutral
reviews. We downloaded it from Kaggle.com [9]
In this chapter we have comprehensively discussed our approach detailing about the each step
we have performed and together with it the system design.
To solve this emotion recognition problem, we will first convert the speech into text if the input
is Realtime speech and a text input box if the user wants to enter the text directly. Then a deep
learning model will be used for text classification. This model will be trained that will analyse
whether the sentiment in the text is positive, negative or neutral.
A deep learning model can be used that takes in audio signals, analyse them and convert them
into corresponding text. Voice inputs are taken using Google API which is then sent to core
cloud functions. This converts human audio into electronic signals and thereafter pre-
processing is done. It is then sent to Speech to Text API (applying Deep Learning Model and
understanding what user is trying to say). Finally passed to auto ML- NLP where it is converted
to text format.
Text data is a favourable research object for emotion recognition when it is free and available
everywhere in human life. This feedback will further be evaluated using ML model that will
help to understand user's perception.
English language is usually easy to find the boundaries of words because in English we can
split our sentence by spaces or punctuation and all that is left of words. For example, Rohan;
friends, lend me your ears. So, it has commas it has a semicolon and it has spaces and if we
split on those then we will get words that are ready for further analysis like Rohan, friends and
so forth. As shown in fig.4.
a) Cleaning: In data cleaning we remove punctuations and symbols from the text data.
b) Stop words Removal: A stop word is a commonly used word that is programmed by search
engine to ignore, both when indexing entries for searching and when retrieving them as the
result of a search query. As shown in the fig.5.
We would remove these words to avoid them taking space in our database, or taking up
valuable processing time. We can remove them by storing a list of words that you consider to
stop words and then comparing the word we want to check with the words in this list. NLTK
(Natural Language Toolkit) in python has a list of stop words stored in 16 different languages.
You can find them in the nltk data directory.
c) Tokenization: The process of splitting an input text into meaningful chunks is called
tokenization and that chunk is actually called token you can think of a token as a useful unit
Here we will convert the tokens into features and the first way to do that is a word bag. We
actually want tag words that are as beautiful or embarrassing and we want to see those words
and make decisions based on the absence or presence of that word and how it can work let’s
take the example of three reviews as a good film, not a good one. movie, I did not like it and
let's take all the possible names or tokens we have in our documents and in each such token we
introduce a new feature or column that will accompany that word to make it a bigger matrix
for numbers and how we translate our text from vector to that matrix or row to that matrix so-
so let's take a good example of a good review of a film with a good name in our text so we put
one in the column corresponding to that name and the word appears. movie and we put one in
the second column to show that the word appears in our text we have no other words so all the
others are zero and that is a very long vector in the sense that it has a lot of zero and not a good
movie will have four and all the other zero and so on this process is called text vectorization
because we actually convert text to a larger number vector and the size of that vector
corresponds to a specific token. in our database you can really see that it has some problems
the first is that we lose the order of the words because we can actually push our words and the
right representation will always be the same which is why it is called the word bag because the
bag names the order and so they can come in any way. As shown in the fig.6.
In a world full of machine learning and artificial intelligence surrounding almost everything
around us, classification and prediction is one of the most important aspects of machine
learning.
Naive bayes is a simple but surprisingly powerful algorithm for imperative analysis. It is a
classification technique based on bayes theorem with an assumption of independence among
predictors. It comprises of two parts which is naive and bayes. In simple terms naive bayes
classifier assumes that the presence of a particular feature in a class is unrelated to the presence
of any other feature even if these features depend on each other or upon the existence of the
other features. All of these properties independently contribute to the probability whether a
fruit is an apple or an orange or a banana so that is why it is known as naive.
Naive bayes model is easy to build and particularly useful for very large datasets in probability
theory and statistics-based theorem which is alternatively known as the Bayes law or the Bayes
rule. It describes the probability of an event based on prior knowledge of the conditions that
might be related to the event.
Bayes theorem is a way to figure out conditional probability. The conditional probability is the
probability of an event happening given that it has some relationship to one or more other
events. We have a dataset which is list of tweets from twitter for example here's an example
that says Great job to coach walker and all the players. That is a positive tweet and it is labelled
positive and so there's about 172215 of these examples in our dataset and we're going to go
We will do a test and train split. We will use 80% for training and a 20% holdout for testing.
Now we will train up naive Bayes classifier. The great thing about naive Bayes is that it's really
fast and it only needs one pass over the data. As shown in fig.8.
The first step is to handle the data we load from the CSV file and extend it across the trained
and tested assets. The second step is to capture the data that captures the features in the training
dataset so that we can calculate and make probabilities. The third step is to create a specific
estimate, we will use the summary of the data set to make the estimate and then we will create
the estimates provided in the test dataset and the summary training data set and finally, we will
use the test dataset. Assess the accuracy of the assumptions made for the dataset, all the
assumptions are correct and finally, we add this and form our naive base taxonomy model. [10]
There is a Python library called Skit-Learn. This will help in creating naive base models in
Python. Under the scikit-learn library there are three types of innocent bay models: Gaussian,
multinomial and Bernoulli. Here we are going to use the polynomial used for discrete
calculations. The multinomial naive bayes falls under the supervised practice and is a
classification algorithm with the help of which we can classify the text. Text taxonomy means
that it is our job to tell you if you have data in the form of text or statements and to what specific
class the particular statement belongs to. Here we want to do sentiment analysis, i.e., you want
to know whether the specific review that the user is posting is positive review, negative review
or neutral review. So, we can do this kind of work with the help of multinomial naive bayes.
As shown in fig.9.
Logistic regression is a classification that can be used to solve a binary classification problem.
The result is usually defined as 0 or 1 in the dual condition model. In the data set below,
assessment is made by applying binary taxonomy with logistic regression on the data assigned
for training and test data. First, standardization is applied for pre-processing, and then the
training data is trained with fit () and then used to evaluate test data using the estimation ()
method. When you instantiate the logistic regression module, you can change the values of
`solver`,` penalty`, `c` and even specify whether it should be called a multi-class classification
problem (one-versus-all or multinomial).[11] As shown in fig.10
SVM is a machine-readable learning algorithm (feed-me) that can be used for both editing or
deceleration challenges. Classification predicts the label / group and regression predicts
continuous value. SVM makes the arrangement by finding a hyper-plane that separates the
classes we have arranged in the n-dimensional space. The SVC method uses line kernel
function to perform differentiation and is ideal for working with large numbers of samples. If
we compare this model with the SVC model which is a vector classifier that supports the SVC
line has an additional parameter similar to the standard operating l1 or l2 function and the
kernel loss function of this model cannot be changed because it is based on the kernel. the line
method itself.[12]
We load the required libraries. We use linear SVC class from sklearn.svm. We have split x and
y's into the train and the test parts here we use 20% of data as a target test data. We'll fit the
model on train data. It'll check the training score. We can also apply a cross validation method
to check the training score.
In this chapter, we have discussed the result obtained after training the models and after the
full execution of the code provided with the comparison table and comparison graphs for proper
analysis.
Initially, for implementing the Speech Emotion Recognition system we have to collect input
samples be it in the form of text, audio file or real time voice input. It will be used to train the
model. The audio input is usually wav or mp3 files.
PYTHON LIBRARY: After data collection we need to represent these input files in numeric
form, to do more extensive analysis on them. This is called feature extraction, where numeric
values for different features of the input is determined. Speech_recognition module in python
was used for this which provides functions for short-term feature extraction. After this step, in
a CSV file each input was represented as a row with 9 columns depicting varying features. The
libraries which were used are:
• Numpy
• Matplotlib
• Panda
• Seaborn
We will then visualize some of the data so as to get a better understanding of the problem and
the type of solution to be built.
Df sample (10) : This statement in python can be used to see some of the rows of the dataset.
The next step is to plot a graph based on the distribution of counts of tweets of different labels.
As shown in the fig.12.
Training a deep learning model is quite a simple task in python if we have a proper dataset. A
proper dataset can have the features like clean data, clarity of labelling, no useless data, no
useless columns etc. So, the training can be done using train_test_split module in
sklearn.model_selection library.
5.4) Result
After training our dataset using all the three models, now we are going to make a single
common testing function for all models in order to compare the best among the three.
Following are the results we got after implementing our model using various techniques :-
5.4.1) Models
a) Multinomial Naïve Bayes Model
As shown in fig.13, since the support factor is same for all the three classes. Support factor is
34534 for all three classes. This means our data is balanced. Hence, the accuracy does a pretty
good job of blending specificity and sensitivity, recall and precision. Here our model gives an
accuracy of 86% for dataset 1 in fig.13 and accuracy of 87% for dataset 2 in fig.14.
As explained above, in fig-, the data is balanced. Accuracy of the model is 95% for dataset1 as
shown in fig,15 and accuracy of the model is 95% for dataset2 as shown in fig.16.
After training and testing 2 datasets namely dataset-1 and dataset-2 on all the three models as
shown above, following is the table for comparison that is created.
The above table shows that Linear_svc model gives best accuracy 96% on both the datasets.
Hence, we conclude that this the most efficient among all the three models we used for speech
emotion recognition. Below are the graphs in fig.21 and fig.22 which depict the above
conclusion.
Now that we know that the best accuracy is given by linear svc model. We’ll take some input
data and check the sentiment of it to confirm whether our code is working properly or not.
Here there are given two options whether you want to type some text and wish to analyse the
sentiment or you want to give some Realtime speech input by speaking through a microphone.
To show a sample result, we’ll be selecting option 2 which is input by text. As shown in the
fig.19.
Now that we’ve entered 2, we’ll see a text box being displayed as you press enter. Here we
need to insert the text of which we need to analyse sentiment. For example, I enter: “We take
pride in the quality of our food and hope that we can continue delivering great-tasting meals. If there is
anything else we can do for you, please let us know.” This is a positive review. As shown in the fig.20.
At times, it is hard to understand human emotions because of its subjective nature. One’s
interpreted emotions can be perceived differently by two or more different individuals.
Moreover in recent times people are highly indulged into sarcastic remarks, which makes it
troublesome and will require people with good intellect to understand it properly. In. addition
to this some perceive showing ones emotions is often considered as a sign of weakness, hence
they often try to hide it.
In our project, the steps for developing a SER system were discussed and analysed
comprehensively and some experiments as well as research work were carried out to
understand the affect of each step. Initially, due to availability of few datasets with positive,
negative as well as neutral reviews over the internet made it hard to properly train a model. In
next step, various approaches to feature extraction had been proposed in the earlier studies,
henceforth, numerous experiments have to be conducted for selecting the best approach. In the
final step, with respect to SER system classifier selection is performed to determine the
strengths and weaknesses of every classifying algorithm.[14] In the end improving the accuracy
of the code came out to be quite a challenging task and an enormous amount of time and effort
went into increasing it above 90 %.
In this chapter, we will be discussing the conclusion and the future scope of our SER project.
6.1) Conclusion
After training our datasets with all the three models namely multinomial naive bayes, logistic
regression and linear_svc, we found the following results. Multinomial naive bayes gives an
average accuracy of 87%, logistic regression gives an average accuracy of 95% and linear_svc
gives an average accuracy of 96%. Hence, we can conclude that linear_svc is the most efficient
among all the three models and gives best accuracy for emotion analysis.
For future improvements, this project can be further improvised in terms of efficiency, usability
and accuracy. We can integrate with GUI interface for better and easier usability. We can
include additional features like taking pre recorded audio files as input and can improve its
effectiveness towards sarcastic statements/human dialogue. The feature of facial expression
recognition can also be included to this project which could serve as a blessing in further
detecting the human emotions to best of its possibilities. Therefore, this speech emotion
recognition system can have numerous applications in the near future.
In this chapter, we will list the references from which we have taken guidance for our project.
1) Eva Lieskovská, Maroš Jakubec, Roman Jarina and Michal Chmulík, (2021), “A Review
on Speech Emotion Recognition Using Deep Learning and Attention Mechanism” -
https://doi.org/10.3390/ electronics10101163
2) https://www.analyticsinsight.net/speech-emotion-recognition-ser-through-machine-
learning/
3) Aastha Joshi, Rajneet Kaur, (2013), “ A Study of Speech Emotion Recognition Methods ”
- IJCSMC, Vol. 2, Issue. 4
4) https://www.verywellmind.com/theories-of-emotion-2795717
5) http://web.archive.org/web/20210422223951/http://www.sparknotes.com/psychology/ps
ych101/emotion/section1/
6) http://web.archive.org/web/20210120172412/https://www.analyticsinsight.net/types-of-
sentiment-analysis-and-how-brands-perform-them/
7) https://monkeylearn.com/sentiment-analysis/
8) Devika M D, Sunitha C, Amal Ganesh (2016) ; “Sentiment Analysis: A Comparative Study
On Different Approaches”.
9) https://www.kaggle.com/talhaaljunaid170221/emotion-negative-positive-neutral-text-data
10) https://www.subtitlelist.com/en/Naive-Bayes-Classifier-in-Python-Naive-Bayes-
Algorithm-Machine-Learning-Algorithm-Edureka-219828
11) https://kavita-ganesan.com/news-classifier-with-logistic-regression-in-
python/#.Yct5z2hBy3A
12) https://www.datatechnotes.com/2020/07/classification-example-with-linearsvm-in-
python.html
13) https://towardsdatascience.com/fit-vs-transform-in-scikit-libraries-for-machine-learning-
3c70e6300ded
14) Sundarprasad, Neethu, (2018),"SPEECH EMOTION DETECTION USING MACHINE
LEARNING TECHNIQUES" - Master's Projects. 628.
15) Victor Odumuyiwa, Ukachi Osisiogu (2019) : A Systematic Review on Hidden Markov
Models For Sentimental Analysis.
16) Anna Jurek, Maurice D. Mulvenna, Yaxin Bi (2015): Improved lexicon-based sentiment
analysis for social media analytics.
17) https://towardsdatascience.com/easy-speech-to-text-with-python3df0d973b426
18) Maghilnan S, Rajesh Kumar M, Senior IEEE, Member School of Electronic Engineering
(2017), Sentiment Analysis on Speaker Specific Speech Data , VIT University,Tamil Nadu,
India.
19) Shivangi Nagdewani, Ashika Jain (2020): A REVIEW ON METHODS FOR SPEECH-
TOTEXT AND TEXT-TO-SPEECH CONVERSION, International Research Journal of
Engineering and Technology (IRJET)