Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

SPEECH EMOTION RECOGNITION

A Minor Project Report

in partial fulfillment for the award of the degree

of

BACHELOR OF TECHNOLOGY
in

COMPUTER SCIENCE & ENGINEERING

by

SHAURYA LALWALIA, 04020902718


AASTHA GAUTAM, 00120902718
MUDIT PANDEY, 02920902718

Guided by
Dr. ANU SAINI
Assistant. Professor

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


G.B. PANT GOVT. ENGINEERING COLLEGE, NEW DELHI.
(AFFILIATED TO GURU GOBIND SINGH INDRAPRASTHA UNIVERSITY, DELHI)

DECEMBER, 2021

Speech Emotion Recognition (2021) 1


CANDIDATE’S DECLARATION

We hereby certify that the work which is being presented in the Minor
Project entitled “Speech Emotion Recognition” in partial fulfillment for the
award of the Degree of Bachelor of Technology in Computer Science and
Engineering / Information Technology affiliated to Guru Gobind Singh
Indraprastha University, New Delhi and submitted to the Department of
Computer Science and Engineering G. B. Pant Govt. Engineering College , is an
authentic record of our own work carried out during a period from August 2020
to Dec 2020. The matter represented in this report has not been submitted by me
for award of any other degree of this or any other institute/university.

Name – Shaurya Lalwalia


(04020902718)
Date – 28th Dec, 2021 Mudit Pandey
(02920902718)
Aastha Gautam
(00120902718)

This is to certify that the above statement made by the candidate is correct to the
best of our knowledge.

Signature of Supervisor
Date Dr. Anu Saini & Designation

Speech Emotion Recognition (2021) 2


TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.

1) Title Page 1
2) Candidate Declaration 2
3) List of Figures 5
4) List of Tables 6
5) List of Abbreviations 7
6) Acknowledgement 8
7) Abstract 9

Chapter 1. INTRODUCTION 10 - 17
1.1) Importance 10
1.2) Motivation 11
1.3) SER 11
1.4) Theories of Emotions 12
1.4.1) Evolutionary Theory
1.4.2) The James-Lange Theory
1.4.3) The Cannon-Bard Theory
1.4.4) Schachter and Singer’s Two-Factor Theory
1.4.5) Cognitive Appraisal

1.5) Types of Sentimental Analysis 14


1.6) Approaches of Sentimental Analysis 16

Chapter 2. RELATED WORK 18 – 20


2.1) Related Work 18
2.2) How Our Approach Is Different 20

Chapter 3. PROBLEM DESCRIPTION AND SPECIFICATION 21 - 22

Speech Emotion Recognition (2021) 3


3.1) Problem Statement 21
3.2) Dataset 21
3.3) Methodology/Our Approach 21

Chapter 4. SYSTEM DESIGN 23 - 29


4.1) Speech to Text 23
4.2) Text Preprocessing 24
4.3) Feature Extraction 25
4.4) Sentiment Classification using Naive Bayes Classifier 26
4.5) Training & Testing 27
4.6) Models of Classification 28

Chapter 5. RESULT & IMPLEMENTATION 30 - 39


5.1) Data Collection 30
5.2) Data Visualization 30
5.3) Training the Models 31
5.4) Results 32

Chapter 6. CONCLUSION & FUTURE SCOPE 40


6.1) Conclusion 40
6.3) Future Scope 40

Chapter 7. REFERENCES 41 - 42

Speech Emotion Recognition (2021) 4


LIST OF FIGURES

1) Fig 1: A basic outline of the Speech Emotion Recognition


2) Fig 2: Categories of Emotion
3) Fig 3: Step by step methodology
4) Fig 4: Speech to Text Code
5) Fig 5: Stop words Removal
6) Fig 6: Feature Extraction
7) Fig 7: Dataset Rows and Columns
8) Fig 8: Training and Testing
9) Fig 9: Multinomial Naive Bayes Model Training
10) Fig 10: Training using Logistic Regression Model
11) Fig 11: Training using Linear_svc model
12) Fig 12: Dataset Analysis
13) Fig 13: Multinomial Naïve Bayes on Dataset-1
14) Fig 14: Multinomial Naïve Bayes on Dataset-2
15) Fig 15: Logistic Regression on Dataset-1
16) Fig 16: Logistic Regression on Dataset-2
17) Fig 17: Linear_svc on Dataset-1
18) Fig 18: Linear_svc on Dataset-2
19) Fig 19: Input Options
20) Fig 20: Sample Output
21) Fig 21: Graph Depicting accuracy of all the three models on dataset-1
22) Fig 22: Graph Depicting accuracy of all the three models on dataset-2

Speech Emotion Recognition (2021) 5


LIST OF TABLES

1) Table 1 : Different types of Sentimental Analysis


2) Table 2 : Different Approaches of Sentimental Analysis
3) Table 3 : Comparison Table

Speech Emotion Recognition (2021) 6


LIST OF ABBREVIATIONS

1) SER – Speech Emotion Detection


2) HCI – Human Computer Interaction
3) AI – Artificial Intelligence
4) SVM – Support Vector Machine
5) SVC – Support Vector Classifier

Speech Emotion Recognition (2021) 7


ACKNOWLEDGEMENT

We would like to express our gratitude towards our mentor Dr. Anu Saini for exposing us to
this topic, providing us with research material and moreover for being the guiding light all this
while. Furthermore we would like to thank Dr Amit R. Khaparde for his valuable suggestions
and to our principal for presenting us with this golden opportunity of making this project.
Finally, we would like to thank our friends and family for being the constant support system
and encouraging force all this while.

Speech Emotion Recognition (2021) 8


ABSTRACT

We as human being often use speech as one the most natural way to express ourselves.
As it helps us to express ourselves effectively. In modern times, we so much rely on it
that we even regularly use emojis and audio messages while interacting via other forms
of communications like email and messages etc. We all are very well aware that
emotions play a essential role in communication, the detection and analysis of the same
can become a blessing in today’s digital world of remote communication. Moreover, it
can help immensely in the development of human–computer interaction (HCI)
applications.[1,2]

Because of the subjective nature of the emotions, emotion detection is a challenging


task. There is no universally accepted method or common consensus on how to measure
or categorize them. Hence we use different methodologies of SER to tackle this
situation and determine the embedded sentiments in one’s speech. Hence a SER system
can find use in a wide variety of application areas like caller-agent conversation or
interactive voice based-assistant or customer reviews analysis. In our project we aim to
detect the embedded sentiments in one’s speech.

First we performed an experiment among the various models present under the different
approaches available for sentimental analysis, to determine which comes out to be the
most effective and accurate. Then we proceeded with the machine learning approach to
detect the sentiments in the speech, once that speech is converted into text as it gets
processed under the speech to text section.

Speech Emotion Recognition (2021) 9


CHAPTER-1

INTRODUCTION

In today’s environment, we’re suffering from data overload, companies usually have mountains
of customer feedback accumulated with no major categorization. For mere humans, it’s
impossible to analyze it manually without any sort of error or bias.

Speech Emotion Recognition (SER) is the answer to this problem. Speech Emotion
Recognition will take human speech as an audio input, converts it into text, analyze the text
and check whether the sentiment of the input statement is positive, negative or neutral and then
will give a graph as an output depicting the analyzed emotion. It helps in improving decision
making in various businesses by providing significant amount of structured data rather than
plain intuition of human which isn’t always right. It analyzes whether the customer’s
feedback/statement is positive, negative or neutral.

Speech Emotion Recognition can be implemented using many different approaches like Deep
Learning approach, Lexicon Based approach or Hybrid approach. We will be using Machine
Learning approach to detect human emotions in their speech.

1.1) Importance

In today’s world, communication is the key to express oneself, as it’s the most common way
of interaction between humans. Humans use most parts of their body to communicate
effectively, in which voice plays a very major role. Body language, hand gestures, and tone are
all often used collectively and effectively to express one’s sentiments. As different languages
are practiced across different parts of the world which leads to a varying verbal form of
communication, the non-verbal part of communication is used to express one’s sentiments
which is most likely to be common among all. Therefore, any advanced technology
environment developed to understand human emotions can be of great help.

Speech Emotion Recognition (2021) 10


In addition to this, the importance of speech emotion recognition also becomes highlighted in
order to understand the true or exact context of your speech. People often use sarcastic
statements which on normal reading depicts a different meaning compared to when they are
analysed in depth. So SER plays a vital role here in detecting the meaning along with its
context.

1.2) Motivation

In today’s environment, we’re suffering from data overload, companies usually have
mountains of customer feedback accumulated with no major categorization. For mere
humans, it’s impossible to analyze it manually without any sort of error or bias.

Speech Emotion Recognition (SER) is the answer to this problem. It improves decision
making in various businesses by providing significant amount of structured data rather than
plain intuition of human which isn’t always right and moreover can have traces of biasedness.

Sentimental analysis is extremely important because it allows businesses to understand the


sentiments of their customers towards their brand or products or services. It analyzes whether
the customer’s feedback/statement is positive, negative or neutral, by automatically sorting
the sentiments behind social media conversations, reviews which will eventually help
businesses in productive decision making.

1.3) SER

Speech emotion mostly focuses on to identify in a spew of seconds about the emotional state
of a human being from their particular voice. It is based on the deep analysis of the general
mechanism of speech signal, taking out some details which contain emotional information
regarding the speaker’s voice, and also to draw out some appropriate pattern to recognise or
identify the emotional states. Such as the usual standard recognition systems, the speech

Speech Emotion Recognition (2021) 11


emotion recognition system also includes the four major elements: speech input, features
extraction, SVM based grouping, and emotion output, as shown in the fig 1. The general
structure in SER system have three basic steps shown below : [3]
i. A speech developing system take out the relevant quantities from signals, in
particularly as a pitch or energy,
ii. These quantities are encapsulated into lessen set of features,
iii. A analyst grasp a well-ordered manner with the example data how to link the
features to the emotions.

Fig 1 : A basic outline of the Speech Emotion Recognition [3]

1.4) Theories of Emotions[4,5]

In psychology, emotion is often considered as an intricate state of feeling that fathom through
physical and psychological changes which influence the thoughts and behaviour. Now here we
will discuss some kind of theories that help us to comprehend in knowing when mostly people
experiences or show emotions. Emotions can be categories into three types, as shown in fig 2.

Speech Emotion Recognition (2021) 12


Fig 2 : Categories of Emotion[4]

1.4.1) Evolutionary Theory

Charles Darwin in the 1870s put forward a theory that stated that, emotion evolved because
they possessed adaptive value. For example, fear grows and help people to take corresponding
actions in a way that increase their survival possibility. [5]

1.4.2) The James-Lange Theory

In the 1880s, two psychologists namely William James and Carl Lange all by one’s self gave
a theory which expresses that people experiences various emotions to discern their bodies’
physiological reactions to different external events that comes in contact with them. As per this
theory, people do not cry because they are unhappy instead of that they feel unhappy that’s
why in that response they start crying.[5]

Speech Emotion Recognition (2021) 13


1.4.3) The Cannon-Bard Theory

This theory refutes the James-Lange theory and expresses that the occurrence of emotion ensue
at the same time when physiological arousal happens. Neither one give rise to the other. The
brain gets a signal that roots the experience of emotion at the same time when the autonomic
nervous system accepts a signal that give rise to awaken the physiological emotion.[5]

1.4.4) Schachter and Singer’s Two-Factor Theory

In the 1960s, Stanley Schachter and Jerome Singer heralded with a theory that says, every
person experience of emotion focuses on the two main elements: physiological awakening and
the cognitive explication of that awakening. When people look on to the physiological signals
of arousal then they also look for an environmental clarification of this awaken emotions[5]

1.4.5) Cognitive Appraisal

The psychologist Richard Lazarus’s studies have revealed that every person get the experience
of emotions on the basis of their appraise or evaluation of the events around them.[5]

1.5) Types of Sentimental Analysis

Description
Types

This sentiment analysis model helps you to determine the polarity


precision. You can manage the analysis covering the subsequent
polarity categories: more positive, positive, neutral, negative, or more
Fine-grained negative. Fine-grained analysis provide assistance for the better scrutiny
Sentiment of reviews and ratings.
Analysis[6]

Speech Emotion Recognition (2021) 14


A rating scale of 1 to 5, determines 1 as negative and 5 as positive. For
a scale of 1 to 10, determines 1-2 as more negative and 9-10 as more
positive.[6]

As the name indicates , emotion detection helps you observe or identify


the different emotions. Emotions can be in a form of sadness, anger,
happiness, frustration, despair, worry, panic, etc. Emotion detection
systems typically use a glossary – a collection of words that carry
various emotions. Some advanced analyst also use the robust machine
learning (ML) algorithms.
Emotion It’s approves to use ML over glossary because every person shows
detection[6] diverse kind of emotions . Take this line, for example: “this person will
hurt me.” This line may express feelings of fear and fright. While on the
contrary “this person gets hurt for me” carries different emotions and
meaning. But if this word “hurt” closely associated with fear or fright
in the lexicon. This will detect emotion in an inaccurate way.[6]

Aspect-based analysis deals deep than just determining the polarity. It


associated with the particular aspects of a person which is known by the

Aspect-based others. [6]

Sentiment
For example - "this alarm clock battery works slow", an aspect-based
Analysis[6,7]
analyst will grasp the sentence expression and noted as a negative
opinion about the future battery life.[7]

This is mainly about the activity. Its main motive is to decide what kind
of intent carried by the message.
Intent
Analysis[6] The intent analysis associated by the consumer – whether the buyer
likes to purchase an item or they are simply checking out or browsing
the items.
If the buyer decides to purchase an item, then you can easily captivate
them with advertisements or deals. If a buyer isn’t ready to buy an item,

Speech Emotion Recognition (2021) 15


you can save your energy, intentions and resources by not advertising
to them.[6]

Multilingual sentiment analysis can be difficult. It is used to analyse the


Multilingual
sentiments statements that are multilingual in nature.
sentiment
analysis
Where first detection of the language is performed and then the further
processes continue.

Table 1 : Different types of Sentimental Analysis

1.6) Approaches of Sentimental Analysis

Approaches Description Classification Advantage Disadvantage


Machine This method by Supervised and Dictionary is Generally,
Learning training determines Unsupervised not necessary. classifier trained
Approach[8] the sentiments of a learning. on the texts in one
known dataset by
(Naive Bayes, Demonstrate domain, does not
using several
Linear the high work with other
Regression, learning algorithms.
accuracy of domains.
SVM,Deep [8]
Learning) classification.

This approach Supervised and Sentiment Efficiency and


identifies opinion Unsupervised classification accuracy depend
words in a text and learning. the defining rules.
then its classification done through
is done on the basis sentence level
Rule Based
of number of gives better
Approach[8] negative & positive results than
words. Classification word level.
is performed based
on different rules
such as negation
words, dictionary
polarity, booster
words, idioms, mixed
opinions, emoticons
etc. [8]

Speech Emotion Recognition (2021) 16


In Lexical Based Unsupervised The procedure It demands strong
approach semantic learning of learning & linguistic
Lexical orientation of words the labelled resources which is
Based or sentences in the data is not not always
Approach review is used to
required. accessible or
calculate sentiment
[8] polarity of a review. available.
The subjectivity and
(VADER, opinion in a text is
measured via
TextBlob, “Semantic
Sentiwordnet) Orientation”. [8]
This approach is the Supervised and It inherits high
Hybrid combination of Unsupervised accuracy from
lexical based learning. the machine
Approach approach & learning and
machine learning. stability from
the lexicon
based
approach.

Table 2 : Different Approaches of Sentimental Analysis

Speech Emotion Recognition (2021) 17


CHAPTER-2

RELATED WORK

In this chapter, we will represent the different material we have referred and what are they
about. Moreover we will describe how our approach is different from these.

2.1) Related Work

In research paper [1], they have performed Speech Emotion Recognition Using Deep Learning
and Attention Mechanism which is based on deep neural network (DNN). This provides a
review of the recent development in SER and also examines the impact of varying attention
mechanisms on SER performance. In this article[2], they have comprehensively discussed
SER using machine learning approach. They have highlighted the three key features they have
used namely, MFCC (Mel Frequency Cepstral Coefficients), Mel Spectrogram and Chroma.
The Python implementation of Librosa package was used in their extraction. The research
paper [3], focuses on study of various speech emotion detection methods like Support Vector
Machine (SVM), Hidden Markov Model (HMM) etc. In this article [4], it has discussed what
emotions is and what are the various theories about it. This article highlights 6 different theories
from first one being given in 1870s to the ones given a few decades ago.

The article [5] also sheds light on the different theories of emotion and also states the 3 major
categorisations, i.e., physiological, neurological and cognitive theories. In the article [6], it
discusses the meaning of sentimental analysis and describes the different types of sentimental
analysis like aspect based, fine grained etc. This Article [7], comprehensively describes
sentimental analysis from its fundamentals to different approaches like machine learning, rule
based etc to its challenges and applications. It describes how it could be done using NLP. In
the research paper [8], they have conducted a in-depth study to numerous models like Naive
Bayes, SVM, N-gram sentimental analysis etc. They have performed intensive comparison
between different models and have stated the advantages and disadvantages of each approach.

Speech Emotion Recognition (2021) 18


We have chosen a twitter based dataset [9], it contains negative, positive and neutral tweets
and we have used it for training as well as in testing after splitting it. In the reference [10] , its
a video in which it has comprehensively described all about Naive Bayes method like how to
implement it, in what aspects its different from others and its different types etc.

In the reference[11], they used logistic regression as the text classifier. They used TF-IDF
weighting for feature weighting and instead of accuracy used MRR for evaluation takes the
rank of the first correct answer into consideration. The reference[12], discussed with example
the use of linear_svc (support vector classifier) for text classification and for accuracy used
confusion matrix function. The reference[13] discusses the use of fit and transform function
for model training. Dr. Robert Chun et al. [14] discussed k means clustering for grouping of
data and multiple algorithms for emotion recognition like logistic regression, naive bayes,
linear_svc, decision tree, random forest. In reference[15], Victor Odumuyiwa and Ukachi
Osisiogu discussed a systematic review of hidden markov model for sentiment analysis. Anna
Jurek et al.[16] discussed lexicon based model for sentiment analysis involves calculating the
sentiment from the semantic orientation of word or phrases that occur in a text.

In reference[17], author discussed and implemented speech to text conversion using google
API. It discusses synchronous audio file, asynchronous audio file and real time speech input.
Maghilnan and Rajesh Kumar M [18] used Sphinx4, Bing Speech API, Google Speech API for
speech recognition and performance metric used was WWR. For speaker recognition, we used
MFCC as feature and DTW with various distance computation methods such as Euclidean,
Correlation, Canberra for feature matching and recognition rate was used as the performance
metric. Shivangi Nagdewani, Ashika Jain[19] used HMM and neural network for speech to
text conversion. Nithya Roopa S. et al.[20] used IEMOCAP corpus Database for sentiment
analysis. Auther used Inception Net v3 Model to build an emotion recognition model. Inception
is evolved from GoogLeNet Architecture with some enhancements.

Speech Emotion Recognition (2021) 19


2.2) How Our Approach Is Different

Our approach is different from the earlier used approaches in the following ways :-

• In our approach we are using both types of supervised learning, i.e., regression and
classification.
• We have provided the feature of 2-way input option, i.e., via speech and via text.
• We have conducted the experiment of comparing the 3 different model trained over 2
distinct datasets.

Speech Emotion Recognition (2021) 20


CHAPTER-3
PROBLEM DESCRIPTION AND
SPECIFICATION

In this chapter we will discuss the problem statement we have chosen and the appropriate
solution we have come up with. We have briefed about the dataset used and methodology we
have opted for.

3.1) Problem Statement

Speech Emotion Recognition will take human speech as an audio input, converts it into text,
analyze the text and check whether the sentiment of the input statement is positive, negative
or neutral and then will give a graph as an output depicting the analyzed emotion.

3.2) Dataset

We have used 2 twitter tweets based dataset which contains positive, negative and neutral
reviews. We downloaded it from Kaggle.com [9]

3.3) Methodology/ Our Approach


This project collects voice input from the user in real time, converts the voice input into text
and then analyze whether the statement of the user is positive, negative or neutral.
Recognition of such human sentiments is called “Speech Emotion Recognition/Sentiment
Analysis”. It is implemented using machine learning and natural language processing with
other related technologies.
In SER the source of input is real-time human speech however we have also included the
additional options for inputs, i.e., in text format. We have provided the option for a 2-way
input acceptor. Depending upon the input we will proceed further: -
Case I : If the input is in text format, then directly machine learning will be implemented on

Speech Emotion Recognition (2021) 21


it and emotions within text will be analyzed.
Case II : If the input is in speech format, then our model first converts it into text and then the
emotion analysis part is carried out. As shown in fig 3.

We have done comparison between different models corresponding to different approaches


to determine the one with optimum results. Some of the models we compared are namely
Logistic Regression, Linear SVC, VADER, Text Blob, Naïve Bayes etc. We trained these
various models on the same dataset as part of our experiment to detect the best model. We
have also added a graph to represent the conclusion of that comparison. In the end we have
added the feature of graphical representation for the model which displays best accuracy. A
graph will depict output of the analyzed emotion in our SER.

Fig 3 : Step by step methodology

Speech Emotion Recognition (2021) 22


CHAPTER-4
SYSTEM DESIGN

In this chapter we have comprehensively discussed our approach detailing about the each step
we have performed and together with it the system design.

To solve this emotion recognition problem, we will first convert the speech into text if the input
is Realtime speech and a text input box if the user wants to enter the text directly. Then a deep
learning model will be used for text classification. This model will be trained that will analyse
whether the sentiment in the text is positive, negative or neutral.

4.1) Speech to Text

A deep learning model can be used that takes in audio signals, analyse them and convert them
into corresponding text. Voice inputs are taken using Google API which is then sent to core
cloud functions. This converts human audio into electronic signals and thereafter pre-
processing is done. It is then sent to Speech to Text API (applying Deep Learning Model and
understanding what user is trying to say). Finally passed to auto ML- NLP where it is converted
to text format.

Text data is a favourable research object for emotion recognition when it is free and available
everywhere in human life. This feedback will further be evaluated using ML model that will
help to understand user's perception.

English language is usually easy to find the boundaries of words because in English we can
split our sentence by spaces or punctuation and all that is left of words. For example, Rohan;
friends, lend me your ears. So, it has commas it has a semicolon and it has spaces and if we
split on those then we will get words that are ready for further analysis like Rohan, friends and
so forth. As shown in fig.4.

Speech Emotion Recognition (2021) 23


Fig 4: Speech to Text Code

4.2) Text Pre-processing

a) Cleaning: In data cleaning we remove punctuations and symbols from the text data.

b) Stop words Removal: A stop word is a commonly used word that is programmed by search
engine to ignore, both when indexing entries for searching and when retrieving them as the
result of a search query. As shown in the fig.5.

Fig 5: Stop words Removal

We would remove these words to avoid them taking space in our database, or taking up
valuable processing time. We can remove them by storing a list of words that you consider to
stop words and then comparing the word we want to check with the words in this list. NLTK
(Natural Language Toolkit) in python has a list of stop words stored in 16 different languages.
You can find them in the nltk data directory.

c) Tokenization: The process of splitting an input text into meaningful chunks is called
tokenization and that chunk is actually called token you can think of a token as a useful unit

Speech Emotion Recognition (2021) 24


for further semantic processing it can be a word a sentence a paragraph or anything else. A
tokenizer splits the input sequence on white spaces that could be a space or any other character
that is not visible and actually we can find that white space tokenizer in Python library nltk.

d) Stemming/Lemmatization: The process of normalizing the words is called stemming or


lemmatization. Stemming is a process of removing and replacing suffixes to get to the root
form of the word which is called the 'stem'. When people talk about lemmatization, they usually
refer to doing things properly with the use of vocabularies and morphological analysis. This
time we return the base or dictionary form of a word which is known as the lemma.

4.3) Feature Extraction

Here we will convert the tokens into features and the first way to do that is a word bag. We
actually want tag words that are as beautiful or embarrassing and we want to see those words
and make decisions based on the absence or presence of that word and how it can work let’s
take the example of three reviews as a good film, not a good one. movie, I did not like it and
let's take all the possible names or tokens we have in our documents and in each such token we
introduce a new feature or column that will accompany that word to make it a bigger matrix
for numbers and how we translate our text from vector to that matrix or row to that matrix so-
so let's take a good example of a good review of a film with a good name in our text so we put
one in the column corresponding to that name and the word appears. movie and we put one in
the second column to show that the word appears in our text we have no other words so all the
others are zero and that is a very long vector in the sense that it has a lot of zero and not a good
movie will have four and all the other zero and so on this process is called text vectorization
because we actually convert text to a larger number vector and the size of that vector
corresponds to a specific token. in our database you can really see that it has some problems
the first is that we lose the order of the words because we can actually push our words and the
right representation will always be the same which is why it is called the word bag because the
bag names the order and so they can come in any way. As shown in the fig.6.

Speech Emotion Recognition (2021) 25


Fig 6: Feature Extraction

4.4) Sentiment Classification using Naive Bayes Classifier

In a world full of machine learning and artificial intelligence surrounding almost everything
around us, classification and prediction is one of the most important aspects of machine
learning.

Naive bayes is a simple but surprisingly powerful algorithm for imperative analysis. It is a
classification technique based on bayes theorem with an assumption of independence among
predictors. It comprises of two parts which is naive and bayes. In simple terms naive bayes
classifier assumes that the presence of a particular feature in a class is unrelated to the presence
of any other feature even if these features depend on each other or upon the existence of the
other features. All of these properties independently contribute to the probability whether a
fruit is an apple or an orange or a banana so that is why it is known as naive.

Naive bayes model is easy to build and particularly useful for very large datasets in probability
theory and statistics-based theorem which is alternatively known as the Bayes law or the Bayes
rule. It describes the probability of an event based on prior knowledge of the conditions that
might be related to the event.

Bayes theorem is a way to figure out conditional probability. The conditional probability is the
probability of an event happening given that it has some relationship to one or more other
events. We have a dataset which is list of tweets from twitter for example here's an example
that says Great job to coach walker and all the players. That is a positive tweet and it is labelled
positive and so there's about 172215 of these examples in our dataset and we're going to go

Speech Emotion Recognition (2021) 26


through and look at all of them they're going to have positive and negative label and we're
going to use those labels and the tweets or the text review to train a classifier to help us
understand positive review versus a negative review. As shown in the fig.7.

Fig 7: Dataset Rows and Columns

4.5) Training and Testing

We will do a test and train split. We will use 80% for training and a 20% holdout for testing.
Now we will train up naive Bayes classifier. The great thing about naive Bayes is that it's really
fast and it only needs one pass over the data. As shown in fig.8.

Fig 8: Training and Testing

The first step is to handle the data we load from the CSV file and extend it across the trained
and tested assets. The second step is to capture the data that captures the features in the training
dataset so that we can calculate and make probabilities. The third step is to create a specific
estimate, we will use the summary of the data set to make the estimate and then we will create
the estimates provided in the test dataset and the summary training data set and finally, we will
use the test dataset. Assess the accuracy of the assumptions made for the dataset, all the
assumptions are correct and finally, we add this and form our naive base taxonomy model. [10]

Speech Emotion Recognition (2021) 27


4.6) Models for classifications

a) Multinomial Naive Bayes Model

There is a Python library called Skit-Learn. This will help in creating naive base models in
Python. Under the scikit-learn library there are three types of innocent bay models: Gaussian,
multinomial and Bernoulli. Here we are going to use the polynomial used for discrete
calculations. The multinomial naive bayes falls under the supervised practice and is a
classification algorithm with the help of which we can classify the text. Text taxonomy means
that it is our job to tell you if you have data in the form of text or statements and to what specific
class the particular statement belongs to. Here we want to do sentiment analysis, i.e., you want
to know whether the specific review that the user is posting is positive review, negative review
or neutral review. So, we can do this kind of work with the help of multinomial naive bayes.
As shown in fig.9.

Fig 9: Multinomial Naive Bayes Model Training

b) Logistic Regression model

Logistic regression is a classification that can be used to solve a binary classification problem.
The result is usually defined as 0 or 1 in the dual condition model. In the data set below,
assessment is made by applying binary taxonomy with logistic regression on the data assigned
for training and test data. First, standardization is applied for pre-processing, and then the
training data is trained with fit () and then used to evaluate test data using the estimation ()
method. When you instantiate the logistic regression module, you can change the values of
`solver`,` penalty`, `c` and even specify whether it should be called a multi-class classification
problem (one-versus-all or multinomial).[11] As shown in fig.10

Speech Emotion Recognition (2021) 28


Fig 10: Training Logistic Regression Model

c) Linear Support Vector Model (SVC)

SVM is a machine-readable learning algorithm (feed-me) that can be used for both editing or
deceleration challenges. Classification predicts the label / group and regression predicts
continuous value. SVM makes the arrangement by finding a hyper-plane that separates the
classes we have arranged in the n-dimensional space. The SVC method uses line kernel
function to perform differentiation and is ideal for working with large numbers of samples. If
we compare this model with the SVC model which is a vector classifier that supports the SVC
line has an additional parameter similar to the standard operating l1 or l2 function and the
kernel loss function of this model cannot be changed because it is based on the kernel. the line
method itself.[12]

We load the required libraries. We use linear SVC class from sklearn.svm. We have split x and
y's into the train and the test parts here we use 20% of data as a target test data. We'll fit the
model on train data. It'll check the training score. We can also apply a cross validation method
to check the training score.

Following is the code, as shown in fig.11:

Fig 11: Training using Linear_svc model

Speech Emotion Recognition (2021) 29


CHAPTER-5
RESULT & IMPLEMENTATION

In this chapter, we have discussed the result obtained after training the models and after the
full execution of the code provided with the comparison table and comparison graphs for proper
analysis.

5.1) Data Collection

Initially, for implementing the Speech Emotion Recognition system we have to collect input
samples be it in the form of text, audio file or real time voice input. It will be used to train the
model. The audio input is usually wav or mp3 files.

PYTHON LIBRARY: After data collection we need to represent these input files in numeric
form, to do more extensive analysis on them. This is called feature extraction, where numeric
values for different features of the input is determined. Speech_recognition module in python
was used for this which provides functions for short-term feature extraction. After this step, in
a CSV file each input was represented as a row with 9 columns depicting varying features. The
libraries which were used are:

• Numpy
• Matplotlib
• Panda
• Seaborn

5.2) Data Visualisation

We will then visualize some of the data so as to get a better understanding of the problem and
the type of solution to be built.
Df sample (10) : This statement in python can be used to see some of the rows of the dataset.

Speech Emotion Recognition (2021) 30


ANALYSING DATASET:
First, we will extract the unique labels of the dataset using label.unique statement. There we’ll
get that there are three different labels namely positive, negative, neutral. The next step is to
count the number of tweets of different in the dataset in order to analyse the distribution of
different labels in the dataset. This will be done using value.counts() function.

The next step is to plot a graph based on the distribution of counts of tweets of different labels.
As shown in the fig.12.

Fig 12: Dataset Analysis

5.3) Training the Model

Training a deep learning model is quite a simple task in python if we have a proper dataset. A
proper dataset can have the features like clean data, clarity of labelling, no useless data, no
useless columns etc. So, the training can be done using train_test_split module in
sklearn.model_selection library.

Following are the steps involved in it:


1. First of all, we need to split our dataset in two parts I.e., for training our model and then
for testing using 80-20 split. That means 80% of our dat in the dataset will be used for

Speech Emotion Recognition (2021) 31


training the model and the remaining 20% for testing whether our model works properly
or not.
2. The next step is to implement the feature extraction process. Feature Extraction means
converting textual content to numerical logic for sentiment analysis. This will be done
using TfidfVectorizer module in sklearn.feature_extraction.text library.
3. The very next step after this will be removal of the stop words from the tweets in order
to remove the words which are of very less meaning to us and would not contribute
much in the training of the model.
4. One major step is model fitting using fit () function. Model fitting is a measure of how
well a machine learning model generalizes to similar data to that on which it was trained. A
model that is well-fitted produces more accurate outcomes. A model that is overfitted matches
the data too closely. A model that is underfitted doesn't match closely enough.
5. Now we’ll use transform function in python. The transform () method will just replace
the NaNs in the column with the newly calculated value, and return the new dataset.
That’s pretty simple. The fit_transform() method will do both the things internally and
makes it easy for us by just exposing one single method. But there are instances where
you want to call only the fit() method and only the transform() method.[13]
6. Now we’ll be using three models to train our model and compare between them which
is better model. The three models are multinomial naive bayes model, logistic
regression model, linear support vector classifier model.

5.4) Result

After training our dataset using all the three models, now we are going to make a single
common testing function for all models in order to compare the best among the three.
Following are the results we got after implementing our model using various techniques :-

5.4.1) Models
a) Multinomial Naïve Bayes Model
As shown in fig.13, since the support factor is same for all the three classes. Support factor is
34534 for all three classes. This means our data is balanced. Hence, the accuracy does a pretty
good job of blending specificity and sensitivity, recall and precision. Here our model gives an
accuracy of 86% for dataset 1 in fig.13 and accuracy of 87% for dataset 2 in fig.14.

Speech Emotion Recognition (2021) 32


Fig.13: Multinomial Naïve Bayes on Dataset-1

14: Multinomial Naïve Bayes on Dataset-2

Speech Emotion Recognition (2021) 33


b) Logistic Regression Model

As explained above, in fig-, the data is balanced. Accuracy of the model is 95% for dataset1 as
shown in fig,15 and accuracy of the model is 95% for dataset2 as shown in fig.16.

Fig.15: Logistic Regression on Dataset-1

Fig.16: Logistic Regression on Dataset-2

Speech Emotion Recognition (2021) 34


c) Linear Support Vector Classifier Model
Accuracy of the model as shown in fig.17, is 96% on dataset-1 and accuracy of the model as
shown in fig-18, is 96% on dataset-2.

Fig.17: Linear_svc on Dataset-1

Fig.18: Linear_svc on Dataset-2

Speech Emotion Recognition (2021) 35


5.4.2) Comparison Table & Graphs

After training and testing 2 datasets namely dataset-1 and dataset-2 on all the three models as
shown above, following is the table for comparison that is created.

Model Precision Recall f1-Score Accuracy


Multinomial NB 86% 87% 87% 87%
on Dataset-1

Logistic Regression Model 95% 95% 95% 95%


On dataset-1

Linear SVC on Dataset-1 96% 96% 96% 96%


Multinomial NB 88% 88% 88% 87%
on Dataset-2

Logistic Regression Model 95% 95% 95% 95%


On dataset-2

Linear SVC on Dataset-2 96% 96% 96% 96%

Table-3: Comparison Table

The above table shows that Linear_svc model gives best accuracy 96% on both the datasets.
Hence, we conclude that this the most efficient among all the three models we used for speech
emotion recognition. Below are the graphs in fig.21 and fig.22 which depict the above
conclusion.

Speech Emotion Recognition (2021) 36


Fig.21: Graph Depicting accuracy of all the three models on dataset-1

Fig.22: Graph Depicting accuracy of all the three models on dataset-2

Speech Emotion Recognition (2021) 37


5.4.3) Example

Now that we know that the best accuracy is given by linear svc model. We’ll take some input
data and check the sentiment of it to confirm whether our code is working properly or not.
Here there are given two options whether you want to type some text and wish to analyse the
sentiment or you want to give some Realtime speech input by speaking through a microphone.
To show a sample result, we’ll be selecting option 2 which is input by text. As shown in the
fig.19.

Fig.19: Input Options

Now that we’ve entered 2, we’ll see a text box being displayed as you press enter. Here we
need to insert the text of which we need to analyse sentiment. For example, I enter: “We take
pride in the quality of our food and hope that we can continue delivering great-tasting meals. If there is
anything else we can do for you, please let us know.” This is a positive review. As shown in the fig.20.

Fig.20: Sample Output

Speech Emotion Recognition (2021) 38


5.4.4) Limitations

At times, it is hard to understand human emotions because of its subjective nature. One’s
interpreted emotions can be perceived differently by two or more different individuals.
Moreover in recent times people are highly indulged into sarcastic remarks, which makes it
troublesome and will require people with good intellect to understand it properly. In. addition
to this some perceive showing ones emotions is often considered as a sign of weakness, hence
they often try to hide it.

In our project, the steps for developing a SER system were discussed and analysed
comprehensively and some experiments as well as research work were carried out to
understand the affect of each step. Initially, due to availability of few datasets with positive,
negative as well as neutral reviews over the internet made it hard to properly train a model. In
next step, various approaches to feature extraction had been proposed in the earlier studies,
henceforth, numerous experiments have to be conducted for selecting the best approach. In the
final step, with respect to SER system classifier selection is performed to determine the
strengths and weaknesses of every classifying algorithm.[14] In the end improving the accuracy
of the code came out to be quite a challenging task and an enormous amount of time and effort
went into increasing it above 90 %.

Speech Emotion Recognition (2021) 39


CHAPTER-6
CONCLUSION & FUTURE SCOPE

In this chapter, we will be discussing the conclusion and the future scope of our SER project.

6.1) Conclusion

After training our datasets with all the three models namely multinomial naive bayes, logistic
regression and linear_svc, we found the following results. Multinomial naive bayes gives an
average accuracy of 87%, logistic regression gives an average accuracy of 95% and linear_svc
gives an average accuracy of 96%. Hence, we can conclude that linear_svc is the most efficient
among all the three models and gives best accuracy for emotion analysis.

6.3) Future Scope

For future improvements, this project can be further improvised in terms of efficiency, usability
and accuracy. We can integrate with GUI interface for better and easier usability. We can
include additional features like taking pre recorded audio files as input and can improve its
effectiveness towards sarcastic statements/human dialogue. The feature of facial expression
recognition can also be included to this project which could serve as a blessing in further
detecting the human emotions to best of its possibilities. Therefore, this speech emotion
recognition system can have numerous applications in the near future.

Speech Emotion Recognition (2021) 40


CHAPTER-7
REFERENCES

In this chapter, we will list the references from which we have taken guidance for our project.

1) Eva Lieskovská, Maroš Jakubec, Roman Jarina and Michal Chmulík, (2021), “A Review
on Speech Emotion Recognition Using Deep Learning and Attention Mechanism” -
https://doi.org/10.3390/ electronics10101163
2) https://www.analyticsinsight.net/speech-emotion-recognition-ser-through-machine-
learning/
3) Aastha Joshi, Rajneet Kaur, (2013), “ A Study of Speech Emotion Recognition Methods ”
- IJCSMC, Vol. 2, Issue. 4
4) https://www.verywellmind.com/theories-of-emotion-2795717
5) http://web.archive.org/web/20210422223951/http://www.sparknotes.com/psychology/ps
ych101/emotion/section1/
6) http://web.archive.org/web/20210120172412/https://www.analyticsinsight.net/types-of-
sentiment-analysis-and-how-brands-perform-them/
7) https://monkeylearn.com/sentiment-analysis/
8) Devika M D, Sunitha C, Amal Ganesh (2016) ; “Sentiment Analysis: A Comparative Study
On Different Approaches”.
9) https://www.kaggle.com/talhaaljunaid170221/emotion-negative-positive-neutral-text-data
10) https://www.subtitlelist.com/en/Naive-Bayes-Classifier-in-Python-Naive-Bayes-
Algorithm-Machine-Learning-Algorithm-Edureka-219828
11) https://kavita-ganesan.com/news-classifier-with-logistic-regression-in-
python/#.Yct5z2hBy3A
12) https://www.datatechnotes.com/2020/07/classification-example-with-linearsvm-in-
python.html
13) https://towardsdatascience.com/fit-vs-transform-in-scikit-libraries-for-machine-learning-
3c70e6300ded
14) Sundarprasad, Neethu, (2018),"SPEECH EMOTION DETECTION USING MACHINE
LEARNING TECHNIQUES" - Master's Projects. 628.
15) Victor Odumuyiwa, Ukachi Osisiogu (2019) : A Systematic Review on Hidden Markov
Models For Sentimental Analysis.
16) Anna Jurek, Maurice D. Mulvenna, Yaxin Bi (2015): Improved lexicon-based sentiment
analysis for social media analytics.
17) https://towardsdatascience.com/easy-speech-to-text-with-python3df0d973b426
18) Maghilnan S, Rajesh Kumar M, Senior IEEE, Member School of Electronic Engineering
(2017), Sentiment Analysis on Speaker Specific Speech Data , VIT University,Tamil Nadu,
India.
19) Shivangi Nagdewani, Ashika Jain (2020): A REVIEW ON METHODS FOR SPEECH-
TOTEXT AND TEXT-TO-SPEECH CONVERSION, International Research Journal of
Engineering and Technology (IRJET)

Speech Emotion Recognition (2021) 41


20) Dhanush Kumar S, Lavanya S, Madhumita G, Mercy Rajaselvi V (2018): Journal of Speech
to Text Conversion, International Journal of Advance Research, Ideas and Innovations in
Technology.
21) Nithya Roopa S., Prabhakaran M, Betty.P (2018): Speech Emotion Recognition using Deep
Learning, International Journal of Recent Technology and Engineering.

Speech Emotion Recognition (2021) 42

You might also like