Lit Review

tAudio-Visual based Emotion Recognition-A New Approach
Authors: Mingli Song Jiajun Bu Chun Chen Nan Li

Problem statement: Emotion recognition is one of the latest challenges in intelligent
human/computer communication. Most of previous work on emotion recognition focused on
extracting emotions from visual or audio information separately
Aim: to develop an approach to recognize the human emotion including both visual and audio
from video clips
Solution: The Facial Animation Parameters (FAPs) compliant facial feature tracking based on
active appearance model is performed on the video to generate two vector stream which
represent the expression feature and the visual speech one. Combined with the visual vectors, the
audio vector is extracted in terms of low level features. Then, a tripled Hidden Markov Model is
introduced to perform the recognition which allows the state asynchrony of the audio and visual
observation sequences while preserving their natural correlation over time.
Results: They developed a computer fusion system that automatically recognizes the subject’s
emotions based audio-visual analysis. Both audio and visual are considered to be used as the
input of our system to obtain the subject’s emotion more precisely, And a tripled HMM is
designed and trained to recognize the principle emotion from the audio and video observation
sequences. Unlike the HMM, the THMM allows for asynchrony in the audio and visual states,
while preserving the natural dependency of the audio and video signals. Furthermore, with the
tripled HMM, the audio and the two visual sequences are treated separately and is no need for
the concatenation of the observation that is often a challenging problem. The advantage of the
THMM is confirmed by the experiment results. This model can be applied to a variety of
human/machine system.
Recommendation: In future work,current model should be improve to increase the efficiency.
Besides this, the research on recognition of emotion intensity will be performed through the
analysis of audio features, which is different from the approach In addition, more effort will be
paid to apply this system on intelligent human/machine system such as robotics, intelligent home
appliances,etc..
An Investigation on Visual and Audiovisual Stimulus Based Emotion Recognition Using

EEG
Authors : Ramachandran Nagarajan
Aim: to investigate the possibility of using visual and audio visual stimulus for detecting the
human emotion by measuring electroencephalogram (EEG)
Solution: an approach to classify human emotions using EEG signals. An acquisition protocol
based on audio-visual stimulus and visual stimulus are designed to acquire emotional data. From
these data’s, MRA of ‘db4’ wavelet function is used to extract the two statistical features in time-
frequency domain. The extracted features from the time-frequency analysis have been classified
using MLP-BP neural network. They compared the results of two different statistical features of
EEG data using ‘db4’ wavelet function and MLP-BP neural network. Neural network based
classification performs well on classifying different emotional states of the subjects using ‘db4’
wavelet function. Hence, the wavelet based feature extraction of EEG signal in alpha band
activity has proved to be successful in distinguishing emotions from the EEG signals.
Results: they proposed an emotion recognition system to recognise human affective state from
visual and audio-visual stimulus. The statistical features are successfully extracted from EEG
signals for classifying emotions. Comparing to the energy attribute, the audio-visual stimulus
gives better result. This indicates that the audio-visual stimulus performs better, even though the
difference is less. Comparing to the power attribute, it can be concluded that the audio-visual
stimulus performs well as the deviations is large (67.33%). However, they may use some strong
emotional video clips for evoking the emotions in such a way to improve the classification
accuracy of audio-visual stimulus over visual stimulus. Hence, the proposed systematic
development of experimental results confirms that the audiovisual stimulus performs better than
visual stimulus.
A Comparative Analysis of Modeling and Predicting Perceived and Induced Emotions in

Sonification
Authors: Faranak Abri , Luis Felipe Gutiérrez , Prerit Datta , David R. W. Sears , Akbar
Siami Namin and Keith S. Jones
Sonification is the utilization of sounds to convey information about data or events. There are
two types of emotions associated with sounds: (1) “perceived” emotions, in which listeners
recognize the emotions expressed by the sound, and (2) “induced” emotions, in which listeners
feel emotions induced by the sound. Although listeners may widely agree on the perceived
emotion for a given sound, they often do not agree about the induced emotion of a given sound,
so it is difficult to model induced emotions
Problem Statement: In particular, the music industry has extensively studied the effects of
soundtracks on individuals’ emotions. Conventionally, emotion recognition models can be
categorical or dimensional. Categorical models consider emotions with discrete labels (such as
happiness, sadness, anger, fear, surprise, and disgust ), whereas dimensional models characterize
emotions along one or more dimensions (such as arousal and valence). The Geneva Emotional
Music Scales (GEMS) model has been widely used for measuring emotions induced by music,
and the arousal–valence dimensional model has been used in studies of perceived and induced
emotions. To our knowledge, there is no comprehensive study of the performance of the
prediction of perceived and induced emotions from acoustic features
Aim: This paper describes the development of several machine and deep learning models that
predict the perceived and induced emotions associated with certain sounds, and it analyzes and
compares the accuracy of those predictions.
Dataset used: In this paper, they explore emotion recognition using two datasets, IADSE and
Emosoundscape which each represent emotions in a two-dimensional space (i.e., arousal and
valence). The IADSE is a set of sounds for which induced emotions have been measured. The
Emosoundscape dataset is a set of sounds for which perceived emotions have been measured.
Methodology : In a previous work with the IADS and the EmoSoundscape datasets [62], they
reported that Random Forest outperformed other models in A/V prediction using a 1D
psycho(acoustic) feature set, while other models mostly suffered from overfitting. This result is
somewhat expected because ensemble models reduce the risk of overfitting. Ensemble models
combine the prediction results of several base models.Therefore, they chose Random Forest as
one of the prediction models for these datasets in this paper. Random Forest (RF) is an ensemble
method that averages the prediction results of several decision trees. To compare the prediction
results from the ensemble model (RF) with deep models, they developed a multilayer perceptron
model and a 1D convolutional neural network model. For all of the models, they used 30% of the
data as the test data and also applied 5-fold cross validation (CV). In order to compute the
training and testing errors, they averaged the RMSE values over these 5 folds.
Title: EMOTION DETECTION THROUGH AUDIO USING MACHINE LEARNING
Authors: V. Kranthi Sai Reddy

The purpose of this study is to detect emotions from the voice of a person by analyzing
frequency, pitch, energy and speaking rate. Speech signals were captured from 300 people
including both male and female in the age group of 20- 30 years with the help of microphones.
All the voice samples for emotion detection were recorded under circumstances that means
samples may contains the noise like of fan or any other common noise. For emotion detection
people were told to speak in four different emotions i.e., 'happy', 'normal', 'sad' and 'angry'. Praat
tool was used to collect the voice samples. From voice samples pitch, energy, and speaking rate
were extracted for emotion detection. For emotion recognition machine learning algorithm
named as MLP is used for classification and AdaBoost with C4.5 was used.
For emotion detection MLP did not perform well as its recognition rate was low i.e., 51.2% as
MLP performs well if the number of input units will be less but in emotion detection, we have
more number of input units. Then we applied the C4.5 algorithm as it is fast in performing the
classification task and got 76.25% recognition rate. To enhance the classification performance of
C4.5 algorithm we used the AdaBoost algorithm as it is a boosting algorithm that can be used to
improve the performance and we got very good results with average accuracy of 93.12%.
Moreover, it is also concluded that the emotion varies with gender, from experimentation, it is
difficult to detect emotion of females as compared to males.
C4.5 & ADABOOST
C4.5 is a machine learning algorithm used to generate a decision tree and developed by Ross Quinlan.
C4.5 is an extension of the Quinlan's earlier ID3 algorithm. The decision trees which are generated by
C4.5 cation, and for this reason, C4.5 is often referred to as a statistical classifier.
AdaBoost stands for Adaptive Boosting, it is a machine learning meta-algorithm formulated by Yoav
Freund and Robert Schapire. It can be used as conjunction with many other types of learning
algorithms to improve performance. AdaBoost is mainly used to boost the performance of the decision
trees or binary classification rather than regression. In our project above random chance on a classi
AdaBoost is used to boost the performance of C4.5
Title: Speech Emotion Recognition using Support Vector Machines

Author :Aaron Don M. Africa, Anna Rovia V. Tabalan, Mharela Angela A. Tan
The technology of recognition is one that has been developed continuously over the years and
with its various applications in a wide variety of fields opens up massive opportunities to bridge
the gap between humans and computers. Albeit common knowledge that computers are designed
to make everyday life easier, there is still an indubitable lack of deep understanding due to the
computer’s lack of knowledge in complex emotions present with human beings and this often
prohibits computers to offer specific help that is suitable for its user. Therefore, it’s important to
further develop today’s technology and one promising way to accomplish this task is to utilize
Speech Recognition to recognize and classify emotions as well. This way, the computer
essentially understands the user enough to give valuable aid instead of just preset actions.
Support Vector Machine is one of the leading classifying algorithms in today’s time, boasting the
highest accuracy rate which makes it the most viable option for this field of study .
THEORETICAL CONSIDERATION
The main feature is the Support Vector Machine. SVM is a machine learning algorithm that uses
structural risk minimization. SVM works but mapping an N- dimension input into a higher
feature space using various kernel functions. Afterward, the algorithm will try to find the best
possible generalization to separate the classes into its respective hyperplanes. To train a support
vector machine algorithm, several methods can be used. One promising way is to use a k-nearest
neighbor with a gaussian kernel. A study done in 2008 compares various methods such as LS-
SVM, FLS-SVM, and LS+k-NN-SVM with that of clustered KSVM and concluded that the
latter had the best accuracy out of all the standard methods. Factors such as the size of any
dataset that may be used or the complexity of the hyperplane or hypersurface can alter and
increase the number of support vectors that will be required. This is illustrated in Figure 2 which
shows how the required number of support vectors changed and the performance concerning that
change.
Fig 2 Performance of SVM concerning the number of support vectors
The Automation System Censor Speech for the Indonesian Rude Swear Words Based on
Support Vector Machine and Pitch Analysis
S N Endah , D M K Nugraheni1 , S Adhy and Sutikno
According to Law No. 32 of 2002 and the Indonesian Broadcasting Commission Regulation No.
02/P/KPI/12/2009 & No. 03/P/KPI/12/2009, stated that broadcast programs should not scold
with harsh words, not harass, insult or demean minorities and marginalized groups. However,
there are no suitable tools to censor those words automatically. Therefore, researches to develop
a system of intelligent software to censor the words automatically are needed. To conduct censor,
the system must be able to recognize the words in question. This research proposes the
classification of speech divide into two classes using Support Vector Machine (SVM), first class
is set of rude words and the second class is set of properly words. The speech pitch values as an
input in SVM, it used for the development of the system for the Indonesian rude swear word.
The results of the experiment show that SVM is good for this system.
Proposed system:
They propose a system of an intelligent software to automatically censor the speech words for
rude swear word in Indonesian using SVM. Input in SVM is the pitch value of speech.
Experiments were conducted to distinguish male and female voice. Each sound category also
distinguished between cuss words which consisting of one word and the phrase. Each experiment
has different words of data. The data used as training data is also different from testing data. A
positive class in the form of words or phrases is categorized as rude or curse word, while a
negative class are categorized as properly word.
Building a Vocal Emotion Sensor with Deep Learning
Author: Alex Muhr
Problem Statement: Voice recognition software has advanced greatly in recent years. This
technology now does an excellent job of recognizing phonetic sounds and piecing these together
to reproduce spoken words and sentences. However, simply translating speech to text does not
fully encapsulate a speaker’s message. Facial expressions and body language aside, text is highly
limited in its capacity to capture emotional intent compared to audio.
Data
The datasets I used to build the emotion classifier were the RAVDESS, TESS, and SAVEE which
are all freely available to the public (SAVEE requires a very simple registration). These datasets
contain audio files across seven common categories: neutral, happy, sad, angry, fearful, disgusted,
and surprised. Combined he had access to over 160 minutes of audio across 4,500 labeled audio
files produced by 30 actors and actresses. The files generally consist of the actor or actress
speaking a short simple phrase with a specific emotional intent.
Takeaways
This blog post may make it seem as though building, training, and testing the model was simple
and straightforward. I can assure you that this was very much not the case. Before achieving 83%
accuracy, there were many versions of the model that performed quite poorly. In one iteration I
did not scale my inputs correctly which led to predicting nearly every file in the test set as
‘surprised’. So what did I learn from this experience?
First off, this project was a great demonstration of how simply collecting more data can greatly
improve results. The first successful iteration of the model used only the RAVDESS dataset,
about 1400 audio files. The best accuracy achieved with this dataset alone was 67%. To get to
83% accuracy all he did was increase the size of the dataset to 4500 files.
Second, I learned that for audio classification data preprocessing is critical. Raw audio, and even
short-time fourier transforms, are almost completely useless. Failure to remove silence is another
simple pitfall. Once audio has been properly transformed into informative features, building and
training a deep learning model is comparatively easy.
Emotion Recognition Using Deep Learning Approach from Audio-Visual Emotional

M. Shamim Hossain and Ghulam Muhammad
Abstract
This paper proposes an emotion recognition system using a deep learning approach from
emotional Big Data. The Big Data comprises of speech and video. In the proposed system, a
speech signal is first processed in the frequency domainto obtain a Mel-spectrogram, which can
be treated as an image. Then this Mel-spectrogram is fed to a convolutional neural network
(CNN). For video signals, some representative frames from a video segment are extracted and
fed to the CNN. The outputs of the two CNNs are fused using two consecutive extreme learning
machines (ELMs). The output of the fusion is given to a support vector machine (SVM) for final
classification of the emotions. The proposed system is evaluated using two audio-visual
emotional databases, one of which is Big Data. Experimental results confirm the effectiveness of
the proposed system involving the CNNs and the ELMs.
Problem Statement:
Though there are several previous works on audio-visual emotion recognition in the literature,
most of them suffer from low recognition accuracies. One of the main reasons behind that is the
way to extract features from these two signals and the fusion between them[14]. In most of the
cases, some handcrafted features are extracted, and the features from two signals are combined
using a weight.
Paper Overview:
This paper proposes an audio-visual emotion recognition system using a deep network to extract
features and another deep network to fuse the features. These two networks ensure a fine non-
linearity of fusing the features. The final classification is done using a support vector machine
(SVM). The deep learning has been extensively used nowadays in different applications such as
image processing, speech processing, and video processing. The accuracies in various
applications using the deep learning approach vary due to the structure of the deep model and the
availability of huge data [15]. The contributions of this paper are
(i) the proposed system is trained using Big Data of emotion and,therefore,the deep
networks are trained well
(ii) the use of layers, one layer for gender separation and another layer for emotion
classification, of an extreme learning machine (ELM) during fusion; this increases the
accuracy of the system,
(iii) the use of a two dimensional convolutional neural network (CNN) for audio signals and
a three dimensional CNN for video signals in the proposed system; a sophisticated
technique to select key frame is also proposed, and (iv) the use of the local binary pattern
(LBP) image and the interlaced derivative pattern (IDP) image together with the gray-
scale image of key frames in the three dimensional CNN; in this way, different
informative patterns of key frames are given to the CNN for feature extraction.
Deep learning framework for subject independent emotion detection using wireless signals
Authors: Ahsan Noor Khan, Achintha Avin Ihalage, Yihan Ma, Baiyang Liu, Yujie Liu,
Yang Hao
Emotion states recognition using wireless signals is an emerging area of research that has an
impact on neuroscientific studies of human behavior and well-being monitoring. Currently,
standoff emotion detection is mostly reliant on the analysis of facial expressions and/ or eye
movements acquired from optical or video cameras. Meanwhile, although they have been widely
accepted for recognizing human emotions from the multimodal data, machine learning
approaches have been mostly restricted to subject dependent analyses which lack of generality.
In this paper, we report an experimental study which collects heartbeat and breathing signals of
15 participants from radio frequency (RF) reflections off the body followed by novel noise
filtering techniques. We propose a novel deep neural network (DNN) architecture based on the
fusion of raw RF data and the processed RF signal for classifying and visualizing various
emotion states. The proposed model achieves high classification accuracy of 71.67% for
independent subjects with 0.71, 0.72 and 0.71 precision, recall and F1-score values respectively.
We have compared our results with those obtained from five different classical ML algorithms
and it is established that deep learning offers a superior performance even with limited amount of
raw RF and post processed time-sequence data. The deep learning model has also been validated
by comparing our results with those from ECG signals. The results indicate that using wireless
signals for stand-by emotion state detection is a better alternative to other technologies with high
accuracy and have much wider applications in future studies of behavioral sciences.
Title: MACHINE LEARNING APPROACH FOR EMOTION RECOGNITION IN

SPEECH
Author: Martin Gjoreski, Hristijan Gjoreski
Abstract
This paper presents a machine learning approach to automatic recognition of human emotions
from speech. The approach consists of three steps. First, numerical features are extracted from
the sound database by using audio feature extractor. Then, feature selection method is used to
select the most relevant features. Finally, a machine learning model is trained to recognize seven
universal emotions: anger, fear, sadness, happiness, boredom, disgust and neutral. A thorough
ML experimental analysis is performed for each step. The results showed that 300 (out of 1582)
features, as ranked by the gain ratio, are sufficient for achieving 86% accuracy when evaluated
with 10 fold cross-validation. SVM achieved the highest accuracy when compared to KNN and
Naive Bayes. We additionally compared the accuracy of the standard SVM (with default
parameters) and the one enhanced by Auto-WEKA (optimized algorithm parameters) using the
leave-one-speaker-out technique. The results showed that the SVM enhanced with Auto-WEKA
achieved significantly better accuracy than the standard SVM, i.e., 73% and 77% respectively.
Finally, the results achieved with the 10 fold cross-validation are comparable and similar to the
ones achieved by a human, i.e., 86% accuracy in both cases. Even more, low energy emotions
(boredom, sadness and disgust) are better recognized by our machine learning approach
compared to the human.
DESIGN APPROACH
 Emotional speech database
For this research the Berlin emotional speech database [12] is used, which is one of the most
exploited databases for speech emotion analysis. It consists of 535 audio files, where 10 actors (5
male and 5 female) are pronouncing 10 sentences (5 short and 5 long). The sentences are chosen
so that all 7 emotions that we are analyzing can be expressed.
 Feature Extraction
The feature extraction tool used in this research is OpenSmile (Open Speech and Music
Interpretation by Large Space Extraction) . It is a commonly used tool for signal processing and
feature extraction when ML approach is applied on sound data. OpenSmile provides
configuration files that can be used for extracting predefined features. For this research the
configuration file ‘emobase2010’ is used. By using the ‘emobase2010’ configuration file in total
1582 features are extracted [14]. OpenSmile computes LLDs from basic speech features (pitch,
loudness, voice quality) or representations of the speech signal (cepstrum, linear predictive
coding). On these LLDs functionals are applied and static feature vectors are computed,
therefore static classifiers can be used. The functionals that are applied are: extremes (position of
mix/min value), statistical moments (first to forth), percentiles (ex. the first quartile), duration
(ex. percentage of time the signal is above threshold) and regression (ex. the offset of a linear
approximation of the contour. After the feature extraction the feature vectors are standardized so
the distribution of the values of each feature is with mean equal to 0 and standard deviation equal
to 1. This way, the values for each feature are on the same scale from -1 to 1, preventing some
features (with bigger values) to have more influence when creating the ML model. This is an
important step in ML, especially for classification algorithms that do not have mechanism for
feature standardization.
 Feature Selection
Feature selection is the process of selecting a subset of relevant features for use in model
construction. The central assumption when using a feature selection technique is that the data
contains many redundant or irrelevant features. Redundant features are those which provide no
more information than the currently selected features, and irrelevant features provide no useful
information in any context . To deal with this issue, we used a method for feature selection.
Features were ranked with an algorithm for feature ranking and experiments were performed
with varying number of top ranked features. For ranking the features the well-known gain ratio
[ algorithm is used. Gain ratio is the ratio of information gain and the entropy of one feature. It is
used to avoid overestimation of multi-valued features (the drawback of information gain).The
algorithm is used as it is implemented in Orange ML toolkit.
 Classification
Once the features are extracted, standardized and selected, they are used to form the feature
vector database. Each data sample in the data base is an instance, i.e., feature vector, used for
classification. Because each instance is labeled with the appropriate emotion, supervised
classification algorithms are used. In our experiments three commonly used algorithms for
classification were tested, K-Nearest Neighbors (KNN) , Naïve Bayes and Support Vector
Machine (SVM) . We performed thorough experiments with each of the classification models,
and once we selected the one with the highest recognition accuracy, we further enhanced its
accuracy with Auto-WEKA . AutoWEKA is a ML tool that is using approach for parameter
optimization of classification algorithms. It searches to the huge space of algorithm parameters
and by using an
Audio-Textual Emotion Recognition Based on Improved Neural Networks
Authors: Linqin Cai , Yaxin Hu , Jiangong Dong , and Sitong Zhou
Problem Statement: With the rapid development in social media, single-modal emotion
recognition is hard to satisfy the demands of the current emotional recognition system. Aiming to
optimize the performance of the emotional recognition system, a multimodal emotion
recognition model from speech and text was proposed in this paper.
Speech Emotion Recognition using Deep Learning Techniques: A Review

Authors: RUHUL AMIN KHALIL1 , EDWARD JONES2 , MOHAMMAD
INAYATULLAH BABAR , TARIQULLAH JAN , MOHAMMAD HASEEB ZAFAR ,
AND THAMER ALHUSSAIN
Deep Learning techniques have been recently proposed as an alternative to traditional techniques
in SER. This paper presents an overview of Deep Learning techniques and discusses some recent
literature where these methods are utilized for speech-based emotion recognition. The review
covers databases used, emotions extracted, contributions made toward speech emotion
recognition and limitations related to it.

Lit Review

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lit Review

Uploaded by

Copyright:

Available Formats

tAudio-Visual based Emotion Recognition-A New Approach

Authors: Mingli Song Jiajun Bu Chun Chen Nan Li

An Investigation on Visual and Audiovisual Stimulus Based Emotion Recognition Using

A Comparative Analysis of Modeling and Predicting Perceived and Induced Emotions in

Title: EMOTION DETECTION THROUGH AUDIO USING MACHINE LEARNING

Authors: V. Kranthi Sai Reddy

C4.5 & ADABOOST

Title: Speech Emotion Recognition using Support Vector Machines

Emotion Recognition Using Deep Learning Approach from Audio-Visual Emotional

Title: MACHINE LEARNING APPROACH FOR EMOTION RECOGNITION IN

Speech Emotion Recognition using Deep Learning Techniques: A Review

You might also like