Professional Documents
Culture Documents
Lit Review
Lit Review
Methodology : In a previous work with the IADS and the EmoSoundscape datasets [62], they
reported that Random Forest outperformed other models in A/V prediction using a 1D
psycho(acoustic) feature set, while other models mostly suffered from overfitting. This result is
somewhat expected because ensemble models reduce the risk of overfitting. Ensemble models
combine the prediction results of several base models.Therefore, they chose Random Forest as
one of the prediction models for these datasets in this paper. Random Forest (RF) is an ensemble
method that averages the prediction results of several decision trees. To compare the prediction
results from the ensemble model (RF) with deep models, they developed a multilayer perceptron
model and a 1D convolutional neural network model. For all of the models, they used 30% of the
data as the test data and also applied 5-fold cross validation (CV). In order to compute the
training and testing errors, they averaged the RMSE values over these 5 folds.
C4.5 is a machine learning algorithm used to generate a decision tree and developed by Ross Quinlan.
C4.5 is an extension of the Quinlan's earlier ID3 algorithm. The decision trees which are generated by
C4.5 cation, and for this reason, C4.5 is often referred to as a statistical classifier.
AdaBoost stands for Adaptive Boosting, it is a machine learning meta-algorithm formulated by Yoav
Freund and Robert Schapire. It can be used as conjunction with many other types of learning
algorithms to improve performance. AdaBoost is mainly used to boost the performance of the decision
trees or binary classification rather than regression. In our project above random chance on a classi
AdaBoost is used to boost the performance of C4.5
The technology of recognition is one that has been developed continuously over the years and
with its various applications in a wide variety of fields opens up massive opportunities to bridge
the gap between humans and computers. Albeit common knowledge that computers are designed
to make everyday life easier, there is still an indubitable lack of deep understanding due to the
computer’s lack of knowledge in complex emotions present with human beings and this often
prohibits computers to offer specific help that is suitable for its user. Therefore, it’s important to
further develop today’s technology and one promising way to accomplish this task is to utilize
Speech Recognition to recognize and classify emotions as well. This way, the computer
essentially understands the user enough to give valuable aid instead of just preset actions.
Support Vector Machine is one of the leading classifying algorithms in today’s time, boasting the
highest accuracy rate which makes it the most viable option for this field of study .
THEORETICAL CONSIDERATION
The main feature is the Support Vector Machine. SVM is a machine learning algorithm that uses
structural risk minimization. SVM works but mapping an N- dimension input into a higher
feature space using various kernel functions. Afterward, the algorithm will try to find the best
possible generalization to separate the classes into its respective hyperplanes. To train a support
vector machine algorithm, several methods can be used. One promising way is to use a k-nearest
neighbor with a gaussian kernel. A study done in 2008 compares various methods such as LS-
SVM, FLS-SVM, and LS+k-NN-SVM with that of clustered KSVM and concluded that the
latter had the best accuracy out of all the standard methods. Factors such as the size of any
dataset that may be used or the complexity of the hyperplane or hypersurface can alter and
increase the number of support vectors that will be required. This is illustrated in Figure 2 which
shows how the required number of support vectors changed and the performance concerning that
change.
Fig 2 Performance of SVM concerning the number of support vectors
The Automation System Censor Speech for the Indonesian Rude Swear Words Based on
Support Vector Machine and Pitch Analysis
S N Endah , D M K Nugraheni1 , S Adhy and Sutikno
According to Law No. 32 of 2002 and the Indonesian Broadcasting Commission Regulation No.
02/P/KPI/12/2009 & No. 03/P/KPI/12/2009, stated that broadcast programs should not scold
with harsh words, not harass, insult or demean minorities and marginalized groups. However,
there are no suitable tools to censor those words automatically. Therefore, researches to develop
a system of intelligent software to censor the words automatically are needed. To conduct censor,
the system must be able to recognize the words in question. This research proposes the
classification of speech divide into two classes using Support Vector Machine (SVM), first class
is set of rude words and the second class is set of properly words. The speech pitch values as an
input in SVM, it used for the development of the system for the Indonesian rude swear word.
The results of the experiment show that SVM is good for this system.
Proposed system:
They propose a system of an intelligent software to automatically censor the speech words for
rude swear word in Indonesian using SVM. Input in SVM is the pitch value of speech.
Experiments were conducted to distinguish male and female voice. Each sound category also
distinguished between cuss words which consisting of one word and the phrase. Each experiment
has different words of data. The data used as training data is also different from testing data. A
positive class in the form of words or phrases is categorized as rude or curse word, while a
negative class are categorized as properly word.
Building a Vocal Emotion Sensor with Deep Learning
Author: Alex Muhr
Problem Statement: Voice recognition software has advanced greatly in recent years. This
technology now does an excellent job of recognizing phonetic sounds and piecing these together
to reproduce spoken words and sentences. However, simply translating speech to text does not
fully encapsulate a speaker’s message. Facial expressions and body language aside, text is highly
limited in its capacity to capture emotional intent compared to audio.
Data
The datasets I used to build the emotion classifier were the RAVDESS, TESS, and SAVEE which
are all freely available to the public (SAVEE requires a very simple registration). These datasets
contain audio files across seven common categories: neutral, happy, sad, angry, fearful, disgusted,
and surprised. Combined he had access to over 160 minutes of audio across 4,500 labeled audio
files produced by 30 actors and actresses. The files generally consist of the actor or actress
speaking a short simple phrase with a specific emotional intent.
Takeaways
This blog post may make it seem as though building, training, and testing the model was simple
and straightforward. I can assure you that this was very much not the case. Before achieving 83%
accuracy, there were many versions of the model that performed quite poorly. In one iteration I
did not scale my inputs correctly which led to predicting nearly every file in the test set as
‘surprised’. So what did I learn from this experience?
First off, this project was a great demonstration of how simply collecting more data can greatly
improve results. The first successful iteration of the model used only the RAVDESS dataset,
about 1400 audio files. The best accuracy achieved with this dataset alone was 67%. To get to
83% accuracy all he did was increase the size of the dataset to 4500 files.
Second, I learned that for audio classification data preprocessing is critical. Raw audio, and even
short-time fourier transforms, are almost completely useless. Failure to remove silence is another
simple pitfall. Once audio has been properly transformed into informative features, building and
training a deep learning model is comparatively easy.
Paper Overview:
This paper proposes an audio-visual emotion recognition system using a deep network to extract
features and another deep network to fuse the features. These two networks ensure a fine non-
linearity of fusing the features. The final classification is done using a support vector machine
(SVM). The deep learning has been extensively used nowadays in different applications such as
image processing, speech processing, and video processing. The accuracies in various
applications using the deep learning approach vary due to the structure of the deep model and the
availability of huge data [15]. The contributions of this paper are
(i) the proposed system is trained using Big Data of emotion and,therefore,the deep
networks are trained well
(ii) the use of layers, one layer for gender separation and another layer for emotion
classification, of an extreme learning machine (ELM) during fusion; this increases the
accuracy of the system,
(iii) the use of a two dimensional convolutional neural network (CNN) for audio signals and
a three dimensional CNN for video signals in the proposed system; a sophisticated
technique to select key frame is also proposed, and (iv) the use of the local binary pattern
(LBP) image and the interlaced derivative pattern (IDP) image together with the gray-
scale image of key frames in the three dimensional CNN; in this way, different
informative patterns of key frames are given to the CNN for feature extraction.
Deep learning framework for subject independent emotion detection using wireless signals
Authors: Ahsan Noor Khan, Achintha Avin Ihalage, Yihan Ma, Baiyang Liu, Yujie Liu,
Yang Hao
Emotion states recognition using wireless signals is an emerging area of research that has an
impact on neuroscientific studies of human behavior and well-being monitoring. Currently,
standoff emotion detection is mostly reliant on the analysis of facial expressions and/ or eye
movements acquired from optical or video cameras. Meanwhile, although they have been widely
accepted for recognizing human emotions from the multimodal data, machine learning
approaches have been mostly restricted to subject dependent analyses which lack of generality.
In this paper, we report an experimental study which collects heartbeat and breathing signals of
15 participants from radio frequency (RF) reflections off the body followed by novel noise
filtering techniques. We propose a novel deep neural network (DNN) architecture based on the
fusion of raw RF data and the processed RF signal for classifying and visualizing various
emotion states. The proposed model achieves high classification accuracy of 71.67% for
independent subjects with 0.71, 0.72 and 0.71 precision, recall and F1-score values respectively.
We have compared our results with those obtained from five different classical ML algorithms
and it is established that deep learning offers a superior performance even with limited amount of
raw RF and post processed time-sequence data. The deep learning model has also been validated
by comparing our results with those from ECG signals. The results indicate that using wireless
signals for stand-by emotion state detection is a better alternative to other technologies with high
accuracy and have much wider applications in future studies of behavioral sciences.
DESIGN APPROACH
Emotional speech database
For this research the Berlin emotional speech database [12] is used, which is one of the most
exploited databases for speech emotion analysis. It consists of 535 audio files, where 10 actors (5
male and 5 female) are pronouncing 10 sentences (5 short and 5 long). The sentences are chosen
so that all 7 emotions that we are analyzing can be expressed.
Feature Extraction
The feature extraction tool used in this research is OpenSmile (Open Speech and Music
Interpretation by Large Space Extraction) . It is a commonly used tool for signal processing and
feature extraction when ML approach is applied on sound data. OpenSmile provides
configuration files that can be used for extracting predefined features. For this research the
configuration file ‘emobase2010’ is used. By using the ‘emobase2010’ configuration file in total
1582 features are extracted [14]. OpenSmile computes LLDs from basic speech features (pitch,
loudness, voice quality) or representations of the speech signal (cepstrum, linear predictive
coding). On these LLDs functionals are applied and static feature vectors are computed,
therefore static classifiers can be used. The functionals that are applied are: extremes (position of
mix/min value), statistical moments (first to forth), percentiles (ex. the first quartile), duration
(ex. percentage of time the signal is above threshold) and regression (ex. the offset of a linear
approximation of the contour. After the feature extraction the feature vectors are standardized so
the distribution of the values of each feature is with mean equal to 0 and standard deviation equal
to 1. This way, the values for each feature are on the same scale from -1 to 1, preventing some
features (with bigger values) to have more influence when creating the ML model. This is an
important step in ML, especially for classification algorithms that do not have mechanism for
feature standardization.
Feature Selection
Feature selection is the process of selecting a subset of relevant features for use in model
construction. The central assumption when using a feature selection technique is that the data
contains many redundant or irrelevant features. Redundant features are those which provide no
more information than the currently selected features, and irrelevant features provide no useful
information in any context . To deal with this issue, we used a method for feature selection.
Features were ranked with an algorithm for feature ranking and experiments were performed
with varying number of top ranked features. For ranking the features the well-known gain ratio
[ algorithm is used. Gain ratio is the ratio of information gain and the entropy of one feature. It is
used to avoid overestimation of multi-valued features (the drawback of information gain).The
algorithm is used as it is implemented in Orange ML toolkit.
Classification
Once the features are extracted, standardized and selected, they are used to form the feature
vector database. Each data sample in the data base is an instance, i.e., feature vector, used for
classification. Because each instance is labeled with the appropriate emotion, supervised
classification algorithms are used. In our experiments three commonly used algorithms for
classification were tested, K-Nearest Neighbors (KNN) , Naïve Bayes and Support Vector
Machine (SVM) . We performed thorough experiments with each of the classification models,
and once we selected the one with the highest recognition accuracy, we further enhanced its
accuracy with Auto-WEKA . AutoWEKA is a ML tool that is using approach for parameter
optimization of classification algorithms. It searches to the huge space of algorithm parameters
and by using an
Audio-Textual Emotion Recognition Based on Improved Neural Networks
Authors: Linqin Cai , Yaxin Hu , Jiangong Dong , and Sitong Zhou
Problem Statement: With the rapid development in social media, single-modal emotion
recognition is hard to satisfy the demands of the current emotional recognition system. Aiming to
optimize the performance of the emotional recognition system, a multimodal emotion
recognition model from speech and text was proposed in this paper.
Deep Learning techniques have been recently proposed as an alternative to traditional techniques
in SER. This paper presents an overview of Deep Learning techniques and discusses some recent
literature where these methods are utilized for speech-based emotion recognition. The review
covers databases used, emotions extracted, contributions made toward speech emotion
recognition and limitations related to it.